A Review of Functional Data Analysis Applying to Time-Course Microarray Data
The first step in FDA is creating an estimate of the gene expression curves rom the original (and possibly noisy) raw data. This step is called smoothing and involves representing the expression curves as a linear combination of a finite number of basis functions (e.g. spline, Fourier, wavelets, etc.).
关键词:creating an estimate of the gene expression curves rom the original (and possibly noisy) raw data. 从原始数据得到基因表达曲线的估计。这一步叫做光滑。包括把基因表达曲线表示成有限个基函数的线性组合。
Representing the expression profiles using basis functions allows for the inclusion of non-uniformly sampled data, enables the experimenter to estimate expression values at times different
from those used in the original experiment, allows for the imputation of missing values and facilitates the removal of noise from the measured data.
关键词:基函数,样本函数非一致性(什么意思),填补缺失值。去掉噪音。
Once the data have been smoothed, many multivariate techniques which have been extended to
the functional case, e.g. principal components analysis, discriminant analysis and regression analysis, can be applied. These have been used to satisfy some of the main aims in modeling gene expression data, i.e. dimension reduction and clustering to determine groups of co-expressed genes, tests for differential expression between genes across treatment groups, discrimination and classification of genes, etc. and have been shown to have advantages over multivariate approaches.
关键词:光滑后,主成分分析,判别分析,回归分析。基因表达数据的分析的目的:降维,聚类以决定共同表达基因的组,不同治疗组的基因表达的差分,基因的判别和分类。
It should be noted that many technical details regarding methods of computation etc. have been omitted since this paper constitutes a review of FDA procedures in microarray analyses.
关键词:技术细节,计算的方法,省略
Section 5 demonstrates how FDA has been used in time-course microarray analyses to date and describes how FDA can provide additional infor-mation about the behavior of gene expression through time.
关键词:微阵列,时间的微阵列分析,目前,额外信息
To date the largest proportion of research papers using FDA techniques have focussed on clustering expression profiles as discussed in Section 5.1. Less work has been carried out in the other main areas of interest though there has been an increase in the use of FDA techniques in other microarray analyses such as tests for dierential expression,discriminating between groups of genes and modeling the relationships between expression profiles.
关键词:最大比例的研究,集中在基因表达谱的函数型的聚类分析,其他的出现的少。
As stated in Ramsay and Silverman (2005), assuming that the error terms are uncorrelated can be unrealistic in a FDA setting since the variance of the errors is likely to change over time or neighboring i j’s may be correlated. However, the authors indicate that explicitly modeling variable variance or autocorrelation structure in the errors may not always be necessary if the resulting
function estimates are indistinguishable from those obtained from assuming the errors are independent. In any case, it is always pertinent to keep in mind that incorporating more complex error structures may be beneficial and result in better estimates.
关键词:误差项不相关,但是在FDA中不现实。因为误差的方差随着时间改变。或者同一个基因的相邻的观测相关。然而作者指出,显式的对误差的不同方差或者自相关系数结构进行建模通常不必要,如果函数型估计与假定误差是独立的不同的话没有区别。(所以证明的关键问题是否在这儿?如果能证明,假定不假定误差是独立的,得到的估计都没差别。但是这要如何证明呢)
A key step in FDA is to determine an estimate of the smooth expression curve gi(t) which is achieved via smoothing methods.
关键词:通过光滑方法,光滑表达,估计,关键
Smoothing methods represent the discrete expression values as a linear combination of K known functions called basis functions f 1(t) K(t)g such that 公式(略)
is a smooth expression curve.
关键词:光滑方法,把离散值表示成,基函数的线性组合,使得是个光滑表达曲线
The basis functions to be used in (2) are chosen to re-flect the characteristic behavior of the data, e.g. Fourier basis functions are suitable for periodic data, B-spline basis functions are suitable for non-periodic data, etc.
关键词:反应数据的特征表现,
In addition, it is necessary to estimate the vector of basis function coecients ci. One
way to estimate ci is via least squares
关键词: 求基函数的系数是必要的,一个方法是最小二乘。
In this instance the number of basis functions K affects the smoothness of the results. The choice of an optimal value for K is a complex problem and it is dicult to control the amount of
smoothing applied to the data. As a result, Ramsay and Silverman (2005) advocate
the use of smoothing splines where K = ni and over-fitting is controlled by adding a
penalty term to the optimization problem.
关键词:基函数的个数;光滑的程度,复杂的问题,推荐使用光滑样条,因为光滑样条的基函数的个数等于样本数。过拟合由惩罚项控制。
using basis functions has several advantages: facilitating the removal of measurement error, allowing the inclusion of non-uniformly sampled data, enabling the estimation of expression values at times different from those used in the original experiment and allowing for the imputation of missing values.
关键词:有利于消除误差项,允许不一致的样本数据,与原观测不同时间点的表达值得估计,填充缺失值;
Another key advantage of representing the raw gene expression data as gene expression func-
tions is the availability of derivative information. This is particularly useful for the analysis of time-course microarray data since gene expression is part of a biolog-ical system and much information about the behavior of a system is contained in the derivatives.
关键词:粗的基因表达数据,基因表达函数,导数信息,时间过程微阵列数据,基因表达,生物系统,信息
As a result, in some instances smooth estimates of the derivative(s) of the expression curves may be required rather than estimates of the original expression profiles. When the interest is in the estimation of the derivatives, Ramsay and Silverman (2005) suggest altering the penalty term on the RHS of (4) to ensure smoothness of the derivative. For example if velocity curves are required, the curvature of the velocity curves should be penalized. This equates to changing the
penalty term to公式(略)
关键词:导数的光滑估计,而不是原始基因表达谱的估计,改变惩罚
Functional principal components analysis (FPCA) is the functional analogue of multivariate principal components analysis. FPCA is a very common method used to summarize functional data and is used to identify the characteristic features of a set of functions. It also provides a way of looking at the variance structure, which can often be more informative than a direct examination of the variance-covariance function. Intuitively FPCA determines the main modes of variation in a set of curves. The rth functional principal component (FPC) is the weight function r(t) chosen to maximize the variance of the functional principal components scores
关键词:函数型主成分分析,常用分析方法,识别一组函数的特征,提供一种方法,观察方差的结构,比直接看方差协方差函数更有信息,直观的,变化的模式,用一组曲线,加权函数,最大化函数型主成分得分的方差,
Each functional principal component r(t) is a function describing a particular pattern of behavior over the entire time interval. A high positive/negative score on a particular FPC indicates that that gene is exhibiting the behavioral pattern represented by that component. FPCA has been used extensively when ana-lyzing time-course microarray data. It has been applied when clustering expression profiles as discussed in Section 5.1. In addition, since the FPCs form a set of or-
thonormal basis functions, some authors have used the FPCs to approximate the data (and/or covariate functions) such that e.g. when using the functional principal components approach to estimate the regression functions as shown in Sections 4 and 5.3.
关键词:描述在整个时间区间上的特定行为模式,高的正的得分反映了基因展示了相应得分的行为模式;在分析时间过程微阵列数据很常见。比如在对表达谱数据进行聚类分析时常用;正交基函数,近似数据,协变量函数,估计回归函数
Functional Regression Analysis
Functional linear models attempt to express one dependent variable as a linear combination of other features or measurements. A model can be functional in one of two instances; the dependent variable is a function or one or more independent variables are functions. Coeffcient vectors (as given in standard multivariate regression problems) become coecient functions (t) and a key theme in functional regression analysis is estimating (t) to ensure the results are interpretable. For example, say we have a scalar response variable yi and a predictor function xi(t) then we can write 公式(略)where is the intercept term and (t) is the regression coecient function. When yi is binary, this reduces to a functional logistic regression model. Other functional
regression models include the varying coecient model
关键词:函数线性模型,因变量,其中的一种情况:因变量是函数,自变量为函数;系数向量,标准多元分析,可理解的,标量因变量,函数自变量,二元的,变系数模型,
where the response is now a function and the predictor is a continuous variable whose relationship with yi(t) changes over time; the concurrent functional mode where both the response and predictor are functions but the value of yi(t) depends only on the current value of xi(t) and the non-concurrent model where the value of yi(t) is influenced by xi(t) over all t. In each case there is an issue with under-determination, i.e. there are a finite number of observations to determine
the infinite-dimensional (t). This results in an infinite number of possible solutions for (t). There are three main ways to overcome this problem. We present the implementation details for the simplest case when the predictor(s) are functions and the response is a scalar. These results can be generalized to the case when both the predictor(s) and the response are functions (see Ramsay and Silverman, 2005,Ramsay et al., 2009 for implementation details).
关键词:因变量为函数,自变量为连续变量,同时发生的函数模型,非同时发生模型,每种情况,有限个观测值,无限维,产生无限个解,三个方法,最简单的情形的实现方法,结果都可以推广
The first method assumes that both the predictor xi(t) and the regression function (t) can be represented using a finite number of Kz and K basis functions respectively, such that公式(略)
(we call this the basis function approach). However if K is too large the total number of basis functions may still exceed the number of observations available,while if K is too small the resulting estimate of (t) may miss important features in the data. As a result a second method can be employed (termed here the roughness penalty approach), which involves estimating (t) using a roughness penalty by minimizing the penalized sum of squares
关键词:第一种方法,有限个基函数表示,基函数途径,K太大,基函数的个数超过可用的观测值得个数,K太小,估计可能错过重要特征。第二种方法,粗糙度途径,包含估计粗糙度惩罚,最小化。。。
where PEN[ (t)] is a penalty suitable for the problem under consideration (e.g. pe-nalizing the second derivative as in (4)). This approach allows for more direct con-trol over smoothing which reduces the chance that important features are missed by using too few basis functions. The third method regresses y on the first R prin-cipal component scores for the functional covariate. This involves expressing the functional covariates as 公式(略)where ¯ x(t) denotes the mean curve and denotes the score on the rth component r(t). yi can then be represented by the
model and the coecient function (t) can be re-constructed as (t) =Pr r(t).
关键词:惩罚,适用于,惩罚二阶导,直接控制光滑,减少重要特征被错过,用了过少的基函数,
第三种方法,主成分得分,宝库函数协方差,代表均值函数曲线,得分
M¨ uller and Yao (2008) have extended this approach to functional additive models, where the linear relationship between yi and the FPC scores shown in (19) is replaced by an additive relationship as given by where gr() is an arbitrary functional relationship. The functions gr() are estimated using a local linear regression to the data ( f y ) such that is minimized with respect to 0 and 1, hr is the bandwidth and K1() is a kernel function. This results in more flexible models and allows for the direct examination of the role of each eigenfunction in predicting the response.
关键词:函数型可加模型,任意函数关系,局部线性回归,最小,窗宽,核函数,更大的复杂模型,允许直接的,特征函数
Using functional linear models overcomes problems associated with multi-variate regression analysis. These include having observations measured at different time points, the high correlation between observations on the same gene, difficulties encountered due to the high dimensionality of both the response and covariate(s) functions (where the number of observations for each gene may exceed the sample size), the need to use multiple testing procedures and incorporating the smoothness of the underlying expression profiles. Using the FDA approach to analyze data also
has the advantage of providing derivative information which greatly extends the power of FDA over multivariate methods. Functional linear models can be used to provide direct examination of relationships between derivatives that could otherwise only be studied indirectly. The use of functional linear models in the study of relationships between derivatives is discussed in Section 6.2. As stated in that section, to date derivative information has not been used extensively when analyzing expression profiles. However, we believe that since gene expression is part of a
biological system modeling such relationships may prove insightful for these types
关键词:不同时间点上的观测,相同基因之间的高度相关,高维,多重检验过程,引进光滑,提供导数信息,功效,提供研究导数之间的关系,
There are many clustering algorithms available to cluster gene expression data, e.g. k-means, hierarchical clustering, self organizing maps, fuzzy clustering, Bayesian clustering, multivariate Gaussian mixture models, etc. However these multivariate methods have their limitations. Some do not account for between time-point cor-relation or assume the correlation has some specified structure (e.g. autoregressive) that may not be appropriate for microarray data. Others require uniform sampling points for all genes or fail to produce clusters when the number of time points are large (seeWang, Neill, andMiller, 2008 for full discussion). Using FDA techniques has circumvented many of these limitations and much of the work to date using FDA to analyze time-course gene expression data has focussed on cluster analysis.Some early examples include papers by Bar-Joseph, Gerber, Giord, Jaakkola, and Simon (2002) and Luan and Li (2003) who simultaneously develop a model which represents the mean curve in each cluster, c(t), using a linear combination of K cubic spline or B-spline basis functions respectively (as in (3)) and cluster using the EM algorithm.
关键词:对基因表达数据聚类,多元高斯混合模型,没有考虑不同时间点之间的相关,假设相关具有某种结构,而对微阵列数据不适用,一致的样本点,时间点的数目很大时不能聚类,同时发展了均值曲线,用EM算法聚类。
As stated in Section 2, it is difficult to choose the optimal value for K and choosing dffierent values can alter the results. To overcome this problem Ma, Castillo-Davis, Zhong, and Liu (2006) model c(t) using smoothing splines,incorporating a penalty to the optimization criterion as shown in (4) and cluster the expression profiles using the rejection-controlled EM (RCEM) algorithm.
关键词:选择最优的K,会改变结果。光滑样条,引进惩罚项,使用“拒绝控制EM算法”。
Ma and Zhong (2008) extend this to incorporate additional covariate effects into the clus-tering algorithm. Wang et al. (2008) propose an agglomerative clustering algorithm for functional data based on a new similarity measure and compare the results with many other clustering approaches such as k-means, self-organizing maps, smooth-ing spline-based clustering using ssclust (see Ma et al., 2006), Gaussian finite mixture model-based clustering using mclust (see Fraley and Raftery, 2002, Fra-ley and Raftery, 2006), etc. available in the R statistical programming environment
关键词:在聚类算法引入额外的协变量效应,成团的聚类算法,一个新的similarity测度,比较与其他聚类算法的结果。
Other authors have used functional principal components analysis to cluster time-course expression data. For example Song, Lee, Morris, and Kangd (2007) smooth the data, calculate the FPCs as outlined in Ramsay and Silverman (2005),Chapter 8 and cluster the vector of FPC scores using Gaussian finite mixture models and the mclust algorithm. An alternative approach is to cluster expression profiles based on the derivative of the curves. As stated in Section 4, use of the derivative has received little attention in bioinformatics literature to date. The derivative con-
tains information about the shape (change pattern) of the expression profiles and since gene expression can be considered part of a biological system, it could be argued that using the derivative is sensible from a biological perspective. One way of clustering based on the derivative involves smoothing the raw expression data,calculating the first derivative of the expression profiles, determining the FPCs of the derivative and clustering the resulting scores using an appropriate clustering algorithm.
关键词:函数主成分,光滑,主成分得分向量, mclust算法,曲线的导数对表达谱聚类,光滑原始表达数据,计算一阶导,得到主成分得分,对得分进行聚类。
D´ ejean, Martin, Baccini, and Besse (2007) calculate the multivariate principal components of the matrix resulting from discretizing the first derivative of the smoothed expression profiles. The principal component scores are then clustered using a combination of k-means and hierarchical clustering algorithms. An alternative procedure is suggested by Kim and Kim (2008). These authors cluster change patterns of expression profiles by smoothing the data using a Fourier expan-sion and calculating the derivatives of the resulting curves using the Fourier coe-
cients. The Fourier coeffcients of the derivative are then clustered using k-means and model-based clustering and the results compared. The authors report that using this method identifies clusters of co-expressed genes not identified by k-means clustering. It could be expected that the derivative may contain different information than the functions and therefore may result in different clustering outcomes.We carried out an initial investigation (results shown in Section 7.1) to examine any dierences between clustering results obtained using the FPC scores of the func-
tions versus clustering results obtained using the FPC scores of the derivatives
关键词:多元主成分,离散化一阶导数,
The LHS of Figure 2 displays the scree plot of the eigenvalues of the FPCs estimated using the
original expression curves. This plot indicates that the first 3 FPCs should be retained. These FPCs are shown on RHS of Figure 2 and account for over 94% of the variation in the data. The first 4 FPCs estimated using the first derivative of the expression profiles as displayed in Figure 3 account for 84% of the variation in the data.
关键词:scree plot主成分分析的特征值的坡度图,原始表达谱曲线,表明前三个主成分,右图中展示了主成分,占了数据。。。的变化,前4个主成分用表达谱的一阶导估计的主成分在图三中,占了数据。。。
Though the scree plot indicate that FPC 5 could also be included in the analysis, the plot of FPC 5 shows that this component contains lots of oscillations and therefore is most likely attributable to noise. The vectors of scores for these 3 (4) FPCs were then supplied to mclust. The resulting clusters are shown in Figures 4 and 6. The first derivative of the members of each cluster are given in Figures 5 and 7.
关键词:坡度图显示应该被包含进来,但主成分得分的图显示有太多的震荡,因此更像是噪音。这些主成分得分向量用来mclust,
Table 1 displays a comparison of the classifications for each method. Cluster 1 obtained using the original expression profiles mainly consists of a combination of observations from Clusters 1 and 5 obtained using the first derivative of the expres-sion profiles. The additional separation of these observations into two clusters as achieved using the derivatives appears to differentiate between genes whose expres-sion levels peak almost immediately (typically before time 2) before decreasing for the remainder of the cycle (Cluster 1) and genes with expression levels that exhibit
a more gradual increase in expression levels (to a peak value at approximately time 5) before decreasing for the remainder of the cycle (Cluster 5). Cluster 3 for both methods identifies genes whose expression levels change little throughout the cycle though using the derivatives has identified 35 more genes belonging to this cluster. Clusters 2 and 4 determined using the functions contain 232 genes while Clusters 2 and 4 determined using the derivatives contain 190 genes. Cluster 4 is the main dif-ference between both methods. Using the derivative has identified genes exhibiting a rapid decrease in expression levels in the initial stage of the cycle before gradually increasing for the remainder of the cycle. Such a cluster is not evident when using
the original expression profiles.
关键词:两种方法聚类的结果的对比。
The following describes an application of functional regression analysis to a subset
of the Drosophila Melanogaster dataset analyzed by Arbeitman et al. (2002). The
authors identified a group of “strictly maternal” genes or genes that were expressed
in the embryo phase and then re-expressed in the pupal-adult phase of female flies.
Therefore we wish to model the relationship between expression levels in the pupal-
adult phase, yi(t), and expression levels in the embryo phase, xi(t). In this instance
both the response and the predictor variables are functions as shown in Figure 8
关键词:“strictly maternal” genes,embryo phase胚胎,表达水平,
The previous section examines the relationship between expression levels in the
embryo stage and pupal-adult stage of development of female flies. However, the
expression profiles of male flies were also recorded in the pupal-adult stage by
Arbeitman et al. (2002). Figure 10 suggests that female flies (black curves) have
higher expression levels of strictly maternal genes than male flies (red curves) in
the pupal-adult phase.
关键词:基因表达水平,公的与母的比较
Critical values for T(t) are determined using a permutation test by randomly shuf-
fling the male (M) and female (F) labels on the curves and calculating the maximum
of T(t) using these new labels.
关键词:临界值,置换检验,随机
Figure 11 displays the observed test statistic T(t) and corresponding critical values and indicates that there is no significant difference between the ex-pression levels of male and female flies until just after the onset of adulthood where expression levels of these genes in females is significantly higher than males.
关键词:检验统计量,相应的临界值,显著的差别,直到,成年的到来,
A major advantage of FDA over multivariate techniques is that it facilitates the
use of derivative information in the curves. This may be particularly useful in time-
course microarray data analyses since gene expression is part of a biological system
and such systems are typically modeled using differential equations.
关键词:多元分析,导数信息。时间过程微阵列数据分析,微分方程
The first term in this model is proportional to the speed at which the system moves, while the second term position-dependent forces.
关键词:系统移动的速度,位置力量
the system tends to exhibit some oscillation that gradually disappears.
关键词:开始为负,意味着基因表达一开始的增长
Figure 13 displays 1(t) and d(t)for the pupal-adult data. d(t) is is initially negative which corresponds to an initial increase in energy marking the beginning of gene expression, followed by a period when d(t) = 0 and the system is in equilibrium. During this time, expression levels are relatively stable. At approximately time 15 (i.e. the onset of adulthood) d(t) quickly becomes negative. At this point the system is exhibiting some oscillatory behavior corresponding to an increase in energy in the system prior to the large jump in expression levels between times 16 and 21. However after this initial burst the change in expression levels is quite stable. From time 21 onwards d(t) is positive and 1(t) is negative implying that after the sharp increase in expression levels between times 16 and 21, the system contains a vary large amount of energy and is behaving like a rapidly oscillating spring. From this simple example it can be seen that PDA gives real insight into the dynamics of expression profiles in this group of genes. We believe that PDA may be particularly useful when examining dierences between the expression levels of genes across two groups, e.g. across two treatment conditions, where information about how the behavior of the derivatives dier across groups may give additional information regarding why expression levels dier. All of the above analysis was carried out using the fda package in R
关键词:
关键词:creating an estimate of the gene expression curves rom the original (and possibly noisy) raw data. 从原始数据得到基因表达曲线的估计。这一步叫做光滑。包括把基因表达曲线表示成有限个基函数的线性组合。
Representing the expression profiles using basis functions allows for the inclusion of non-uniformly sampled data, enables the experimenter to estimate expression values at times different
from those used in the original experiment, allows for the imputation of missing values and facilitates the removal of noise from the measured data.
关键词:基函数,样本函数非一致性(什么意思),填补缺失值。去掉噪音。
Once the data have been smoothed, many multivariate techniques which have been extended to
the functional case, e.g. principal components analysis, discriminant analysis and regression analysis, can be applied. These have been used to satisfy some of the main aims in modeling gene expression data, i.e. dimension reduction and clustering to determine groups of co-expressed genes, tests for differential expression between genes across treatment groups, discrimination and classification of genes, etc. and have been shown to have advantages over multivariate approaches.
关键词:光滑后,主成分分析,判别分析,回归分析。基因表达数据的分析的目的:降维,聚类以决定共同表达基因的组,不同治疗组的基因表达的差分,基因的判别和分类。
It should be noted that many technical details regarding methods of computation etc. have been omitted since this paper constitutes a review of FDA procedures in microarray analyses.
关键词:技术细节,计算的方法,省略
Section 5 demonstrates how FDA has been used in time-course microarray analyses to date and describes how FDA can provide additional infor-mation about the behavior of gene expression through time.
关键词:微阵列,时间的微阵列分析,目前,额外信息
To date the largest proportion of research papers using FDA techniques have focussed on clustering expression profiles as discussed in Section 5.1. Less work has been carried out in the other main areas of interest though there has been an increase in the use of FDA techniques in other microarray analyses such as tests for dierential expression,discriminating between groups of genes and modeling the relationships between expression profiles.
关键词:最大比例的研究,集中在基因表达谱的函数型的聚类分析,其他的出现的少。
As stated in Ramsay and Silverman (2005), assuming that the error terms are uncorrelated can be unrealistic in a FDA setting since the variance of the errors is likely to change over time or neighboring i j’s may be correlated. However, the authors indicate that explicitly modeling variable variance or autocorrelation structure in the errors may not always be necessary if the resulting
function estimates are indistinguishable from those obtained from assuming the errors are independent. In any case, it is always pertinent to keep in mind that incorporating more complex error structures may be beneficial and result in better estimates.
关键词:误差项不相关,但是在FDA中不现实。因为误差的方差随着时间改变。或者同一个基因的相邻的观测相关。然而作者指出,显式的对误差的不同方差或者自相关系数结构进行建模通常不必要,如果函数型估计与假定误差是独立的不同的话没有区别。(所以证明的关键问题是否在这儿?如果能证明,假定不假定误差是独立的,得到的估计都没差别。但是这要如何证明呢)
A key step in FDA is to determine an estimate of the smooth expression curve gi(t) which is achieved via smoothing methods.
关键词:通过光滑方法,光滑表达,估计,关键
Smoothing methods represent the discrete expression values as a linear combination of K known functions called basis functions f 1(t) K(t)g such that 公式(略)
is a smooth expression curve.
关键词:光滑方法,把离散值表示成,基函数的线性组合,使得是个光滑表达曲线
The basis functions to be used in (2) are chosen to re-flect the characteristic behavior of the data, e.g. Fourier basis functions are suitable for periodic data, B-spline basis functions are suitable for non-periodic data, etc.
关键词:反应数据的特征表现,
In addition, it is necessary to estimate the vector of basis function coecients ci. One
way to estimate ci is via least squares
关键词: 求基函数的系数是必要的,一个方法是最小二乘。
In this instance the number of basis functions K affects the smoothness of the results. The choice of an optimal value for K is a complex problem and it is dicult to control the amount of
smoothing applied to the data. As a result, Ramsay and Silverman (2005) advocate
the use of smoothing splines where K = ni and over-fitting is controlled by adding a
penalty term to the optimization problem.
关键词:基函数的个数;光滑的程度,复杂的问题,推荐使用光滑样条,因为光滑样条的基函数的个数等于样本数。过拟合由惩罚项控制。
using basis functions has several advantages: facilitating the removal of measurement error, allowing the inclusion of non-uniformly sampled data, enabling the estimation of expression values at times different from those used in the original experiment and allowing for the imputation of missing values.
关键词:有利于消除误差项,允许不一致的样本数据,与原观测不同时间点的表达值得估计,填充缺失值;
Another key advantage of representing the raw gene expression data as gene expression func-
tions is the availability of derivative information. This is particularly useful for the analysis of time-course microarray data since gene expression is part of a biolog-ical system and much information about the behavior of a system is contained in the derivatives.
关键词:粗的基因表达数据,基因表达函数,导数信息,时间过程微阵列数据,基因表达,生物系统,信息
As a result, in some instances smooth estimates of the derivative(s) of the expression curves may be required rather than estimates of the original expression profiles. When the interest is in the estimation of the derivatives, Ramsay and Silverman (2005) suggest altering the penalty term on the RHS of (4) to ensure smoothness of the derivative. For example if velocity curves are required, the curvature of the velocity curves should be penalized. This equates to changing the
penalty term to公式(略)
关键词:导数的光滑估计,而不是原始基因表达谱的估计,改变惩罚
Functional principal components analysis (FPCA) is the functional analogue of multivariate principal components analysis. FPCA is a very common method used to summarize functional data and is used to identify the characteristic features of a set of functions. It also provides a way of looking at the variance structure, which can often be more informative than a direct examination of the variance-covariance function. Intuitively FPCA determines the main modes of variation in a set of curves. The rth functional principal component (FPC) is the weight function r(t) chosen to maximize the variance of the functional principal components scores
关键词:函数型主成分分析,常用分析方法,识别一组函数的特征,提供一种方法,观察方差的结构,比直接看方差协方差函数更有信息,直观的,变化的模式,用一组曲线,加权函数,最大化函数型主成分得分的方差,
Each functional principal component r(t) is a function describing a particular pattern of behavior over the entire time interval. A high positive/negative score on a particular FPC indicates that that gene is exhibiting the behavioral pattern represented by that component. FPCA has been used extensively when ana-lyzing time-course microarray data. It has been applied when clustering expression profiles as discussed in Section 5.1. In addition, since the FPCs form a set of or-
thonormal basis functions, some authors have used the FPCs to approximate the data (and/or covariate functions) such that e.g. when using the functional principal components approach to estimate the regression functions as shown in Sections 4 and 5.3.
关键词:描述在整个时间区间上的特定行为模式,高的正的得分反映了基因展示了相应得分的行为模式;在分析时间过程微阵列数据很常见。比如在对表达谱数据进行聚类分析时常用;正交基函数,近似数据,协变量函数,估计回归函数
Functional Regression Analysis
Functional linear models attempt to express one dependent variable as a linear combination of other features or measurements. A model can be functional in one of two instances; the dependent variable is a function or one or more independent variables are functions. Coeffcient vectors (as given in standard multivariate regression problems) become coecient functions (t) and a key theme in functional regression analysis is estimating (t) to ensure the results are interpretable. For example, say we have a scalar response variable yi and a predictor function xi(t) then we can write 公式(略)where is the intercept term and (t) is the regression coecient function. When yi is binary, this reduces to a functional logistic regression model. Other functional
regression models include the varying coecient model
关键词:函数线性模型,因变量,其中的一种情况:因变量是函数,自变量为函数;系数向量,标准多元分析,可理解的,标量因变量,函数自变量,二元的,变系数模型,
where the response is now a function and the predictor is a continuous variable whose relationship with yi(t) changes over time; the concurrent functional mode where both the response and predictor are functions but the value of yi(t) depends only on the current value of xi(t) and the non-concurrent model where the value of yi(t) is influenced by xi(t) over all t. In each case there is an issue with under-determination, i.e. there are a finite number of observations to determine
the infinite-dimensional (t). This results in an infinite number of possible solutions for (t). There are three main ways to overcome this problem. We present the implementation details for the simplest case when the predictor(s) are functions and the response is a scalar. These results can be generalized to the case when both the predictor(s) and the response are functions (see Ramsay and Silverman, 2005,Ramsay et al., 2009 for implementation details).
关键词:因变量为函数,自变量为连续变量,同时发生的函数模型,非同时发生模型,每种情况,有限个观测值,无限维,产生无限个解,三个方法,最简单的情形的实现方法,结果都可以推广
The first method assumes that both the predictor xi(t) and the regression function (t) can be represented using a finite number of Kz and K basis functions respectively, such that公式(略)
(we call this the basis function approach). However if K is too large the total number of basis functions may still exceed the number of observations available,while if K is too small the resulting estimate of (t) may miss important features in the data. As a result a second method can be employed (termed here the roughness penalty approach), which involves estimating (t) using a roughness penalty by minimizing the penalized sum of squares
关键词:第一种方法,有限个基函数表示,基函数途径,K太大,基函数的个数超过可用的观测值得个数,K太小,估计可能错过重要特征。第二种方法,粗糙度途径,包含估计粗糙度惩罚,最小化。。。
where PEN[ (t)] is a penalty suitable for the problem under consideration (e.g. pe-nalizing the second derivative as in (4)). This approach allows for more direct con-trol over smoothing which reduces the chance that important features are missed by using too few basis functions. The third method regresses y on the first R prin-cipal component scores for the functional covariate. This involves expressing the functional covariates as 公式(略)where ¯ x(t) denotes the mean curve and denotes the score on the rth component r(t). yi can then be represented by the
model and the coecient function (t) can be re-constructed as (t) =Pr r(t).
关键词:惩罚,适用于,惩罚二阶导,直接控制光滑,减少重要特征被错过,用了过少的基函数,
第三种方法,主成分得分,宝库函数协方差,代表均值函数曲线,得分
M¨ uller and Yao (2008) have extended this approach to functional additive models, where the linear relationship between yi and the FPC scores shown in (19) is replaced by an additive relationship as given by where gr() is an arbitrary functional relationship. The functions gr() are estimated using a local linear regression to the data ( f y ) such that is minimized with respect to 0 and 1, hr is the bandwidth and K1() is a kernel function. This results in more flexible models and allows for the direct examination of the role of each eigenfunction in predicting the response.
关键词:函数型可加模型,任意函数关系,局部线性回归,最小,窗宽,核函数,更大的复杂模型,允许直接的,特征函数
Using functional linear models overcomes problems associated with multi-variate regression analysis. These include having observations measured at different time points, the high correlation between observations on the same gene, difficulties encountered due to the high dimensionality of both the response and covariate(s) functions (where the number of observations for each gene may exceed the sample size), the need to use multiple testing procedures and incorporating the smoothness of the underlying expression profiles. Using the FDA approach to analyze data also
has the advantage of providing derivative information which greatly extends the power of FDA over multivariate methods. Functional linear models can be used to provide direct examination of relationships between derivatives that could otherwise only be studied indirectly. The use of functional linear models in the study of relationships between derivatives is discussed in Section 6.2. As stated in that section, to date derivative information has not been used extensively when analyzing expression profiles. However, we believe that since gene expression is part of a
biological system modeling such relationships may prove insightful for these types
关键词:不同时间点上的观测,相同基因之间的高度相关,高维,多重检验过程,引进光滑,提供导数信息,功效,提供研究导数之间的关系,
There are many clustering algorithms available to cluster gene expression data, e.g. k-means, hierarchical clustering, self organizing maps, fuzzy clustering, Bayesian clustering, multivariate Gaussian mixture models, etc. However these multivariate methods have their limitations. Some do not account for between time-point cor-relation or assume the correlation has some specified structure (e.g. autoregressive) that may not be appropriate for microarray data. Others require uniform sampling points for all genes or fail to produce clusters when the number of time points are large (seeWang, Neill, andMiller, 2008 for full discussion). Using FDA techniques has circumvented many of these limitations and much of the work to date using FDA to analyze time-course gene expression data has focussed on cluster analysis.Some early examples include papers by Bar-Joseph, Gerber, Giord, Jaakkola, and Simon (2002) and Luan and Li (2003) who simultaneously develop a model which represents the mean curve in each cluster, c(t), using a linear combination of K cubic spline or B-spline basis functions respectively (as in (3)) and cluster using the EM algorithm.
关键词:对基因表达数据聚类,多元高斯混合模型,没有考虑不同时间点之间的相关,假设相关具有某种结构,而对微阵列数据不适用,一致的样本点,时间点的数目很大时不能聚类,同时发展了均值曲线,用EM算法聚类。
As stated in Section 2, it is difficult to choose the optimal value for K and choosing dffierent values can alter the results. To overcome this problem Ma, Castillo-Davis, Zhong, and Liu (2006) model c(t) using smoothing splines,incorporating a penalty to the optimization criterion as shown in (4) and cluster the expression profiles using the rejection-controlled EM (RCEM) algorithm.
关键词:选择最优的K,会改变结果。光滑样条,引进惩罚项,使用“拒绝控制EM算法”。
Ma and Zhong (2008) extend this to incorporate additional covariate effects into the clus-tering algorithm. Wang et al. (2008) propose an agglomerative clustering algorithm for functional data based on a new similarity measure and compare the results with many other clustering approaches such as k-means, self-organizing maps, smooth-ing spline-based clustering using ssclust (see Ma et al., 2006), Gaussian finite mixture model-based clustering using mclust (see Fraley and Raftery, 2002, Fra-ley and Raftery, 2006), etc. available in the R statistical programming environment
关键词:在聚类算法引入额外的协变量效应,成团的聚类算法,一个新的similarity测度,比较与其他聚类算法的结果。
Other authors have used functional principal components analysis to cluster time-course expression data. For example Song, Lee, Morris, and Kangd (2007) smooth the data, calculate the FPCs as outlined in Ramsay and Silverman (2005),Chapter 8 and cluster the vector of FPC scores using Gaussian finite mixture models and the mclust algorithm. An alternative approach is to cluster expression profiles based on the derivative of the curves. As stated in Section 4, use of the derivative has received little attention in bioinformatics literature to date. The derivative con-
tains information about the shape (change pattern) of the expression profiles and since gene expression can be considered part of a biological system, it could be argued that using the derivative is sensible from a biological perspective. One way of clustering based on the derivative involves smoothing the raw expression data,calculating the first derivative of the expression profiles, determining the FPCs of the derivative and clustering the resulting scores using an appropriate clustering algorithm.
关键词:函数主成分,光滑,主成分得分向量, mclust算法,曲线的导数对表达谱聚类,光滑原始表达数据,计算一阶导,得到主成分得分,对得分进行聚类。
D´ ejean, Martin, Baccini, and Besse (2007) calculate the multivariate principal components of the matrix resulting from discretizing the first derivative of the smoothed expression profiles. The principal component scores are then clustered using a combination of k-means and hierarchical clustering algorithms. An alternative procedure is suggested by Kim and Kim (2008). These authors cluster change patterns of expression profiles by smoothing the data using a Fourier expan-sion and calculating the derivatives of the resulting curves using the Fourier coe-
cients. The Fourier coeffcients of the derivative are then clustered using k-means and model-based clustering and the results compared. The authors report that using this method identifies clusters of co-expressed genes not identified by k-means clustering. It could be expected that the derivative may contain different information than the functions and therefore may result in different clustering outcomes.We carried out an initial investigation (results shown in Section 7.1) to examine any dierences between clustering results obtained using the FPC scores of the func-
tions versus clustering results obtained using the FPC scores of the derivatives
关键词:多元主成分,离散化一阶导数,
The LHS of Figure 2 displays the scree plot of the eigenvalues of the FPCs estimated using the
original expression curves. This plot indicates that the first 3 FPCs should be retained. These FPCs are shown on RHS of Figure 2 and account for over 94% of the variation in the data. The first 4 FPCs estimated using the first derivative of the expression profiles as displayed in Figure 3 account for 84% of the variation in the data.
关键词:scree plot主成分分析的特征值的坡度图,原始表达谱曲线,表明前三个主成分,右图中展示了主成分,占了数据。。。的变化,前4个主成分用表达谱的一阶导估计的主成分在图三中,占了数据。。。
Though the scree plot indicate that FPC 5 could also be included in the analysis, the plot of FPC 5 shows that this component contains lots of oscillations and therefore is most likely attributable to noise. The vectors of scores for these 3 (4) FPCs were then supplied to mclust. The resulting clusters are shown in Figures 4 and 6. The first derivative of the members of each cluster are given in Figures 5 and 7.
关键词:坡度图显示应该被包含进来,但主成分得分的图显示有太多的震荡,因此更像是噪音。这些主成分得分向量用来mclust,
Table 1 displays a comparison of the classifications for each method. Cluster 1 obtained using the original expression profiles mainly consists of a combination of observations from Clusters 1 and 5 obtained using the first derivative of the expres-sion profiles. The additional separation of these observations into two clusters as achieved using the derivatives appears to differentiate between genes whose expres-sion levels peak almost immediately (typically before time 2) before decreasing for the remainder of the cycle (Cluster 1) and genes with expression levels that exhibit
a more gradual increase in expression levels (to a peak value at approximately time 5) before decreasing for the remainder of the cycle (Cluster 5). Cluster 3 for both methods identifies genes whose expression levels change little throughout the cycle though using the derivatives has identified 35 more genes belonging to this cluster. Clusters 2 and 4 determined using the functions contain 232 genes while Clusters 2 and 4 determined using the derivatives contain 190 genes. Cluster 4 is the main dif-ference between both methods. Using the derivative has identified genes exhibiting a rapid decrease in expression levels in the initial stage of the cycle before gradually increasing for the remainder of the cycle. Such a cluster is not evident when using
the original expression profiles.
关键词:两种方法聚类的结果的对比。
The following describes an application of functional regression analysis to a subset
of the Drosophila Melanogaster dataset analyzed by Arbeitman et al. (2002). The
authors identified a group of “strictly maternal” genes or genes that were expressed
in the embryo phase and then re-expressed in the pupal-adult phase of female flies.
Therefore we wish to model the relationship between expression levels in the pupal-
adult phase, yi(t), and expression levels in the embryo phase, xi(t). In this instance
both the response and the predictor variables are functions as shown in Figure 8
关键词:“strictly maternal” genes,embryo phase胚胎,表达水平,
The previous section examines the relationship between expression levels in the
embryo stage and pupal-adult stage of development of female flies. However, the
expression profiles of male flies were also recorded in the pupal-adult stage by
Arbeitman et al. (2002). Figure 10 suggests that female flies (black curves) have
higher expression levels of strictly maternal genes than male flies (red curves) in
the pupal-adult phase.
关键词:基因表达水平,公的与母的比较
Critical values for T(t) are determined using a permutation test by randomly shuf-
fling the male (M) and female (F) labels on the curves and calculating the maximum
of T(t) using these new labels.
关键词:临界值,置换检验,随机
Figure 11 displays the observed test statistic T(t) and corresponding critical values and indicates that there is no significant difference between the ex-pression levels of male and female flies until just after the onset of adulthood where expression levels of these genes in females is significantly higher than males.
关键词:检验统计量,相应的临界值,显著的差别,直到,成年的到来,
A major advantage of FDA over multivariate techniques is that it facilitates the
use of derivative information in the curves. This may be particularly useful in time-
course microarray data analyses since gene expression is part of a biological system
and such systems are typically modeled using differential equations.
关键词:多元分析,导数信息。时间过程微阵列数据分析,微分方程
The first term in this model is proportional to the speed at which the system moves, while the second term position-dependent forces.
关键词:系统移动的速度,位置力量
the system tends to exhibit some oscillation that gradually disappears.
关键词:开始为负,意味着基因表达一开始的增长
Figure 13 displays 1(t) and d(t)for the pupal-adult data. d(t) is is initially negative which corresponds to an initial increase in energy marking the beginning of gene expression, followed by a period when d(t) = 0 and the system is in equilibrium. During this time, expression levels are relatively stable. At approximately time 15 (i.e. the onset of adulthood) d(t) quickly becomes negative. At this point the system is exhibiting some oscillatory behavior corresponding to an increase in energy in the system prior to the large jump in expression levels between times 16 and 21. However after this initial burst the change in expression levels is quite stable. From time 21 onwards d(t) is positive and 1(t) is negative implying that after the sharp increase in expression levels between times 16 and 21, the system contains a vary large amount of energy and is behaving like a rapidly oscillating spring. From this simple example it can be seen that PDA gives real insight into the dynamics of expression profiles in this group of genes. We believe that PDA may be particularly useful when examining dierences between the expression levels of genes across two groups, e.g. across two treatment conditions, where information about how the behavior of the derivatives dier across groups may give additional information regarding why expression levels dier. All of the above analysis was carried out using the fda package in R
关键词: