This scheme is developed carefully into a package containing binary and source code, which is easy to reconstruct in all platform. This package is available on our website.
4.1 Database
All of the data retrieved from public domain, are reconstructed and organized in sever and manage by MySQL database system.
4.1.1 GEO dataset
We fetched the whole dataset from GEO, including 1790 human dataset and 1623 mouse dataset and 5519 dataset for all species. Our GEO database schema is described in figure 4.1.
Figure 4.1 Database scheme of mirror GEO.
The program assesses each expression profile by querying mysql database with statement that specified its GDS ID storing in attribute sample_id of data_set. By this schema, FNC can easily scan through every entry in GEO dataset. Each expression profile are formatted and standardized during the process of FNC and Time warping.
4.1.2 GO dataset
GO had already provided organized database dumping file for users to download and import to their servers. In our scheme, we focused on the GO term at the level of four including cell-cycle category, since cell-cycle category is also defined in MIPS, which let our prediction comparable to other approaches.
4.1.3 Homologene dataset
Homologene database curated mapping information between genes of different species. The linkage information of human and mouse is presented in the form of text file, which can be easily access by file I/O operation in computer language.
4.2 Implementation of time warping
4.2.1 DTW core
The core program of DTW is adapted from BTW (Boltzmann Time Warping) web server. We take off the Boltzmann pair probabilities estimation sub-routine and maintaining group and one-on-one DTW algorithm in it. DTW is implement like sequence alignment considering weight function according to time periods. We also implement the semi-global alignment algorithm into the core, which just is a modification of normal DTW with different initiation condition. See Figure 4.2.
Figure 4.2 Demonstration of semi-global DTW.
However, because of the dynamic programming nature of the algorithm, the distance function should be able to be calculated in this means (dynamic
programming). We will discuss this in next section.
4.2.2 Group-Euclidean distance function
Considering original formula of Euclidean distance function, and there are two vectors, A (a1, a2, a3, a4) and B (b1, b2, b3, b4), the distance between A and B
( . However, this sort of formula is not suit for dynamic programming
nature of DTW algorithm. The formula should be modified as
∑
−i
i
i a
b )2 ( . It is
obvious that the distance is no longer Euclidean distance. However, in single-gene warping, this is not a problem, if we do not sparing the distance when calculating single gene distance, so
∑
−i
( . This problem can be more serious
when calculating group-Euclidean distance. That is, the distance function design for DTW has its limitation; we will discuss this in chapter discussion.
4.3 Implementation of SVD
4.3.1 SVD core
The function doing SVD is based on a routine by Forsythe et al., which is in turn based on the original routine of Golub and Reinsch[26], found, in various forms, in Wilkinson and Reinsch, in LINPACK, and elsewhere. These references include extensive discussion of the algorithm used. In our implementation, the parameter of the function is adapted to our specific usage.
SVD is a series of matrix operations. Here, we take a low-dimension example as
our demonstration materials. Consider matrix A =
SVD is computing the eigenvalue of matrix ATA. So we implement three matrix operations: transpose, multiplex, and Gauss-Jordan elimination (containing elementary row operations) to complete this step. In this example, we got three eigenvalue of ATA, λ1=3, λ2 = 3, andλ3 = 3. Therefore, matrix A has singular
. With the information of eigenvalue, we can compute
corresponding eigenvectors V(λ) which are the basis of ker(ATA-λ). In our example,
V =
. Next, the most complicate part is computing matrix U, since ui =
1
Because we only have three singular values, but U has four dimension in this case, so we need to compute an additional orthonormal basis. In our implementation, a simple Gauss-Jordan elimination can help to generate a solution; remember that this basis is not unique. By these operations, SVD can be done in polynomial time.
An expression profile which had been normalized and standardized is taken as input,
and the diagonal matrix S is the only thing we concern in following step doing noise filtering.
4.3.2 Noise reduction
The diagonal values of S make up the singular value spectrum. The height of any one singular value is indicative of its importance in explaining the data. More specifically, the square of each singular value is proportional to the variance explained by each singular vector. The relative variances are often plotted. The approach we used is proposed by Everitt and Dunn, the approach based on comparing the relative variance
-1
The S matrix outputted by our SVD function, was calculated the relative variance and filter out any eigengene with relative variance lower than 0.7/n.
4.4 Implementation of FNC
All of the programs are developed under Linux system using C++ with STD library.
The program automatically fetched human expression profile one at a time, and FNC on it. The prediction results were stored and summed up, after thousands of prediction on the same genes but different samples, we can get a more reliable result of the function predication.
4.4.1 Mining by unsupervised approach
The algorithm of this part is described in previous chapter. We maintained a class to
store entire expression profile, which includes also kind of operations, such as standardization and normalization. We leveraged the function in Cluster3[27] to do the hierarchical clustering according to known genes’ GO annotations and prune the edges between clusters, of which have relatively low correlations to other clusters just like the algorithm says.
4.4.2 Predicting category by classification methods
When each GDS profile has been processed by the approach described in last section, the program will maintain a cluster profile to represent the average expression pattern of each cluster of each GO term.
Expression profiles of unknown-function genes are read and calculate the Pearson Correlation with every average expression pattern of each cluster of all functions.
Correlation coefficient higher than λ will be recorded and summed up after all 1790 dataset were predicted by FNC. The top 3 predicted functions of each gene will be regarded as the final prediction of the FNC.
4.5 Statistical analysis
4.5.1 Hypergeometric testing
The routine calculating p-value of hyergeometric testing is implemented based on binomial coefficient subroutine and garmma function. Since the factorial function used in binomial coefficient subroutine is just a gamma function but offset by one. By gamma function, the time complexity computing binomial coefficient is reduced to O(n).
∑
−In hypergeometric formula shown above, N is now the total number of genes with known and predicted function, M is the number of genes including predicted genes which are annotated to interested GO term. n is the number of genes within the group.
By doing so, the p-value can demonstrate the biological significance of this GO term in the group.
4.6 Implementation of visualization
4.6.1 GD library[28]
GD library is a C language library providing convenient function generating graph according to common file formatted, like JPEG and GIF. By the help of GD library, our package can directly generate graphs to demonstrate simple clustering and warping result.
4.6.2 Genesis and graphwarp
Genesis and graphwarp provide even more information about the clustering, and support more analysis. Our package generate corresponding file format for these two programs. The result is shown in Figure 4.3.
Figure 4.3 Visualization of (a)Genesis (b) graphwarp.