Human Genome Database - Data Complex Simulation

Data Complex Simulation

4.3 Human Genome Database

The Human Genome Database is collected from NCBI Hs37.2 and removes du-plicate genome sequences before analysis. After removing dudu-plicate sequences, there are 23629 di¤erent gene sequences. The edit distance for each gene pairs are com-puted as their distance.

(a) Hessian LLE with k = 17 (b) Laplacian eigenmap with

= 2:5

Figure 4-8: Selected embedding results for stock dataset from other manifold learn-ing methods.

4.3.1 Distance Metric

For the human genome database, there are only distance metrics for each gene pairs. No e¤ective data dimensions can be extracted. So manifold learning methods require weight computations on data dimensions, such as LLE, Hessian LLE, and LTSA, cannot be directly applied. According to [14], the weight matrix computation can be derived from distance metric by computing local Gram matrix Gij according to a speci…c data point. The su¢ x 0 means the speci…ed data point to be represented by the nearest neighbors while k is the number of neighbors for the speci…c point.

Dmeans the original distance matrix.

K_ij = 1

2 D²_ij + 1 k²

D_lm² 1 k

(D²_il+ D_lj²)

; 0 i; j; l; m k;

G_ij = K₀₀ K_i0 K_0j + K_ij:

For Hessian LLE and LTSA, the computation steps require solving local tangent space derived from singular value decomposition from data dimension di¤erence.

The distance metrics still need to be converted to the dot product form as centered Gram matrix and solving the eigen-decomposition of centered Gram matrix K is equivalent to the original singular value decomposition for obtaining the required local tangent vectors for further matrix computation.

K_ij = 1

2 D²_ij+ 1 k²

D²_lm 1 k

(D_il² + D²_lj)

; 1 i; j; l; m k:

4.3.2 Parameter Settings

Since the distance can be as low as smaller than 100 and as large as several tens of thousands, the speci…c parameter t from di¤usion map should be modi…ed as t = maxi;j(D_ij) for computing reasonable results and avoid numerical precision problem. Since the large deviation of the distances from gene pairs, the -distance will give many components with one data point for smaller or accept almost all connections in some large components for larger . So this dataset is not suitable for pure -distance approach. Currently, only k-nn will be analyzed. The range of k is 4 k 50.

4.3.3 Results

Since the edit distances of gene pairs do not …t in Euclidean data dimensions, the assumption of Gram matrix computation may not be hold, so the corresponding eigenvalues and embedding results for LLE, Hessian LLE, and LTSA can only be considered as reference results for representative.

Table 4.11: Eigenvalues from manifold learning on human gene dataset.

Approach LLE Isomap Hessian LLE

k-nn

Approach Laplacian eigenmap Di¤usion map LTSA

k-nn

The eigenvalues from the …rst dataset by di¤erent manifold learning methods and neighborhood selection approaches can be shown in Table 4.11. The eigenvalues from LLE and Hessian LLE are not possible to identify change points because of the unstable curvature of eigenvalue curves. For Isomap, only limited behavior of the dataset can be discovered according to previous full range analysis since the second eigenvalues are still rising up when k = 50. For Laplacian eigenmap and di¤usion map, their eigenvalue curves share similar behaviors and they do generate similar embedding results. For LTSA, if the points with slope change are considered as changing points, there will be too many change points for analysis.

Even if the analysis for human genome dataset will not be easy, some selected results from speci…c parameters can be shown in Figure 4-9. The point color indi-cates the length of bases in individual gene. Blue point means a short gene while red point means a long gene. For LLE applying modi…cation on distance metrics, the embedding result is not structured in terms of base length, so gene dataset is not suitable for LLE to analyze. The other manifold learning methods are categorized as three categories from embedding results. Isomap generates the all pair shortest path

(a) LLE with k = 35 (b) Isomap with k = 15

(e) Di¤usion map with k = 30

(f) LTSA with k = 20

Figure 4-9: Selected embedding results for stock dataset from all manifold learning methods.

and generate embedding result similar to a bowl. Hessian LLE and LTSA are based on tangent space and obtained distant groups as the …nal embedding results. For Laplacian eigenmap and di¤usion map, they directly use the selected distance metric and generate another class of embedding results. In human genome dataset, di¤erent manifold learning basis will majorly a¤ect the …nal embedding results since the edit distance from the human genome dataset will violate assumptions from Euclidean space. The di¤erent viewpoints from di¤erent basis of manifold learning methods can extract more useful information from the dataset.

Chapter 5 Discussion

In this chapter, some issues will be discussed from implementation details to fu-ture directions. The …rst section is about attempts of improving the performance of manifold learning method from introducing the general eigen-decomposition instead of using the traditional eigen-decomposition in most of manifold learning methods within the framework. The second section is about relations between current mea-sures for embedding results from manifold learning and the eigenvalues generated from these manifold learning methods. These measures are designed for speci…c neighborhood relation to check if the embedding results are good enough, but the eigenvalues can contain more information about the changes of the …nal embedding results.

在文檔中複雜資料應用流形學習之特徵值分析 (頁 66-71)