A SC-MDS basic concept based method - 分群與合併的多元尺度分析法之最佳分群決策與遺失值問題的討論

In the third method, we don’t focus on how to permute these objects such that all these missing part can be removed from the neighborhood around the main diagonal part. Instead, we utilize the basic concept of SC-MDS to combine only each complete part in the data set to compute the coordinate of objects. Let’s see an extreme example. Suppose there is a distance matrix with dimensionality r and size N (meaning there are N objects). All elements in this distance matrix are missing values except the elements from the i-th column to the (i+r)-th column and from the i-th row to the (i+r)-th row. This matrix can be shown as in Figure 10. Gray represents the missing part and white represents the complete part. As shown in the ﬁgure, this distance matrix has information in the white cross region. On the other hand, we have complete information about the r+1 objects. Obviously, we can not ﬁnd a permutation of index (shown in Figure 11) such that the chain of any two neighbor groups have a partial overlap at the same time. In this case, the only way to solve this question is to align all the other points individually with the r+1 points as the center group. The split step will show in ﬁgure 13 on the right side. We split the whole data set into N-(r+1) groups, except the (r+1) points. The remainder will be allocated to a diﬀerent group, and each group will include those (r+1) points with complete information and one point of the remainder. Each groups will have size r+2. Therefore, we can combine each group in the r dimension coordinate space through the (r+1) overlapping part after applying MDS on each part.

We have two remarks here.

Fig. 13: An extreme example for missing value problem. This square represent a distance matrix and there is information only on the white cross region.

1. Do not persist in permuting objects. In the condition that the data has no missing value (complete dataset), each chain group has diﬀerent overlapping part (see in Figure 11).

But with the eﬀect of missing data, we don’t have suﬃcient information to do this.

Hence, to make good use of the information in existence, we allow the overlapping part to occur repeatedly in diﬀerent groups.

2. When we know the actual dimension of samples and the overlapping region is greater than the dimension, the random permutation plays no role in improving accuracy of SC-MDS. In the case of missing value problem, we should focus on how to ﬁnd the max group that the pairwise distance has no missing value. Then using this group as the center to combine other points, then we can process SC-MDS to get the coordinates of the other points.

How do we utilize the concept of SC-MDS to deal with the missing value problem practically?

• Let a set G = {1, 2, · · · , N} records that the whole distance matrix is composed of which objects.

• At ﬁst, we ﬁnd the largest complete data groups G1 = {g11, g₁₂,· · · , g1k1}, g1i is the index of samples such that the distance matrix composed of the set of object in G₁ have no missing value.

• In the second step, we want to ﬁnd a set G2 which satisﬁes length(G₁∩ G2)≥ r + 1, length(G₂∩ (G \ G1))≥ 1 and the distance matrix is composed of objects in G2 have

no missing value. Then, we apply the SC-MDS process on two groups to ﬁnd the point conﬁguration.

• The next step is to ﬁnd a set G3 which satisﬁes G₂ which satisﬁes length(G₁∩ (G1∪ G₂))≥ r + 1, length(G2∩ (G \ (G1∪ G2)))≥ 1, and the distance matrix composed of objects in G₃ has no missing value. Then, we apply an MDS process on group 3 to ﬁnd the point conﬁguration and use the combine step to align with group 1 and group 2.

...

G_k)) > 1, and the distance matrix composed of objects in G_i have no missing value. Then, we apply the MDS process on group i to ﬁnd the point conﬁguration and use the combine step to align with

i−1

Then we could get the spatial conﬁguration of all objects.

The following is an easy example to help you to understand the process more clearly.

Assume there are six objects and their distance matrix is expressed as the following. A cross symbols a missing value.

To get the largest complete data groups, we delete an object that has the most cross marks (missing value).

G₁ ={1, 4, 5}

To get G₂ which satisﬁes length(G₁ ∩ G2) ≥ r + 1, length(G2∩ (G \ G1)) > 1 and a distance matrix composed of objects in G₂ that has no missing value. Assume we only need two overlapping objects, and ﬁnd the object which has no missing value with at least two object in G₁ ={1, 4, 5}.

Choose two overlapping objects which have the fewest missing values.

Delete the object which has the most cross marks

Check if there is any missing value in relation of object 2 and object 6.

G₂ ={1, 5, 2} To get G3 which satisﬁes length(G₁∩ (G1∪ G2))≥ r + 1, length(G2∩ (G \ (G₁∪ G2))) > 1, and the distance matrix composes by objects in G₃ have no missing value.

Choose two overlapping objects which have fewest missing value.

G₃ ={1, 5, 6}

Repeat the process consist with former.

G₄ ={1, 3, 4}

Apply SC-MDS on G₁, G₂, G₃, G₄ in sequence. The tolerance of missing data depends on the number of the overlapping numbers. We perform the simulation with N = 1000 and r = 3.The tolerance of ratio of missing value is around 0.3 based on the simulation results.

However, SC-MDS is also possible to operate successfully when ratio of missing value is more than 0.3. The missing value should spread well enough. What does ”spread well”

means? The ﬁrst suﬃcient condition is that each column should have missing value less than (N − r − 2), because each object needs at least the information about the relation of itself and the overlapping part. Moreover, each group should have at least r+1 overlapping points connecting with its center group. For example, if the missing value is located as Figure 15, there are no overlapping region between the two groups (or the overlapping region is smaller than r + 1), then SC-MDS falt to reconstruct the coordinate from the given distance matrix.

The following is the time cost variation in diﬀerent ratios of missing values. By intuition, the time cost should increase as the ratio of missing values increases. However, as shown in Figure 15, the time cost increases sharply when the ratio of missing values goes up, then decreases mildly when the ratio of missing values exceeds 0.13. The main reason for this is

Fig. 14: Missing values do not spread well to employ SC-MDS.

in the sorting process. When we try to ﬁnd the largest complete data group, we will choose certain objects as our overlapping points. Then, we will collect all the objects that have no missing values with these overlapping objects and delete one object from the collection at a time according to which has the most missing values until the distance between pairs of these objects have no missing values. In this process, the number of missing values in each column of the distance matrix will be sorted over and over. We compare the sorting process when the ratio of missing values is 0.13 and 0.30. The number of objects that have no missing values and that have overlapping objects will be larger when the ratio of missing values is 0.13 than when the ratio of missing values is 0.30. It will cost more time when the ratio of missing values is 0.13 than when the ratio is 0.30.

Fig. 15: Average time cost for SC-MDS with missing values with diﬀerent ratio of missing value

5 Empirical Study

Fig. 16: Scree plot of SC-MDS and CMDS

Yeast data obtained from Cho et al., 1998. It records 6457 genes whose expression changes during 17 hours. We keep 4000 genes which changes signiﬁcantly by evaluating the ratio of standard error to mean for each gene, and remove the remainder 2457 genes. We apply SC-MDS on this data with 4000 genes. On the other hand, we can remove some values from distance matrix of this data randomly. Then, we use SC-MDS to reconstruct the distance matrix and evaluate the error by calculating stress. Then, we compare the performance of SC-MDS in both conditions. A scree plot is shown in Figure 16. The left panel is the scree plot of SC-MDS, and the right panel is the scree plot of CMDS. It can help us to estimate the hidden dimensionality of data. Here, we choose r = 19, N i = 20, N g = 1.5∗20 = 30, and the ratio of missing value is 0.2. Stress of SC-MDS without missing values is 2.09∗10⁻9, and time cost is 1.29 seconds. Stress of SC-MDS with missing values is 0.54, and time cost is 445.21 seconds. As we mentioned above, the tolerance of ratio of missing value is strongly related to the sample size and hidden dimensions. Especially when the missing value is randomly remove from the distance matrix, it is hard to achieve the ”well spread” as we mentioned before. Consequently, the tolerance of ratio of missing value will decade. Figure 17 is the result of SC-MDS of data with missing values.

Fig. 17: Spatial conﬁguration of yeast data with missing values

6 Conclusion

In this article, we try to complete the SC-MDS process. Parameters in SC-MDS have sugges-tions. SC-MDS will have the optimal performance when the number of overlapping points, N_I, is at least the dimensionality of samples plus one, and the size of group, N_g, is about 1.51 times N_I. We can also estimate the hidden dimensionality from the variation of error by changing the number of overlapping points. Besides, the combine step should have slight revision. When we process the combine step, we should take into account of the dimension-ality of two groups. Consider the group with larger dimensiondimension-ality as center groups to align two groups together. At last, we prove that the result of SCMDS is the same as CMDS in the sense of rotation eﬀect, if there is at least one dimensionality of groups is equal to the dimensionality of the total data set.

Another result is using SC-MDS to solve the missing value problem. We apply the property of SC-MDS on dealing with incomplete data. The tolerance of missing values have improvement, ratio of missing value raises from 20% (Troyanskaya et al., 2001) to more than 30%.

7 Reference

1. A. Mead (1992). Review of the Development of Multidimensional Scaling Methods.

The Statistician, 41, 27-39

2. Buja A, Swayne DF, Littman M, Dean N, Hofmann H (2001). XGvis: Interactive data visualization with multidimensional scaling. Journal of Computational and Graphical Statistics

3. Chalmers M (1996). A linear iteration time layout algorithm for visualizing high-dimensional data. Proceedings of the 7th conference on Visualization, 127-ﬀ

4. Donald A. Jackson (1993). Stopping Rules in Principal Components Analysis: A Com-parison of Heuristical and Statistical Approaches. Ecology, Vol. 74, No. 8., 2204-2214 5. Florian Wickelmaier (2003). An Introduction to MDS. Sound Quality Research Unit,

Aalborg University, Denmark.

6. G. Golub and C. Van Loan. Matrix Computations, Johns Hopkins University Press, Baltimore, (1996).

7. Gale Young and A. S. Householder (1938). DISCUSSION OF A SET OF POINTS IN TERMS OF THEIR MUTUAL DISTANCES. PSYCHOMETRIKA, Vol. 3, No. 1, 19-22

8. Jengnan Tzeng, Henry Horng-Shing Lu, and Wen-Hsiung Li (2008). Multidimensional Scaling for Large Genomic Data Sets.BMC Bioinformatics 9:179

9. Mark Steyvers (2001). Multidimensional Scaling. Encyclopedia of Cognitive Science, Macmillan Reference Ltd

10. Morrison A, Ross G, Chalmers M (2003). Fast multidimensional scaling through sam-pling, springs and interpolation. Information Visualization, 2:68-77

11. Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman (2001). Missing value estimation methods for DNA microarrays. BIOINFORMATICS, Vol.17, no.6, 520-525

12. Trevor F. Cox and Michael A. A. Cox (2001). Multidimensional Scaling, Second Edi-tion, CHAPMAN & HALL/CRC

在文檔中分群與合併的多元尺度分析法之最佳分群決策與遺失值問題的討論 (頁 31-43)