For the clustering purpose, we have to find some method to generate ideal clusters we want. In the related work, we investigated several clustering methods to get the proper way to cluster apps by their behavior. Besides, we also compare two dimension-reduction methods to conquer the challenge of large data, and then find the one which is more compatible with our research. In the third part of related work, we explore the technique of sequence analysis to improve our clustering result.
2.1 Clustering methods
In this part, we investigate three machine-learning methods. They are all unsupervised clustering approaches.
2.1.1 K-Means Algorithm
It’s a conventional clustering algorithm which is proposed in 1979[12]. The purpose of this algorithm is to find the clustering rule that can minimized the sum of square in each cluster. The process of k-means is to divide M points in N dimensions into K clusters with getting the minimum sum of square. It can apply on lots of fields. Because of the early-proposed year, it has become a classic method when people want to cluster something. The far range applying of k-means is because it is a simple and quickly classic algorithm. The aim of clustering is to find structure in data that exploratory in the future. Since k-means has proposed in 1955, it has passed over 50 years. K-means algorithm is still widely used.
The core formula is below:
𝐽𝑀𝑆𝐸 = ∑ ∑ ‖𝑥𝑡− 𝑐𝑖‖2
𝑥𝑡∈𝑐𝑖 𝑘
𝑖=1
‧
geometric centroid of the cluster i.In the application of [15], author proposed a two phase k-means algorithm named k’ means to conquer the issue of assigning the number of the cluster in advanced. They pre-process the experimental data using the original k-means to do the initial cluster. They also use the formula below to decide nodes of the clusters:
I(𝑥𝑡, i) = {1, 𝑖𝑓 𝑖 = arg min (𝑑𝑚(𝑥𝑡, 𝑗) 𝑗 = 1, … , 𝑁) 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
After getting the initial k clusters, they choose a unit to be the center of the cluster and re-cluster data. In the end, they get the final cluster with this method.
In their result, authors prove that this method can efficiently cluster data without setting the parameter k.
2.1.2 SOM
Among the algorithm of artificial neural network, Self-Organizing Map (SOM) has the special property of effectively creating spatially organized
“internal representations” of various features of input signals and their
abstractions [16]. It uses unsupervised learning to produce a low-dimensional representation of input of the training sample; meanwhile, it can preserve the original topological properties of inputs.
SOM is a sheep-like neural network. The cells are becoming special various input signal patterns or classes of patterns through an unsupervised learning process.
The process of SOM can be 3 steps.
1. Select the best-matching cell 2. Adapt the weight vectors
3. Demonstrate the ordering process
In the training phase, SOM use the formula below to calculate the winner cells of input:
‧
Based on the formula, SOM can continued calculated the updated winner and furthermore get the clustering result. In the experiment with clustering the speech recognition and semantic map [16], it has been achieved and been currently better than the result produced by more conventional methods.
SOM is an excellent tool to explore data mining. It can visualize and explore properties of data. When the number of SOM unit is large and for facilitating quantitative analysis, it must be clustered to many groups. [28] It proposed two-level clustering approach using SOM and k-means algorithm, and it is compared with directly k-mean clustering. While the experiment of
comparing two methods has done, author calculated the computation time and cost to prove that using this approach with SOM and k-mean is an effective clustering method. And they also proved that using SOM as the first abstracting is not only utilized in clustering of data but also has some of the clear advantages.
The most important of its advantages is the reduction of the computation cost, especially on the hierarchical algorithm allowing clusters of arbitrary size and shape. The experiment indicated that clustering the SOM is computationally effective approach instead of directly clustering the data.
2.1.3 GHSOM
SOM has a major advantage like getting the great visualization capability of topological relationship among the high-dimensional input and the
low-dimensional view. However, even lots of report about applying SOM has been proposed, it still has some weakness can’t be conquered over the years. [21]
First, the network architecture of SOM maps has to be determined after the training process. It can be hard to get the satisfied result especially when you have a large data. Second, SOM can’t provide hierarchical function while the hierarchical relationship was shown in the same representation space and was hard to be identified.
In order to address these limitations, a further method was proposed named Growing Hierarchical Self-Organizing Map (GHSOM). [21] The key idea of
‧
using GHSOM is to use a hierarchical structure of multiple layers where each layer consists of an independent SOM. At the meantime, GHSOM solve the problem of SOM, which is described that the network structure has to be determined in advance. The flexibility and hierarchical features of GHSOM effectively improve the ability of generating the visualizing-clustering result. In this research we get a useful clustering result that proved GHSOM is a property clustering method. Nowadays, GHSOM has been used in fields of image recognition, data mining and text mining [3].
On the application aspect, GHSOM can be applied in every field. For example, in [27], author demonstrated the clustering of the software repositories because of its clustering ability and the visualization of large and complex data sets. Data sources of this experiment are 273 samples of C/C++ program source code files gathered from three well-known text books, and the purpose of this research is to cluster these codes to get the better organizing of code repositories.
In the result, they confirmed that GHSOM is an effectively approach to manage these code and get the useful result.
2.1.4 Comparison of clustering method
After listing above method of clustering, we have to choose a best applying approach on our research. We found a research in 2008, which is the reference about comparing three clustering methods: Spectral clustering, SOM and
GHSOM. Spectral clustering, in [6], has been compared with K-means algorithm.
The main difference of two methods is that k means used to deal with linear data and spectral clustering used to deal with non-linear data. On the other hand, spectral clustering uses another method to cluster data with using the spectrum of the similar matrix of the data to cluster them. After studying the case of applying k mean and spectral clustering technique in analyzing the data set from an outfitter in Taipei City, they can conclude that result of experiment shows that spectral clustering outperforms k-means method.
The other two of the methods are GHSOM and SOM. In his experiment, he compared these methods on word image clustering. The data of experiment is a real set which composed of 132,956 word images from pages of the 14th
century’s book. In the progress of clustering, author found no matter in precision of cluster and pattern, or in the percentage of the correct clustering, GHSOM all gets the best performance out of all methods including SOM. With the result of
‧
this exploration, we can get the evidence that GHSOM is better than SOM.
From the survey and conclusion of comparing clustering methods above, we found that GHSOM has got the best performance on clustering data.
In the other hand, we found that there was no one use SOM and K means on App analysis. Not mention that anyone tried to use these two methods to deal with large data. We thought we could give it a try and see if we can get great results.
So in our research, we’re going to choose GHSOM as our main method.