2.4 App Analysis and clustering
4.3.1 Iterative GHSOM on 800 apps
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
38 4.3.1 PCA reduction on 800 apps
We choose a new batch of apps from AppBeach. These apps are the new data which downloaded from App store. The purpose we use these data is to prove that GHSOM can deal with more apps and more attributes. In this part, we applied PCA on data too. These data include 1-sequence and 2- sequence, so we have 2 results to see if iterative GHSOM get useful results.
Unfortunately, these data also have the problem of large attributes. For example, at 1-sequence analysis we have 140000+ attributes. Not to mention that 2-sequence analysis got 250000+ attributes.
First, we apply PCA on these data to get less attributes that can meet the limitation of machines. In this experiment we set PCA level as 95%, so we can get less attributes to analyze. The PCA progress is shown below:
After reducing attributes, we can continue our experiment.
4.3.1 Iterative GHSOM on 800 apps
After applying PCA on 800 apps, we chose 3 apps as targets in the same.
These targets are TRAVELDOOR, WhereMyDogsAt and SuperLiveHD. The
‧
interesting thing is, when we apply iterative GHSOM, we can’t immediately observe that each cluster they have has common ground. But once we download them from App Store, we can find some common functions which were not described on App Store description. Below result table can prove it.
TRAVELDOOR map, figures
'EmoticonArtTextEmojiPicsUnicodeForFacebookTwitterGoogleFbTumblr' 'BouncePlanets' game(3D)
'ParentingBoxforBabiesandToddlers' video, figures 'WallpapersHD' figures
'SpitakaSpeedlite' game
Table 6: cluster of app: TRAVELDOOR – in discription
In description part, you can’t find what they have in common. But if you download these apps and used it, you will find they do have same functions like below table:
TRAVELDOOR Send email
'EmoticonArtTextEmojiPicsUnicodeForFacebookTwitterGoogleFbTumblr' Send email
'BouncePlanets' game(3D)
'ParentingBoxforBabiesandToddlers' Send email 'WallpapersHD' figures, Send email
'SpitakaSpeedlite' game
Table 7: cluster of app: TRAVELDOOR – in function
You can find most of them have the same function of sending e-mail. The same situation is appeared at the other 2 apps. See table (8) to have detailed understanding.
App Name Cluster members 'WhereMyDogs
At'
'ArtClockLite' timer, download 'MomentumLITE' timer
'bodlebook' timer 'SuperLiveHD' 'PuzzleRings' game
MetroBeijingSubway' upload/download, languages 'IHearEweAnimalSoundsforToddlers' game
'AdXTrackingOptOut' upload, languages 'AIPAMemberApp' log-in, languages 'AutoHaus' magazine, download
‧
Table 8 : cluster of app: where my dogs at & SuperLive HD
Meanwhile, we did the 2-sequence analysis and try to see if it can prove the effect of iterative GHSOM. Below is result table of 2-sequence analysis.
App Name Cluster members
TRAVELDOOR 'Oz-heritage' travel
'AngPauPal' calculate cost, link (wedding) 'APNG' animation, link function
'Ado' music WhereMyDogsAt 'WhereMyDogsAt' SuperliveHD Acimga images
ABUSLifeView video
Battery7 record video play time RussianpaintingHDFree images
Table 9: 2-sequence analysis results
In the end of this part, we do some comparison of iterative GHSOM and non-iterative GHSOM. This comparison includes executing time and executing results. First, is executing time comparison of 115 apps and 800 apps. Below table shows the executing time they have:
Iterative time Non-iterative time 115 apps (without PCA) 103 minutes 98 minutes
800 apps (with PCA) 1minutes 5 seconds 8.288 seconds
Table 10 : comparison of 115 apps and 800 apps
About executing result, we can observe a few features from former result:
Iterative result Non-iterative result
Detailed Roughed
More Accurate Have more error
Slower Faster
Table 11 : comparison of iterative GHSOM result and non-iterative GHSOM result
Though at 115 apps the executing time didn’t have great appearance, but the executing time does have obviously reductions on 800 apps.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
41
Below is the comparison of 1-sequence analysis and 2-sequence analysis.
Executing time Executing results
1-sequence 8.288 seconds More member
More common ground
2-sequence 4.728 seconds Less member
Less group
Table 12 : executing time comparison of 1-sequence analysis and 2-sequence analysis
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
42
5 Conclusions
In this paper, we proposed an algorithm called iterative GHSOM with PCA to deal with the large attributes of lots of apps and successfully cluster them by their similarity and behaviors.
From our experiment and comparison of result, we got the conclusion that iterative GHSOM can effectively reduce the time of executing with similar clustering result. With PCA, we can successfully conquer the problem of large attributes. In another way of using sequence analysis, we just found that it can reduce much more executing time on iterative GHSOM, and it can find another clusters if need. Though it may not generate as the same result as 1-sequence analysis, it still can be reference.
For future works, we may try to modify our algorithm to find more suitable way to make it be more compatible to apps. If this algorithm get successful, it can be useful in many research areas. With this research, you may find other similar apps with one target app. We will continue working on this issue and try to get better result.
‧
[1] Anonymous. (2010) Mimvi Reports Patent Filing for 'Intelligent' Mobile App Search and Recommendation Technology." Entertainment Close – Up
[2] Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433-459.
[3] Bizzi, S., Harrison, R. F., & Lerner, D. N. (2009). The Growing Hierarchical Self-Organizing Map (GHSOM) for analysing multi-dimensional stream habitat datasets. In 18th World IMACS/MODSIM Congress.
[4] Banković, Z., Stepanović, D., Bojanić, S., & Nieto-Taladriz, O. (2007).
Improving network security using genetic algorithm approach. Computers &
Electrical Engineering, 33(5), 438-451.
[5] Bilar, D. (2007). Opcodes as predictor for malware. International Journal of Electronic Security and Digital Forensics, 1(2), 156-168.
[6] Chang, E. C., Huang, S. C., & Wu, H. H. (2010). Using K-means method and spectral clustering technique in an outfitter’s value analysis. Quality & Quantity, 44(4), 807-815.
[7] Chandy, R., & Gu, H. (2012, April). Identifying spam in the iOS app store. In Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality (pp.
56-59). ACM.
[8] Danyu X.(2003).Pattern Recognition of Mutual Funds using Self-Organizing Maps Order No. MQ88787 Carleton University (Canada)
[9] Eleyan, A., & Demirel, H. (2006). PCA and LDA based face recognition using feedforward neural network classifier. In Multimedia Content
Representation, Classification and Security (pp. 199-206). Springer Berlin Heidelberg.
[10] Eleyan, A., & Demirel, H. (2007). Pca and lda based neural networks for human face recognition. Face Recognition, 93-106.
[11] Hurlburt, G., Voas, J., & Miller, K. W. (2011). mobile-app addiction: threat to security?. IT Professional.
[12] Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Applied statistics, 100-108.
[13] Jieun Kim, Yongtae Park, Chulhyun Kim, Hakyeon Lee. "Mobile
application service networks: Apple’s App Store." Service Business 8.1 (2014):
1-27.
[14] Kenney, M., & Pon, B. (2011). Structuring the smartphone industry: is the
‧
mobile internet OS platform the key?. Journal of Industry, Competition and Trade, 11(3), 239-261.
[15] Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R.,
& Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7), 881-892.
[16] Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464-1480.
[17] Liu, K. H. (2009). A taxonomy and business analysis for mobile web applications (Doctoral dissertation, Massachusetts Institute of Technology, System Design and Management Program).
[18] Martínez, A. M., & Kak, A. C. (2001). Pca versus lda. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(2), 228-233.
[19] MOHAMMAD TAFIQUR, R. A. H. M. A. N. (2013). Android App Store (Google Play) Mining and Analysis.
[20] Rauber, A., Merkl, D., & Dittenbach, M. (2002). The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data. Neural Networks, IEEE Transactions on, 13(6), 1331-1341.
[21] Rauber, A., Merkl, D., & Dittenbach, M. (2002). The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data. Neural Networks, IEEE Transactions on, 13(6), 1331-1341.
[22] Santos, I., Brezo, F., Ugarte-Pedrero, X., & Bringas, P. G. (2013). Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences, 231, 64-82.
[23] Sneed, H. M., & Erdos, K. (1996). Extracting business rules from source code. In Program Comprehension, 1996, Proceedings., Fourth Workshop on (pp.
240-247). IEEE.
[24] Santos, I., Brezo, F., Nieves, J., Penya, Y. K., Sanz, B., Laorden, C., &
Bringas, P. G. (2010). Idea: Opcode-sequence-based malware detection. In Engineering Secure Software and Systems (pp. 35-43).
[25] Shahzad, R. K., Lavesson, N., & Johnson, H. (2011, August). Accurate adware detection using opcode sequence extraction. In Availability, Reliability and Security (ARES), 2011 Sixth International Conference on (pp. 189-195).
IEEE.
[26] Sanz, B., Santos, I., Laorden, C., Ugarte-Pedrero, X., & Bringas, P. G.
(2012). On the automatic categorisation of android applications. In Consumer Communications and Networking Conference (CCNC), 2012 IEEE (pp.
149-153). IEEE.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
45
[27] Tangsripairoj, S., & Samadzadeh, M. H. (2005). Organizing and visualizing software repositories using the growing hierarchical self-organizing map. In Proceedings of the 2005 ACM symposium on Applied computing (pp.
1539-1545)
[28] Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map.
Neural Networks, IEEE Transactions on, 11(3), 586-600.
[29] Weiser, M. (1981). Program slicing. In Proceedings of the 5th international conference on Software engineering (pp. 439-449). IEEE Press.
[30] Yu, F., Lee, Y. C., Tai, S., & Tang, W. S. (2013, June). AppBeach:
Characterizing App Behaviors via Static Binary Analysis. In Proceedings of the 2013 IEEE Second International Conference on Mobile Services (p. 86). IEEE Computer Society.
[31] Zhu, H., Cao, H., Chen, E., Xiong, H., & Tian, J. (2013). Mobile App Classification with Enriched Contextual Information.
[32] https://www.apple.com/hk/iphone-5s/app-store/ APPLE [33]
http://www.emarketer.com/Article/Smartphone-Users-Worldwide-Will-Total-175 -Billion-2014/1010536
[34] http://en.wikipedia.org/wiki/Mocana
[35] Kuhn, A., Ducasse, S., & Gírba, T. (2007). Semantic clustering: Identifying topics in source code. Information and Software Technology, 49(3), 230-243.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
46
Appendix
1.GHSOM clustering result of 115 apps
‧
(1). Segment of MATLAB code on transfer the original data:for i=1:m
fid = fopen(outfile, 'w');
if fid == -1; error('Cannot open file: %s', outfile); end [nrows,ncols]= size(Cube4);
%fprintf(fid, '%s ', fullAttrName_115{:});
fprintf(fid, '\n');
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
48
for row=1:nrows
fprintf(fid, '%d ', Cube4(row,:));
fprintf(fid, '%s\n' ,Row{row});
end
fclose(fid);
clearlist = {'m','n','Row','x','nrows','ncols'};
clear(clearlist{:});