• 沒有找到結果。

Clustering Same-Interest Applications with GHSOM

3.2 Same-Interest Application Discovery

3.2.3 Clustering Same-Interest Applications with GHSOM

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

we could find a popular application named ”Google Translate.”. Its topic distribution is extremely high in topic #5, the language topic(62%) while no more than 5% in other topics. In contrast, some of applications cross two topics with both more than 10%.

For example, there is an application named ”Fruits and vegetables flashcards quiz and matching game for toddlers and kids in English” assigned to the following two topics.

1. Topic 12 (Child) with a probability of 35.1%

2. Topic 5 (Language) with a probability of 12.1%,

By viewing its official description, we aware that the application could help child to learn the names of various fruits and vegetables, related to two different topics. The following is its preprocessed description:

help child learn name various fruit vegetable encounter daily life use interactive picture book match game quiz game english language remember parent child first teacher app make child help discover learn name picture use interactive flash card different way feature beautiful eye catch picture professional pronunciation many language simple intuitive navigation apps flashcard match game quiza perfect kid book pronunciation voice early learn phone tablet app specifically design toddler baby mind simple intuitive navigation different picture feature fruit banana orange melon lime mango pineapple strawberry app real picture much easier baby relate compare draw animate image non native english speak app use teach child name common fruit vegetable thereby get good start learn english second language esl may also benefical child autisme asperges continuously expand range theme learn apps game child want get

3.2.3 Clustering Same-Interest Applications with GHSOM

After topic modeling with LDA, we could retrieve each application with probability of topic distribution, then the application could be identified by a vector of probabilities for

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

similar vector fields. We cluster applications with GHSOM algorithm, famous clustering algorithm for the feature of the dynamic and hierarchical structure. In AppReco, we use GHSOM in Matlab with GHSOM toolbox. We use GHSOM clustering to applications so we could get the proper size of group and the result of GHSOM is shown as Fig 10 and introduce the processing in detail in following paragraph.

Figure 10: GHSOM tree structure for clustering applications.

1. We used τ1 as 0.5 and τ2 as 1.5 to GHSOM cluster algorithm.

2. Start at the layer 0, computing as the average of all input data as the weight vector and then calculate the mean quantization error of the map as mqe0. The following formula is mqe0 Eq.(4):

mqe0 = 1

d∥w0− x∥ (4)

In Eq.(4), d means the number of inputs data, m0 represents the weight vector, and x is the value of the input data.

3. The train of the GHSOM starts with its first layer SOM, after computation of mqe0. The first layer map would consist of a rather small number of units, such as a grid of 2× 2 units.

4. Evaluate the result of SOM map, if each unit i of the map does not fulfill growing-stopping equation, it would continue growing to the next layer. The Eq.(5) is the stop growing formula.

mqei < τ2· mqe0 (5)

The Eq.(5), mqei means the value of mqe of the unit i. τ2 represents the parameter to control the growth of GHSOM. If τ2 is smaller, it is comparatively easy to subdivide to multiple layers.

5. When growing to next layer, GHSOM uses breadth parameter as τ1 to control the growing map. GHOSM uses the Eq.(6) to calculate the mqe of the map, and then increases the nodes of the map until fulfilling the Eq.(7).

M QEm = 1 µ

"

i

mqei (6)

In Eq.(6), M QEm represents the value of the map, while µ refers to the number of units i, contained in the map m.

M QEm < τ1· mqe0 (7)

In Eq.(7), M QEmmeans the value of mqe of the map, and τ1 represents the parameter to control the growth of GHSOM. If τ1 grows, M QEm would increase positively. It would cause a relatively small number of the nodes of the map. In contrast, if τ1 is reduced, M QEm would decrease and then lead to a relatively large number of the nodes of the map.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

6. Repeat step 4 to step 5 until all the units of map fulfill the Eq.(5).

The flexibility and hierarchical feature of GHSOM make it more capable of generating the hierarchical visualizing representation and tackling dynamic, growing and large data than SOM. With the result of the research, they proved GHSOM model able to uncover hierarchical structure. In recently years, GHSOM has been used in data mining, text mining and image recognition fields. Finally, we get lots of sub-clusters with fewer appli-cations sharing more similar topics than the first clustering. We use this cluster result to make ratings and recommend applications to the user.

3.3 Behavior Analysis

AppReco uses static binary code analyzing check behaviors of applications. In order to analyze applications, we collect iOS applications from iTunes App Store, the official resource for iOS applications. However, downloading application is not enough. Apple has developed some security rules in the design of iOS so that users can not directly extract and analyze applications without root permission. Thus, we also decrypted applications to do binary code analyzing. We would discuss these parts in the following sections.

相關文件