Synthesizing Topic with LDA - Same-Interest Application Discovery

3.2 Same-Interest Application Discovery

3.2.2 Synthesizing Topic with LDA

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

we use NLTK package in python to do word stemming. After the stemming process, for each token, we further filter and delete stop words: words that are commonly appearing in documents without meanings, such as the, is, at, which,”who”,”I.” The following is an example part from a weather application, Weather Motion after preprocessing description:

weather motion take mobile weather apps whole new level stun high definition hd graphic sound eﬀect see hear weather indoors head outdoors appadvice com feature macworld appguide re need soothe alarm clock way listen watch weather without go outside weather motion another way experience weather padvance com weather information style stylishipadapps tumblr comweather motion beautiful way check weather information iphone ipod touch application show local condition report top soothe video representation current weather condition description see current fore-cast weather gorgeous hd video high quality video footage diﬀerent weather condi-tion completely optimize iphone amaze retina display real time weather informacondi-tion base current location gps wifi wherever world local condition report temperature high low wind speed direction humidity level day local weather forecast every city slide finger screen see forecast temperature

Finally, we could generate a corpus from the list of processed descriptions. Then the corpus would be used as an LDA input to analyze the topics of whole descriptions in the following section.

3.2.2 Synthesizing Topic with LDA

To analyze sets of topics for the applications, we apply Latent Dirichlet Allocation (LDA) to discover topics. LDA is based on the hypothesis that each document could be viewed as a mixture of various topics. For example, the sentence,”learning English through sports communication”, might be divided into language topics and sport topics in LDA model.

Then LDA relies on statistical models to discover the elicit latent topics that occurs in a collection of unlabeled texts. A ”topic” in LDA consists of a set of words that occurs

‧

LDA, the result would group words like ”baseball”, ”fastball”, ”pitcher”, and ”catcher”

into one cluster, and ”basketball”, ”NBA”, ”dunk”, and ”FIBA” into another cluster.

Some applications such as ”MLB”, ”Fantasy Baseball” would be associated with the former topic, others like ”NBA Game Time”, ”Pro Basketball Pocket Reference” would be assigned to the latter topic, and still others like ”ESPN”, ”Yahoo Sports” would be connected to both topics. There would be another application not belonged to the two topics.

There are some diﬀerent algorithms to implement LDA, such as EM algorithms and Gibbs Sampling. We choose the open-source LDA implementation - JGibbLDA as our tool [43]. LDA takes a corpus file and three input parameters:

(1) α, the number of latent topics to represent a document (2) β, the number of terms to represent a latent topic

(3) The number of topics, the overall number of latent topics to be identified in the given corpus.

We have set the first two parameters by Griﬃths and Steyvers’ research [44] which discussed about choosing appropriate values for alpha and beta in LDA. The value of α is 50/the number of topics, and the value of β is 0.1. After some experiments with α and β, we fix the number of topics at a value of 25. With the value of parameters and the corpus which we have produced from the previous step(processed descriptions), we could discover the topics of apps and the topic distributions in these applications with LDA.

After LDA processing, we could collect the word distribution for each topic and the topic distribution for each document. The following Fig. 8 and Fig. 9 show two topics and their top terms as word clouds.

While the word distribution in topics show up, we could assign a name to each topic.

For instance, we could assign ”medical” topic to Fig. 8, and assign ”calendar” topic to Fig. 9. Table 1 shows a set of topics assigned name and words of topics of one of the experiments.

At same time, we also retrieve topic distribution in each application. For instance,

‧

Figure 8: Sample topic #1 Figure 9: Sample topic #2

Table 1: Topics of one of the experiments ID Assigned Topic Name Most Representative Words

1 iOS Device iphone, ipad, support, touch, device 2 Video video, watch, tv, news, live

3 Photo photo, share, facebook, picture, friend 4 Design color, design, create, app, tool

5 Language word, english, learn, language, chinese 6 Fitness time, weight, body, day, workout 7 Contact service, card, phone, call, information 8 Free Application app, free, com, get, apps

9 Weather weather, time, day, location 10 Live Sports game, player, play, live, score 11 Music music, sound, song, play, sleep 12 Child child, learn, fun, baby, kid

13 Food food, recipe, store, eat

14 File Commander support, file, download, view, share 15 Medical medical, health, include, care, app 16 Love Story life, name, good, love, people 17 Shopping shop, buy, find, new, good

18 Newsstand subscription, issue, purchase, period 19 News read, news, new, magazine, content 20 Screen version, screen, full, set, tap

21 Application application, please, app, update 22 Taiwan Information provide, book, taiwan, information 23 Calendar contact, data, use, message, calendar 24 Finance information, stock, service, market 25 Travel map, travel, search, car, information

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

we could find a popular application named ”Google Translate.”. Its topic distribution is extremely high in topic #5, the language topic(62%) while no more than 5% in other topics. In contrast, some of applications cross two topics with both more than 10%.

For example, there is an application named ”Fruits and vegetables flashcards quiz and matching game for toddlers and kids in English” assigned to the following two topics.

1. Topic 12 (Child) with a probability of 35.1%

2. Topic 5 (Language) with a probability of 12.1%,

By viewing its oﬃcial description, we aware that the application could help child to learn the names of various fruits and vegetables, related to two diﬀerent topics. The following is its preprocessed description:

help child learn name various fruit vegetable encounter daily life use interactive picture book match game quiz game english language remember parent child first teacher app make child help discover learn name picture use interactive flash card diﬀerent way feature beautiful eye catch picture professional pronunciation many language simple intuitive navigation apps flashcard match game quiza perfect kid book pronunciation voice early learn phone tablet app specifically design toddler baby mind simple intuitive navigation diﬀerent picture feature fruit banana orange melon lime mango pineapple strawberry app real picture much easier baby relate compare draw animate image non native english speak app use teach child name common fruit vegetable thereby get good start learn english second language esl may also benefical child autisme asperges continuously expand range theme learn apps game child want get

在文檔中 AppReco: 基於行為識別的行動應用服務推薦系統 - 政大學術集成 (頁 28-31)

Synthesizing Topic with LDA

3.2 Same-Interest Application Discovery

3.2.2 Synthesizing Topic with LDA

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學