• 沒有找到結果。

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

2.2.4 Comparison of Recommendation Systems

After listing the above methods of recommendation system, we have to choose the most suitable approach on our research. We find a research in 2005 by Adomavicius and Tuzhilin [23], it compared the three recommendation methods, and pointed out their advantage and disadvantage respectively. The main difference between content-based and collaborative filtering system is that the former approach focuses on measuring the similarity between vectors of items, while the latter targets on calculating the similarity between vectors of the actual user-specified ratings. In AppReco, we focus on how the similarity within the applications, so we decide to choose content-based approach as our main method.

After surveying some content-based recommendation systems, we find out that some of them use topics model to make a recommendation model. For example, Godin, Slavkovikj, De Neve, Schrauwen and Van de Walle [32] used the Latent Dirichlet Allocation (LDA) to model the underlying topic assignment of language classified tweets. They used over 1.8 million tweets as data set, and randomly picked 100 tweets of them for constructing model. They ended up proving that they could recommend hashtags for tweets in a fully unsupervised manor. In our research, we are going to find out topic model approaches to construct our recommendation model.

2.3 Topic Model

AppReco uses the topic model technique in natural language processing (NLP) to convert the applications’ descriptions into a vector of word appearances. The result of topic model can reflect a collaborative shared view of the application and popular vocabulary of the topics to describe applications. In addition, topics model explains the feature vector of data so that we can perform clustering approach. In this part, we investigate three popular topic model methods. They are used to dig out latent keywords in a document.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

2.3.1 Latent Semantic Analysis

Deerwester, Dumais, Furnas, Landauer and Harshman [26] proposed Latent Semantic Analysis (LSA), a famous information retrieval technique in 1990. The purpose of LSA, also called Latent Semantic Indexing(LSI) is analyzing documents to find the latent mean-ing or concepts behind the documents. LSA assumes that terms will have high-related meaning in similar documents and that terms will have high association with concepts and meanings.

However, the problem is difficult, because a word often bears multiple meanings. Lex-ical ambiguity obscures concepts of a single term. Even native speakers would have a hard time dealing with the received information. For example, the term ”case” means a criminal institution when it was used around ”murder”, ”police”, and ”victim”. In con-trast, when it was used around ”package”, ”luggage”, and ”travel”, it probably indicates a suit case.

In order solve the ambiguity, LSA maps both terms and documents into a matrix space and does the comparison in this space at first. Then, LSA uses a mathematical technique, called Singular Value Decomposition (SVD) model to reduce the number of rows while preserving the feature of structure with the large matrix. The core formula of SVD is Eq.(1).

X = T0S0D0 (1)

In Eq.(1), X represents a t× d matrix of terms and documents, decomposed into the product of three matrix. T0 and D0 have orthonormal columns and S0 is diagonal. In SVD, T0 and D0 are the matrices of left and right singular value vectors and S0 is the matrix of diagonal matrix of singular value. At same time, the singular values of S0 are ordered by sized. They mean the eigenvalues of the terms and documents. After SVD decomposing the matrix, LSA uses the number of the dimension k to reduce dimensional representation of the matrix. It can emphasize the strongest relationships and throw away

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

the noise. Thus, the Eq.(1) would be change to Eq.(2).

X ≈ !X = TkSkDk (2)

With Eq.(2), it reserves large k entries in Eq.(1) and the product of the result matrices, a matrix !X which is approximate to X. The number of dimensions k is very important.

If k is too small, it would cause important pattern to be dropped out. In contrast, if k is too big, it would let noise word creep in. In the end of SVD, Tk would become the term-topics matrix while Dk would become the topics-documents matrix.

Landauer, Foltz and Laham [33] applied LSA to knowledge with Grolier Encyclopedia, a popular academic encyclopedia in the United States. They used 60,768 terms and 30,473 documents from Grolier to construct a terms - documents matrix. Then, they applied SVD to reduce the high-dimension matrix and to reconstruct with a low-dimensional matrix.

Finally, they could compute the latent concept behind terms with the low-dimensional ma-trix. They concluded that LSA would perform well around 300 dimensions, and perform comparatively poorly below 100 dimensions so that dimensionality-optimization could greatly improve the extraction and representation of LSA.

2.3.2 Probabilistic Latent Semantic Analysis

Hofmann [27] proposed a Probabilistic Latent Semantic Analysis (pLSA), also known as Probabilistic Latent Semantic Indexing (pLSI) in 1999. In contrast to LSA, a document is represented as a mixture of the topics in pLSA. The pLSA uses the aspect model to show a low-dimensional representation of the co-occurrence data, associated with hidden variables. The creation of pLSA model differs from LSA, which usually uses a singular value decomposition to reduce down a high dimension matrix. pLSA is based on a mixture decomposition derived from a statistical technique, latent class model. The model of pLSA is Fig. 5.

d and w represent observable documents and words index variable. z represents a latent variable which is the topic distribution of documents and words. M means the

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Figure 5: Model of pLSA. [1]

number of the documents and N means the number of words of one of documents d.

The process of pLSA can be divided into 3 steps and represented as an Eq.(3):

P (d, w) = P (d)P (w|d) P (w|d) ="

z

P (w|z)P (z|d) (3)

1. Select a document d with probability P (d).

2. For each words in document d, select a topic z with probability P (z|d).

3. Select word w on the previous chosen topic z with probability P (w|z).

In order to estimate the latent variable in pLSA, Hofmann [27] uses Expectation Maximization (EM) algorithm. EM can be divided into three steps: 1. computing poste-rior probabilities for the latent variables z with the current estimates of the parameter 2.

using the parameter computed in the previous step to maximize estimation parameter, and 3. repeat step 1 step 2 until convergence.

In the conclusion, Hoffman proved that pLSA could take advantage of statistical stan-dard methods for model better than traditional LSA. pLSA could be applied to language modeling, collaborative filtering.

2.3.3 Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA), a type of topic model was proposed by Blei, Ng and Jordan [1] in 2003. The general idea of LDA is based on a hypothesis that a document may

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

statistic model. LDA is very similar to pLSA, but LDA assumes two Dirichlet priors to the topic distribution. In practice, LDA reduces more executing time than pLSA because pLSA has to estimate lots of parameters. If a user gives a uniform Dirichlet prior to the LDA model, it would be equivalent to the pLSA model. The model of LDA is Fig. 6.

Figure 6: Model of LDA. [1]

In the Fig. 6, parameter α is the Dirichlet prior of topic distributions of documents. β is the Dirichlet prior of word distributions of topics. K is the number of topics. M is the number of documents. N is the number of words in a document i. θ is a M × K matrix which means topic distributions for each document. ϕ is a K × W matrix which means word distributions for each topic. z is a set of topics of the document i, and w is the only observable variable, a set of word of topics in the document i.

The following is the generative process of LDA, which would be applied to each doc-ument in corpus:

1. Choose N , the number of words in a document i.

2. Choose the topic distribution θ of the the document, with the Dirichlet prior α.

3. Choose the word distribution ϕ of the the topic, with the Dirichlet prior β.

4. For each word j in the document i, choose a topic zi,j with multinomial distribution θi.

5. For each word j in the document i, choose a word wi,j with multinomial distribution ϕz .

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

After repeating the generative process of LDA, it would reach a convergence and steady result model. As a result, they proved that LDA model had a better prediction result than the mixture of unigrams model and the pLSA model.

2.3.4 Comparison of Topic Models

After the instigation of approaches of topic model, it is obvious that each approach has its unique strength and inevitable weakness. In order to find the most qualified method for our research, we filter out the less suitable methods with the comparison step by step.

Each points of comparison would be discussed separately in the following paragraphs.

For statistics, LDA and pLSA have strong enough background to graph their model in Fig. 5 and Fig. 6. They both use the concept of probability and their advantage is that the topics, extracted by these models are invariably interpretable. LDA and pLSA facilitate the analysis of model output, unlike the uninterpretable directions produced by LSA.

For efficiency, LSA has the lowest executing time than LDA and pLSA because it doesn’t need to calculate the probability of model. LDA is the second efficient method because it estimates the probability fast with its hyper-parameter Dirichlet priors to the topic distribution. On the contrary, pLSA has to estimate all probability in their equation so it has the lowest efficiency among all the methods.

After comparing, we first filter out LSA because its lack of statistical model. Second, we filter out pLSA because LDA has better efficiency on executing. As a result, we use LDA as our topic model method.

相關文件