未來研究方向

總結目前研究的結果和心得，基於調適演算法，本論文方法對於下列應用都有其可再研究發展的空間。

文件分群：文件分群的概念不同於分類，基於目前資料呈現越來越多樣化的表現，就如同本論文實驗中使用的文件分類資料集 Reuters-21578，文件不會單純屬於某個類別，但相似的文件勢必擁有相類似的特徵，若是使用潛在語意可以將文件依照概念分群，勢必對於後續的應用有所幫助。

資訊回饋：為文件檢索中，另一個受人矚目的議題，文件檢索探求的是相似度的量測，然而相似度量測不代表該文件便完全符合使用者需求。適當的使用者資訊回饋可以調整文件檢索結果，以求更符合使用者特性。而本論文中使用所謂的 aspect model，加以利用應該可以處理使用者回饋資訊，而使得檢索結果擁有更進一步的提升。

語言模型：近年來，語言模型在文件擷取上也有廣泛的研究，並且對於檢索效能有顯著的改善。如何結合語言模型，甚至利用檢索時的查詢句建立更人性化的的自然語言檢索，相信 PLSA 中，其字詞對應潛在語意機率形式表示法，在這方面的也有研究的空間。

除此之外，在本論文實驗中，

K 值的取決方式，主要是以 perplexity

的評估作為基準，此為較啟發式（heurustic）的方法。因此對於文件模型如何選取較佳的

K 值，以及因應不同資料量如何調整適當的 K 值，在未

來的研究上都將是重要的課題之一。

第七章參考文獻

[1] Y. Akita and T. Kawahara, “Language modeling adaptation based on PLSA of topics and speakers”, Proceedings of International Conference

on Spoken Language Processing, 2004.

[2] J. R. Bellegarda, “Exploiting latent semantic information in statistical language modeling,” Proceeding of the IEEE, vol. 88, No. 8, pp.

1279-1296, 2000.

[3] J. R. Bellegarda, “Fast update of latent semantic spaces using a linear transform framework”, Proceedings of International Conference on

Acoustics, Speech, and Signal Processing, vol. 1, pp. 769-772, 2002.

[4] M. W. Berry, S. T. Dumais and G. W. O’Brien, “Using linear algebra for intelligent information retrieval”, SIAM Review, vol. 37, no. 4, pp.

573-595, 1995.

[5] M. W. Berry, Z. Drmac, and E. R. Jessup, “Matrices, vector spaces and information retrieval”, SIAM Review, vol. 41, no. 2, pp. 335-362, 1999.

[6] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet allocation”,

Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022,

2003.

[7] T. Brants, F. Chen and I. Tsochantaridis, “Topic-based document segmentation with probabilistic latent semantic analysis”, Proceedings of

the Eleventh International Conference on Information and Knowledge

Management, pp. 211-218, 2002.

[8] L. Cai and T. Hofmann, “Text categorization by boosting automatically extracted concepts”, Proceedings of the 26th Annual International ACM

SIGIR Conference on Research and Development in Information Retrieval, pp. 182-189, 2003.

[9] J. Canny, “GaP: a factor model for discrete data”, Proceedings of the

27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 122-129, 2004.

[10] J. T. Chien, “Online hierarchical transformation of hidden Markov models for speech recognition”, IEEE Transaction on Speech and Audio

Processing, vol. 7, no. 6, pp. 656-667, 1999.

[11] J.-T. Chien, M.-S. Wu and H.-J. Peng, “On latent semantic language modeling and smoothing”, Proceedings of International Conference on

Spoken Language Processing, vol. 2, pp. 1373-1376, 2004.

[12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R.

Harshman, “Indexing by latent semantic analysis”, Journal of the

American Society for Information Science, vol. 41, no. 6, pp. 391-407,

1990.

[13] M. H. DeGroot, Optimal Statistical Decisions, McGraw-Hill, 1970.

[14] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm”, Journal of the Royal

Statistical Society, Series B, vol. 39, no. 1, pp. 1-38, 1977.

[15] C. H. Q. Ding, “A similarity-based probability model for latent semantic indexing”, Proceedings of the 22nd Annual International ACM SIGIR

Conference on Research and Development in Information Retrieval, pp.

58-65, 1999.

[16] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, Inc., 1973.

[17] G. Dupret, “Latent concepts and the number orthogonal factors in latent semantic analysis”, Proceedings of the 26th Annual International ACM

SIGIR Conference on Research and Development in Information Retrieval, pp. 221-226, 2003.

[18] M. Federico, “Language model adaptation through topic decomposition and MDI estimation”, Proceedings of International Conference on

Acoustics, Speech, and Signal Processing, 2002.

[19] D. Gildea and T. Hofmann, “Topic based language models using EM”,

Proceedings of 6th European Conference on Speech Communication and Technology, pp. 2167-2170, 1999.

[20] M. Girolami and A. Kaban, “On an equivalence between PLSI and LDA”, Proceedings of the 26th Annual International ACM SIGIR

Conference on Research and Development in Information Retrieval, pp.

433-434, 2003.

[21] L. K. Hansen, S. Sigurdsson, T. Kolenda, F. A. Nielsen, U. Kjems and J.

Larsen, “Modeling text with generalizable Gaussian mixtures”,

Proceedings of International Conference on Acoustics, Speech, and

Signal Processing, vol. 6, pp. 3494-3497, 2000.

[22] T. Hofmann, “Probabilistic latent semantic indexing”, Proceedings of the

22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.

[23] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis”, Machine Learning, vol. 42, no. 1, pp. 177–196, 2001.

[24] T. Hofmann, “Collaborative filtering via Gaussian probabilistic latent semantic analysis”, Proceedings of the 26th Annual International ACM

SIGIR Conference on Research and Development in Information Retrieval, pp. 259-266, 2003.

[25] Q. Huo and C.-H. Lee, “On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate”, IEEE Transactions on Speech and Audio Processing, vol. 5, pp. 161-172, 1997.

[26] X. Jin, Y. Zhou and B. Mobasher, “Web usage mining based on probabilistic latent semantic analysis”, Proceedings of the 2004 ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 197-205, 2004.

[27] T. Joachims, “Text categorization with support vector machine: learning with many relevant features”, Proceedings of 10th European Conference

on Machine Learning, pp. 137-142, 1998.

[28] T. G. Kolda and D. P. O’Leary, “A semi-discrete matrix decomposition for latent semantic indexing in information retrieval”, ACM Transactions

on Information Systems, vol. 16, no. 4, pp. 322-346, 1998.

[29] H. Masataki, Y. Sagisaka, K. Hisaki and T. Kawahara, “Task adaptation using MAP estimation in n-gram language modeling”, Proceedings of

International Conference on Acoustics, Speech, and Signal Processing,

vol. 2, pp. 783-786, 1997.

[30] T. Minka and J. Lafferty, “Expectation-propagation for the generative aspect model”, Proceedings of the 18th Conference on Uncertainty in

Artificial Intelligence, pp. 352-359, 2002.

[31] K. Nigam, A. K. McCallum, S. Thrun and T. Mitchell, “Text classification from labeled and unlabeled documents using EM”,

Machine Learning, vol. 39, no. 2-3, pp. 103-134, 2000.

[32] J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”, Proceedings of the 21st Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275-281, 1998.

[33] F. Song and W. B. Croft, “A general language model for information retrieval”, Proceedings of 8th Internaional Confrence on Information

and Knowledge Management, pp. 316-321, 1999.

[34] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, New York: McGraw-Hill, 1983.

[35] D. I. Witter and M. W. Berry, “Downdating the latent semantic indexing model for conceptual information retrieval”, The Computer Journal, vol.

41, no. 8, pp. 589-601, 1998.

[36] H. Zha and H. D. Simon, “On updating problems in latent semantic indexing”, SIAM Journal on Scientific Computing, vol. 21, no. 2, pp.

782-791, 1999.

[37] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the ACM

SIGIR 2001, pages 111-119.

[38] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to information retrieval”, ACM Transactions on

Information Systems, vol. 22, no. 2, pp. 179-214, 2004.

[39] D. Harman, Overview of the Fourth Text Retrieval Conference. 1995.

Available at http://trec.nist.gov/pubs/trec4/overvies.ps.gz

[40] B.-Y. Ricardo and Berthier Ribeiro-Neto , Modern Information Retrieval , Addison-Wesley Longman, May 1999

[41] Gene Golub and Arthur van Loan. Matrix computation. Johns Hopokins U. press, 1996

[42] L. Rabiner and B.H. Juang, “Funadamental of Speech Recognition”, Prentice Hall, pp.321-387, 1993

[43] C. Apte, F. Damerau, and S.M. Weiss, “Towards Langage Independent Automated Learning of Text Categorization Models”, ACM SIGIR

Conference on Research and Development in Information Retrieval, pp.

23-30, 1994

[44] Hugo Zaragoza, D. Hiemstra and M. Tipping, Bayesian extension to the language model for ad hoc information retrieval. In Proceedings of

SIGIR, pp. 4-9, 2003

[45] Yuya Akita and Tatsuya Kawahara, “Language Model Adaptation based on PLSA of Topics and Speakers”, 8th International Conference on

Spoken Language Processing, 2004

[46] David Mrva and Philip C. Woodland, “A PLSA-based Language Model for Conversational Telephone Speech”, 8th International Conference on

Spoken Language Processing, 2004

附錄：Bayesian Learning for Latent Semantic Analysis

Jen-Tzung Chien, Meng-Sung Wu and Chia-Sheng Wu

Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, Taiwan 70101, ROC

{chien,jackywu,cswu}@chien.csie.ncku.edu.tw

Abstract

Probabilistic latent semantic analysis (PLSA) is a popular approach to text modeling where the semantics and statistics in documents can be effectively captured. In this paper, a novel Bayesian PLSA framework is presented. We focus on exploiting the incremental learning algorithm for solving the updating problem of new domain articles. This algorithm is developed to improve text modeling by incrementally extracting the up-to-date latent semantic information to match the changing domains at run time. The expectation-maximization (EM) algorithm is applied to resolve the quasi-Bayes (QB) estimate of PLSA parameters. The online PLSA is constructed to accomplish parameter estimation as well as hyperparameter updating. Compared to standard PLSA using maximum likelihood estimate, the proposed QB approach is capable of performing dynamic document indexing and classification. Also, we present the maximum a posteriori PLSA for corrective training. Experiments on evaluating model perplexities and classification accuracies demonstrate the superiority of using Bayesian PLSA.

1. Introduction

Recently, latent semantic analysis (LSA) [6] has been effectively applied for n-gram language modeling [1][5]. The basic idea of LSA aims to represent the words or documents in a low dimensional vector space consisting of the common semantic factors. Using LSA, all words and documents are mapped to the common semantic space, which is constructed via the singular value decomposition (SVD) of a word-by-document matrix. More attractively, the probabilistic LSA (PLSA) is presented to model the aspects in documents in probabilistic way that the document-word joint distributions are expressed conditionally on latent mixtures or topics [10][11]. PLSA text modeling was shown to achieve low perplexity. A topic-based language modeling based on PLSA was developed [9]. Furthermore, a latent Dirichlet allocation [3] method was exploited to deal with the issue of PLSA that the aspect models were estimated only for those documents appearing in training set. In this study, we strive for developing Bayesian PLSA aspect model for adaptive text modeling and classification. Our goal aims to establish the incremental learning and corrective training capabilities for PLSA. The adaptive PLSA modeling is fulfilled to recognize unknown documents with changing domains/topics.

The underlying concept of online PLSA is motivated from the principle of quasi-Bayes (QB) estimate, which has been successfully applied for speech recognition [4][12].

Instead of adaptive hidden Markov modeling (HMM) for speech recognition, we propose QB estimate of PLSA model where the model parameters are estimated by maximizing an approximate posterior distribution, or equivalently a product

of likelihood function of currently observed documents and a priori density given the updating hyperparameters. The advantage of incremental learning highlights on continuously updating model parameters without waiting long history of batch documents. Computation and memory requirements can be substantially reduced. After simplification, the maximum a posteriori (MAP) PLSA is realized for batch model adaptation.

Expectation-maximization (EM) algorithm [8] is adopted to resolve missing data or latent variable problem in Bayesian parameter estimation. In the experiments, we conduct evaluation of perplexity and document classification using Bayesian PLSA. The updating performance can be consistently improved when increasing number of adaptation documents.

2. Probabilistic Latent Semantic Analysis

PLSA is a general machine learning technique, which adopts the aspect model to represent the co-occurrence data associated with a topic or hidden variable

} , ,

{ ₁ _K

k Z z z

z ∈ = K . Let the text corpus Y consist of document-word pairs (d_i,w_j) collected from N documents

} , ,

{ ₁ _N

i d d

d ∈ K with a vocabulary of M words

} , ,

{ ₁ _M

j w w

w ∈ K . The joint probability of observed pair )

(d_i w_j can be generated in asymmetric parameterization form [11]

∑

= ^K

i k k j i j

i,w Pd Pw z P z d

d P

) ( ) ( ) ( )

( , (1)

assuming that d and _i w are independent conditionally on _j the mixture of associated topic z . We can accumulate the _k log likelihood of overall training data Y ={d_i,w_j} as follows

∑ ∑

= =

= ^N

i M j

j i j

i w P d w

d n Y

1 1

) , ( log ) , ( )

(

log θ , (2)

where )n(d_i,w_j is the count of word w occurring in _j document d_i and θ is the PLSA parameter set

)}

( ), (

{P w_jz_k P z_kd_i

θ= . According to maximum likelihood (ML) estimate, PLSA parameters are obtained via maximizing accumulated log likelihood

) ( log max

ML arg θ

θ PY

= . (3) Due to the latent variable z appearing in PLSA, we should _k apply EM algorithm to solve ML parameter estimation. In E-step, we calculate an expectation function of new estimate

)}

ˆ( ), ˆ( ˆ {

i k k

j z P z d

= P

θ over the latent variable z to find _k

= [log ( , ˆ) , ] ˆ )

(θθ E PY Zθ Y θ

Q _Z

∑ ∑ ∑

= = =

N i

K k

i k k j j

i k M

i w P z d w P w z P z d

d n

1 1 1

)]

ˆ( ) ˆ( log[

) , ( ) ,

( . (4)

The posterior probability using current estimate )}

( ), (

{P w_jz_k P z_kd_i

θ= is expressed by

∑=

= _K

l j l l i

i k k j j

k Pw z P z d

d z P z w w P

d z P

1 ( ) ( )

) ( ) ) (

( . (5)

In M-step, we maximize Q(θˆθ) with respect to θˆ and find ML new estimate θˆ_ML given by [11]

) , ( ) , (

) , ( ) , ( )

ˆ (

1 1

1 ML

m i k M

m N

i i m

j i k N

i i j

j nd w Pz d w

w d z P w d n z

P ∑ ∑

∑

= =

= = , (6)

∑ ∑

∑

= =

= = K l

j i j l i j

j i j k i j

k nd w P z d w

w d z P w d n d

z P

1 1

ML ( , ) ( , )

) , ( ) , ( )

ˆ ( . (7)

3. Bayesian Learning for PLSA Modeling

Practically, either SVD in LSA or ML parameters of (6) (7) in PLSA should be adaptive to deal with out-of-vocabulary or out-of-domain problem in an information system. We need to trace domain knowledge from new data collection so as to update LSA or PLSA model. Accordingly, the system robustness can be enhanced to handle changing document collections. Using LSA, the updating problems can be resolved via folding-in, SVD recomputing or SVD updating [2]. In what follows, we present the solutions to updating issues of PLSA model via Bayesian learning of model parameters.

3.1. PLSA Updating

Although updating problems have been investigated for SVD-based LSA framework, similar approaches can not applied for ML-based PLSA framework. To merge up-to-date knowledge into an existing PLSA model, we highlight on building adaptive PLSA for document modeling and classification.

Our goal is to use the newly collected documents, called adaptation data X , to adapt an existing PLSA model to fit the domains of new documents or queries. Importantly, we apply Bayesian theory to develop two new adaptation paradigms for PLSA (1) corrective training and (2) incremental learning. In [4][12], Bayesian learning has been explored to adjust the existing speech HMM’s to a new speaker for adaptive speech recognition. Herein, we present the maximum a posteriori (MAP) PLSA and quasi-Bayes (QB) PLSA to achieve corrective training and incremental learning for adaptive text mining, respectively.

3.2. MAP Estimation for Corrective Training

According to MAP estimation, PLSA parameters θ are estimated by maximizing a posteriori probabilityP( Xθ ), or correspondingly the sum of logarithms of a likelihood function P(Xθ) and a prior density g(θ)

) ( log ) ( log max arg ) ( max

MAP arg θ θ θ

θ = θ P X = θ P X + g . (8)

Prior density represents the randomness of probability parameters θ ={P(w_jz_k),P(z_kd_i)} . Basically, the definition of prior distribution plays a crucial role in Bayesian

learning. As suggested in [7], the choice of conjugate prior is good for Bayesian inference. We will show you later two attractive properties of using conjugate prior: 1) a closed-form solution for rapid learning and 2) a reproducible prior/posterior pair for incremental learning. As we know, Dirichlet density is referred as the conjugate prior for probability parameters or multinomial observations [7].

Assuming the variables P(w_jz_k) and P(z_kd_i) are independent, the prior density of overall parameters is expressed by

∏ ∏ ∏

= =

−

⎥⎥

⎦

⎤

⎢⎢

⎣

∝ ^K ⎡

i k M

k j

i k k

j P z d

z w P g

1 1

1 _,

, ( )

) ( )

(θ ^α ^β , (9)

where }ϕ={α_j_,_k,β_k_,_i are hyperparameters of Dirichlet densities. Again, we apply EM algorithm to iteratively calculate the posterior expectation function R(θˆθ) (E-step) and maximize it with respect to θˆ so as to find new MAP estimates θˆ_MAP (M-step). Because the optimization is performed subject to the constraints ˆ( ) 1

1 =

∑

j P wjzk and 1

) ˆ(

1 =

∑

^Kk=P zkdi , we obey the Lagrange optimization procedure and form the modified expectation function

) ) ˆ( 1 ( ) ) ˆ( 1 (

) ˆ( log ) 1 ( ) ˆ( log ) 1 (

)]

ˆ( ) ˆ( log[

) , ( ) , ( ˆ )

1 1

, 1 1

1 1 1

∑

∑ ∑

∑ ∑ ∑

= =

= = =

− +

∝

K k

i k d

M j

k j w

N i

K k

i k i

k M

j K k

k j k

j N i

K k

i k k j j

i k M

j i

d z P z

w P

d z P z

w P

d z P z w P w d z P w d n R

η η

β α

θ θ

(10) where η_d and η_w are two Lagrange multipliers. Then, we differentiate R~(θˆθ) with respect to ˆ( )

k jz w

P and ˆ( )

z P to obtain new MAP estimates in closed-form solution

∑ ∑

∑

= =

− +

−

= _M +

m jm

i i m k i m

k j j i k N

i i j

j nd w P z d w

w d z P w d n z

w P

1 1 ,

1 ,

MAP [ ( , ) ( , ) ( 1)]

) 1 ( ) , ( ) , ( )

ˆ (

α α

(11)

∑

− +

−

= _K +

l li

i k M

j i j k i j

k nd

w d z P w d n d

z P

1 ,

MAP ( ) ( 1)

) 1 ( ) , ( ) , ( )

ˆ (

β β

. (12) This MAP PLSA algorithm is designed for corrective training or batch adaptation, which adapts the existing PLSA parameters θ to θ_MAP using newly collected documents X in batch mode. Apparently, new parameter θ_MAP could perform better than θ when classifying future documents with new topics and terms. Due to the closed-form solution, we don’t need to use descent algorithm to find optimal parameters. Rapid adaptation can be achieved.

3.3. QB Estimation for Incremental Learning

However, in an online information system, we need to continuously update system parameters with new words and topics. Such system can be robust for pattern recognition at different epochs. Because the system is updated at each epoch, the out-of-date words or documents will be gradually faded away or down dated from system parameters. In MAP PLSA,

there is no updating mechanism designed for online PLSA adaptation. Hereafter, we present the quasi-Bayes (QB) estimation to fulfill PLSA incremental learning. In general, at the nth epoch, we maximize the posterior probability using the sequence of documents χⁿ={X₁,L,X_n} through [4][12]

) ( ) ( max arg

) ( ) ( max arg ) ( max arg

) 1 (

1 )

( QB

−

≅

n n

g X P

P X P P

ϕ θ θ

χ θ θ χ

θ θ

θ , (13)

where the posterior density P(θχⁿ⁻¹) is approximated by the closest tractable prior density g(θϕ⁽ⁿ⁻¹⁾) with hyperparameters ϕ⁽ⁿ⁻¹⁾ evolved from historical documents

−1

χn . Attractively, QB estimation provides a recursive learning mechanism of PLSA parameters θ⁽¹⁾,_L,θ⁽ⁿ⁾ from incrementally observed block documents X₁,L,X_n. At each epoch, we use the current block of documents X_n and the accumulated statistics ϕ⁽ⁿ⁻¹⁾ to update PLSA parameters to

) ( QB

θ n . After updating hyperparameters ϕ⁽ⁿ⁻¹⁾→ϕ⁽ⁿ⁾ , the current observation documents X_n={d_i⁽ⁿ⁾,w⁽_jⁿ⁾} are discarded without storage. As compared to MAP PLSA, the key difference using QB PLSA is due to the updating of hyperparameters. If we substitute the hyperparameters

} , { ⁽_, ¹⁾ ⁽_, ¹⁾

) 1

(ⁿ− = α_jⁿ_k− β_kⁿ_i−

ϕ into (11) (12), the corresponding QB estimate θ_QB⁽ⁿ⁾={P_QB(w⁽_jⁿ⁾z_k),P_QB(z_kd_i⁽ⁿ⁾)} can be obtained. Because EM algorithm is employed, the updating of hyperparameters can be derived in E-step of QB estimation.

By introducing the latent variables Z={z_k}, the expectation of the logarithm of posterior distribution R(θˆ⁽ⁿ⁾θ⁽ⁿ⁾) is expanded. After careful arrangement, the exponential of posterior expectation function can be expressed as a new Dirichlet distribution

∏ ∏ ∏

= =

−

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

∝

K k

N i

n i k M n

k n j n

n n

n i k n

j P z d

z w P

1 1

) 1 ( ) ( 1

) 1 ( ) (

) ( ) (

) (

, )

(

, ˆ ( )

) ˆ (

ˆ )}

( exp{

α β

θ θ

, (14)

with new hyperparameters ϕ⁽ⁿ⁾={α⁽_j_,ⁿ_k⁾,β_k⁽_,ⁿ_i⁾} given by

) 1 (

, ) ( ) ( ) ( 1

) ( ) ( )

(

, ( , ) ( , ) ⁻

∑

jⁿk

n j n i k N n

n j n i n

j nd w P z d w α

α , (15)

) 1 (

, ) ( ) ( ) ( 1

) ( ) ( )

(

, ( , ) ( , ) ⁻

= +

∑

kⁿi

n j n i k M n

n j n i n

k nd w P z d w β

β . (16)

Interestingly, a reproducible prior/posterior pair is generated to build the updating mechanism of hyperparameters. In (15)(16), the new hyperparameters α⁽_jⁿ_,_k⁾ and β_k⁽_,ⁿ_i⁾ are obtained by interpolating the previous hyperparameters

) 1 (

− n

αj and β_k⁽_,ⁿ_i⁻¹⁾ with the accumulated statistics of latent variable z_k related to word w⁽ⁿ_j⁾ and document d_i⁽ⁿ⁾ , respectively.

3.4. Estimation of Initial Hyperparameters

Either MAP PLSA or QB PLSA, it is critical to estimate initial hyperparameters ϕ⁽⁰⁾={α⁽_j⁰_,_k⁾,β_k⁽⁰_,_i⁾} . Typically, the hyperparameters can be estimated from training data in an empirical Bayes sense. Nevertheless, the estimation of initial hyperparameters is an open issue in Bayesian learning.

Without loss of generality, herein, we adopt the estimation of initial hyperparameters for Dirichlet density introduced by Huo and Lee [12]

∑

= ^N

j i k k

j P z d w

1 ) 0 (

, 1 ( , )

α , (17)

∑

= ^M

j i i k

k P z d w

1 ) 0 (

, 1 ( , )

β . (18)

4. Experiments

In the experiments, we evaluate the performance of proposed methods using two public-domain document collections. One was MED corpus [10] and the other was Reuters-21578 [3].

Before PLSA modeling, we performed preprocessing stages of stemming and stop word removal for all documents.

4.1. Evaluation of Model Perplexity

When evaluating the perplexity performance, we used MED corpus containing 1,033 medical abstracts with 30 queries and 7,014 unique terms collected from National Library of Medicine. Without loss of generalization, we adopted 433 abstracts for ML PLSA model training and the remaining 600 abstracts for MAP corrective training or QB incremental learning. A query subset served as the test data for evaluation of model perplexity. To examine the effect on different numbers of adaptation data (N), we calculated perplexities for cases of N=200, 400 and 600. In case of N=600, MAP PLSA performed one learning epoch using all adaptation data while QB PLSA performed three learning epochs with 200 adaptation data at each epoch. Number of latent variables was fixed to be K=8. Without model adaptation, the baseline perplexity is 146.7. As shown in Figure 1, the performance of MAP PLSA and QB PLSA are improved greatly. Perplexities are consistently reduced when increasing number of adaptation data. QB PLSA achieves lower perplexities compared to MAP PLSA. For the case of N=600, QB PLSA attained perplexity of 53.9 which is better than 63 using MAP PLSA. Also, in Figure 2, we compare computation times (measured in seconds) of MAP PLSA and QB PLSA for different numbers of adaptation data. Computation times were measured using a personal computer with Pentium IV 2.4GHz CPU and 1 GB RAM. We can see that MAP PLSA increases computation cost when increasing adaptation data. However, QB PLSA computes parameters only using adaptation data at current learning epoch. QB PLSA is efficient at each epoch.

4.2. Experimental Results on Text Classification

We followed the ModApte split of Reuters-21578 data set in which 7,195 documents were used for training or adaptation and 2,790 documents were used for classification. We further partitioned 7,195 documents into 4,270 documents for training and the remaining 2,925 documents for QB incremental adaptation. The corpus contained 13,353 unique words. Text classification was performed on ten most

populous categories. As shown in Figure 3, we report classification accuracies of QB PLSA for the cases of N=975, 1950 and 2925. N=0 implies baseline system. Numbers of latent variables K=64 and K=128 are investigated. We find that QB adaptation consistently improves performance when increasing number of adaptation data. In this case,

=64

K is sufficient for adaptive modeling and performs better than K=128.

200 400 600

50 60 70 80 90 100 110 120 130 140 150

Number of adaptation data Peplexyrit

Baseline MAP PLSA QB PLSA

Figure 1: Perplexities for different numbers of adaptation data

200 400 600

15 20 25 30 35 40 45 50 55

Number of adaptation data Co

m putati on Ti me e(s c)

MAP PLSA QB PLSA

Figure 2: Computation times (sec) using MAP and QB PLSA

0 975 1950 2925

88.2 88.4 88.6 88.8 89 89.2 89.4 89.6 89.8 90

Number of adaptation data Csslaific

ati on Acc uracy (

% )

QB PLSA,K=64 QB-PLSA,K=128

(Baseline)

Figure 3: Classification accuracy (%) for different numbers of adaptation data and latent variables.

5. Conclusions

This paper presented an adaptive text modeling and classification approach for PLSA based information system.

We performed batch adaptation and incremental adaptation through MAP PLSA and QB PLSA algorithms, respectively, without the needs of recomputing and storing whole adaptation data. We highlighted the contributions of incremental learning where new domain knowledge could be continuously updated and simultaneously the out-of-date words or documents would be faded away. In this manner, we did not only solve the updating problem but also the downdating problem. The experiments on evaluating model perplexity and classification accuracy illustrated the performance of using MAP PLSA and QB PLSA. In the future, we will develop Bayesian learning for other text modeling approaches. Extension of PLSA for bigram or trigram will be explored. Further work is also required to apply for spoken document classification and retrieval.

6. References

[1] J. R. Bellegarda, “Exploiting latent semantic information in statistical language modeling,” Proceeding of the IEEE, vol. 88, No. 8, pp. 1279-1296, 2000.

[2] M. W. Berry, S. T. Dumais and G. W. O’Brien, “Using linear algebra for intelligent information retrieval”, SIAM Review, vol. 37, no. 4, pp. 573-595, 1995.

[3] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet allocation”, Journal of Machine Learning Research, vol.

3, no. 5, pp. 993-1022, 2003.

[4] J. T. Chien, “Online hierarchical transformation of hidden Markov models for speech recognition”, IEEE Transactions on Speech and Audio Processing, vol. 7, no.

6, pp. 656-667, 1999.

[5] J.-T. Chien, M.-S. Wu and H.-J. Peng, “On latent semantic language modeling and smoothing”, Proceedings of International Conference on Spoken Language Processing, vol. 2, pp. 1373-1376, 2004.

[6] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K.

Landauer and R. Harshman, “Indexing by latent semantic analysis”, Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.

[7] M. H. DeGroot, Optimal Statistical Decisions, McGraw-Hill, 1970.

[8] A. P. Dempster, N. M. Laird and D. B. Rubin,

“Maximum likelihood from incomplete data via the EM algorithm”, Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1-38, 1977.

[9] D. Gildea and T. Hofmann, “Topic based language models using EM”, Proceedings of 6th European Conference on Speech Communication and Technology, pp. 2167-2170, 1999.

[10] T. Hofmann, “Probabilistic latent semantic indexing”, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.

[11] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis”, Machine Learning, vol. 42, no. 1, pp.

177–196, 2001.

[12] Q. Huo and C.-H. Lee, “On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate”, IEEE Transactions on Speech and Audio Processing, vol. 5, pp. 161-172, 1997.

在文檔中使用貝氏潛在語意分析於文件分類及資訊檢索 (頁 85-97)

K 值的取決方式，主要是以 perplexity

K 值，以及因應不同資料量如何調整適當的 K 值，在未

第 七 章 參考文獻

on Spoken Language Processing, 2004.

Acoustics, Speech, and Signal Processing, vol. 1, pp. 769-772, 2002.

Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022,

the Eleventh International Conference on Information and Knowledge

Management, pp. 211-218, 2002.

SIGIR Conference on Research and Development in Information Retrieval, pp. 182-189, 2003.

27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 122-129, 2004.

Processing, vol. 7, no. 6, pp. 656-667, 1999.

Spoken Language Processing, vol. 2, pp. 1373-1376, 2004.

American Society for Information Science, vol. 41, no. 6, pp. 391-407,

Statistical Society, Series B, vol. 39, no. 1, pp. 1-38, 1977.

Conference on Research and Development in Information Retrieval, pp.

SIGIR Conference on Research and Development in Information Retrieval, pp. 221-226, 2003.

Acoustics, Speech, and Signal Processing, 2002.

Proceedings of 6th European Conference on Speech Communication and Technology, pp. 2167-2170, 1999.

Conference on Research and Development in Information Retrieval, pp.

Proceedings of International Conference on Acoustics, Speech, and

Signal Processing, vol. 6, pp. 3494-3497, 2000.

22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.

SIGIR Conference on Research and Development in Information Retrieval, pp. 259-266, 2003.

SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 197-205, 2004.

on Machine Learning, pp. 137-142, 1998.

on Information Systems, vol. 16, no. 4, pp. 322-346, 1998.

International Conference on Acoustics, Speech, and Signal Processing,

Artificial Intelligence, pp. 352-359, 2002.

Machine Learning, vol. 39, no. 2-3, pp. 103-134, 2000.

ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275-281, 1998.

and Knowledge Management, pp. 316-321, 1999.

SIGIR 2001, pages 111-119.

Information Systems, vol. 22, no. 2, pp. 179-214, 2004.

Conference on Research and Development in Information Retrieval, pp.

SIGIR, pp. 4-9, 2003

Spoken Language Processing, 2004

Spoken Language Processing, 2004

Jen-Tzung Chien, Meng-Sung Wu and Chia-Sheng Wu

Abstract

1. Introduction

2. Probabilistic Latent Semantic Analysis

∑

∑ ∑

∑ ∑ ∑

∑ ∑

∑

3. Bayesian Learning for PLSA Modeling

∏ ∏ ∏

∑

∑

∑

∑

∑ ∑

∑ ∑

∑ ∑ ∑

∑ ∑

∑

∑

∑

∏ ∏ ∏

∑

∑

∑

∑

4. Experiments

5. Conclusions

6. References

第七章參考文獻