HIERARCHICAL THEME AND TOPIC MODEL FOR SUMMARIZATION

(1)

2013 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 22–25, 2013, SOUTHAMPTON, UK

HIERARCHICAL THEME AND TOPIC MODEL FOR SUMMARIZATION

Jen-Tzung Chien and Ying-Lan Chang

Department of Electrical and Computer Engineering

National Chiao Tung University, Hsinchu, Taiwan 30010, ROC

jtchien@nctu.edu.tw

ABSTRACT

This paper presents a hierarchical summarization model to ex-tract representative sentences from a set of documents. In this study, we select the thematic sentences and identify the topical words based on a hierarchical theme and topic model (H2TM). The latent themes and topics are inferred from docu-ment collection. A tree stick-breaking process is proposed to draw the theme proportions for representation of sentences. The structural learning is performed without ﬁxing the num-ber of themes and topics. This H2TM is delicate and ﬂexible to represent words and sentences from heterogeneous ments. Thematic sentences are effectively extracted for docu-ment summarization. In the experidocu-ments, the proposed H2TM outperforms the other methods in terms of precision, recall and F-measure.

Index Terms— Topic model, structural learning, Bayesian

nonparametrics, document summarization 1. INTRODUCTION

As the internet grows prosperously, the online web documents have been too redundant to browse and search efficiently. Au-tomatic summarization becomes crucial for browsers to cap-ture themes and concepts in a short time. Basically, there are two kinds of solutions to summarization. Abstraction is to rewrite summary for a document while extraction is to extract the representative sentences for a summary. Abstraction is usually difficult and arduous, so mostly we focus on extrac-tion. However, a good summary system should reflect diverse topics of documents and keep redundancy to a minimum. Ex-traction likely leads to a summary with too coherent topics.

In the literature, the unsupervised learning via

probabilis-tic topic model [1] has been popular for document

categoriza-tion [2], speech recognicategoriza-tion [3], text segmentacategoriza-tion [4], and im-age analysis [1]. The latent semantic topics are learnt from a bag of words. Such topic model can capture the salient themes embedded in data collection and work for document summa-rization [5]. However, topic model based on latent Dirichlet allocation (LDA) [2] was constructed as a ﬁnite-dimensional mixture representation which assumed that 1) number of

top-ics was ﬁxed, and 2) toptop-ics were independent. The hierar-chical Dirichlet process (HDP) [6] and the nested Chinese restaurant process (nCRP) [7][8] were proposed to conduct structural learning to relax these two assumptions.

HDP [6] is a Bayesian nonparametric extension of LDA where the representation of documents is allowed to grow structurally as more data are observed. Each word token within a document is drawn from a mixture model where the hidden topics are shared across documents. Dirichlet process (DP) is realized to find flexible data partitions or provide the nonparametric prior over number of topics for each document. The base measure for the child Dirichlet pro-cesses (DPs) is itself drawn from a parent DP. On the other hand, nCRP [7][8] explores the topic hierarchies with flex-ible extension of infinite branches and infinite layers. Each document selects a tree path with nodes containing topics in different sharing conditions. All words in the document are represented by using these topics.

In this study, we develop a hierarchical tree model for representation of sentences from heterogeneous documents. Using this model, each path from root node to leaf node cov-ers from general theme to individual theme. These themes contain coherent information but in varying degrees of shar-ing. The brother nodes expand the diversity of themes from different sentences. This model does not only group sen-tences into a node in terms of its theme, but also distinguish their concepts by means of different levels. A structural stick-breaking process is proposed to draw a subtree path and determine a variety of theme proportions. We conduct the task of multi-document summarization where the sentences are selected across documents with a diversity of themes and concepts. The number of latent components and the depen-dency between these components are ﬂexibly learnt from the collected data. Further, the words of the sentences inside a node are represented by a topic model which is drawn by DP. All the topics from different nodes are shared under a global DP. We propose Bayesian nonparametric approach to structural learning of latent topics and themes from the ob-served words and sentences, respectively. This approach is applied for concept-based summarization over multiple text documents.

(2)

2. BAYESIAN NONPARAMETRIC LEARNING 2.1. HDP and nCRP

There have been many Bayesian nonparametric approaches developed for discovering a countably inﬁnite number of la-tent features in a variety of real-world data. Bayesian in-ference is performed by integrating out the inﬁnitely many parameters. HDP [6] conducts Bayesian nonparametric rep-resentation of documents or grouped data where each doc-ument or group is associated with a mixture model. Words in different documents share a global mixture model. Us-ing HDP, each documentd is associated with a draw from

a DP Gd, which determines how much each member of a shared set of mixture components contributes to that docu-ment. The base measure ofGdis itself drawn from a global DPG0 which ensures that there is a set of mixtures shared across data. Each distributionGd governs the generation of words for a documentd. The strength parameter α0

deter-mines the proportion of a mixture in a documentd. The

doc-ument distributionGd is generated byG0 ∼ DP(γ, H ) and

Gd ∼ DP(α0, G0) where {γ, α0} and H denote the strength

parameters and the base measure, respectively. HDP is de-veloped to represent a bag of words from a set of documents through nonparametric priorG0. In [7][8], the nCRP was

pro-posed to conduct Bayesian nonparametric inference of topic hierarchies and learn the deeply branching trees from data collection. Using this hierarchical LDA (hLDA), each doc-ument was modeled by a path of topics along a random tree where the hierarchically-correlated topics from global topics to speciﬁc topics were extracted. In general, HDP and nCRP could be implemented by using stick-breaking process and Chinese restaurant process. The approximate inference algo-rithms via Markov chain Monte Carlo (MCMC) [6][7][8] and variational Bayesian [9][10] were developed.

2.2. Stick-Breaking Process

Stick-breaking process is designed to implement inﬁnite mix-ture model according to a DP. Beta distribution is introduced to draw binary variables for stick-breaking into left segment and right segment. A random probability measureG is ﬁrst

drawn from a DP with base measureH using a sequence of

beta variates. Using this process, a stick of unit length is par-titioned at a random location. The left segment is denoted by

θ1. The right segment is further partitioned at a new location.

The partitioned left segment is denoted byθ2. We continue

this process by generating the left segmentθi and breaking the right segment at each stepi. Stick-breaking depends on a

random value drawn fromH which is seen as center of

proba-bility measure. The distribution over sequence of proportions

{θ1, · · · , θi} is called GEM distribution which provides a dis-tribution over inﬁnite partitions of unit interval [11]. In [12], a tree stick-breaking process was proposed to infer a tree struc-ture. This method interleaved two stick-breaking procedures.

The ﬁrst has beta variates for depth which determine the size of a given node’s partition as a function of depth. The second has beta variates for branch which determine the branching probabilities. Interleaving two procedures could partition the unit interval into a tree structure.

3. HIERARCHICAL THEME AND TOPIC MODEL In this study, a hierarchical theme and topic model (H2TM) is proposed for representation of sentences and words from a collection of documents based on Bayesian nonparametrics.

1 T 11 T 111 T 12 T 121 T s2 s1 s3 s4 s5 s6 s7 s8 wi wi wi wi wi wi wi wi wi wi wi wi wi wi wi 1 \ 2 \ \4 3 \ \5 wi 1 I wi 2

I

wi 1 S 2 S d={s1,s2,s3,s4, s5,s6,s7,s8} d c

Fig. 1. A tree structure for representation of words, sentences and documents. Thick arrows denote the tree paths cddrawn for eight sentences of a documentd. Dark rectangle, diamonds and circles

denote the observed document, sentences and words, respectively. Each sentencesjis assigned with a theme variableψlat a tree node along tree paths with probabilityθdlwhile each wordwiin tree node is assigned with a topic variableφkwith probabilityπlk.

3.1. Model Description

H2TM is constructed by considering the structure of a docu-ment where each docudocu-ment consists of a “bag of sentences” and each sentence consists of a “bag of words”. Different from the inﬁnite topic model using HDP [6] and the hierar-chical topic model using nCRP [7][8], we propose a new tree model for representation of a “bag of sentences” where each sentence has variable length of words. A two-stage proce-dure is developed for document representation as illustrated in Figure 1. In the ﬁrst stage, each sentencesjof a document is drawn from a mixture of theme model where the themes are shared for all sentences from a collection of documents. The theme model of a documentd is composed of the themes

along its corresponding tree paths c_d. With a tree structure of themes, the unsupervised grouping of sentences into different layers is constructed. In the second stage, each wordwiof the sentences allocated in a tree node is drawn by an individual

(3)

mixture of topic model. All topics from different nodes are

drawn using a global topic model.

Using H2TM, we assume that the words of the sentences in a tree node given topic k are conditionally independent

and drawn from a topic model with inﬁnite topics{φk}∞k=1. The sentences in a document given theme l are

condition-ally independent and drawn from a theme model with inﬁ-nite themes {ψl}∞l=1. The document-dependent theme pro-portions{θdl}∞l=1and the theme-dependent topic proportions

{πlk}∞k=1are introduced. Given these proportions, each word

wiis drawn from a mixture model of topics

kπlk·φkwhile each sentencesj is sampled from a mixture model of themes

lθdl· ψl. Since a theme for sentences is represented by a

mixture model of topics for words, we accordingly bridge the relation between themes and topics via ψl∼

kπlk· φk. 3.2. Sentence-Based nCRP

In this study, a sentence-based tree model with inﬁnite nodes and branches are estimated to conduct unsupervised structural learning and select semantically-rich sentences for document summarization. A sentence-based nCRP (snCRP) is proposed to construct a tree model where root node contains general theme and leaf node conveys a speciﬁc theme for sentences. Different from previous word-based nCRP [7][8] where top-ics along a single tree path are selected to represent all words of a document, the snCRP is exploited to represent all sen-tences of a document based on the themes which are from multiple tree paths or equivalently from a subtree path. It is because that the variation of themes does exist in heteroge-neous documents. The conventional word-based nCRP us-ing GEM distribution should be extended to the snCRP usus-ing tree-based GEM (treeGEM) distribution by considering mul-tiple paths for document representation. A tree stick-breaking process is proposed to draw a subtree path and determine the theme proportions for representation of all sentences in a doc-ument.

A new scenario is described as follows. There are infinite number of Chinese restaurants in a city. Each restaurant has infinite tables. A tourist visits the first (root) restaurant where each of its tables has a card showing the next restaurant which is arranged in the second layer of this tree. Such visit repeats infinitely. Each restaurant is associated with a tree layer, and each table has its unique label. The restaurants in a city are organized into an infinitely-branched and infinitely-deep tree structure. Model construction for H2TM is summarized by

1. For each themel

(a) Draw a topic modelφk∼ G0.

(b) Draw topic proportionsπl|{α0, λ0} ∼ DP(α0, λ0).

(c) Theme model is generated byψl∼

kπlk· φk. 2. For each documentd ∈ {1, · · · , D}

(a) Draw a subtree path cd= {cdj} ∼ snCRP(γs).

(b) Draw theme proportions over path cd by tree stick-breakingθd|{αs, λs} ∼ treeGEM(αs, λs).

(c) For each sentencesj

i. Choose a theme labelzsj= l|θd∼ Mult(θd). ii. For each wordwi

A. Choose a topic label based on topic propor-tion of themel, i.e. zwi= k|πl∼ Mult(πl). B. Draw a word based on topiczwibywi|{zwi,

φk} ∼ Mult(φzwi).

The hierarchical grouping of sentences is accordingly ob-tained through a nonparametric tree model based on snCRP. Each tree node stands for a theme. A sentencesj is deter-mined by a theme modelψl. In what follows, we address how proportionsθdof themezsj= l are drawn for representation of wordswiof sentencessjin documentd.

1 T 11 T T12 111 T T112 T121 T122 (a) 10

T

T11

T

12 110 T T₁₁₁ T120 T121 1 0 T (b)

Fig. 2. Illustrations for (a) tree stick-breaking process, and (b) hierarchical theme proportions.

3.3. Tree Stick-Breaking Process

Traditional GEM is used to provide a distribution over inﬁ-nite proportions. Such distribution is not suitable for char-acterizing a tree structure with dependencies between parent nodes and child nodes. To cope with this issue, we present a new snCRP by conducting a tree stick-breaking (TSB) pro-cess where the theme proportions along with a subtree path are drawn. The subtree path is chosen to reveal a variety of subjects from different sentences while different levels are built to characterize the hierarchy of aspects. Each sentence is assigned by a node with theme proportion determined by all nodes in the selected subtree path cd.

(4)

Interestingly, the proposed TSB process is a special real-ization of the tree-structured stick-breaking process given in [12]. This process is specialized to draw theme proportions

θd = {θdl} for a document d subject to _∞

l=1θdl = 1 based on a tree model with inﬁnite nodes. TSB process is described as follows. We consider a set of a parent node and its child nodes that are connected as shown by thick arrows in Figure 2(a). Letla denote an ancestor node andlc = {la1, la2, · · · } denote its child nodes. TSB is run for each set of nodes

{la, lc} in a recursive fashion. Figures 2(a) and 2(b) illus-trate how the tree structure in Figure 1 is constructed. Figure 2(b) shows how theme proportions are inferred by TSB. The theme proportionθla0in the beginning child node denotes the

initial fragment of nodela when proceeding stick-breaking process for its child nodeslc. Here,θ0= 1 denotes the initial unit length,θ1 = ν1 denotes the ﬁrst fragment of stick for

root node and1 − ν₁denotes the remaining fragment of the stick. Given the treeGEM parameters{αs, λs}, the beta vari-ableνu ∼ Beta (αsλs, αs(1 − λs)) of a child node lu∈ Ωlc

is ﬁrst drawn. The probability of generating this draw is cal-culated byνu

_u−1

v=0(1 − νv). This probability is then multi-plied by the theme proportionθlaof ancestor nodelaso as to

ﬁnd theme proportion for its child nodeslu ∈ Ωlc. We can

recursively calculate the theme proportion by

θlu = θlaνu u−1 v=0

(1 − νv) , for lu∈ Ωlc. (1)

Therefore, a tree model is constructed without limitation of tree layers and branches. We improve the efﬁciency of tree stick-breaking method [12] by adopting a single set of beta parameters{α_s, λs} for stick-breaking towards depth as well as branch. Using this process, we draw a global theme pro-portions for sentences in different documentsd by using

scal-ing parameterγs and then determine a subtree path for all sentences sj in document d via snCRP by cd = {cdj} ∼ snCRP(γs). A tree stick-breaking process is performed to sample the theme proportionsθd∼ treeGEM(αs, λs). 3.4. HDP for Words

After having the hierarchical grouping of sentences based on snCRP, we treat the words corresponding to a node of themel as grouped data and conduct HDP using the grouped

data from different tree nodes. The topic model is then con-structed and utilized to draw individual words. Importantly, each theme is represented by a mixture model of topics

ψl∼

kπlk· φk. HDP is applied to infer word distributions and topic proportions. The standard stick-breaking process is applied to infer topic proportions for DP mixture model based on GEM distribution. The words of a tree node corresponding to themel is generated by

λ0∼ GEM(γw), πl∼ DP(α0, λ0), φk∼ G0

zwi|πl∼ Mult(πl), wi|{zwi, φk} ∼ Mult(φzwi)

(2)

whereλ0is a global prior for tree nodes,πlis the topic pro-portion for themel, φk is thekth topic, α0 andγw are the strength parameters for DP. At last, the snCRP compound

HDP is fulﬁlled to establish the hierarchical theme and topic

model (H2TM).

4. MODEL INFERENCE

The approximate inference using Gibbs sampling is devel-oped to infer posterior parameters or latent variables for H2TM. Each latent variable is iteratively sampled by a poste-rior probability with the condition on the observations and all the other latent variables. We sample tree pathscd = {cdj} for different sentences of documentd. Each sentence sj is grouped into a tree node with themel which is sampled by

proportionsθdunder a subtree pathcdthrough snCRP. Each wordwi of a sentence is assigned by the topick which is sampled via HDP.

4.1. Sampling of Tree Paths

A document is treated as “a bag of sentences” for path sam-pling in proposed snCRF. To do so, we iteratively sample tree pathsc_dfor wordsw_din documentd consisting of sentences {wdj}. Sampling tree paths is performed according to the posterior probability

p(cdj|cd(−j), wd, zsj, ψl, γs) ∝ p(cdj|cd(−j), γs)

×p(wdj|wd(−j), zsj, cd, ψl)

(3) where c_d(−j) denotes the paths of all sentences in docu-ment d except sentence sj. The notation “-” denotes the self-exception. In (3), γs is Dirichlet prior parameter for global theme proportions. The ﬁrst term in right-hand-side (RHS) calculates the probability of choosing a path for a sentence. This probability is determined by applying CRP [8] where thejth sentence chooses either an occupied path h by p(cdj = h|cd(−j), γs) =

f_{d(cdj =h)}

fd·−1+γs or or a new path

by p(cdj = new|cd(−j), γs) = _f_d·_−1+γγs _s where fd(cdj=h)

denotes the number of sentences in documentd that are

allo-cated along tree pathh. Path h is selected for sentence wdj. The second term in RHS of (3) can be calculated by referring [7][8].

4.2. Sampling of Themes

Given the current pathcdjselected via snCRP by using words

wdj, we sample a tree node at level or equivalently sample a themel according to the posterior probability given current

values of all other variables

p(zsj= l|wd, zs(−j), cdj, αs, λs, ψl) ∝

p(zsj= l|zs(−j), cdj, αs, λs)p(wdj|wd(−j), zs, ψl) (4) wherezs = {zsj, zs(−j)}. The number of theme is unlim-ited. The ﬁrst term in RHS of (4) is a distribution over levels

(5)

derived as an expectation of treeGEM which is implemented via TSB process and is calculated via a product of beta vari-ablesνu∼ Beta (αsλs, αs(1 − λs)) along path cdj. The sec-ond term calculates the probability of sentencewdjgiven the theme modelψl.

4.3. Sampling of Topics

According to HDP, we apply stick-breaking construction to draw topics for words in different tree nodes. We view words

{wdji} of the sentences in a node with theme l as the grouped data. Topic proportions are drawn from DP(α₀, λ0). Drawing

of a topick for word wdji or wi depends on the posterior probability

p(zwi= k|wdj, zw(−i), cdj, α0, λ0, φk) ∝ p(zwi= k|

zw(−i), cdj, α0, λ0)p(wdji|wdj(−i), zw, φk).

(5) Calculating (5) is equivalent to estimating the topic propor-tionπlk. The ﬁrst term in RHS of (5) is a distribution over topics derived as an expectation of GEM and is calculated via a product of beta variables using Beta(α₀λ0, α0(1 − λ0)).

The second term calculates the probability of wordwdjigiven topic modelφk. Given the current status of the sampler, we iteratively sample each variable conditioned on the rest vari-ables. For each documentd, the paths cdj, themesl and topics

k are sequentially sampled and iteratively employed to update

the corresponding posterior probabilities in Gibbs sampling procedure. The true posteriors are approximated by running sufﬁcient iterations of Gibbs sampling. The resulting H2TM is implemented for document summarization.

5. EXPERIMENTS 5.1. Experimental Setup

A series of experiments were conducted to evaluate the pro-posed H2TM for document summarization. The experi-ments were performed by using DUC (Document Under-standing Conference) 2007 (http://duc.nist.gov/). In DUC 2007, there were 45 super-documents where each document contained 25-50 news articles. The number of total sen-tences in this dataset was 22961. The vocabulary size was 18696 after removing stop words. This corpus provided the reference summaries, which were manually written for evaluation. The automatic summary for DUC was limited to 250 words at most. The NIST evaluation tool, ROUGE (Recall-Oriented Understudy for Gisting Evaluation), was adopted. ROUGE-1 was used to measure the matched uni-grams between reference summary and automatic summary, and ROUGE-L was used to calculate the longest common subsequence between two text datasets. For simplicity, we constrained tree growing to three layers in our experiments. The initial values of three-layer H2TM were speciﬁed by

ψl= φk= [0.05 0.025 0.0125]T,λ0= λs= 0.35, αs= 100 andγs= 0.5.

Recall Precision F-measure H2TM-root 0.4001 0.3771 0.3878 H2TM-leaf 0.4019 0.3930 0.3927 H2TM-MMR 0.4093 0.3861 0.3969 H2TM-path 0.4100 0.3869 0.3976

Table 1. Comparison of recall, precision and F-measure by using H2TM based on four sentence selection methods.

5.2. Evaluation for Summarization

We conduct unsupervised structural learning and provide sentence-level thematic information for document summa-rization. The thematic sentences are selected from a tree structure where the sentences are allocated in the correspond-ing tree nodes. The tree model was built by runncorrespond-ing 40 itera-tions of Gibbs sampling. The tree path contains sentences for a theme in different layers with varying degree of thematic focus. The thematic sentences are selected by four methods. The ﬁrst two methods (denoted by root and H2TM-leaf) are designed to calculate the Kullback-Leibler (KL) divergence between document model and sentence models using the sentences which are grouped into root node and leaf nodes, respectively. The sentences with small KL divergences are selected. The third method (denoted by H2TM-MMR) is to apply the maximal marginal relevance (MMR) [13] to select sentences from all possible paths. The fourth method (denoted by H2TM-path) chooses the most frequently-visited path of a document among different paths and selects the sentences which are closest to the whole document according to their KL divergence.

Table 1 compares four selection methods to document summarization in terms of recall, precision and F-measure under ROUGE-1. The H2TM-path and H2TM-MMR obtains comparable results. These two methods perform better than H2TM-root and H2TM-leaf. H2TM-path obtains the highest F-measure. The sentences along the most frequently-visited path contain the most representative information for summa-rization. In the subsequent evaluation, H2TM-path is adopted for comparison with other summarization methods.

Table 2 reports the recall, precision and F-measure of doc-ument summarization by using Vector Space Model (VSM), sentence-based LDA [5] and H2TM under ROUGE-1 and ROUGE-L. We also show improvement rates (%) (given in parentheses) of LDA and H2TM over baseline VSM. LDA was implemented for individual sentences by adopting Dirichlet parameter α = 10 and ﬁxing the number of

top-ics as100 and number of themes as 1000. Using LDA, the model size is ﬁxed. This model size is comparable with that of H2TM which is determined autonomously by Bayesian nonparametric learning. The comparison between LDA and H2TM is fair under comparable model complexity. In this evaluation, LDA consistently outperforms baseline VSM in terms of precision, recall and F-measure under different

(6)

ROUGE-1 ROUGE-L

Recall Precision F-measure Recall Precision F-measure VSM 0.3262 (-) 0.3373 (-) 0.3310 (-) 0.2971 (-) 0.3070 (-) 0.3013 (-) LDA 0.3372 (3.4) 0.3844 (14.0) 0.3580 (8.2) 0.2982 (0.4) 0.3395 (10.6) 0.3164 (5.0) H2TM 0.4100 (25.7) 0.3869 (14.7) 0.3976 (20.1) 0.3695 (24.4) 0.3489 (13.7) 0.3585 (19.0)

Table 2. Comparison of recall, precision and F-measure and their improvement rates (%) over baseline system.

ROUGE measures. Nevertheless, H2TM further improves LDA in presence of different experimental conditions. For the case of ROUGE-1, the improvement rates of F-measure using LDA and H2TM are 8.2% and 20.1%, respectively. The contributions of H2TM come from the ﬂexible model complexity and the structural theme information which are beneﬁcial for document summarization.

6. CONCLUSIONS

This paper addressed a new H2TM for unsupervised learn-ing of latent structure of the grouped data in different levels. A hierarchical theme model was constructed according to a sentence-level nCRP while the topic model was established through a word-level HDP. The snCRP compound HDP was proposed to build H2TM where each theme was character-ized by a mixture model of topics. A delicate document rep-resentation using the themes in sentence level and the topics in word level was organized. We further presented a TSB process to draw a subtree path for a heterogeneous document and built a hierarchical mixture model of themes according to snCRP. The hierarchical clustering of sentences was re-alized. The sentences were allocated in tree nodes and the corresponding words in different nodes were drawn by HDP. The proposed H2TM is a general model which can be ap-plied for unsupervised structural learning of different kinds

of grouped data. Experimental results on document

summa-rization showed that H2TM could capture the latent structure from multiple documents and outperform the other methods in terms of recall, precision and F-measure. Further inves-tigations shall be conducted for document classiﬁcation and information retrieval.

7. REFERENCES

[1] D. Blei, L. Carin, and D. Dunson, “Probabilistic topic models,” IEEE Signal Processing Magazine, vol. 27, no. 6, pp. 55–65, 2010.

[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 5, pp. 993–1022, 2003.

[3] J.-T. Chien and C.-H. Chueh, “Dirichlet class language models for speech recognition,” IEEE Transactions on

Audio, Speech and Language Processing, vol. 19, no. 3,

pp. 482–495, 2011.

[4] J.-T. Chien and C.-H. Chueh, “Topic-based hierarchi-cal segmentation,” IEEE Transactions on Audio, Speech

and Language Processing, vol. 20, no. 1, pp. 55–66,

2012.

[5] Y.-L. Chang and J.-T. Chien, “Latent Dirichlet learning for document summarization,” in Proc. of International

Conference on Acoustics, Speech, and Signal Process-ing, 2009, pp. 1689–1692.

[6] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet process,” Journal of the

Ameri-can Statistical Association, vol. 101, no. 476, pp. 1566–

1581, 2006.

[7] D. M. Blei, T. L. Grifﬁths, M. I. Jordan, and J. B. Tenebaum, “Hierarchical topic models and the nested Chinese restaurant process,” in Advances in Neural

In-formation Processing Systems, 2004.

[8] D. M. Blei, T. L. Grifﬁths, and M. I. Jordan, “The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies,” Journal of the ACM, vol. 57, no. 2, pp. 7, 2010.

[9] J. Paisley, L. Carin, and D. Blei, “Variational inference for stick-breaking beta process priors,” in Proc. of

Inter-national Conference on Machine Learning, 2011.

[10] C. Wang and D. M. Blei, “Variational inference for the nested Chinese restaurant process,” in Advances in

Neu-ral Information Processing Systems, 2009.

[11] J. Pitman, “Poisson-Dirichlet and GEM invariant distri-butions for split-and-merge transformations of an inter-val partition,” Combinatorics, Probability and

Comput-ing, vol. 11, pp. 501–514, 2002.

[12] R. P. Adams, Z. Ghahramani, and M. I. Jordan, “Tree-structured stick breaking for hierarchical data,” in

Ad-vances in Neural Information Processing Systems, 2010.

[13] J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz, “Multi-document summarization by sentence extrac-tion,” in Proc. of ANLP/NAACL Workshop on Automatic