• 沒有找到結果。

Intra-Speaker Topic Modeling for Improved Multi-Party Meeting Summarization with Integrated Random Walk

N/A
N/A
Protected

Academic year: 2022

Share "Intra-Speaker Topic Modeling for Improved Multi-Party Meeting Summarization with Integrated Random Walk"

Copied!
5
0
0

加載中.... (立即查看全文)

全文

(1)

Intra-Speaker Topic Modeling for Improved Multi-Party Meeting Summarization with Integrated Random Walk

Yun-Nung Chen and Florian Metze

School of Computer Science, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213-3891, USA

{yvchen,fmetze}@cs.cmu.edu

Abstract

This paper proposes an improved approach to extrac- tive summarization of spoken multi-party interac- tion, in which integrated random walk is performed on a graph constructed on topical/ lexical relations.

Each utterance is represented as a node of the graph, and the edges’ weights are computed from the topi- cal similarity between the utterances, evaluated us- ing probabilistic latent semantic analysis (PLSA), and from word overlap. We model intra-speaker topics by partially sharing the topics from the same speaker in the graph. In this paper, we perform ex- periments on automatically and manually generated transcripts. For automatic transcripts, our results show that intra-speaker topic sharing and integrating topical/ lexical relations can help include the impor- tant utterances.

1 Introduction

Speech summarization is an active and important topic of research (Lee and Chen, 2005), because multimedia/ spo- ken documents are more difficult to browse than text or image content. While earlier work was focused primarily on broadcast news content, recent effort has been increas- ingly directed to new domains such as lectures (Glass et al., 2007; Chen et al., 2011) and multi-party interac- tion (Banerjee and Rudnicky, 2008; Liu and Liu, 2010).

We describe experiments on multi-party interaction found in meeting recordings, performing extractive summariza- tion (Liu et al., 2010) on transcripts generated by auto- matic speech recognition (ASR) and human annotators.

Graph-based methods for computing lexical centrality as importance to extract summaries (Erkan and Radev, 2004) have been investigated in the context of text sum- marization. Some works focus on maximizing cover- age of summaries using the objective function (Gillick, 2011). Speech summarization carries intrinsic difficul- ties due to the presence of recognition errors, sponta-

neous speech effect, and lack of segmentation. A gen- eral approach has been found very successful (Furui et al., 2004), in which each utterance in the document d, U = t1t2...ti...tn, represented as a sequence of terms ti, is given an importance score:

I(U, d) = 1 n

n

X

i=1

1s(ti, d) + λ2l(ti) (1) + λ3c(ti) + λ4g(ti)] + λ5b(U ), where s(ti, d), l(ti), c(ti), g(ti) are respectively some statistical measure (such as TF-IDF), linguistic measure (e.g., different part-of-speech tags are given different weights), confidence score and N-gram score for the term ti, and b(U ) is calculated from the grammatical structure of the utterance U , and λ1, λ2, λ3, λ4and λ5are weight- ing parameters. For each document, the utterances to be used in the summary are then selected based on this score.

In recent work, Chen (2011) proposed a graphical structure to rescore I(U, d), which can model the topical coherence between utterances using random walk within documents. Similarly, we now use a graph-based ap- proach to consider the importance of terms and the simi- larity between utterances, where topical and lexical simi- larity are integrated in the graph, so that utterances topi- cally or lexically similar to more important utterances are given higher scores. Using topical similarity can com- pensate the negative effects of recognition errors on sim- ilarity evaluated on word overlap to some extent. In addi- tion, this paper proposes an approach of modeling intra- speaker topics in the graph to improve meeting summa- rization (Garg et al., 2009) using information from multi- party interaction, which is not available in lectures or broadcast news.

2 Proposed Approach

We apply word stemming and noise utterance filtering for utterances in all meetings. Then we construct a graph to compute the importance of all utterances.

(2)

U1

U2

U3 U4

U5

U6 pt(4, 3)

pt(3, 4) A4t = {U3, U5, U6} B4t= {U1, U3, U6}

Figure 1: A simplified example of the graph considered.

We formulate the utterance selection problem as ran- dom walk on a directed graph, in which each utterance is a node and the edges between them are weighted by topical and lexical similarity. The basic idea is that an utterance similar to more important utterances should be more important (Chen et al., 2011). We formulate two types of directed edge, topical edges and lexical edges, which are weighted by topical and lexical similarity re- spectively. We then keep only the top N outgoing edges with the highest weights from each node, while consider incoming edges to each node for importance propagation in the graph. A simplified example for such a graph with topical edges is in Figure 1, in which Atiand Bitare the sets of neighbors of the node Ui connected respectively by outgoing and incoming topical edges.

2.1 Parameters from PLSA

Probabilistic latent semantic analysis (PLSA) (Hofmann, 1999) has been widely used to analyze the semantics of documents based on a set of latent topics. Given a set of documents {dj, j = 1, 2, ..., J } and all terms {ti, i = 1, 2, ..., M } they include, PLSA uses a set of latent topic variables, {Tk, k = 1, 2, ..., K}, to charac- terize the “term-document” co-occurrence relationships.

The PLSA model can be optimized with EM algorithm by maximizing a likelihood function. We utilize two pa- rameters from PLSA, latent topic significance (LTS) and latent topic entropy (LTE) (Kong and Lee, 2011) in the paper.

Latent Topic Significance (LTS) for a given term ti

with respect to a topic Tk can be defined as LTSti(Tk) =

P

dj∈Dn(ti, dj)P (Tk | dj) P

dj∈Dn(ti, dj)[1 − P (Tk | dj)], (2) where n(ti, dj) is the occurrence count of term ti in a document dj. Thus, a higher LTSti(Tk) indicates the term tiis more significant for the latent topic Tk.

Latent Topic Entropy (LTE), for a given term tican be calculated from the topic distribution P (Tk | ti):

LTE(ti) = −

K

X

k=1

P (Tk| ti) log P (Tk | ti), (3)

where the topic distribution P (Tk | ti) can be estimated from PLSA. LTE(ti) is a measure of how the term tiis focused on a few topics, so a lower latent topic entropy implies the term carries more topical information.

2.2 Statistical Measures of a Term

The statistical measure of a term ti, s(ti, d) in (1) can be defined in terms of LTE(ti) in (3) as

s(ti, d) = γ · n(ti, d)

LTE(ti) , (4)

where γ is a scaling factor such that 0 ≤ s(ti, d) ≤ 1; the score s(ti, d) is inversely proportion to the latent topic entropy LTE(ti). Some works (Kong and Lee, 2011) showed that the use in (1) of s(ti, d) as defined in (4) out- performed the very successful “significance score” (Furui et al., 2004) in speech summarization; then, we use it as the baseline.

2.3 Similarity between Utterances

Within a document d, we can first compute the probabil- ity that the topic Tkis addressed by an utterance Ui:

P (Tk | Ui) = P

t∈Uin(t, Ui)P (Tk | t) P

t∈Uin(t, Ui) . (5) Then an asymmetric topical similarity TopicSim(Ui, Uj) for utterances Ui to Uj (with direction Ui → Uj) can be defined by accumulating LTSt(Tk) in (2) weighted by P (Tk | Ui) for all terms t in Ujover all latent topics:

TopicSim(Ui, Uj) = X

t∈Uj K

X

k=1

LTSt(Tk)P (Tk| Ui), (6) where the idea is very similar to the generative probability in IR. We call it generative significance of Uigiven Uj.

Within a document d, the lexical similarity is the mea- sure of word overlap between the utterance Ui and Uj. We compute LexSim(Ui, Uj) as the cosine similarity be- tween two TF-IDF vectors from Ui and Uj like well- known LexRank (Erkan and Radev, 2004). Note that LexSim(Ui, Uj) = LexSim(Uj, Ui)

2.4 Intra-Speaker Topic Modeling

We assume a single speaker usually focuses on similar topics, so if an utterance is important, the scores of the utterances from the same speaker should be increased.

Then we increase the similarity between the utterances from the same speaker to share the topics:

TopicSim0k(Ui, Uj) =





TopicSim(Ui, Uj)1+w , if Ui∈ Skand Uj∈ Sk

TopicSim(Ui, Uj)1−w , otherwise

(7)

(3)

where Skis the set including all utterances from speaker k, and w is a weighting parameter for modeling the speaker relation, which means the level of coherence of topics within a single speaker. Here the topics from the same speaker can partially shared.

2.5 Integrated Random Walk

We modify random walk (Hsu and Kennedy, 2007; Chen et al., 2011) to integrate two types of similarity over the graph obtained above. v(i) is the new score for node Ui, which is the interpolation of three scores, the normalized initial importance r(i) for node Ui and the score con- tributed by all neighboring nodes Ujof node Uiweighted by pt(j, i) and pl(j, i),

v(i) = (1 − α − β)r(i) (8)

+ α X

Uj∈Bit

pt(j, i)v(j) + β X

Uj∈Bli

pl(j, i)v(j),

where α and β are the interpolation weights, Bitis the set of neighbors connected to node Ui via topical incoming edges, Bilis the set of neighbors connected to node Uivia lexical incoming edges, and

r(i) = I(Ui, d) P

UjI(Uj, d) (9)

is normalized importance scores of utterance Ui, I(Ui, d) in (1). We normalize topical similarity by the total sim- ilarity summed over the set of outgoing edges, to pro- duce the weight pt(j, i) for the edge from Ujto Uion the graph. Similarly, pl(j, i) is normalized in lexical edges.

(8) can be iteratively solved with the approach very similar to that for the PageRank problem (Page et al., 1998). Let v = [v(i), i = 1, 2, ..., L]Tand r = [r(i), i = 1, 2, ..., L]Tbe the column vectors for v(i) and r(i) for all utterances in the document, where L is the total number of utterances in the document d and T represents trans- pose. (8) then has a vector form below,

v = (1 − α − β)r + αPtv + βPlv (10)

= (1 − α − β)reT+ αPt+ βPl v = P0v, where Ptand Plare L×L matrices of pt(j, i) and pl(j, i) respectively, and e = [1, 1, ..., 1]T. It has been shown that the solution v of (10) is the dominant eigenvector of P0 (Langville and Meyer, 2006), or the eigenvector corresponding to the largest absolute eigenvalue of P0. The solution v(i) can then be obtained.

3 Experiments

3.1 Corpus

The corpus used in this research consists of a sequence of naturally occuring meetings, which featured largely over- lapping participant sets and topics of discussion. For each

meeting, SmartNotes (Banerjee and Rudnicky, 2008) was used to record both the audio from each participant as well as his notes. The meetings were transcribed both manually and using a speech recognizer; the word error rate is around 44%. In this paper we use 10 meetings held from April to June of 2006. On average each meeting had about 28 minutes of speech. Across these 10 meetings there were 6 unique participants; each meeting featured between 2 and 4 of these participants (average: 3.7). The total number of utterances is 9837 across 10 meetings. In this paper, we separate dev set (2 meetings) and test set (8 meetings). Dev set is used to tune the parameters such as α, β, w.

The reference summaries are given by the set of “note- worthy utterances”: two annotators manually labelled the degree (three levels) of “noteworthiness” for each utter- ance, and we extract the utterances with the top level of

“noteworthiness” to form the summary of each meeting.

In the following experiments, for each meeting, we ex- tract the top 30% number of terms as the summary.

3.2 Evaluation Metrics

Automated evaluation utilizes the standard DUC eval- uation metric ROUGE (Lin, 2004) which represents recall over various n-grams statistics from a system- generated summary against a set of human generated peer summaries. F-measures for ROUGE-1 (unigram) and ROUGE-L (longest common subsequence) can be eval- uated in exactly the same way, which are used in the fol- lowing results.

3.3 Results

Table 1 shows the performance achieved by all proposed approaches. In these experiments, the damping factor, (1 − α − β) in (8), is empirically set to 0.1. Row (a) is the baseline, which use LTE-based statistical measure to compute the importance of utterances I(U, d). Row (b) is the result only considering lexical similarity; row (c) only uses topical similarity. Row (d) are the re- sults additionally including speaker information such as TopicSim0(Ui, Uj). Row (e) is the result performed by integrated random walk (with α 6= 0 and β 6= 0) using parameters that have been optimized on the dev set.

3.3.1 Graph-Based Approach

We can see the performance after graph-based re- computation, shown in rows (b) and (c), is significantly better than the baseline, shown in row (a), for both ASR and manual transcripts. For ASR transcripts, topical sim- ilarity and lexical similarity give similar results. For man- ual transcripts, topical similarity performs slightly worse than lexical similarity, because manual transcripts don’t contain the recognition errors, and therefore word overlap can accurately measure the similarity between two utter-

(4)

F-measure ASR Transcripts Manual Transcripts

ROUGE-1 ROUGE-L ROUGE-1 ROUGE-L

(a) Baseline: LTE 46.816 46.256 44.987 44.162

(b) LexSim (α = 0, β = 0.9) 48.940 48.504 46.540 45.858 (c) TopicSim (α = 0.9, β = 0) 49.058 48.436 46.199 45.392 (d) Intra-Speaker TopicSim 49.212 48.351 47.104 46.299 (e) Integrated Random Walk 49.792 49.156 46.714 46.064

MAX RI +6.357 +6.269 +4.706 +4.839

Table 1: Maximum relative improvement (RI) with respect to the baseline for all proposed approaches (%).

48 48.5 49 49.5 50

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

ROUGE-1 ROUGE-L α

β F-measure

Figure 2: The performance from integrated random walk with different combination weights, α and β (α + β = 0.9 in all cases) for ASR transcripts.

ances. However, for ASR transcripts, although topical similarity is not as accurate as lexical similarity, it can compensate for recognition errors, so that the approaches have similar performance. Thus, graph-based approaches can significantly improve the baseline results.

3.3.2 Effectiveness of Intra-Speaker Modeling We find that modeling intra-speaker topics can improve the performance (row (c) and row (d)), which means speaker information is useful to model the topical simi- larity. The experiment shows intra-speaker modeling can help us include the important utterances for both ASR and manual transcripts.

3.3.3 Integration of Topical and Lexical Similarity Row (e) shows the result of the proposed approach, which integrates topical and lexical similarity into a sin- gle graph, considering two types of relations together.

For ASR transcripts, row (e) is better than row (b) and row (d), which means topical similarity and lexical sim- ilarity can model different types of relations, because of recognition errors. Figure 2 shows the sensitivity of the combination weights for integrated random walk. We can see topical similarity and lexical similarity are additive, i.e. they can compensate each other, improving the per- formance by integrating two types of edges in a single graph. Note that the exact values of α and β do not mat-

ter so much for the performance.

For manual transcripts, row (e) cannot perform better by combing two types of similarity, which means topical similarity can dominate lexical similarity, since without recognition errors topical similarity can model the rela- tions accurately and additionally modeling intra-speaker topics can effectively improve the performance.

In addition, Banerjee and Rudnicky (2008) used su- pervised learning to detect noteworthy utterances on the same corpus, and achieved ROGURE-1 scores of around 43% for ASR, and 47% for manual transcriptions. Our unsupervised approach performs better, especially for ASR transcripts.

Note that the performance on ASR is better than on manual transcripts. Because a higher percentage of recognition errors occurs on “unimportant” words, these words tend to receive lower scores; we can then exclude the utterances with more errors, and achieve better sum- marization results. Other recent work has also demon- strated better performance for ASR than manual tran- scripts (Chen et al., 2011; Kong and Lee, 2011).

4 Conclusion and Future Work

Extensive experiments and evaluation with ROUGE met- rics showed that intra-speaker topics can be modeled in topical similarity and that integrated random walk can combine the advantages from two types of edges for imperfect ASR transcripts, where we achieved more than 6% relative improvement. We plan to model inter- speaker topics in the graph-based approach in the future.

Acknowledgements

The first author was supported by the Institute of Edu- cation Science, U.S. Department of Education, through Grants R305A080628 to Carnegie Mellon University.

Any opinions, findings, and conclusions or recommen- dations expressed in this publication are those of the au- thors and do not necessarily reflect the views or official policies, either expressed or implied of the Institute or the U.S. Department of Education.

(5)

References

Banerjee, S. and Rudnicky, A. I. 2008. An extractive- summarizaion baseline for the automatic detection of note- worthy utterances in multi-party human-human dialog. Proc.

of SLT.

Chen, Y.-N., Huang, Y., Yeh, C.-F., and Lee, L.-S. 2011. Spo- ken lecture summarization by random walk over a graph con- structed with automatically extracted key terms. Proc. of In- terSpeech.

Erkan, G. and D. R. Radev., D. R. 2004. LexRank: Graph- based lexical centrality as salience in text summarization.

Journal of Artificial Intelligence Research, Vol. 22.

Furui, S., Kikuchi, T., Shinnaka, Y., and Hori, C. 2004.

Speech-to-text and speech-to-speech summarization of spon- taneous speech. IEEE Trans. on Speech and Audio Process- ing.

Garg, N., Favre, B., Reidhammer, K., and Hakkani-T¨ur 2009.

ClusterRank: A graph based method for meeting summariza- tion. Proc. of InterSpeech.

Gillick, D. J. 2011. The elements of automatic summarization.

PhD thesis, EECS, UC Berkeley.

Glass J., Hazen, T. J., Cyphers, S., Malioutov, I., Huynh, D., and Barzilay, R. 2007. Recent progress in the MIT spoken lecture processing project. Proc. of InterSpeech.

Hofmann, T. 1999. Probabilistic latent semantic indexing.

Proc. of SIGIR.

Hsu, W. and Kennedy, L. 2007. Video search reranking through random walk over document-level context graph. Proc. of MM.

Kong, S.-Y. and Lee, L.-S. 2011. Semantic analysis and orga- nization of spoken documents based on parameters derived from latent topics. IEEE Trans. on Audio, Speech and Lan- guage Processing, 19(7): 1875-1889.

Langville, A. and Meyer, C. 2005. A survey of eigenvector methods for web information retrieval. SIAM Review.

Lee, L.-S. and Chen, B. 2005. Spoken document understanding and organization. IEEE Signal Processing Magazine.

Lin, C. 2004. Rouge: A package for automatic evaluation of summaries. Proc. of Workshop on Text Summarization Branches Out.

Liu, F. and Liu, Y. 2010. Using spoken utterance compression for meeting summarization: A pilot study. Proc. of SLT.

Liu Y., Xie, S., and Liu, F. 2010. Using N-best recognition output for extractive summarization and keyword extraction in meeting speech. Proc. of ICASSP.

Page, L., Brin, S., Motwani, R., Winograd, T. 1998. The pager- ank citation ranking: bringing order to the web. Technical Report, Stanford Digital Library Technologies Project.

參考文獻

相關文件

Between-Layer Relation via Hidden Parameters Given a set of utterances {U i , i = 1, 2, ..., |U |} and a set of speakers {S j , j = 1, 2, |S|}, where a speaker node in the graph

similarity between projection vectors is also defined as a basis for region growing to segment the image into textured regions. As is demonstrated by the experimental

 Construct a graph to represent utterances in the document (node: utterance, edge: weighted by topical/lexical similarity) Use the graph to re-compute the importance.

[註 2] 如比對結果為論文定稿之版本,請檢附整份報告(為節省紙張請雙面列印)。If the similarity index is attributed to the final draft of the thesis, please download and print the full

We note that in all the cases discussed in McLeod and Brewster (2004), and therefore in Tsai (2016), splitting words (if they are needed) are a subset of blocking words and we do

▫ Not only the sentences with high importance score based on statistical measure should be considered as indicative sentence... Proposed

Under the multiple competitive dynamics of the market, market commonality and resource similarity, This research analyze the competition and the dynamics of

(2)As for the attitudes towards selection autonomy, significant variations exist among teachers categorized by different ages and by different years of teaching.. (3)The similarity