Improved Speech Summarization and Spoken Term Detection with Graphical Analysis of Utterance Similarities

(1)

Improved Speech Summarization and Spoken Term Detection

with Graphical Analysis of Utterance Similarities

Hung-Yi Lee, Yun-Nung Chen, and Lin-Shan Lee National Taiwan University

E-mail: {tlkagkb93901106, vivian.ynchen}@gmail.com, lslee@gate.sinica.edu.tw

Abstract—We present summarization and spoken term detection (STD) approaches that take into account similarities between utterances to be scored for summary extraction or ranking in STD. A graph is constructed in which each utterance is a node. Similar utterances are connected by edges, with the edge weights representing the degree of similarity. The similarity for summarization is topical similarity; that for STD is feature- space similarity. The score of each utterance for extraction in summarization and ranking in STD is not solely decided by the individual utterance but is influenced by similar utterances on the graph. Experimental results show significant improvements compared with two baselines in terms of the ROUGE evaluation for summarization and mean average precision for STD.

I. INTRODUCTION

In the Internet era, digital network content covers all of the information and activities in human life. The most attractive form of network content is multimedia, including speech. The subjects, topics, and core concepts of such speech information is usually to be found within the content itself. However, as multimedia or spoken documents are just video or audio signals, they are usually much more difficult to retrieve and browse, because they cannot be easily displayed on-screen, and the user cannot simply “skim through” each one from beginning to end. Hence the importance of speech information retrieval and spoken document summarization in helping users efficiently mine speech content [1].

In general, there are two stages in a speech information retrieval system [2]. In the first stage, the audio content is recognized and transformed into transcriptions or lattices. In the second stage, after the user enters a query, the retrieval engine searches through the recognition output and returns a list of relevant spoken utterances to the user. The discussion in this paper is limited to spoken term detection (STD) [3], in which the query is a term submitted by the user in text form and the system returns a list of spoken utterances containing that term. Summarization is the process of automatically creating a compressed version of a given spoken document that provides useful information for the user. The information content of a summary depends on the user’s needs. Here we discuss topic-oriented summaries; we thus focus on extracting the information in the document that is related to the specified topic. Like speech information retrieval, the spoken documents are first transcribed into text using the recognition engine, and then the system selects a number of indicative utterances from

the original spoken documents according to a target summarization ratio, and contatenates them to form a summary.

Both STD and summarization can be considered utterance ranking problems which rank the utterances based on cues found in the utterance set but with different ranking targets.

In STD, the system ranks the utterances over the entire spoken archive based on the relevance scores assigned to the utterances representing the probabilities of the appearance of the query term. The posterior probability of the query term derived from the lattice is widely used as a relevance score [4], [5]; other confidence measures are also useful [6]. In summarization, each utterance is given an importance score representing how well it represents the document as a whole.

The utterances are selected for the summary in the order of the ranking of the importance scores until the number of terms in the summary exceeds a target summarization ratio. The importance score of each utterance is typically based on its grammatical structure as well as various statistical measures, linguistic measures, and confidence scores of the terms in the utterance [7]. In mainstream approaches for STD and summarization, the utterance ranking score relies only on evidence observed in each individual utterance; we however believe that the relationships among utterances may yield fruitful information for ranking and therefore should not be ignored.

Although much research has been devoted to ranking instances, additionally taking into account such inter-instance

Fig. 1. The framework for the proposed approach.

(2)

relationships significantly complicates matters. These relationships are usually represented as a graph in which each node is an instance; ranks are induced by the scores assigned to the nodes. The relationships between these instances are represented by edges, the weights of which reflect the degree of relation. “Similarity” between instances is especially useful for ranking. We assume that similar instances should be ranked similarly because they share the same properties. Thus we assign high scores to instances that are connected to other instances with high scores; whereas instances that are similar to other instances with low scores are assigned low scores.

In other words, each instance’s score is influenced by other instances that share similar properties. Although similarities between instances are usually helpful in ranking problems, what kinds of similarities as edges are able to improve the ranking performance should be individually researched for each domain.

We attempt to take into account similarity between utterances in summarization and STD. In topic-oriented summarization, where the summary is expected to describe specific information about the topic of the document, the utterances in the summary should have similar content and thus similar latent topic distributions. Hence we seek to extract utterances with similar latent topics as the summary. Therefore, here the similarity between utterances being considered refers to topical similarity in summarization, and an utterance with strong evidences to be selected to form a summary would increase the summary selection possibility of the utterances with similar latent topics. For spoken term detection, it has been verified by the feature-space pseudo-relevance feedback (PRF) techniques proposed earlier [8] that utterances that are highly similar at the feature level to parts of the pseudo- relevant utterances are likely to be relevant. Hence, for STD, similarity between utterances refers to feature-space similarity of query hypotheses. The system assigns the utterances close relevance scores if they are characterized by similar features.

We present a graph-based method to take the advantage of utterance similarity in STD [9] and summarization [10].

The frameworks of the proposed approaches for STD and summarization are shown in Fig. 1, with summarization on the right. An utterance graph is constructed for each transcribed spoken document. The nodes on the graph are utterances in one document, and topically similar utterances are connected with edges. Each utterance’s importance score depends not only on the statistical measure of the terms in the utterance itself but also on utterances connected to it within the graph.

For STD, on the left in Fig. 1, when the user enters a query, the retrieval engine searches through the lattices to find the utterances containing the query term as the first-pass returned list, which is not shown to the user. A graph is constructed from this list in which each node represents an utterance in the list and the edges represent high feature-space similarity between utterances. The relevance score of each utterance is partially decided by the utterances connected, and then the list is reranked accordingly.

The proposed approaches for summarization and STD are

in Section II and Section III respectively. In Section IV, course lectures were taken as an example in the experiments to test the proposed approach for both STD and summarization. In Section V we offer concluding remarks.

II. SUMMARIZATION WITHUTTERANCESIMILARITY

We introduce a graph-based method that takes into account topical similarity when computing importance scores for the utterances in a spoken document. Whereas similar approaches to utterance similarity have been used on text summarization [11], [12], previous research on text only used term similarity instead of the proposed latent topic similarity.

A. Baseline

Here the statistical measures of the terms in utterance x belonging to spoken document d are used to infer importance score Id(x):

I_d(x) = X

ti∈x

n(t_i, x)s(t_i, d), (1)

where n(ti, x) is the occurrence count of term ti in utterance x, and s(ti, d) is the statistical measure of term ti. In this work, s(ti, d) is defined as described below in two different ways.

1) Latent Topic Entropy-based Statistical Measure: Proba- bilistic latent semantic analysis (PLSA) [13] has been widely used to analyse the semantics of documents based on a set of latent topics. Given a set of documents {dj, j = 1, 2, ..., J } and all the terms {ti, i = 1, 2, ..., M } they include, PLSA uses a set of latent topic variables, {Tk, k = 1, 2, ..., K}, to characterize the “term-document” co-occurrence relationships.

The probability of observing term ti given document dj can be parameterized by

P (ti|dj) =

K

X

k=1

P (ti|Tk)P (Tk|dj). (2) The PLSA model can be optimized using the EM algorithm by maximizing a likelihood function.

The latent topic entropy (LTE), LTE (ti), for a given term tican be calculated as (3) from the topic distribution P (Tk|ti) for each term ti:

LTE (ti) = −

K

X

k=1

P (Tk|ti) log P (Tk|ti), (3)

where the topic distribution P (Tk|t_i) can be estimated as follows [14], [15]:

P (Tk|ti) =P (ti|Tk) × P (Tk)

P (t_i) 'P (ti|Tk)

P (t_i) , (4) where the probability P (Tk) is omitted because there is as yet no good approach to estimate it. P (ti) can be obtained from a large corpus. LTE (ti) is a measure of how focused the term ti is on a few topics; a lower latent topic entropy implies the term carries more topical information.

(3)

The statistical measure s(ti, d) in (1) can be defined based on LTE (ti) in (3) as

s_LTE(ti, d) = n(ti, d)

LTE (ti), (5)

where n(ti, d) is the occurrence count of term tiin document d. Score s_LTE(t_i, d) is inversely proportional to LTE (ti).

Previous work [14], [15] has showed that this measure outperforms the very successful “significance score” in speech summarization [7]. Here we use s_LTE(ti, d) as one of the baselines.

2) Key-Term-based Statistical Measure: Key terms are the terms in a document that carry the core concepts of the content. They are useful for indexing, retrieval, and browsing.

In general there are two types of key terms: keywords (single words) and key phrases (such as “hidden Markov model”). Au- tomatically extracting key terms from spoken content is still a difficult problem, but some initial approaches have been shown to be successful in recent experiments [16]. Such approaches include the use of right/left branching entropy derived from PAT-Trees to extract frequently occurring patterns including two or more words, and identifying or verifying key terms (including key phrases) by prosodic (pitch, duration, energy), lexical, and semantic (from PLSA) features with unsupervised techniques or supervised training. Such automatically extracted key terms are very helpful in summarization.

With key terms thus automatically extracted (with some errors), we can estimate a new latent topic probability P_KEY(Tk|d) that is hopefully better than the P (Tk|d) calculated directed from the PLSA model:

P_KEY(Tk|d) = P

t_i∈keyn(ti, d)P (Tk|ti) PK

k=1

P

ti∈keyn(ti, d)P (Tk|ti), (6) where key is the set of automatically extracted key terms, and P (Tk|ti) is in (4). Therefore only the automatically extracted key terms ti in d are considered, eliminating the influence from other insignificant terms. We then define the statistical measure s(ti, d) as

s_KEY(t_i, d) =

K

X

k=1

LTS_t_i(T_k)P_KEY(T_k|d). (7) LTS_t_i(T_k), the latent topic significance (LTS) for term tiwith respect to topic T_k, is defined [14], [15] as

LTS_t_i(T_k) = P

dj∈Dn(ti, dj)P (Tk|dj) P

d_j∈Dn(t_i, d_j)[1 − P (T_k|dj)], (8) where n(ti, d_j) is the occurrence count of term tiin document d_j. In the numerator of (8), the count of term ti in document d_j, n(ti, d_j), is weighted by the likelihood that topic Tk

is addressed by document d_j, P (T_k|d_j), and then summed over all documents dj in the PLSA model training corpus D.

Therefore the numerator is the total count of term ti used for the given topic Tk over the whole PLSA training corpus, as estimated by PLSA model. The denominator is very similar except that it is for latent topics other than Tk, so P (Tk|dj)

Fig. 2. A simplified example of a graph, the nodes of which correspond to utterances.AiandBiare the node sets connected respectively by outgoing and incoming edges ofxi.

is replaced with [1 − P (Tk|dj)]. Thus, a higher LTSt_i(Tk) indicates the term ti is more significant for latent topic Tk. B. Proposed Approach

To take into account the topical similarity between utterances during summarization, a graph representing the topical similarity between utterances in document d is first constructed. All the utterances in d are nodes on the graph, and each utterance connects to only the N1 most topically similar utterances. Then an importance score is assigned to each utterance based on the graph structure. The weights of the edges correspond to the topical similarity between the associated utterances. To estimate the topical similarity between two utterances, we first compute the probability that topic Tk is addressed by utterance xi,

P (Tk|xi) = P

t∈xin(t, x_i)P (T_k|t) P

t∈xin(t, x_i) . (9) Then the edge weight for utterance x_i to x_j (with direction x_i → x_j) is defined by accumulating LTS_t(T_k) in (8) weighted by P (Tk|xi) for all terms t in xj over all latent topics,

W_topic(x_i, x_j) =X

t∈xj

K

X

k=1

LTS_t(T_k)P (T_k|xi). (10)

A simplified example for such graph is given in Figure 2, in which Aiand Biare the utterance sets connected by outgoing and incoming edges of utterance xi respectively.

Consider document d with utterances {xi, i = 1, 2, ..., Nd}.

The proposed importance scores are {I_d^G(xi), i = 1, 2, ..., Nd} satisfying the equation

I_d^G(xi) = (1 − α) ˆId(xi) + α X

xj∈Bi

Ptopic(j, i)I_d^G(xj) (11)

for i = 1, 2, ..., Nd. Bi is the set of utterances connected to utterance xi via incoming edges. ˆId(xi) is the normalized importance score

Iˆ_d(x_i) = I_d(x_i) PN_d

j=1I_d(x_j). (12)

(4)

Fig. 3. The definition of “hypothesised region”(the red part) of an utterance xi and the distance d(xi, xj) between two utterances xi and xj. The hypothesised region of an utterancexi is the corresponding time span of a word arc in the lattice whose word hypothesis is exactly the queryQ with the highest posterior probability in the lattice.

Id(xj) can use either a LTE-based or key-term-based statistical measure. Ptopic(j, i) is the weight of the edge from xj to x_i normalized by the weights over the outgoing edges of utterance xj:

Ptopic(j, i) = W_topic(x_j, x_i) P

x_k∈AjWtopic(xj, xi), (13) where Aj are the utterances connected by the outgoing edges of xk. α is an interpolation weight between 0 and 1. Thus I_d^G(xi), the new importance score of xi, takes into account not only the statistical measures of the terms in xi but also the importance scores of the utterances that are topically very similar to xi. The higher the edge weight, that is, the topical similarity, the more influence it has on the importance score of xi. During summarization, with the graph-based (11), all utterances in the document d are taken into consideration jointly and not individually. The normalizations in (12) and (13) are necessary to formulate (11) as the random walk problem [17], [18]. The theory of the random walk guarantees that {I_d^G(x_i), i = 1, 2, ..., N_d} satisfying (11) is unique and nonnegative, which can be found efficiently by the power method [19].

For better results, I_d^G(xi) is integrated with the baseline Id(xi) as

I_d⁰(xi) = Id(x)(I_d^G(xi))^δ¹, (14) where δ1 is a weighting parameter. The proposed approach uses I_d⁰(x_i) as the importance score when ranking the utterances in a spoken archive for summary selection.

III. SPOKENTERMDETECTION WITHUTTERANCESIMILARITY

When the user submits query Q, the retrieval engine searches over all of the lattices to find those utterances containing the query Q as the first-pass returned list ranked by the relevance score SQ(x). The spoken segment set retrieved in the first pass is denoted as XQ. With the success of acoustic feature space pseudo-relevance feedback (PRF) [8], we know that if an utterance has a “hypothesized region” very “similar”

to utterances with high relevance scores, it is more likely

to be relevant, or its relevance score should be increased. A hypothesized region (Fig. III) is the most probable occurrence of query Q in the utterance, as the corresponding time span of a word arc in the lattice whose word hypothesis is exactly the query term Q with the highest posterior probability in the lattice. As shown in Fig. III, the distance d(x_i, x_j) between two utterances x_i and x_j given query Q is the dynamic time warping distance [20] of the MFCC sequences corresponding to the time spans of the hypothesized region. The feature- space similarity between the utterances xi and xj is defined according to d(xi, xj). Here we introduce a graph-based approach for reranking the first-pass returned list with feature- space similarity.

A. Baseline

We use as the first baseline the first-pass retrieval result, ranked according to the widely-used query posterior probability. PRF is used as the second set of baselines.

1) First Pass: The relevance score S_Q(x) of utterance x with respect to query Q is defined as

SQ(x) = X

word (a)=Q

P (a|x), (15)

where a is any arc in the lattice of x, word (a) is the word hypothesis of a and P (a|x) is the posterior probability. The first pass retrieval result is ranked according to SQ(x).

2) Pseudo-Relevance Feedback: In PRF, a pseudo-relevant utterance set YQis selected out of the first-pass retrieval result XQ for query Q, and the similarity between each utterance xi in the first-pass returned list and the set YQ is computed and integrated with the original relevance score. The top N utterances in the returned list (that is, the N utterances with highest relevance scores) in XQ are selected as YQ. The distance between utterance x_i and set Y_Q is

D(xi, YQ) = X

xj∈YQ

d(xi, xj)², (16) the total distance between xi and all utterances in YQ. The value of D(xi, YQ) is normalized between 0 and 1 as D(xˆ _i, Y_Q), and the similarity between xi and YQ is

SIM (x_i, Y_Q) = 1 − ˆD(x_i, Y_Q), (17) which is 1 minus the normalized distance between xi

and YQ. The retrieval result is ranked according to SQ(x)SIM (xi, YQ)^δ, where SQ(x) is defined in (15).

B. Proposed Approach

To take into account utterance similarities in STD ranking, we first construct a graph representing the feature-space similarities between the utterances of the first-pass retrieved utterances XQ. Each utterance in XQ is a node in the graph, and each utterance (node) connects to the most similar N2

utterances in feature space. The weight of the edge from utterance x_i to x_j is

W_sim(x_i, x_j) = 1 − d(x_i, x_j) − d_min dmax− dmin

, (18)

(5)

where dmax and dmin are the largest and smallest values of d(xi, x_j) for all pairs of utterances in XQ. ¹ Then new feature-space similarity-based relevance scores are obtained based on the graph structure. This graph is the same as Fig. 2 but uses edges that instead represent feature-space similarities.

The definitions of A_i and B_i are the same as in Section II-B.

The proposed relevance scores {R^G_Q(x_i), x_i ∈ X_Q} com- pose the value set satisfying

R^G_Q(xi) = (1 − α) ˆRQ(xi) + α X

x_j∈Bi

Psim(j, i)R^G_Q(xj) (19)

for all xi∈ XQ.

Rˆ_Q(x_i) = S_Q(x_i) P

x_j∈XQSQ(xj) (20) is the normalized relevance score of utterance xi, SQ(x) is as defined in (15), and XQ is the first-pass retrieved utterance set. Psim(j, i) is the normalization of the edge weight W_sim(x_j, x_i) over the outgoing edges of utterance xj on the graph:

Psim(j, i) = Wsim(xj, xi) P

xk∈AjWsim(xj, xk), (21) where Aj are again the utterances connected by the outgoing edges of xk. α is an interpolation weight between 0 and 1.

Equation (19) shows that R^G_Q(xi), the new relevance score of utterance xi, depends on two factors. One is the posterior probability of query Q in the xi lattice (the first term on the right side of (19)), and the other is the relevance scores of the similar utterances (the second term on the right side). Compared with PRF, which takes into account only similarities to utterances in the pseudo-relevant set, the proposed approach takes into consideration the relations of all the utterances retrieved. To be specific, PRF only raises the scores of utterances that are connected to other utterances with high relevance scores; the proposed method, however, also lowers the relevance scores of those utterances that are connected to other utterances with low relevance utterances. Therefore, the proposed approach outperforms PRF. The normalization in (21) and (20) here formulates (19) as a random walk problem on a graph. As mentioned in Section II-B, the solution of {R^G_Q(xi), xi∈ XQ} satisfying (19) is unique and nonnegative.

R^G_Q(x_i) is integrated with the original relevance score S_Q(x_i) for re-ranking as

S_Q⁰ (x_i) = S_Q(x_i)(R^G_Q(x_i))^δ², (22) where δ2 is a weighting parameter. The final retrieval result displayed to the user is then ranked according S_Q⁰ (x_i).

IV. EXPERIMENTS

A. Corpus

As the testing archive we used a corpus of 33 hours of recorded lectures for a course offered at National Taiwan University produced by a single instructor. Used for retrieval

1Wsim(xi, xj) and Wsim(xj, xi) are equal.

and summarization, this corpus is quite noisy and spontaneous.

The lectures were given in Mandarin Chinese (the “host”

language) with English (the “embedded” language) terms and phrases embedded within the Mandarin utterances.

B. Summarization

1) Experimental Setup: For summary experiments, both manual transcriptions (Manual) without word errors and the results of speech recognition (ASR) on the lectures were used for testing. For speech recognition, the acoustic model was trained using the maximum likelihood criterion with 4602 state-tied triphones spanned from 37 monophones using a corpus of noiseless Mandarin read speech, including 24.6 hours of data produced by 100 males and 100 females, and adapted with a 25.2-minutes bilingual corpus from the target speaker (the course instructor) [21]. The language model was trained with two other courses offered by the same instructor and was adapted to the course slides. The accuracies for the ASR transcriptions were 78.15% for Mandarin characters, 53.44% for English words, and 76.26% overall. The unsupervised automatic key term extraction approach mentioned in Section II-A2 was used, and both keywords and key phrases were extracted [16]. The key terms F-measures for ASR and manual transcriptions were 52.60% and 55.84% respectively.

To evaluate the performance of the automatically gener- ated summaries, we used the well-known evaluation package ROUGE [22]. The ROUGE-N F-measure (N = 1, 2, 3) and ROUGE-L were used to evaluate summarization results. We segmented the whole lecture into 155 documents using topic segmentation [23], and extracted the summary for each document. As the test corpus we used 34 out of the 155 documents for which reference summaries were produced manually. We used 32 topics for PLSA, and set α to 0.9. In the experiments presented below, the summarization ratio was set to 10%, 20%, and 30% respectively. The automatically extracted key phrases (all with more than one word) were taken as individual terms in PLSA modeling and all following processes.

2) Experimental Results: Fig. 4 shows the results for ROUGE-N and ROUGE-L for ASR ((a)–(d)) and manual transcriptions ((e)–(h)). In each case the three groups of bars are for 10%, 20%, and 30% summarization ratios, and in each group the four bars are respectively for the LTE-based statistical measure s_LTE(ti, d) in (5), that followed by the proposed topical similarity graph (LTE + G), the key-term- based statistical measure in (7) (Key) and that followed by the proposed approach (Key + G). In all cases, the key- term-based statistical measure (bar 3) outperformed the LTE baseline (bar 1). Clearly key term knowledge was very helpful, especially for manual transcriptions. This is probably because in manual transcriptions all key terms were correctly transcribed (although they were sometimes incorrectly extracted), which ensured more accurate estimation of the key-term- based statistical measures. Similar but slightly less significant improvements were yielded for ASR transcriptions.

In all cases, the proposed approach considering utterance relationships based on topical similarity graph improved on

(6)

Fig. 4. The results of different choices of parameters: LTE-based (1, 2) or key- term-based (3, 4), with (2, 4) or without (1, 3) the topical similarity graph, for ASR ( (a)–(d) ) or manual ( (e)–(h) ) transcriptions at summarization ratios of10%, 20%, and 30%.

the LTE-based statistical measure (bar 2 vs bar 1), except for ASR ROUGE-3 with the 20% summarization ratio. For ASR transcriptions ((a)–(d)), the proposed approach also improved on the key-term-based statistical measure (bar 4 vs bar 3).

However, at the 10% and 30% summarization ratios for manual transcriptions ((e)–(h)), the proposed approach did not similarly with help the key-term-based statistical measure (bar 4 vs bar 3). This may be because for manual transcriptions, important utterances were already well represented by the key-term-based statistical measure; hence adding extra topical similarity among utterances did not lead to better performance.

C. Spoken Term Detection

1) Experimental Setup: A tri-gram language model trained on news data was used in speech recognition. In order to evaluate the performance of the proposed approach with respect to acoustic models of different matched conditions, we used three sets of acoustic models:

• A speaker-independent model (SI) trained on 24.6 hours of read speech produced by 100 male and 100 female speakers.

• An MLLR model (MLLR) adapted from the above SI model on 500 utterances taken from the training set of the lecture corpus.

• A speaker-dependent model (SD) trained on 12 hours of the training set of the lecture corpus, all produced by the same speaker as that in the retrieval corpus.

For all acoustic models, we trained 4602 state-tied triphones spanned from 37 monophones. The recognition accuracies were 50.26%, 62.55%, and 81.34% respectively for the SI, MLLR, and SD models.

TABLE I

MAP results for the baseline, PRF, and the proposed approach with various acoustic models.

Methods SI MLLR SD

MAP Impr. MAP Impr. MAP Impr.

First pass 45.47 - 55.54 - 73.52 -

PRF 52.10 6.63 61.59 6.05 75.78 2.26

Proposed 53.42 7.95 63.78 8.24 76.71 3.19

Mean average precision (MAP) was used as the measure for retrieval performance evaluation. 162 Mandarin queries were manually selected in the tests, each being a single word.

2) Experimental Results: The results of the first-pass retrieval for the three sets of acoustic models are listed in the first row of Table I as the first baseline. Clearly the performance is heavily dependent on the quality of the acoustic model. The second row is PRF (described in Section III-A2), which outperforms the baseline regardless of the quality of the acoustic model. PRF serves as the second baseline.

The results of integrating the original score in (15) with the proposed scores satisfying (19) are shown in the third row of Table I. These results show that the integration with the scores derived based on utterance similarity yields better performance than the first-pass results for all acoustic models, especially with poorer acoustic models (SI and MLLR), or when the original relevance scores are less precise. It also clearly outperformed the PRF approach. This shows the effectiveness of taking into account the complete relationships between all the utterances retrieved.

V. CONCLUSIONS

We present graph-based approaches that take into account utterance similarity for summarization and STD. All approaches take utterances as nodes on the graph. Edge weights for summarization are represented by topical similarities;

those for STD are represented by feature-space similarities.

Encouraging results were obtained in the experiments for both tasks.

REFERENCES

[1] L.-S. Lee and B. Chen, “Spoken document understanding and organi- zation,” Signal Processing Magazine, IEEE, vol. 22, pp. 42 – 60, 2005.

[2] C.Chelba, T. J. Hazen, and M. Saraclar, “Retrieval and browsing of spoken content,” IEEE Signal Processing Magazine, vol. 24, pp. 39 – 49, 2008.

[3] National Institute of Standards and Technology, The spoken term detection (STD) 2006 evaluation plan, 2006.

[4] C. Chelba and A. Acero, “Position specific posterior lattices for indexing speech,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005.

[5] M. Saraclar, “Lattice-based search for spoken utterance retrieval,” in In Proceedings of HLT-NAACL 2004, 2004.

[6] D. Wang, J. Tejedor, J. Frankel, S. King, and J. Colas, “Posterior-based confidence measures for spoken term detection,” in ICASSP, 2009.

[7] S. Furui, T. Kikuchi, Y. Shinnaka, and C. Hori, “Speech-to-text and speech-to-speech summarization of spontaneous speech,” Speech and Audio Processing, IEEE Transactions on, vol. 12, pp. 401 – 408, 2004.

[8] C.-P. Chen, H.-Y. Lee, C.-F. Yeh, and L.-S. Lee, “Improved spoken term detection by feature space pseudo-relevance feedback,” in INTER- SPEECH, 2010.

(7)

[9] Y.-N. Chen, C.-P. Chen, H.-Y. Lee, C.-A. Chan, and L.-S. Lee, “Im- proved spoken term detection with graph-based re-ranking in feature space,” in ICASSP, 2011.

[10] Y.-N. Chen, Y. Huang, C.-F. Yeh, and L.-S. Lee, “Spoken lecture summarization by random walk over a graph constructed with automatically extracted key terms,” in reviewing, 2011.

[11] G. Erkan and D. R. Radev, “Lexrank: graph-based lexical centrality as salience in text summarization,” J. Artif. Int. Res., vol. 22, pp. 457–479, 2004.

[12] J. Otterbacher, G. Erkan, and D. R. Radev, “Biased lexrank: Passage retrieval using random walks with question-based priors,” Inf. Process.

Manage., vol. 45, pp. 42–54, 2009.

[13] T. Hofmann, “Probabilistic latent semantic analysis,” in UAI, 1999.

[14] S.-Y. Kong and L.-S. Lee, “Improved spoken document summarization using probabilistic latent semantic analysis (PLSA),” in ICASSP, 2006.

[15] S.-Y. Kong and L.-S. Lee, “Improved summarization of chinese spoken documents by probabilistic latent semantic analysis (PLSA) with further analysis and integrated scoring,” in SLT, 2006.

[16] Y.-N. Chen, Y. Huang, S.-Y. Kong, and L.-S. Lee, “Automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features,” in SLT, 2010.

[17] Laszlo Lovasz, “Random walks on graphs: A survey,” 1993.

[18] S. Brin and L. Page, “The anatomy of a largescale hypertextual web search engine,” in WWW, 1998.

[19] A. N. Langville and C. D. Meyer, “A survey of eigenvector methods for web information retrieval,” SIAM Rev., vol. 47, pp. 135–161, January 2005.

[20] C.-A. Chan and L.-S. Lee, “Unsupervised spoken term detection with spoken queries using segment-based dynamic time warping,” in InterSpeech, 2010.

[21] C.-F. Yeh, L.-C. Sun, C.-Y. Huang, and L.-S. Lee, “Bilingual acoustic modeling with state mapping and three-stage adaptation for transcribing unbalanced code-mixed lectures,” in ICASSP, 2011.

[22] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”

in Workshop on Text Summarization Branches Out, 2004.

[23] S.-C. Hsu, “Topic segmentation on lecture corpus and its application,”

M.S. thesis, NTU, 2008.