• 沒有找到結果。

Prosody-Based Unsupervised Speech Summarization with Two-Layer Mutually Reinforced Random Walk

N/A
N/A
Protected

Academic year: 2022

Share "Prosody-Based Unsupervised Speech Summarization with Two-Layer Mutually Reinforced Random Walk"

Copied!
7
0
0

加載中.... (立即查看全文)

全文

(1)

Prosody-Based Unsupervised Speech Summarization with Two-Layer Mutually Reinforced Random Walk

Sujay Kumar Jauhar, Yun-Nung Chen, and Florian Metze School of Computer Science, Carnegie Mellon University

5000 Forbes Ave., Pittsburgh, PA 15213-3891, USA {sjauhar,yvchen,fmetze}@cs.cmu.edu

Abstract

This paper presents a graph-based model that integrates prosodic features into an unsupervised speech summarization framework without any lexical informa- tion. In particular it builds on previous work using mutually reinforced random walks, in which a two-layer graph struc- ture is used to select the most salient ut- terances of a conversation. The model consists of one layer of utterance nodes and another layer of prosody nodes. The random walk algorithm propagates scores between layers to use shared information for selecting utterance nodes with highest scores as summaries. A comparative eval- uation of our prosody-based model against several baselines on a corpus of academic multi-party meetings reveals that it per- forms competitively on very short sum- maries, and better on longer summaries according to ROUGE scores as well as the average relevance of selected utterances.

1 Introduction

Automatic extractive speech summarization (Hori and Furui, 2001) has garnered considerable in- terest in the natural language processing research community for its immediate application in mak- ing large volumes of multimedia documents more accessible. Several variants of speech summa- rization have been studied in a range of tar- get domains, including news (Hori et al., 2002;

Maskey and Hirschberg, 2003), lectures (Glass et al., 2007; Chen et al., 2011) and multi-party meet- ings (Banerjee and Rudnicky, 2008; Liu and Liu, 2010; Chen and Metze, 2012b).

Research in speech summarization – unlike its text-based counterpart – carries intrinsic difficul- ties, which draw their origins from the noisy na- ture of the data under consideration: imperfect

ASR transcripts due to recognition errors, lack of proper segmentation, etc. However, it also of- fers some advantages by making it possible to leverage extra-textual information such as emo- tion and other speaker states through an incorpo- ration of prosodic knowledge into the summariza- tion model.

A study by Maskey and Hirschberg (2005) on the relevance of various levels of linguistic knowl- edge (including lexical, prosodic and discourse structure) showed that enhancing a summarizer with prosodic information leads to more accurate and informed results.

In this work we extend the model proposed by Chen and Metze (2012c), where a random walk is performed on a lexico-topical graph structure to yield summaries. They exploited intra- and inter- speaker relationships through partial topic shar- ing for judging the importance of utterances in the context of multi-party meetings. This paper, on the other hand, enriches the underlying graph struc- ture with prosodic information, rather than lexico- topical knowledge, to model speaker states and emotions.

Also different from Maskey and Hirschberg (2005), we model the multimedia document struc- ture as a graph, which allows for flexibility as well as expressive power in representation. This graph structure provides the easy incorporation of targeted features into the model as well as in- depth analyses of individual feature contributions towards representing speaker information.

To the best of our knowledge this paper presents the first attempt at performing speech summariza- tion using no lexical information in a completely unsupervised setting. Maskey and Hirschberg (2006) use an HMM to perform summarization by relying solely on prosodic features. However, their model – unlike ours – is supervised. The only requirement of the model in this paper is a pre- processing step that segments the audio into “ut-

(2)

terances”.

While utterance segmentation may be a non- trivial problem, the possibility of an unsupervised speech summarization model that relies solely on acoustic input is advantageous. Importantly, it does not rely on any training data and circumvents the primary difficulties that plague most speech summarization techniques — namely the noise introduced into the system by imperfect speech recognition.

We evaluate our model on a dataset consist- ing of multi-party academic meetings (Chen and Metze, 2012b; Banerjee and Rudnicky, 2008). We perform evaluation using the ROUGE metric for automatic summarization, which counts n-gram overlap between reference and candidate sum- maries. We also run a post-hoc analysis, which measures the average relevance score of utterances in a candidate summary.

Evaluation results indicate that our model out- performs a number of baselines across varying experimental settings in all but the shortest sum- maries. We hence claim that our model is a robust, flexible, and effective framework for unsupervised speech summarization.

The rest of the paper is organized as follows.

Section 2 describes the prosodic features encoded in the model and how they are extracted. Section 3 presents the construction of the two-layer graph and mutually reinforced random walk for propa- gating information through the graph. Section 4 shows experimental results of applying the pro- posed model to the dataset of academic meetings and discusses the effects of prosody on summa- rization. Section 5 concludes.

2 Prosodic Feature Extraction

As previously stated, the only pre-requisite of the model proposed in this paper is a segmentation of the input document into chunks that are dictated by some meaningful notion of utterances. Once the audio has been segmented utterance-wise, the rest of the pipeline is effectively agnostic to all but its acoustic properties.

Given a set of pre-segmented audio files, we ex- tract the following prosodic features from them us- ing PRAAT scripts (Huang et al., 2006).

• Number of syllables and number of pauses.

• Duration time, – which is the speaking time including pauses – and the phonation time, –

U1

U2

U3 U4

U5

U6

U7 Utterance-Layer

P2

Prosody-Layer

P1 P3

Figure 1: A simplified example of the two-layer graph considered, where a type of prosody Pi is represented as a node in prosody-layer and an ut- teranceUj is represented as a node in utterance- layer of the two-layer graph.

which is the speaking time excluding pauses.

• Speaking rate and articulation rate, which are the number of syllables divided by the dura- tion time and phonation time, respectively.

• The average, maximal and minimal funda- mental frequencies measured in Hz (which objectify the perceptive notion of pitch).

• The energy measured in Pa2/sec and the in- tensity measured in dB.

The inclusion of the features above into the model was motivated by their possible contribu- tion to the notion of “important utterances” in a dialogue. For example, intuitively, pitch is a vocal channel for emotions, such as anger, or embarrass- ment. It may thus contribute, via the emotional in- vestment of the speaker to the importance of her utterances. Similarly, the variation of energy over an utterance results in its perceived loudness, thus possibly permitting the inference of emphasis or stress to particular utterances by speakers. Again, speech rate often acts as a latent channel for com- munication of information, where excitement or emphasis is implicitly conveyed by a speaker.

3 Two-Layer Mutually Reinforced Random Walk

In this section we describe our method for modelling speech data as a two-layered inter- connected graph structure and run the mutually re- inforced random-walk algorithm for summariza- tion.

(3)

Given an input speech document that is suitably segmented into utterance chunks, we construct a linked two-layer graph G containing an utterance set VU and a prosody set VP. Each node of the graph Ui∈ VUcorresponds to a single utterance as obtained from the pre-processing “chunking” step.

Every node Pi ∈ VP illustrates a single prosodic features incorporated into the model.

Figure 1 shows a simplified example of such a two-layered graph. G = hVU, VP, EU P, EP Ui, where VU = {Ui}, VP = {Pi}, EU P = {eij | Ui ∈ VU, Pj ∈ VP}, and EP U = {eij | Pi ∈ VP, Uj ∈ VU}. Here, EU P and EP U represent the sets of directional edges between utterances and prosodic nodes with different directions (Cai and Li, 2012).

Based on these sets of directional edges we fur- ther define LU P = [wi,j]|VU|×|VP| and LP U = [wj,i]|VP|×|VU|. The matrices LU P and LP U effec- tively encode the directional relationship between utterances and prosodic features. More concretely, for example, the entry wi,j of LU P is the value of the prosodic feature Pj extracted from the ut- terance Ui. Row-normalization is performed on LU P and LP U (Shi and Malik, 2000). It may be noted that, as a consequence, LU P is different from LTP U.

Traditional random walk only operates on a sin- gle layer of the graph structure and integrates the initial similarity scores with the scores propagated from other utterance nodes (Chen et al., 2011;

Chen and Metze, 2012a; Hsu et al., 2007). The approach adopted in this paper, however, consid- ers prosodic information by propagating informa- tion between layers based on external mutual rein- forcement (Chen and Metze, 2012c).

Effectively the working of the algorithm stems from two interrelated intuitions. On the one hand, utterances that evidence more pronounced signs of important prosodic features should themselves be judged as more salient. On the other hand, prosodic features in salient utterances that are recorded with higher values should themselves be deemed as more important.

The advantage of the algorithm is that it is en- tirely unsupervised and allows for the integration of knowledge-rich target specific features. The mathematical formulation of the algorithm is pre- sented as follows.

Given some initial scores FU(0) and FP(0)for ut- terance and prosody nodes respectively, the update

rule is given by:

(

FU(t+1)= (1 − α)FU(0)+ α · LU PFP(t) FP(t+1)= (1 − α)FP(0)+ α · LP UFU(t) (1) Here FU(t)and FP(t)integrate the initial importance associated with their respective nodes with the score obtained by between-layer propagation at a given iteration t.

Hence, the scores in each layer are mutually updated by the scores from the other layer, itera- tively. In particular, utterances that exhibit more pronounced signs of important prosodic feature are progressively scored higher. At the same time, prosodic features that appear with higher values in salient utterances become progressively more im- portant.

For the utterance set, the update rule incre- ments the importance of nodes with the combina- tion LU PFP(t). This latter term can be considered as the score from linked nodes in the prosody set, weighted by prosodic feature values. Finally, an α value encodes the trade-off between initial utter- ance weight and information sharing via propaga- tion. The algorithm converges satisfying (2).

(

FU = (1 − α)FU(0)+ α · LU PFP FP = (1 − α)FP(0)+ α · LP UFU (2) Additionally FU has an analytical solution which is given by:

FU = (1 − α)FU(0) (3)

+ α · LU P

(1 − α)FP(0)+ α · LP UFU

= (1 − α)FU(0)+ α(1 − α)LU PFP(0) + α2LU PLP UFU

=



(1 − α)FU(0)eT + α(1 − α)LU PFP(0)eT + α2LU PLP U FU

= M FU,

where e = [1, 1, ..., 1]T. The closed-form so- lution FU of (3) is the dominant eigenvector of M (Langville and Meyer, 2005).

It may be noted that for the practical implemen- tation of the algorithm, we set the initial scores of utterance nodes FU(0) and prosodic nodes FP(0) to have equal importance. Also we empirically set α = 0.9 for all our experiments because several studies have shown that (1−0.9) is a proper damp- ing factor (Hsu et al., 2007; Brin and Page, 1998).

(4)

4 Experiments

4.1 Pre-processing – Time Alignment

We have previously stressed that while our model is independent from the lexical representation of an audio document, it does rely on a pre- processing step that chunks the document into in- dividual utterances. It is noted that this may not be a trivial task.

Speaker diarization (Tranter and Reynolds, 2006) and utterance segmentation (Christensen et al., 2005; Geertzen et al., 2007) are open areas of research in the NLP community. Systems devel- oped for these purposes may be used to produce the initial chunking required by our model. In this paper, however, we do not explore these methods and instead rely on segmentation obtained from manually produced textual transcripts. This is to study the efficacy of our model in isolation.

A second reason for using textual transcripts is the presentation of experimental evaluation.

This form of data allows for tangible results that are obtained through evaluation metrics such as ROUGE, which rely on measuring n-gram overlap between reference and candidate summaries. Fur- thermore, the resulting textual surface form and summaries are more “semantically” interpretable as well.

To associate prosodic information with the tex- tual realization of each utterance in a manual tran- script, a preprocessing step requires time align- ments between the audio and the corresponding text of each utterance. Note that this step is unnec- essary in the case when manual transcripts are not present, and utterance chunking is obtained from some other, automatic means. The time alignment is then implicitly obtained in the process of utter- ance segmentation.

To accomplish the alignment in our experimen- tal framework, a speech recognizer is first used to produce an ASR output of the audio document.

A by-product of this step is that each recognized token contains an inherent time signature. Us- ing Viterbi alignment between the ASR output and manual transcription the time signatures from the audio is projected onto each manually transcribed utterance.

We experimented with Viterbi alignment at a number of different levels of granularity includ- ing token level, character level, and phoneme level (via conversion of text to phonetic representation using the CMU pronunciation dictionary (Weide,

1998)). The latter was empirically found to pro- duce the most fine-grained and precise alignments, and was consequently used in all our experiments.

4.2 Corpus

The dataset used in our evaluation is the same one previously employed by Chen and Metze (2012b).

It consists of 10 meetings held between April and June 2006, with largely overlapping participants and topics of discussion. There were a total of 6 unique participants, with each meeting involving between 2 and 4 individual speakers. SmartNotes (Banerjee and Rudnicky, 2008) was use to record both the audio and the notes for each meeting.

The average duration of a meeting in the dataset was approximately 28 minutes, and the total num- ber of utterances was 7123. We only use the man- ual transcripts of the meetings to actually evaluate our model, although ASR transcripts were used for time alignment.

The reference summaries are produced by se- lecting the set of the most “noteworthy” utter- ances. Two annotators manually labelled the de- gree of “noteworthiness” (on a relevance scale of 1 to 3) for each utterance. We extract all the utter- ances with a “noteworthiness” level of 3 to form the reference summary of each meeting.

4.3 Baselines

Several baselines were used for comparison against our model and are described below.

1. Longest: The first baseline simply selects the longest utterances to form a summary of a document (where the length of the extracted summary is based on the desired ratio). We define the length of utterances by the number of tokens they contain.

2. Begin: A second variant of this baseline se- lects the utterances that appear in the begin- ning of the document.

3. LTE: The third baseline is a summary produced by using Latent Topic Entropy (LTE) (Kong and Lee, 2011). This measure essentially estimates the “focus” of an utter- ance. Hence, theoretically, a lower topic en- tropy relates to a more topically informative utterance, which in turn translates into a note- worthy utterance to include in a summary.

4. TF-IDF The final baseline uses basic TF-IDF to measure the importance of utterances, by

(5)

F-measure ROUGE-1 ROUGE-L

10% 20% 30% 10% 20% 30%

Baseline

Longest 34.05 52.48 61.11 33.66 52.10 60.77 Begin 35.45 54.42 64.63 35.28 54.18 64.37 LTE 35.16 54.67 64.97 35.03 54.54 64.76 TFIDF 32.01 51.33 63.11 31.89 51.08 62.84 This Paper 35.33 55.17 65.60 35.09 54.90 65.36

Table 1: ROUGE scores (%) on multi-party meeting dataset

taking the averaged TF-IDF score over each of its individual words.

It may be noted that the topic distribution of words as well as their IDF scores were obtained by computing statistics over all ten meetings in our experimental dataset.

4.4 Evaluation Metrics

Our automated evaluation utilizes the stan- dard DUC (Document Understanding Conference) evaluation metric, ROUGE (Lin, 2004), which measures recall over various n-gram statistics be- tween a system-generated summary and a set of summaries produced by humans. F-measures for ROUGE-1 (unigram) and ROUGE-L (longest common subsequence) can be evaluated in exactly the same way.

We also use a post-hoc evaluation metric to measure the average “importance” of utterances in a summary. This metric associates a relevance score to a summary by taking the averaged note- worthiness score of each utterance, as obtained from human annotators.

4.5 Results and Discussion

We ran each of the baseline summarizers as well as the system proposed in this paper to produce 10%, 20% and 30% summaries of each of the meetings in the dataset. The percentage of a summary was determined by selecting the top k utterances (as determined by a given system) until the desired ratio between the number of tokens in the sum- mary to the total length of its corresponding meet- ing was met.

Evaluation results on the ROUGE metric are presented in Table 1. They reveal that the perfor- mance of our prosody-based model is competitive with the other baselines on the shortest 10% sum- maries. In fact it ranks second, only scoring lower than the baseline that considers the beginning of a document as a summary. Additionally, on the

longer 20% and 30% summaries, the system out- performs all the baselines.

We believe that in the case of very short sum- maries, the nature of the data under consideration biases the evaluation of the “begin” baseline. This is because the meetings generally commence with a presentation of an agenda which contains key terms that are likely to be discussed during the course of the rest of the session. In this scenario a metric such as ROUGE – which effectively mea- sures n-gram overlap – would reward the “begin”

summary for including key terms that appear sev- eral times in the gold standard summaries.

However for longer summaries, where lexical variation is more pronounced, prosodic informa- tion provides a robust source of intelligence to se- lect noteworthy utterances. In fact we are sur- prised that it outperforms the lexically derived LTE and TF-IDF baselines in all evaluation con- figurations.

Overall, these results seem to suggest that our model is able to capture latent speaker information and incorporate it effectively into the process of extractive summarization.

We further test this conclusion by conducting a post-hoc analysis, where we examine the average

“importance” of utterances in the summary pro- duced by a particular system. More specifically, we measure the average relevance score – ranging on a scale of 1 to 3 – of the utterances, where the score of each utterance is derived from its note- worthiness level as judged by human annotators (Banerjee and Rudnicky, 2008). The results of this analysis are presented in Table 2.

While the “begin” baseline is able to produce summaries with the highest relevance score for the shortest 10% summaries, our model outperforms all other systems on the longer 20% and 30% sum- maries. Moreover, it is competitive with the “be- gin” baseline even on the shortest summaries and scores higher than the other baselines. These re-

(6)

Avg. Relevance 10% 20% 30%

Baseline

Longest 2.299 2.272 2.283 Begin 2.464 2.402 2.398 LTE 2.334 2.369 2.367 TFIDF 2.355 2.363 2.375 This paper 2.454 2.422 2.411 Table 2: Avg. relevance scores on multi-party meeting dataset

sults align with the findings in Table 1.

As an auxiliary analysis we also extract the con- verged scores of prosody nodes and rank them in order to analyze their effectiveness. The rank- ing reveals that the number of pauses in an ut- terance, its minimum and average pitch, and its intensity tend to be the most predictive features.

In the context of academic meetings the number of pauses may be indicative of the time a speaker takes to formulate and articulate his/her thoughts.

Thus more pauses may indicate utterances that have been more carefully crafted and therefore in- clude more relevant content. Pitch and intensity are generally good measures of important infor- mation, because speakers tend to use them to ex- press emotion. This fact has previously been suc- cessfully leveraged for key term extraction (Chen et al., 2010).

Conversely the duration time of the utterance, the number of syllables, and the energy are the least predictive features. With the exception of en- ergy, the other two features can be considered as a surrogate measure for the length of utterances.

This parallels what the “longest” utterance base- line performs lexically. The finding corresponds to the results from Tables 1 and 2, which show that this baseline does not produce particularly relevant summaries.

5 Conclusion

Our paper proposes a novel approach to integrat- ing speaker-state information, through the incor- poration of prosodic knowledge into an unsuper- vised model for extractive speech summarization.

We have also shown the first attempt at performing unsupervised speech summarization without using lexical information.

We have presented experiments on a dataset of academic meetings involving spoken interactions between multiple parties. Evaluation results in- dicate that our model extracts relevant utterances

as summaries, both from the perspective of auto- matic evaluation metrics such as ROUGE as well as a post-hoc metric that measured the average relevance score of utterances within summaries.

In addition our model compared favorably with a number of heuristic and lexically derived base- lines outperforming them in all but one scenario.

This substantiates its claim to a robust and viable method for completely unsupervised speech sum- marization.

References

Satanjeev Banerjee and Alexander I. Rudnicky. 2008.

An extractive-summarization baseline for the auto- matic detection of noteworthy utterances in multi- party human-human dialog. In Proceedings of The 2nd IEEE Workshop on Spoken Language Tachnol- ogy (SLT), pages 177–180. IEEE.

Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine.

Computer networks and ISDN systems, 30(1):107–

117.

Xiaoyan Cai and Wenjie Li. 2012. Mutually re- inforced manifold-ranking based relevance prop- agation model for query-focused multi-document summarization. IEEE Transactions on Acoustics, Speech and Language Processing, 20:1597–1607.

Yun-Nung Chen and Florian Metze. 2012a. Inte- grating intra-speaker topic modeling and temporal- based inter-speaker topic modeling in random walk for improved multi-party meeting summarization.

In Proceedings of The 13th Annual Conference of the International Speech Communication Associa- tion (INTERSPEECH).

Yun-Nung Chen and Florian Metze. 2012b. Intra- speaker topic modeling for improved multi-party meeting summarization with integrated random walk. In Proceedings of The 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (NAACL-HLT), pages 377–381, Montréal, Canada, June. Association for Computational Lin- guistics.

Yun-Nung Chen and Florian Metze. 2012c. Two- layer mutually reinforced random walk for improved multi-party meeting summarization. In Proceedings of The 4th IEEE Workshop on Spoken Language Tachnology (SLT).

Yun-Nung Chen, Yu Huang, Sheng-Yi Kong, and Lin- Shan Lee. 2010. Automatic key term extraction from spoken course lectures using branching en- tropy and prosodic/semantic features. In Spoken Language Technology Workshop (SLT), 2010 IEEE, pages 265–270. IEEE.

(7)

Yun-Nung Chen, Yu Huang, Ching-Feng Yeh, and Lin- Shan Lee. 2011. Spoken lecture summarization by random walk over a graph constructed with automat- ically extracted key terms. In Proceedings of The 12th Annual Conference of the International Speech Communication Association (INTERSPEECH).

Heidi Christensen, BalaKrishna Kolluru, Yoshihiko Gotoh, and Steve Renals. 2005. Maximum entropy segmentation of broadcast news. IEEE Signal Pro- cessing Society Press.

Jeroen Geertzen, Volha Petukhova, and Harry Bunt.

2007. A multidimensional approach to utterance segmentation and dialogue act classification. In Pro- ceedings of the 8th SIGdial Workshop on Discourse and Dialogue, Antwerp, pages 140–149.

James R Glass, Timothy J Hazen, D Scott Cyphers, Igor Malioutov, David Huynh, and Regina Barzilay.

2007. Recent progress in the mit spoken lecture pro- cessing project. In Proceedings of The 8th Annual Conference of the International Speech Communi- cation Association (INTERSPEECH), pages 2553–

2556.

Chiori Hori and Sadaoki Furui. 2001. Advances in automatic speech summarization. RDM, 80:100.

Chiori Hori, Sadaoki Furui, Rob Malkin, Hua Yu, and Alex" Waibel. 2002. Automatic speech summariza- tion applied to english broadcast news speech. In Proceedings of ICASSP, volume 1, pages 9–12.

Winston H Hsu, Lyndon S Kennedy, and Shih-Fu Chang. 2007. Video search reranking through ran- dom walk over document-level context graph. In Proceedings of the 15th international conference on Multimedia, pages 971–980. ACM.

Zhongqiang Huang, Lei Chen, and Mary Harper. 2006.

An open source prosodic feature extraction tool. In Proceedings of the Language Resources and Evalu- ation Conference.

Sheng-Yi Kong and Lin-Shan Lee. 2011. Seman- tic analysis and organization of spoken documents based on parameters derived from latent topics.

IEEE Trans. on Audio, Speech and Language Pro- cessing.

Amy N Langville and Carl D Meyer. 2005. A sur- vey of eigenvector methods for web information re- trieval. SIAM Review, 47(1):135–161.

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Work- shop, pages 74–81.

Fei Liu and Yang Liu. 2010. Using spoken utter- ance compression for meeting summarization: A pi- lot study. In Proceedings of The 3rd IEEE Workshop on Spoken Language Tachnology (SLT).

Sameer Raj Maskey and Julia Hirschberg. 2003. Auto- matic summarization of broadcast news using struc- tural features. In Proceedings of Eurospeech.

Sameer Maskey and Julia Hirschberg. 2005. Com- paring lexical, acoustic/prosodic, structural and dis- course features for speech summarization. In Pro- ceedings of InterSpeech.

Sameer Maskey and Julia Hirschberg. 2006. Sum- marizing speech without text using hidden markov models. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 89–92. Association for Computational Linguistics.

Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 22(8):888–

905.

Sue E Tranter and Douglas A Reynolds. 2006. An overview of automatic speaker diarization systems.

Audio, Speech, and Language Processing, IEEE Transactions on, 14(5):1557–1565.

RL Weide. 1998. The cmu pronunciation dictionary, release 0.6.

參考文獻

相關文件

We can make the following observations: (i) The quality of synthe- sis after the iterative process can be substantially better than with the initial labeling, (ii) Using larger

This paper presents a new approach based on genetic algorithms to solving the single- vehicle pickup and delivery problem with time window constraints (1-PDPTW).. In partic- ular,

One is the posterior prob- ability of query Q in the x i lattice (the first term on the right side of (19)), and the other is the relevance scores of the similar utterances (the

4 shows example saliency maps produced by our method and some state-of-the-art methods, including unsupervised co-saliency detection methods (CSSCF [3], CoDW [12]),

This paper proposes an improved approach for spoken lecture summarization, in which random walk is performed on a graph constructed with automatically extracted key terms and

Between-Layer Relation via Hidden Parameters Given a set of utterances {U i , i = 1, 2, ..., |U |} and a set of speakers {S j , j = 1, 2, |S|}, where a speaker node in the graph

For manual transcripts, row (e) cannot perform better by combing two types of similarity, which means topical similarity can dominate lexical similarity, since without

 Construct a graph to represent utterances in the document (node: utterance, edge: weighted by topical/lexical similarity) Use the graph to re-compute the importance.