Sujay Kumar Jauhar
Yun-Nung (Vivian) Chen
Florian Metze The 6th International Joint Conference on Natural Language Processing – Oct. 14-18, 2013
{sjauhar, yvchen, fmetze}@cs.cmu.edu
Language Technologies Institute School of Computer Science Carnegie Mellon University
Outline
Introduction Approach Experiments Conclusion
Outline
Introduction Approach Experiments Conclusion
O
Motivation
O
Extractive Summarization
Outline
Introduction Approach Experiments Conclusion
O
Motivation
O
Extractive Summarization
Motivation
O
Speech Summarization
O Spoken documents are more difficult to browse than texts
easy to browse, save time, easily get the key points O
Prosodic Features
O Speakers may use prosody to implicitly convey the importance of the speech
Outline
Introduction Approach Experiments Conclusion
O
Motivation
O
Extractive Summarization
Extractive Summarization
(1/2)O
Extractive Speech Summarization
O Select the indicative utterances in a spoken document
O Cascade the utterances to form a summary
1st utterance 2nd utterance 3rd utterance 4th utterance
: :
n-th utterance :
:
Extractive Summary
Extractive Summarization
(2/2)O
Selection of Indicative Utterances
O Each utterance U in a spoken document d is given an importance score I(U, d)
O Select the indicative utterances based on I(U,d)
O The number of utterances selected as summary is decided by a predefined ratio
n
i
t
t t
t
U
1 2
n
i
i
d t
s d
U
1
] ,
[ ,
I
utterance term
term statistical measure (ex. TF-IDF) Importance score
Outline
Introduction Approach Experiments Conclusion
O
Prosodic Feature Extraction
O
Graph Construction
O
Two-Layer Mutually Reinforced Random Walk
Outline
Introduction Approach Experiments Conclusion
O
Prosodic Feature Extraction
O
Graph Construction
O
Two-Layer Mutually Reinforced Random Walk
Prosodic Feature Extraction
O
For each pre-segmented audio file, we extract
O number of syllables
O number of pauses
O duration time: speaking time including pauses
O phonation time: speaking time excluding pauses
O speaking rate: #syllable / duration time
O articulation rate: #syllable / phonation time
O fundamental frequency measured in Hz: avg, max, min
O energy measured in Pa2/sec
O intensity measured in dB
Outline
Introduction Approach Experiments Conclusion
O
Prosodic Feature Extraction
O
Graph Construction
O
Two-Layer Mutually Reinforced Random Walk
Graph Construction
(1/3)O
Utterance-Layer
O Each node is the utterance in the meeting document
U1
U2
U3 U4
U5
U6
U7
Utterance-Layer
Graph Construction
(2/3)O
Utterance-Layer
O Each node is the utterance in the meeting document O
Prosody-Layer
O Each node is a
prosodic feature U1
U2
U3 U4
U5
U6
U7
Utterance-Layer
P1
P2
P3 P4
P5
P6
Prosody-Layer
Graph Construction
(3/3)O
Utterance-Layer
O Each node is the utterance in the meeting document O
Prosody-Layer
O Each node is a prosodic feature O
Between-Layer
Relation
U1
U2
U3 U4
U5
U6
U7
Utterance-Layer
P1
P2
P3 P4
P5
P6
Prosody-Layer
O The weight of the edge is the normalized value of the prosodic feature extracted from the utterance
Outline
Introduction Approach Experiments Conclusion
O
Prosodic Feature Extraction
O
Graph Construction
O
Two-Layer Mutually Reinforced Random Walk
O
Mathematical Formulation
Two-Layer Mutual Reinforced Random Walk
(1/2)utterance scores at (t+1)-th iteration
U1
U2
U3 U4
U5
U6
U7 Utterance-Layer
P1
P2
P3 P4
P5
P6 Prosody-Layer
O
Mathematical Formulation
Two-Layer Mutual Reinforced Random Walk
(1/2)original importance of utterances
O
Original importance
O Utterance: equal weight
U1
U2
U3 U4
U5
U6
U7 Utterance-Layer
P1
P2
P3 P4
P5
P6 Prosody-Layer
O
Mathematical Formulation
Two-Layer Mutual Reinforced Random Walk
(1/2)scores propagated from prosody nodes weighted by prosodic values
O
Original importance
O Utterance: equal weight
U1
U2
U3 U4
U5
U6
U7 Utterance-Layer
P1
P2
P3 P4
P5
P6 Prosody-Layer
O
Mathematical Formulation
Two-Layer Mutual Reinforced Random Walk
(1/2)prosody scores at (t+1)-th iteration
O
Original importance
O Utterance: equal weight
U1
U2
U3 U4
U5
U6
U7 Utterance-Layer
P1
P2
P3 P4
P5
P6 Prosody-Layer
O
Mathematical Formulation
Two-Layer Mutual Reinforced Random Walk
(1/2)original importance of prosodic features
O
Original importance
O Utterance: equal weight
O Prosody: equal weight
U1
U2
U3 U4
U5
U6
U7 Utterance-Layer
P1
P2
P3 P4
P5
P6 Prosody-Layer
O
Mathematical Formulation
Two-Layer Mutual Reinforced Random Walk
(1/2)O
Original importance
O Utterance: equal weight
O Prosody: equal weight
U1
U2
U3 U4
U5
U6
U7 Utterance-Layer
P1
P2
P3 P4
P5
P6 Prosody-Layer
scores propagated from utterances weighted by prosodic values
Two-Layer Mutual Reinforced Random Walk
(2/2)O
Mathematical Formulation
Utterance node U can get higher score when
• More important prosodic features with higher weights corresponding to utterance U
Two-Layer Mutual Reinforced Random Walk
(2/2)O
Mathematical Formulation
Utterance node U can get higher score when
• More important prosodic features with higher weights corresponding to utterance U
Prosody node P can get higher score when
• More important utterances have higher weights corresponding to the prosodic feature P
Unsupervised learn important utterances/prosodic features
Outline
Introduction Approach Experiments Conclusion
O
Experimental Setup
O
Evaluation Metrics
O
Results
O
Analysis
Outline
Introduction Approach Experiments Conclusion
O
Experimental Setup
O
Evaluation Metrics
O
Results
O
Analysis
O
CMU Speech Meeting Corpus
O 10 meetings from 2006/04 – 2006/06
O #Speaker: 6 (total), 2-4 (each meeting)
O WER = 44%
O
Reference Summaries
O Manually labeled by two annotators as three
“noteworthiness” level (1-3)
O Extract utterances with level 3 as reference summaries O
Parameter Setting
O α = 0.9
O Extractive summary ratio = 10%, 20%, 30%
Experimental Setup
Outline
Introduction Approach Experiments Conclusion
O
Experimental Setup
O
Evaluation Metrics
O
Results
O
Analysis
O
ROUGE
O ROUGE-1
O F-measure of matched unigram between extracted summary and reference summary
O ROUGE-L (Longest Common Subsequence)
O F-measure of matched LCS between extracted summary and reference summary
O
Average Relevance Score
O Average noteworthiness scores for the extracted utterances
Evaluation Metrics
Outline
Introduction Approach Experiments Conclusion
O
Experimental Setup
O
Evaluation Metrics
O
Results
O
Analysis
O
Longest
O the longest utterances based on #tokens O
Begin
O the utterances that appear in the beginning O
Latent Topic Entropy (LTE)
O Estimate the “focus” of an utterance
O Lower topic entropy represents more topically informative O
TFIDF
O Average TFIDF scores of all words in the utterances
Baseline
2.20 2.25 2.30 2.35 2.40 2.45 2.50
Longest Begin LTE TFIDF Proposed
Avg. Relevance
30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00
Longest Begin LTE TFIDF Proposed
ROUGE-1
30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00
Longest Begin LTE TFIDF Proposed
ROUGE-L
Results
For 10% summaries, Begin performs best and proposed performs comparable results
2.20 2.25 2.30 2.35 2.40 2.45 2.50
Longest Begin LTE TFIDF Proposed
Avg. Relevance
30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00
Longest Begin LTE TFIDF Proposed
ROUGE-1
30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00
Longest Begin LTE TFIDF Proposed
ROUGE-L
Results
For 20% summaries, proposed approach outperforms all of the baselines
2.20 2.25 2.30 2.35 2.40 2.45 2.50
Longest Begin LTE TFIDF Proposed
Avg. Relevance
30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00
Longest Begin LTE TFIDF Proposed
ROUGE-1
30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00
Longest Begin LTE TFIDF Proposed
ROUGE-L
Results
For 30% summaries, proposed approach outperforms all of the baselines
Outline
Introduction Approach Experiments Conclusion
O
Experimental Setup
O
Evaluation Metrics
O
Results
O
Analysis
O
Based on converged scores for prosodic features
O Predictive features
O number of pauses
O min pitch
O avg pitch
O intensity
O Least predictive features
O the duration time
O the number of syllables
O the energy
Analysis
Outline
Introduction Approach Experiments Conclusion
O Two-layer mutually reinforced random walk integrates
prosodic knowledge into an unsupervised model for speech summarization
O We show the first attempt at performing unsupervised speech summarization without using lexical information
O Compared to some lexically derived baselines, the proposed approach outperforms all of them but one scenario