Prosody-Based Unsupervised Speech Summarization with Two-Layer Mutually Reinforced Random Walk

(1)

Sujay Kumar Jauhar

Yun-Nung (Vivian) Chen

Florian Metze The 6th International Joint Conference on Natural Language Processing – Oct. 14-18, 2013

{sjauhar, yvchen, fmetze}@cs.cmu.edu

Language Technologies Institute School of Computer Science Carnegie Mellon University

(2)

Outline

Introduction Approach Experiments Conclusion

(3)

Outline

O

Motivation

O

Extractive Summarization

(4)

Outline

O

Motivation

O

Extractive Summarization

(5)

Motivation

O

Speech Summarization

O Spoken documents are more difficult to browse than texts

 easy to browse, save time, easily get the key points O

Prosodic Features

O Speakers may use prosody to implicitly convey the importance of the speech

(6)

Outline

O

Motivation

O

Extractive Summarization

(7)

Extractive Summarization

^(1/2)

O

Extractive Speech Summarization

O Select the indicative utterances in a spoken document

O Cascade the utterances to form a summary

1st utterance 2nd utterance 3rd utterance 4th utterance

: :

n-th utterance :

:

Extractive Summary

(8)

Extractive Summarization

^(2/2)

O

Selection of Indicative Utterances

O Each utterance U in a spoken document d is given an importance score I(U, d)

O Select the indicative utterances based on I(U,d)

O The number of utterances selected as summary is decided by a predefined ratio

n

i

t

t t

t

U 

₁ ₂

   

    







n

i

d t

s d

U

1

] ,

[ ,

I    

utterance term

term statistical measure (ex. TF-IDF) Importance score

(9)

Outline

O

Prosodic Feature Extraction

O

Graph Construction

O

Two-Layer Mutually Reinforced Random Walk

(10)

Outline

O

Prosodic Feature Extraction

O

Graph Construction

O

Two-Layer Mutually Reinforced Random Walk

(11)

Prosodic Feature Extraction

O

For each pre-segmented audio file, we extract

O number of syllables

O number of pauses

O duration time: speaking time including pauses

O phonation time: speaking time excluding pauses

O speaking rate: #syllable / duration time

O articulation rate: #syllable / phonation time

O fundamental frequency measured in Hz: avg, max, min

O energy measured in Pa²/sec

O intensity measured in dB

(12)

Outline

O

Prosodic Feature Extraction

O

Graph Construction

O

Two-Layer Mutually Reinforced Random Walk

(13)

Graph Construction

^(1/3)

O

Utterance-Layer

O Each node is the utterance in the meeting document

U₁

U₂

U₃ U₄

U₅

U₆

U₇

Utterance-Layer

(14)

Graph Construction

^(2/3)

O

Utterance-Layer

O Each node is the utterance in the meeting document O

Prosody-Layer

O Each node is a

prosodic feature ^U1

U₂

U₃ U₄

U₅

U₆

U₇

Utterance-Layer

P₁

P₂

P₃ P₄

P₅

P₆

Prosody-Layer

(15)

Graph Construction

^(3/3)

O

Utterance-Layer

O Each node is the utterance in the meeting document O

Prosody-Layer

O Each node is a prosodic feature O

Between-Layer

Relation

U₁

U₂

U₃ U₄

U₅

U₆

U₇

Utterance-Layer

P₁

P₂

P₃ P₄

P₅

P₆

Prosody-Layer

O The weight of the edge is the normalized value of the prosodic feature extracted from the utterance

(16)

Outline

O

Prosodic Feature Extraction

O

Graph Construction

O

Two-Layer Mutually Reinforced Random Walk

(17)

O

Mathematical Formulation

Two-Layer Mutual Reinforced Random Walk

^(1/2)

utterance scores at (t+1)-th iteration

U₁

U₂

U₃ U₄

U₅

U₆

U₇ Utterance-Layer

P₁

P₂

P₃ P₄

P₅

P₆ Prosody-Layer

(18)

O

Mathematical Formulation

Two-Layer Mutual Reinforced Random Walk

^(1/2)

original importance of utterances

O

Original importance

O Utterance: equal weight

U₁

U₂

U₃ U₄

U₅

U₆

P₁

P₂

P₃ P₄

P₅

(19)

O

Mathematical Formulation

Two-Layer Mutual Reinforced Random Walk

^(1/2)

scores propagated from prosody nodes weighted by prosodic values

O

Original importance

U₁

U₂

U₃ U₄

U₅

U₆

P₁

P₂

P₃ P₄

P₅

(20)

O

Mathematical Formulation

Two-Layer Mutual Reinforced Random Walk

^(1/2)

prosody scores at (t+1)-th iteration

O

Original importance

U₁

U₂

U₃ U₄

U₅

U₆

P₁

P₂

P₃ P₄

P₅

(21)

O

Mathematical Formulation

Two-Layer Mutual Reinforced Random Walk

^(1/2)

original importance of prosodic features

O

Original importance

O Prosody: equal weight

U₁

U₂

U₃ U₄

U₅

U₆

P₁

P₂

P₃ P₄

P₅

(22)

O

Mathematical Formulation

Two-Layer Mutual Reinforced Random Walk

^(1/2)

O

Original importance

O Prosody: equal weight

U₁

U₂

U₃ U₄

U₅

U₆

P₁

P₂

P₃ P₄

P₅

scores propagated from utterances weighted by prosodic values

(23)

Two-Layer Mutual Reinforced Random Walk

^(2/2)

O

Mathematical Formulation

Utterance node U can get higher score when

• More important prosodic features with higher weights corresponding to utterance U

(24)

Two-Layer Mutual Reinforced Random Walk

^(2/2)

O

Mathematical Formulation

Utterance node U can get higher score when

• More important prosodic features with higher weights corresponding to utterance U

Prosody node P can get higher score when

• More important utterances have higher weights corresponding to the prosodic feature P

 Unsupervised learn important utterances/prosodic features

(25)

Outline

O

Experimental Setup

O

Evaluation Metrics

O

Results

O

Analysis

(26)

Outline

O

Experimental Setup

O

Evaluation Metrics

O

Results

O

Analysis

(27)

O

CMU Speech Meeting Corpus

O 10 meetings from 2006/04 – 2006/06

O #Speaker: 6 (total), 2-4 (each meeting)

O WER = 44%

O

Reference Summaries

O Manually labeled by two annotators as three

“noteworthiness” level (1-3)

O Extract utterances with level 3 as reference summaries O

Parameter Setting

O α = 0.9

O Extractive summary ratio = 10%, 20%, 30%

Experimental Setup

(28)

Outline

O

Experimental Setup

O

Evaluation Metrics

O

Results

O

Analysis

(29)

O

ROUGE

O ROUGE-1

O F-measure of matched unigram between extracted summary and reference summary

O ROUGE-L (Longest Common Subsequence)

O F-measure of matched LCS between extracted summary and reference summary

O

Average Relevance Score

O Average noteworthiness scores for the extracted utterances

Evaluation Metrics

(30)

Outline

O

Experimental Setup

O

Evaluation Metrics

O

Results

O

Analysis

(31)

O

Longest

O the longest utterances based on #tokens O

Begin

O the utterances that appear in the beginning O

Latent Topic Entropy (LTE)

O Estimate the “focus” of an utterance

O Lower topic entropy represents more topically informative O

TFIDF

O Average TFIDF scores of all words in the utterances

Baseline

(32)

2.20 2.25 2.30 2.35 2.40 2.45 2.50

Longest Begin LTE TFIDF Proposed

Avg. Relevance

30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00

ROUGE-1

30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00

ROUGE-L

Results

For 10% summaries, Begin performs best and proposed performs comparable results

(33)

2.20 2.25 2.30 2.35 2.40 2.45 2.50

Avg. Relevance

30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00

ROUGE-1

30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00

ROUGE-L

Results

For 20% summaries, proposed approach outperforms all of the baselines

(34)

2.20 2.25 2.30 2.35 2.40 2.45 2.50

Avg. Relevance

30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00

ROUGE-1

30.00 35.00 40.00 45.00 50.00 55.00 60.00 65.00 70.00

ROUGE-L

Results

For 30% summaries, proposed approach outperforms all of the baselines

(35)

Outline

O

Experimental Setup

O

Evaluation Metrics

O

Results

O

Analysis

(36)

O

Based on converged scores for prosodic features

O Predictive features

O number of pauses

O min pitch

O avg pitch

O intensity

O Least predictive features

O the duration time

O the number of syllables

O the energy

Analysis

(37)

Outline

O Two-layer mutually reinforced random walk integrates

prosodic knowledge into an unsupervised model for speech summarization

O We show the first attempt at performing unsupervised speech summarization without using lexical information

O Compared to some lexically derived baselines, the proposed approach outperforms all of them but one scenario

(38)