### Definition

## • ^{Key Term }

### •

Higher term frequency### •

Core content## • ^{Two types }

### •

^{Keyword }

### •

Key phrase## • ^{Advantage }

### •

Indexing and retrieval### •

The relations between key terms and segments of documents### Introduction

### Introduction

acoustic model language model

hmm n gram

phone hidden Markov model

### Introduction

hmm

acoustic model language model

n gram

phone hidden Markov model

bigram

Target: extract key terms from course lectures

### Automatic Key Term Extraction

▼ Original spoken documents

Archive of spoken documents

Branching Entropy

Feature Extraction

Learning Methods 1) K-means Exemplar 2) AdaBoost

3) Neural Network ASR

speech signal

ASR trans

### Automatic Key Term Extraction

Archive of spoken documents

Branching Entropy

Feature Extraction

Learning Methods 1) K-means Exemplar 2) AdaBoost

3) Neural Network ASR

speech signal

ASR trans

### Automatic Key Term Extraction

Archive of spoken documents

Branching Entropy

Feature Extraction

Learning Methods 1) K-means Exemplar 2) AdaBoost

3) Neural Network ASR

speech signal

ASR trans

Phrase Identification

### Automatic Key Term Extraction

Archive of spoken documents

Branching Entropy

Feature Extraction

Learning Methods 1) K-means Exemplar 2) AdaBoost

3) Neural Network ASR

speech signal

First using branching entropy to identify phrases

ASR trans

Phrase Identification

Key Term Extraction

### Automatic Key Term Extraction

Archive of spoken documents

Branching Entropy

Feature Extraction

Learning Methods 1) K-means Exemplar 2) AdaBoost

3) Neural Network ASR

speech signal

**Key terms **
entropy
acoustic
model

:

Then using learning methods to extract key terms by some features

ASR trans

Phrase Identification

Key Term Extraction

### Automatic Key Term Extraction

Archive of spoken documents

Branching Entropy

Feature Extraction

Learning Methods 1) K-means Exemplar 2) AdaBoost

3) Neural Network ASR

speech signal

**Key terms **
entropy
acoustic
model

:

ASR trans

### Branching Entropy

### •

“hidden” is almost always followed by the same word### hidden Markov model

**How to decide the boundary of a phrase? **

represent is

can : : is

of in

: :

### Branching Entropy

### •

“hidden” is almost always followed by the same word### •

“hidden Markov” is almost always followed by the same word### hidden Markov model

**How to decide the boundary of a phrase? **

represent is

can : : is

of in

: :

### Branching Entropy

### hidden Markov model

**boundary **

Define branching entropy to decide possible boundary

**How to decide the boundary of a phrase? **

represent is

can : : is

of in

: :

### •

“hidden” is almost always followed by the same word### •

“hidden Markov” is almost always followed by the same word### •

“hidden Markov model” is followed by many different words### Branching Entropy

### hidden Markov model

• Definition of Right Branching Entropy

• *Probability of children x*_{i}* for X *

• *Right branching entropy for X *

*X * *x*_{i }

**How to decide the boundary of a phrase? **

represent is

can : : is

of in

: :

### Branching Entropy

### hidden Markov model

• Decision of Right Boundary

• *Find the right boundary located between X and x*_{i }*where *

*X *

**boundary **

**How to decide the boundary of a phrase? **

represent is

can : : is

of in

: :

### Branching Entropy

### hidden Markov model

**How to decide the boundary of a phrase? **

represent is

can : : is

of in

: :

### Branching Entropy

### hidden Markov model

**How to decide the boundary of a phrase? **

represent is

can : : is

of in

: :

### Branching Entropy

### hidden Markov model

**How to decide the boundary of a phrase? **

represent is

can : : is

of in

: :

### Branching Entropy

### hidden Markov model

• Decision of Left Boundary

• *Find the left boundary located between X and x*_{i }*where *

*X: model Markov hidden *

**How to decide the boundary of a phrase? **

**boundary **

*X *

represent is

can : : is

of in

: :

Using PAT Tree to implement

### Branching Entropy

• Implementation in the PAT tree

• *Probability of children x*_{i}* for X *

• *Right branching entropy for X *

hidden

Markov

1 model

2

chain 3

state

distribution 5 6 variable

4

**X **

**x**_{1 }

**x**_{2 }**X : hidden Markov **

**x***_{1}*: hidden Markov model

**x***: hidden Markov chain*

_{2}**How to decide the boundary of a phrase? **

Phrase Identification

Key Term Extraction

### Automatic Key Term Extraction

Archive of spoken documents

Branching Entropy

Feature Extraction

Learning Methods 1) K-means Exemplar 2) AdaBoost

3) Neural Network ASR

speech signal

**Key terms **
entropy
acoustic
model

:

Extract prosodic, lexical, and semantic features for each candidate term

ASR trans

### Feature Extraction

## • Prosodic features

### •

For each candidate term appearing at the first time**Feature **

**Name ** **Feature Description **
Duration

(I – IV)

normalized duration (max, min, mean, range)

Speaker tends to use longer duration to emphasize key terms

using 4 values for duration of the term duration of phone “a” normalized by

avg duration of phone “a”

### Feature Extraction

## • Prosodic features

### •

For each candidate term appearing at the first time Higher pitch may represent significant information**Feature **

**Name ** **Feature Description **
Duration

(I – IV)

normalized duration (max, min, mean, range)

### Feature Extraction

## • Prosodic features

### •

For each candidate term appearing at the first time Higher pitch may represent significant information**Feature **

**Name ** **Feature Description **
Duration

(I – IV)

normalized duration (max, min, mean, range) Pitch

(I - IV)

F0

(max, min, mean, range)

### Feature Extraction

## • Prosodic features

### •

For each candidate term appearing at the first time Higher energy emphasizes important information**Feature **

**Name ** **Feature Description **
Duration

(I – IV)

normalized duration (max, min, mean, range) Pitch

(I - IV)

F0

(max, min, mean, range)

### Feature Extraction

## • Prosodic features

### •

For each candidate term appearing at the first time Higher energy emphasizes important information**Feature **

**Name ** **Feature Description **
Duration

(I – IV)

normalized duration (max, min, mean, range) Pitch

(I - IV)

F0

(max, min, mean, range) Energy

(I - IV)

energy

(max, min, mean, range)

### Feature Extraction

## • Lexical features

**Feature Name ** **Feature Description **

TF term frequency

IDF inverse document frequency

TFIDF tf * idf

PoS the PoS tag

Using some well-known lexical features for each candidate term

### Feature Extraction

## • Semantic features

### •

Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Probability

Key terms tend to focus on limited topics

*t*1
*t 2*
*t j*

*t n*

*D*_{1}
*D*_{2}

*D*_{i}

*D*_{N}

*TK*
*Tk*

*T2*
*T1*

*P(T |D )*k i *P(t |T )*j k

*D** _{i}*: documents

*T*

*: latent topics*

_{k}*t*

*: terms*

_{j}### Feature Extraction

## • Semantic features

### •

Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Probability

**Feature Name ** **Feature Description **

LTP (I - III) Latent Topic Probability (mean, variance, standard deviation) non-key term

key term

Key terms tend to focus on limited topics

describe a probability distribution
**How to use it? **

### Feature Extraction

## • Semantic features

### •

Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Significance

Within-topic to out-of-topic ratio

**Feature Name ** **Feature Description **

LTP (I - III) Latent Topic Probability (mean, variance, standard deviation) non-key term

key term

Key terms tend to focus on limited topics

within-topic freq.

out-of-topic freq.

### Feature Extraction

## • Semantic features

### •

Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Significance

Within-topic to out-of-topic ratio

**Feature Name ** **Feature Description **

LTP (I - III) Latent Topic Probability (mean, variance, standard deviation) LTS (I - III) Latent Topic Significance (mean, variance, standard deviation)

non-key term

key term

Key terms tend to focus on limited topics

within-topic freq.

out-of-topic freq.

### Feature Extraction

## • Semantic features

### •

Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Entropy

**Feature Name ** **Feature Description **

LTP (I - III) Latent Topic Probability (mean, variance, standard deviation) LTS (I - III) Latent Topic Significance (mean, variance, standard deviation)

non-key term

key term

Key terms tend to focus on limited topics

### Feature Extraction

## • Semantic features

### •

Probabilistic Latent Semantic Analysis (PLSA) Latent Topic Entropy

**Feature Name ** **Feature Description **

LTP (I - III) Latent Topic Probability (mean, variance, standard deviation) LTS (I - III) Latent Topic Significance (mean, variance, standard deviation)

LTE term entropy for latent topic

non-key term

key term

Key terms tend to focus on limited topics

**Higher LTE **

**Lower LTE **

Phrase Identification

Key Term Extraction

### Automatic Key Term Extraction

Archive of spoken documents

Branching Entropy

Feature Extraction

Learning Methods 1) K-means Exemplar 2) AdaBoost

3) Neural Network ASR

speech signal

ASR trans

**Key terms **
entropy
acoustic
model

:

Using unsupervised and supervised approaches to extract key terms

### Learning Methods

## • Unsupervised learning

### •

K-means Exemplar Transform a term into a vector in LTS (Latent Topic Significance) space

Run K-means

Find the centroid of each cluster to be the key term

The candidate term in the same group are related to the key term The key term can represent this topic

The terms in the same cluster focus on a single topic

### Learning Methods

## • Supervised learning

### •

Adaptive Boosting### •

Neural NetworkAutomatically adjust the weights of features to produce a classifier

### Experiments

## • ^{Corpus }

### •

NTU lecture corpus Mandarin Chinese embedded by English words

Single speaker

45.2 hours

我們的solution是viterbi algorithm (Our solution is viterbi algorithm)

### Experiments

## • ASR Accuracy

**Language ** **Mandarin ** **English ** **Overall **

**Char Acc (%) ** 78.15 53.44 76.26

CH EN

SI Model

some data from target speaker

AM

**Out-of-domain **
**corpora **
Background

**In-domain **
**corpus **
Adaptive

trigram

interpolation LM

Bilingual AM and model adaptation

### Experiments

## • Reference Key Terms

### •

Annotations from 61 students who have taken the course *If the k-th annotator labeled N** _{k}* key terms, he gave each
of them a score of , but 0 to others

Rank the terms by the sum of all scores given by all annotators for each term

*Choose the top N terms form the list (N is average N** _{k}*)

### •

*= 154 key terms*

^{N} 59 key phrases and 95 keywords

### Experiments

## • Evaluation

### •

Unsupervised learning *Set the number of key terms to be N*

### •

Supervised learning 3-fold cross validation

0 10 20 30 40 50 60

Pr Lx Sm Pr+Lx Pr+Lx+Sm

### Experiments

## • Feature Effectiveness

### •

Neural network for keywords from ASR transcriptionsEach set of these features alone gives F1 from 20% to 42% Prosodic features and lexical features are additive Three sets of features are all useful

20.78

42.86

35.63

48.15

56.55

Pr: Prosodic Lx: Lexical Sm: Semantic

F-measure

0 10 20 30 40 50 60 70

Baseline U: TFIDF U: K-means S: AB S: NN

manual

### Experiments

## • Overall Performance

51.95

55.84

62.39

67.31

23.38

Conventional TFIDF scores w/o branching entropy

stop word removal

PoS filtering

Branching entropy performs well K-means Exempler outperforms TFIDF

Supervised approaches are better than unsupervised approaches

F-measure

AB: AdaBoost

NN: Neural Network

0 10 20 30 40 50 60 70

Baseline U: TFIDF U: K-means S: AB S: NN

manual ASR

### Experiments

## • Overall Performance

The performance of ASR is slightly worse than manual but reasonable Supervised learning using neural network gives the best results

23.38 20.78

51.95

43.51

55.84 52.60

62.39 57.68

67.31 62.70 F-measure

AB: AdaBoost

NN: Neural Network

### Conclusion

## • We propose the new approach to extract key terms

## • The performance can be improved by

### •

Identifying phrases by branching entropy### •

Prosodic, lexical, and semantic features together## • The results are encouraging

Thank reviewers for valuable comments

NTU Virtual Instructor: http://speech.ee.ntu.edu.tw/~RA/lecture