Multi-Cue Fusion for Seman tic Video Indexing

(1)

Multi-Cue Fusion for Seman tic Video Indexing

Ming-Fang Weng and Yung-Yu Chuang National Taiwan University

mfueng@cmlab.csie.ntu.edu.tw

(2)

Motivation

• A rapidly growing amount of videos derives a need in semantic video search

Semantic video search Concept-based

video indexing

Query-by- concept

Semantic gaps Video

Video

Semantic Concepts

• Comprehensively characterize the meaning of the video content, e.g.,

Airplane Sports Crowd Mountain Building

(4)

Goal

• To improve the accuracy of semantic video indexing

A ranking list according to confidence measure

Detecting

… Detecting

Airplane

Detecting Sports 1

2 3

4 5

6 1 2

3 4 5

6

(5)

A Typical Approach

• Supervised learning

Lexicon Annotation

Training Concept Classifiers

Semantic Concept Prediction

Training path:

Testing path:

Video Archive

Shot Ranking Video

Segmentation Video Segmentation

Feature Extraction

Feature

Extraction

(6)

A Typical Approach

• Supervised learning

• Main problems:

• The annotation data is not fully utilized

• The label for all concepts in all shots are predict

ed independently

(7)

Ground Truth Annotation

A sequence of video shots (training set)

A lexicon of concepts car

outdoor

1 1 1 1 1 1 1 1

building 1 0 0 1 1 1 1 0

sky 0 1 1 1 0 1 1 0

people 0 0 0 1 0 1 1 0

urban 1 1 0 1 1 1 1 1

Temporal dependency Contextual

correlation

(8)

Semantic Concept Prediction

A lexicon of concepts car

outdoor

building sky urban

people

0.92 0.56 0.81

0.74 0.66

0.62

0.75 0.61 0.72 0.37 0.91

0.68

0.84 0.62 0.86 0.88 0.21

0.85

0.96 0.58 0.74 0.90 0.83

0.80 0.33

0.64 0.73 0.61 0.87

0.75 A sequence of video shots (test set)

0.84 0.69 0.62 0.83 0.94

0.26

(9)

0.56 0.61 0.69 0.64 0.67 0.58 0.72 0.68 0.93 0.75 0.26 0.80 0.92 0.75 0.84 0.33 0.84 0.96

0.81 0.73 0.86 0.73 0.62 0.74 0.62 0.37 0.88 0.66 0.83 0.90 0.66 0.91 0.21 0.87 0.94 0.83

Our Views

• Detectors’ predictions form a “noisy image”

in the contextual-temporal domain

A sequence of video shots (test set)

A lexicon of concepts people

sky car

urban building outdoor

NOISE

(10)

Image Denoising

Input Image Output image

Image

denoising

(11)

Image Denoising

Observation

Hidden variable

(12)

Image Denoising

Observation

Hidden variable

Observation

Prior relationship

Energy minimization

Observation Prior relationship Enhanced image

(13)

Main Ideas

• Denoising: Exploit prior relationships among nodes to reduce the noise

A sequence of video shots (test set)

A lexicon of concepts people

sky car

urban building outdoor

Temporal dependency

Contextual correlation

NOISE

(14)

For Semantic Video Indexing

• Observation ＝ Detectors’ prediction

• Prior relationships ＝？

• Energy function ＝？

(15)

Outline

• Multi-Cue Fusion Framework

• Modeling High-Order Relationship

• Inference using High-Order Relationship

• Experiments and Results

• Conclusions

(16)

System Framework

Video Segmentation

Feature Extraction

Lexicon Annotation

Training Concept Classifiers

Modeling High-Order Relationships

Inference Using High-Order Relationships Inference Using

High-Order Relationships

Training path:

Testing path:

Video Archive

Shot Ranking Semantic

Concept Prediction

Semantic

Concept

Prediction

(17)

Relationship Modeling

• Two issues

– Relationship discovering?

– Relationship representation?

(18)

Relationship Representation

• The probability of presence of X

(19)

Relationship Representation

• The probability of presence of X

• Given a binary variable Y

(20)

Relationship Representation

• The probability of presence of X

• Given two binary variables Y and Z

(21)

Relationship Representation

• The probability of presence of X

• Given a partition of data

S

₂

S

₁

S

₃

S

_n

S

_k+1

S

_k

(22)

Relationship Discovering

• A recursive algorithm selects the variables which are

– Highly correlated to the target variable – Independent of other selected variables

• Chi-square test

– Discovers the hidden associations

– Judges whether a correlation is reliable

(23)

• Concept lexicon:

– {Mountain(M), Sky(S), Tree(T), River(R)}

• Annotation data D :

• To discover the contextual relationship for M ountain

Toy Example

(24)

• Correlation measuring

– Assume that Sky is the most correlative concept to Mountain

Toy Example

(25)

Toy Example

• Data partition according to Sky

(26)

Toy Example

• Correlation measuring

– Assume that there is no concept with significan

t correlation to Mountain

(27)

Toy Example

• Correlation measuring

– Assume that River is the most correlative conce

pt to Mountain

(28)

Toy Example

• Data partition according to River

(29)

Toy Example

• Conditional probability estimation

– E.g.,

(30)

Toy Example

• The high-order relationship

– Independence assumption

(31)

An Example from Real Data

• Concept lexicon: Columbia374

• Annotation data: TRECVid 2005 devel. set

• The discovered contextual relationship of co ncept Mountain

H: hill

P: military_personnel S: sky

G: group L: landscape V: valleys

C: commercial_advertisement R: river

F: forest

K: rocky_ground W: waterways T: trees

H⁺

P⁺ P^- L⁺ L^-

S⁺ S^- V⁺ V^- K⁺ K^-

C⁺ C^-

The whole dataset

G⁺ G^-

S⁺ S^- R⁺ R^-

F⁺ F^-

K⁺ K^- W

+

R⁺ R^- F⁺ F^-

T⁺ T^-

S⁺ S^- H^-

W^-

Leaf node

Internal node

(32)

Temporal Relationships

• Discover temporal dependence among neigh boring shots

• Similar to the way discovering the contextual relationships

• Tests the correlation between neighboring sh

ots in the temporal order

(33)

Inference using Relationships

Prediction score Hidden variable Contextual cues Temporal cues

A sequence of shots

A le xi co n of c on ce pt s

(34)

Energy Function

Observed Likelihood

Contextual Relationship

Temporal

Relationship

(35)

Optimization

• Parameter estimation

– Obtain the concept-dependent parameters from t raining corpus with cross validation

• Energy minimization

– Use Conjugate Gradient Methods to solve this n on-linear function

– Adopt prediction scores from detectors as an initi

al guess

(36)

Experimental Settings ^1/2

• TRECVid benchmark

• Performance evaluation

– 20 officially selected concepts in TRECVid 2006 – Inferred average precision (infAP) for individua

l concept performance

– Mean infAP for overall system performance

Dataset # of Videos # of Shots

Training data TV05 devel set 137 43,907

Test data TV06 test set 259 79,484

(37)

Experimental Settings ^2/2

• Baselines in our experiments

VIREO-374 Columbia374

Provider City U. of H. K. Columbia University Feature Color moment,

Wavelet texture, Keypoint features

Edge direction histogram, Garbor,

Grid color moment

Learning SVMs SVMs

Fusion lately average lately average

Accuracy high medium

(38)

Overall performance

Baseline VIREO-374 Columbia374

Mean infAP 0.1542 0.0948

Contextual cues only Liu et al. 0.2% 0.5%

MCF 16.7% 19.6%

Temporal cues only Liu et al. 10.6% 16.9%

MCF 14.6% 17.3%

Both cues

Liu et al. 11.2% 18.1%

MCF-AC 19.7% 23.3%

MCF-EM 27.3% 32.1%

MCF-AC: MCF with average combination

MCF-EM: MCF with energy minimization

(39)

Performance of Individual Co ncepts

0 0.1 0.2 0.3 0.4 0.5 0.6

VIREO-374 baseline Liu et al. MCF-AC MCF-EM

in fA P

(40)

Comparison with TRECVid 2006 Su bmissions

0 0.05 0.1 0.15 0.2

infAP

Baselines MCF

The best run of each group

MCF on VIREO-374 VIREO-374 Baseline

MCF on Columbia374

Columbia374 Baseline

(41)

Observations

• MCF improved each of the 20 concepts with rang es varying from 5.9% to 88.1%

• 15 of 20 concepts showed more than 20% impro vement

• We achieved ~30% performance gain for two det

ectors with different levels of accuracy

(42)

Multi-Cue Fusion for Seman tic Video Indexing

Multi-Cue Fusion for Seman tic Video Indexing

Ming-Fang Weng and Yung-Yu Chuang National Taiwan University

mfueng@cmlab.csie.ntu.edu.tw

Motivation

• A rapidly growing amount of videos derives a need in semantic video search

Semantic video search Concept-based

video indexing

Query-by- concept

Semantic gaps Video

archives

Video

archives

Semantic Concepts

• Comprehensively characterize the meaning of the video content, e.g.,

Airplane Sports Crowd Mountain Building

Goal

• To improve the accuracy of semantic video indexing

A ranking list according to confidence measure

Detecting

… Detecting

Airplane

Detecting Sports 1

2 3

4 5

6

1 2

3

4 5

6

A Typical Approach

• Supervised learning

Lexicon Annotation

Lexicon Annotation

Training Concept Classifiers

Training Concept Classifiers

Semantic Concept Prediction

Semantic Concept Prediction

Training path:

Testing path:

Video Archive

Shot Ranking Video

Segmentation Video Segmentation

Feature Extraction

Feature

Extraction

A Typical Approach

• Supervised learning

• Main problems:

• The annotation data is not fully utilized

• The label for all concepts in all shots are predict

ed independently

Ground Truth Annotation

A sequence of video shots (training set)

A lexicon of concepts car

outdoor

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1

building 1 0 0 1 1 1 1 0

sky 0 1 1 1 0 1 1 0

people 0 0 0 1 0 1 1 0

urban 1 1 0 1 1 1 1 1

Temporal dependency Contextual

correlation

Semantic Concept Prediction

A lexicon of concepts car

outdoor

building sky urban

people

0.92

0.56 0.81

0.74 0.66

0.62

0.75

0.61 0.72 0.37 0.91

0.68

0.84

0.62 0.86 0.88 0.21

0.85

0.96