• 沒有找到結果。

Multi-Cue Fusion for Seman tic Video Indexing

N/A
N/A
Protected

Academic year: 2022

Share "Multi-Cue Fusion for Seman tic Video Indexing"

Copied!
43
0
0

加載中.... (立即查看全文)

全文

(1)

Multi-Cue Fusion for Seman tic Video Indexing

Ming-Fang Weng and Yung-Yu Chuang National Taiwan University

mfueng@cmlab.csie.ntu.edu.tw

(2)

Motivation

• A rapidly growing amount of videos derives a need in semantic video search

Semantic video search Concept-based

video indexing

Query-by- concept

Semantic gaps Video

archives

Video

archives

(3)

Semantic Concepts

• Comprehensively characterize the meaning of the video content, e.g.,

Airplane Sports Crowd Mountain Building

(4)

Goal

• To improve the accuracy of semantic video indexing

A ranking list according to confidence measure

Detecting

… Detecting

Airplane

Detecting Sports 1

2 3

4 5

6

1 2

3

4 5

6

(5)

A Typical Approach

• Supervised learning

Lexicon Annotation

Lexicon Annotation

Training Concept Classifiers

Training Concept Classifiers

Semantic Concept Prediction

Semantic Concept Prediction

Training path:

Testing path:

Video Archive

Shot Ranking Video

Segmentation Video Segmentation

Feature Extraction

Feature

Extraction

(6)

A Typical Approach

• Supervised learning

Main problems:

• The annotation data is not fully utilized

• The label for all concepts in all shots are predict

ed independently

(7)

Ground Truth Annotation

A sequence of video shots (training set)

A lexicon of concepts car

outdoor

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1

building 1 0 0 1 1 1 1 0

sky 0 1 1 1 0 1 1 0

people 0 0 0 1 0 1 1 0

urban 1 1 0 1 1 1 1 1

Temporal dependency Contextual

correlation

(8)

Semantic Concept Prediction

A lexicon of concepts car

outdoor

building sky urban

people

0.92

0.56 0.81

0.74 0.66

0.62

0.75

0.61 0.72 0.37 0.91

0.68

0.84

0.62 0.86 0.88 0.21

0.85

0.96

0.58 0.74 0.90 0.83

0.80 0.33

0.64 0.73 0.61 0.87

0.75

A sequence of video shots (test set)

0.84

0.69 0.62 0.83 0.94

0.26

(9)

0.56 0.61 0.69 0.64 0.67 0.58 0.72 0.68 0.93 0.75 0.26 0.80 0.92 0.75 0.84 0.33 0.84 0.96

0.81 0.73 0.86 0.73 0.62 0.74 0.62 0.37 0.88 0.66 0.83 0.90 0.66 0.91 0.21 0.87 0.94 0.83

Our Views

• Detectors’ predictions form a “noisy image”

in the contextual-temporal domain

A sequence of video shots (test set)

A lexicon of concepts people

sky car

urban building outdoor

NOISE

(10)

Image Denoising

Input Image Output image

Image

denoising

(11)

Image Denoising

Observation

Hidden variable

(12)

Image Denoising

Observation

Hidden variable

Observation

Prior relationship

Energy minimization

Observation Prior relationship Enhanced image

(13)

Main Ideas

Denoising: Exploit prior relationships among nodes to reduce the noise

A sequence of video shots (test set)

A lexicon of concepts people

sky car

urban building outdoor

Temporal dependency

Contextual correlation

NOISE

(14)

For Semantic Video Indexing

Observation = Detectors’ prediction

Prior relationships =?

Energy function =?

(15)

Outline

• Multi-Cue Fusion Framework

• Modeling High-Order Relationship

• Inference using High-Order Relationship

• Experiments and Results

• Conclusions

(16)

System Framework

Video Segmentation

Video Segmentation

Feature Extraction

Feature Extraction

Lexicon Annotation

Lexicon Annotation

Training Concept Classifiers

Training Concept Classifiers

Modeling High-Order Relationships

Modeling High-Order Relationships

Inference Using High-Order Relationships Inference Using

High-Order Relationships

Training path:

Testing path:

Video Archive

Shot Ranking Semantic

Concept Prediction

Semantic

Concept

Prediction

(17)

Relationship Modeling

Two issues

– Relationship discovering?

– Relationship representation?

(18)

Relationship Representation

• The probability of presence of X

(19)

Relationship Representation

• The probability of presence of X

• Given a binary variable Y

(20)

Relationship Representation

• The probability of presence of X

• Given two binary variables Y and Z

(21)

Relationship Representation

• The probability of presence of X

• Given a partition of data

S

2

S

1

S

3

S

n

S

k+1

S

k

(22)

Relationship Discovering

A recursive algorithm selects the variables which are

– Highly correlated to the target variable – Independent of other selected variables

Chi-square test

– Discovers the hidden associations

– Judges whether a correlation is reliable

(23)

• Concept lexicon:

{Mountain(M), Sky(S), Tree(T), River(R)}

• Annotation data D :

• To discover the contextual relationship for M ountain

Toy Example

(24)

• Correlation measuring

– Assume that Sky is the most correlative concept to Mountain

Toy Example

(25)

Toy Example

• Data partition according to Sky

(26)

Toy Example

• Correlation measuring

– Assume that there is no concept with significan

t correlation to Mountain

(27)

Toy Example

• Correlation measuring

– Assume that River is the most correlative conce

pt to Mountain

(28)

Toy Example

• Data partition according to River

(29)

Toy Example

• Conditional probability estimation

E.g.,

(30)

Toy Example

• The high-order relationship

Independence assumption

(31)

An Example from Real Data

• Concept lexicon: Columbia374

• Annotation data: TRECVid 2005 devel. set

• The discovered contextual relationship of co ncept Mountain

H: hill

P: military_personnel S: sky

G: group L: landscape V: valleys

C: commercial_advertisement R: river

F: forest

K: rocky_ground W: waterways T: trees

H+

P+ P- L+ L-

S+ S- V+ V- K+ K-

C+ C-

The whole dataset

G+ G-

S+ S- R+ R-

F+ F-

K+ K- W

+

R+ R- F+ F-

T+ T-

S+ S- H-

W-

Leaf node

Internal node

(32)

Temporal Relationships

• Discover temporal dependence among neigh boring shots

• Similar to the way discovering the contextual relationships

• Tests the correlation between neighboring sh

ots in the temporal order

(33)

Inference using Relationships

Prediction score Hidden variable Contextual cues Temporal cues

A sequence of shots

A le xi co n of c on ce pt s

(34)

Energy Function

Observed Likelihood

Contextual Relationship

Temporal

Relationship

(35)

Optimization

Parameter estimation

– Obtain the concept-dependent parameters from t raining corpus with cross validation

Energy minimization

Use Conjugate Gradient Methods to solve this n on-linear function

– Adopt prediction scores from detectors as an initi

al guess

(36)

Experimental Settings 1/2

TRECVid benchmark

Performance evaluation

– 20 officially selected concepts in TRECVid 2006 – Inferred average precision (infAP) for individua

l concept performance

Mean infAP for overall system performance

Dataset # of Videos # of Shots

Training data TV05 devel set 137 43,907

Test data TV06 test set 259 79,484

(37)

Experimental Settings 2/2

Baselines in our experiments

VIREO-374 Columbia374

Provider City U. of H. K. Columbia University Feature Color moment,

Wavelet texture, Keypoint features

Edge direction histogram, Garbor,

Grid color moment

Learning SVMs SVMs

Fusion lately average lately average

Accuracy high medium

(38)

Overall performance

Baseline VIREO-374 Columbia374

Mean infAP 0.1542 0.0948

Contextual cues only Liu et al. 0.2% 0.5%

MCF 16.7% 19.6%

Temporal cues only Liu et al. 10.6% 16.9%

MCF 14.6% 17.3%

Both cues

Liu et al. 11.2% 18.1%

MCF-AC 19.7% 23.3%

MCF-EM 27.3% 32.1%

MCF-AC: MCF with average combination

MCF-EM: MCF with energy minimization

(39)

Performance of Individual Co ncepts

0 0.1 0.2 0.3 0.4 0.5 0.6

VIREO-374 baseline Liu et al. MCF-AC MCF-EM

in fA P

(40)

Comparison with TRECVid 2006 Su bmissions

0 0.05 0.1 0.15 0.2

infAP

Baselines MCF

The best run of each group

MCF on VIREO-374 VIREO-374 Baseline

MCF on Columbia374

Columbia374 Baseline

(41)

Observations

• MCF improved each of the 20 concepts with rang es varying from 5.9% to 88.1%

• 15 of 20 concepts showed more than 20% impro vement

• We achieved ~30% performance gain for two det

ectors with different levels of accuracy

(42)

Concluding Remarks

• Proposed a general framework which improved t he accuracy of semantic concept detection in vid eos

• MCF has 4 advantages:

– Salable

– Annotation data is reused

– Temporal and contextual information are used simult aneously

– Independent to classifiers

(43)

Thank You for Your Attention

參考文獻

相關文件

Two sources to produce an interference that is stable over time, if their light has a phase relationship that does not change with time: E(t)=E 0 cos( w t+ f ).. Coherent sources:

• CEPC design has to maintain the possibility for SPPC, but there is no need now to firmly prove the feasibility of SPPC,.. scientifically

If P6=NP, then for any constant ρ ≥ 1, there is no polynomial-time approximation algorithm with approximation ratio ρ for the general traveling-salesman problem...

Over there, there is a celebration of Christmas and the little kid, Tiny Tim, is very ill and the family has no money to send him to a doctor.. Cratchit asks the family

In the school opening ceremony, the principal announces that she, Miss Shen, t is going to retire early.. There will be a new teacher from

It is useful to augment the description of devices and services with annotations that are not captured in the UPnP Template Language. To a lesser extent, there is value in

“Since our classification problem is essentially a multi-label task, during the prediction procedure, we assume that the number of labels for the unlabeled nodes is already known

For a data set of size 10000, after solving SVM on some parameters, assume that there are 1126 support vectors, and 1000 of those support vectors are bounded.. Soft-Margin