Multi-Cue Fusion for Seman tic Video Indexing
Ming-Fang Weng and Yung-Yu Chuang National Taiwan University
mfueng@cmlab.csie.ntu.edu.tw
Motivation
• A rapidly growing amount of videos derives a need in semantic video search
Semantic video search Concept-based
video indexing
Query-by- concept
Semantic gaps Video
archives
Video
archives
Semantic Concepts
• Comprehensively characterize the meaning of the video content, e.g.,
Airplane Sports Crowd Mountain Building
Goal
• To improve the accuracy of semantic video indexing
A ranking list according to confidence measure
Detecting
… Detecting
Airplane
Detecting Sports 1
2 3
4 5
6
1 2
3
4 5
6
A Typical Approach
• Supervised learning
Lexicon Annotation
Lexicon Annotation
Training Concept Classifiers
Training Concept Classifiers
Semantic Concept Prediction
Semantic Concept Prediction
Training path:
Testing path:
Video Archive
Shot Ranking Video
Segmentation Video Segmentation
Feature Extraction
Feature
Extraction
A Typical Approach
• Supervised learning
• Main problems:
• The annotation data is not fully utilized
• The label for all concepts in all shots are predict
ed independently
Ground Truth Annotation
A sequence of video shots (training set)
A lexicon of concepts car
outdoor
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
building 1 0 0 1 1 1 1 0
sky 0 1 1 1 0 1 1 0
people 0 0 0 1 0 1 1 0
urban 1 1 0 1 1 1 1 1
Temporal dependency Contextual
correlation
Semantic Concept Prediction
A lexicon of concepts car
outdoor
building sky urban
people
0.92
0.56 0.81
0.74 0.66
0.62
0.75
0.61 0.72 0.37 0.91
0.68
0.84
0.62 0.86 0.88 0.21
0.85
0.96
0.58 0.74 0.90 0.83
0.80 0.33
0.64 0.73 0.61 0.87
0.75
A sequence of video shots (test set)
0.84
0.69 0.62 0.83 0.94
0.26
0.56 0.61 0.69 0.64 0.67 0.58 0.72 0.68 0.93 0.75 0.26 0.80 0.92 0.75 0.84 0.33 0.84 0.96
0.81 0.73 0.86 0.73 0.62 0.74 0.62 0.37 0.88 0.66 0.83 0.90 0.66 0.91 0.21 0.87 0.94 0.83
Our Views
• Detectors’ predictions form a “noisy image”
in the contextual-temporal domain
A sequence of video shots (test set)
A lexicon of concepts people
sky car
urban building outdoor
NOISE
Image Denoising
Input Image Output image
Image
denoising
Image Denoising
Observation
Hidden variable
Image Denoising
Observation
Hidden variable
Observation
Prior relationship
Energy minimization
Observation Prior relationship Enhanced image
Main Ideas
• Denoising: Exploit prior relationships among nodes to reduce the noise
A sequence of video shots (test set)
A lexicon of concepts people
sky car
urban building outdoor
Temporal dependency
Contextual correlation
NOISE
For Semantic Video Indexing
• Observation = Detectors’ prediction
• Prior relationships =?
• Energy function =?
Outline
• Multi-Cue Fusion Framework
• Modeling High-Order Relationship
• Inference using High-Order Relationship
• Experiments and Results
• Conclusions
System Framework
Video Segmentation
Video Segmentation
Feature Extraction
Feature Extraction
Lexicon Annotation
Lexicon Annotation
Training Concept Classifiers
Training Concept Classifiers
Modeling High-Order Relationships
Modeling High-Order Relationships
Inference Using High-Order Relationships Inference Using
High-Order Relationships
Training path:
Testing path:
Video Archive
Shot Ranking Semantic
Concept Prediction
Semantic
Concept
Prediction
Relationship Modeling
• Two issues
– Relationship discovering?
– Relationship representation?
Relationship Representation
• The probability of presence of X
Relationship Representation
• The probability of presence of X
• Given a binary variable Y
Relationship Representation
• The probability of presence of X
• Given two binary variables Y and Z
Relationship Representation
• The probability of presence of X
• Given a partition of data
S
2S
1S
3S
nS
k+1S
kRelationship Discovering
• A recursive algorithm selects the variables which are
– Highly correlated to the target variable – Independent of other selected variables
• Chi-square test
– Discovers the hidden associations
– Judges whether a correlation is reliable
• Concept lexicon:
– {Mountain(M), Sky(S), Tree(T), River(R)}
• Annotation data D :
• To discover the contextual relationship for M ountain
Toy Example
• Correlation measuring
– Assume that Sky is the most correlative concept to Mountain
Toy Example
Toy Example
• Data partition according to Sky
Toy Example
• Correlation measuring
– Assume that there is no concept with significan
t correlation to Mountain
Toy Example
• Correlation measuring
– Assume that River is the most correlative conce
pt to Mountain
Toy Example
• Data partition according to River
Toy Example
• Conditional probability estimation
– E.g.,
Toy Example
• The high-order relationship
– Independence assumption
An Example from Real Data
• Concept lexicon: Columbia374
• Annotation data: TRECVid 2005 devel. set
• The discovered contextual relationship of co ncept Mountain
H: hill
P: military_personnel S: sky
G: group L: landscape V: valleys
C: commercial_advertisement R: river
F: forest
K: rocky_ground W: waterways T: trees
H+
P+ P- L+ L-
S+ S- V+ V- K+ K-
C+ C-
The whole dataset
G+ G-
S+ S- R+ R-
F+ F-
K+ K- W
+
R+ R- F+ F-
T+ T-
S+ S- H-
W-
Leaf node
Internal node
Temporal Relationships
• Discover temporal dependence among neigh boring shots
• Similar to the way discovering the contextual relationships
• Tests the correlation between neighboring sh
ots in the temporal order
Inference using Relationships
Prediction score Hidden variable Contextual cues Temporal cues
A sequence of shots
A le xi co n of c on ce pt s
Energy Function
Observed Likelihood
Contextual Relationship
Temporal
Relationship
Optimization
• Parameter estimation
– Obtain the concept-dependent parameters from t raining corpus with cross validation
• Energy minimization
– Use Conjugate Gradient Methods to solve this n on-linear function
– Adopt prediction scores from detectors as an initi
al guess
Experimental Settings 1/2
• TRECVid benchmark
• Performance evaluation
– 20 officially selected concepts in TRECVid 2006 – Inferred average precision (infAP) for individua
l concept performance
– Mean infAP for overall system performance
Dataset # of Videos # of Shots
Training data TV05 devel set 137 43,907
Test data TV06 test set 259 79,484
Experimental Settings 2/2
• Baselines in our experiments
VIREO-374 Columbia374
Provider City U. of H. K. Columbia University Feature Color moment,
Wavelet texture, Keypoint features
Edge direction histogram, Garbor,
Grid color moment
Learning SVMs SVMs
Fusion lately average lately average
Accuracy high medium
Overall performance
Baseline VIREO-374 Columbia374
Mean infAP 0.1542 0.0948
Contextual cues only Liu et al. 0.2% 0.5%
MCF 16.7% 19.6%
Temporal cues only Liu et al. 10.6% 16.9%
MCF 14.6% 17.3%
Both cues
Liu et al. 11.2% 18.1%
MCF-AC 19.7% 23.3%
MCF-EM 27.3% 32.1%
MCF-AC: MCF with average combination
MCF-EM: MCF with energy minimization
Performance of Individual Co ncepts
0 0.1 0.2 0.3 0.4 0.5 0.6
VIREO-374 baseline Liu et al. MCF-AC MCF-EM
in fA P
Comparison with TRECVid 2006 Su bmissions
0 0.05 0.1 0.15 0.2
infAP