H S t ti d
H S t ti d
Homogeneous Segmentation and Homogeneous Segmentation and Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag
Annotation and Retrieval Annotation and Retrieval
Hung-Yi Lo, Ju-Chiang Wang, and Hsin-Min Wang g , g g, g
July 20, 2010
Spoken Language Processing Group Spoken Language Processing Group
Natural Language and Knowledge Processing Lab.
Natural Language and Knowledge Processing Lab.
Institute of Information Science Institute of Information Science Institute of Information Science Institute of Information Science Academia Sinica, Taiwan
Academia Sinica, Taiwan
http://sovideo.iis.sinica.edu.tw/SLG http://sovideo.iis.sinica.edu.tw/SLG
Social Tagging to Music
Social Tagging to Music
Audio Tag Annotation and Retrieval Audio Tag Annotation and Retrieval
Annotating audio clips with tags
Scores of Tag Predictors Annotate Audio
Using One
Scores of Tag Predictors
An Audio Clip
Using One Predictor for
Each Tag Female R&B Guitar Metal Bass
Retrieving audio clips using a tag query
A Query: Rock
Rank Audio Clips Based on the
Ranking List for the Query
A Query: Rock
Scores of the Rock PredictorHigh Relevance Low Relevance 2010/07/20
Our Contributions Our Contributions
1. Dividing the audio signal into homogeneous segments using an audio novelty curve
2. Each tag predictor is an ensemble classifier combining two classifiers: SVM and AdaBoost
¾ Ranking Ensemble for audio tag retrieval
¾ Probability Ensemble for audio tag annotation
Our ranking ensemble won the Audio Tagging Competition in 2009 Music Information Retrieval Evaluation eXchange in 2009 Music Information Retrieval Evaluation eXchange (MIREX)
¾ In terms of tag F-measureg and the area under the ROC curve given a g tag (for audio retrieval)
Audio Segmentation Audio Segmentation
• Feature of the Matrix:
13 Di MFCC 13 Dim MFCC
• Kernel Type:
Gaussian
• Kernel Size:e e S e 128 frames
•The prediction score on the whole clip is the on the whole clip is the average of scores on each segment.
2010/07/20
Audio Segmentation Audio Segmentation
• Feature of the Matrix:
13 Di MFCC 13 Dim MFCC
• Kernel Type:
Gaussian
• Kernel Size:e e S e 128 frames
•The prediction score on the whole clip is the on the whole clip is the average of scores on each segment.
Audio Feature Extraction Using
Audio Feature Extraction Using MIRToolbox MIRToolbox
Classes Features
Dynamics
Dynamics Rms
Peak and centroids of the fluctuation summary
Rhythm Rhythm
Peak and centroids of the fluctuation summary
Tempo
Attack slop and attack time of the onset
Zero-crossing rate
Spectral centroid, spread, skewness and kurtosis
Brightness
Rolloff with 95% threshold R ll ff ith 85% th h ld
Timbre Timbre
Rolloff with 85% threshold
Spectral entropy and flatness
Roughness
IrregularityIrregularity
Inharmonicity
MFCCs, delta-MFCCs, and delta-delta-MFCCs
Low energy rategy
Spectral flux
Pitch
Pitch Pitch
Chromagram and its centroids and highest peak
2010/07/20
Tonality Tonality
Key clarity
Key mode
Harmonic change
Classification Methods and The Difficulties Classification Methods and The Difficulties
The tag predictor is an ensemble that combines the outputs of two classifiers
¾ SVM: Linear SVM implemented by the LIBLINEAR package
¾ AdaBoost: decision stump as the base learner
Two methods to merge the two prediction scores 1 Ranking Ensemble for the retrieval task
1. Ranking Ensemble for the retrieval task
¾ The scales of the two classifiers’ prediction scores are rather different
2. Probability Ensemble for the annotation task 2. Probability Ensemble for the annotation task
¾ The scores of different tag predictors are not comparable
Female R&B Guitar Metal Bass
Ranking Ensemble Ranking Ensemble
AdaBoost SVM AdaBoost SVM Merged
Prediction
1.9 7.1 1 2 1.5
-0.5
1 1
6.5
3 9
4
2
3
4
3.5
3 1.1
-2 3
3.9
-0 3
2
5
4
5
3
5 2.3
0.2
0.3
12
5
3
5
1
5
2
Prediction
S Respective
R ki
Average R ki
2010/07/20
Scores Rankings Ranking
Probability Ensemble Probability Ensemble
In the audio annotation task, we need to compare the scores of all tag predictors
¾ The raw scores of different tag classifiers are not comparable
W t f th t t f SVM d Ad B t i t
We transform the output scores of SVM and AdaBoost into probability scores with a sigmoid function:
) exp(
1 ) 1
| 1
Pr( y Af B
+
≈ +
= x
¾ f : the output score of a classifier
¾ A, B:, can be learned by solving a regularized maximum likelihood y g g problem
Model Selection Model Selection
MIREX evaluates submitted algorithms by 3-fold cross- validation
Inner cross-validation on the training set to determine the classifier parameters
¾ The cost parameter C in the linear SVM
¾ The number of base learners in AdaBoost
Re train the classifiers with the complete training set and the
Re-train the classifiers with the complete training set and the selected parameters
Model selection criterion: AUC-ROC
¾ Since the class distributions for
Inner Cross- Validation
¾ Since the class distributions for
some tags are imbalanced Outer Cross-
Validation
2010/07/20
MIREX 2009 Results on The
MIREX 2009 Results on The MajorMiner MajorMiner Dataset Dataset
Tag F-measure
Tag Accuracy
Tag AUC-ROC
Clip AUC-ROC No Seg 0 289 0 900 0 782 0 751 No Seg 0.289 0.900 0.782 0.751 Seg 0.311 0.903 0.807 0.774 BP1 0 277
Audio Retrieval:
0 868Audio Annotation:
0 742 0 871Better Than BP1 0.277 0.868 0.742 0.871
BP2 0.290 0.859 0.761 0.861
CC1 0 209 0 912 0 762 0 882
Audio Retrieval:
Given a tag query, correct audio clips should be ranked higher
Audio Annotation:
Given a clip, correct tags should have higher scores
CC1 0.209 0.912 0.762 0.882
CC2 0.241 0.905 0.791 0.882
CC3 0 170 0 913 0 721 0 854
CC3 0.170 0.913 0.721 0.854
CC4 0.263 0.890 0.749 0.854
GP 0 012 0 891
GP 0.012 0.891
GT1 0.290 0.850 0.784 0.872
MIREX 2009 Results on The
MIREX 2009 Results on The Mood Mood Dataset Dataset
Tag F-measure
Tag Accuracy
Tag AUC-ROC
Clip AUC-ROC No Seg 0 204 0 882 0 667 0 678 No Seg 0.204 0.882 0.667 0.678 Seg 0.219 0.887 0.701 0.704
BP1 0 195 0 837 0 648 0 854
BP1 0.195 0.837 0.648 0.854
BP2 0.193 0.829 0.632 0.859
CC1 0 172 0 878 0 652 0 849
CC1 0.172 0.878 0.652 0.849
CC2 0.180 0.882 0.681 0.848
CC3 0 147 0 882 0 629 0 812
CC3 0.147 0.882 0.629 0.812
CC4 0.183 0.862 0.646 0.812
GP 0 084 0 863
GP 0.084 0.863
GT1 0.211 0.823 0.649 0.860
GT2 0 209 0 824 0 655 0 861
2010/07/20
GT2 0.209 0.824 0.655 0.861
HBC 0.063 0.909 0.664 0.861
Extended Experiments Extended Experiments
We extensively evaluate the classifiers and the ensemble methods on the downloaded MajorMiner dataset
¾ MajorMiner is a web-based music labeling game: http://majorminer.org/
Our extended experiments basically follow the MIREX 2009 t
setup
¾ Use the same 45 tags and download all the audio clips that are associated with these tags
associated with these tags
¾ The dataset might be slightly different from that used in MIREX 2009
¾ The resulting audio database
metal instrumental horns piano guitar
contains 2,472 clips
Repeat cross-validation
metal instrumental horns piano guitar
ambient saxophone house loud bass
fast keyboard vocal noise british
solo electronica beat 80s dance
twenty times to reduce variance
solo electronica beat 80s dance
jazz drum machine strings pop r&b
female distortion voice rap male
Results of The Audio Retrieval Task Results of The Audio Retrieval Task
Mean±
Tag AUC-ROC Tag F-measure
Standard
Deviation Without Seg.
With Seg.
Without Seg.
With 4.23% Seg.
g g g g
AdaBoost 0.7520
±0.0026 0.7943
±0.0024 0.2856
±0.0036 0.3034
±0.0051 1.42%
Linear SVM 0.7848
±0.0029 0.7990
±0.0030 0.3092
±0.0028 0.3169
±0.0038 1.42%
2.14%
Better Than
Probability
Ensemble 0.7894
±0.0030 0.8108
±0.0020 0.3163
±0.0037 0.3296
±0.0039 2.14%
1.92%
Ranking
Ensemble 0.7997
±0.0022 0.8189
±0.0017 0.3211
±0.0032 0.3332
±0.0038 1.92%
6.69%
2010/07/20
Results of The Audio Annotation Task Results of The Audio Annotation Task
Mean±
Clip AUC-ROC Tag Accuracy
Standard
Deviation Without Seg.
With Seg.
Without Seg.
With Seg.
g g g g
AdaBoost 0.8627
±0.0009 0.8774
±0.0009 0.9162
±0.0004 0.9184
±0.0004
Linear SVM 0.8788
±0.0009 0.8828
±0.0012 0.9191
±0.0004 0.9200
±0.0003 Probability
Ensemble 0.8788
±0.0007 0.8848
±0.0007 0.9191
±0.0002 0.9201
±0.0003 Ranking
Ensemble 0.7626
±0.0012 0.7814
±0.0010 0.9016
±0.0004 0.9057
±0.0003 10.34%
Conclusion Conclusion
This paper has presented our methods for audio tag annotation and retrieval
Major contributions:
¾ Use a novelty curve to divide audio clips into homogeneous segments
¾ Exploit two classifier ensembles: ranking ensemble and probability ensemble
The ranking ensemble performs very well in the MIREX 2009 audio tag classification task in terms of audio retrieval
metrics
¾ But not very good in terms of audio annotation metrics
The probability ensemble method performs very well in terms of audio annotation metrics
2010/07/20