Homogeneous Segmentation and Homogeneous Segmentation and Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag

(1)

H S t ti d

Homogeneous Segmentation and Homogeneous Segmentation and Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag

Annotation and Retrieval Annotation and Retrieval

Hung-Yi Lo, Ju-Chiang Wang, and Hsin-Min Wang g , g g, g

July 20, 2010

Spoken Language Processing Group Spoken Language Processing Group

Natural Language and Knowledge Processing Lab.

Institute of Information Science Institute of Information Science Institute of Information Science Institute of Information Science Academia Sinica, Taiwan

Academia Sinica, Taiwan

http://sovideo.iis.sinica.edu.tw/SLG http://sovideo.iis.sinica.edu.tw/SLG

(2)

Social Tagging to Music

(3)

Audio Tag Annotation and Retrieval Audio Tag Annotation and Retrieval

Annotating audio clips with tags

Scores of Tag Predictors Annotate Audio

Using One

Scores of Tag Predictors

An Audio Clip

Using One Predictor for

Each Tag Female R&B Guitar Metal Bass

Retrieving audio clips using a tag query

A Query: Rock

Rank Audio Clips Based on the

Ranking List for the Query

A Query: Rock

Scores of the Rock Predictor

High Relevance Low Relevance 2010/07/20

(4)

Our Contributions Our Contributions

1. Dividing the audio signal into homogeneous segments using an audio novelty curve

2. Each tag predictor is an ensemble classifier combining two classifiers: SVM and AdaBoost

¾ Ranking Ensemble for audio tag retrieval

¾ Probability Ensemble for audio tag annotation

Our ranking ensemble won the Audio Tagging Competition in 2009 Music Information Retrieval Evaluation eXchange in 2009 Music Information Retrieval Evaluation eXchange (MIREX)

¾ In terms of tag F-measureg and the area under the ROC curve given a g tag (for audio retrieval)

(5)

Audio Segmentation Audio Segmentation

• Feature of the Matrix:

13 Di MFCC 13 Dim MFCC

• Kernel Type:

Gaussian

• Kernel Size:e e S e 128 frames

•The prediction score on the whole clip is the on the whole clip is the average of scores on each segment.

2010/07/20

(6)

Audio Segmentation Audio Segmentation

• Feature of the Matrix:

13 Di MFCC 13 Dim MFCC

• Kernel Type:

Gaussian

• Kernel Size:e e S e 128 frames

•The prediction score on the whole clip is the on the whole clip is the average of scores on each segment.

(7)

Audio Feature Extraction Using

Audio Feature Extraction Using MIRToolbox MIRToolbox

Classes Features

Dynamics

Dynamics Rms

Peak and centroids of the fluctuation summary

Rhythm Rhythm

Peak and centroids of the fluctuation summary

Tempo

Attack slop and attack time of the onset

Zero-crossing rate

Spectral centroid, spread, skewness and kurtosis

Brightness

Rolloff with 95% threshold R ll ff ith 85% th h ld

Timbre Timbre

Rolloff with 85% threshold

Spectral entropy and flatness

Roughness

IrregularityIrregularity

Inharmonicity

MFCCs, delta-MFCCs, and delta-delta-MFCCs

Low energy rategy

Spectral flux

Pitch

Pitch ^Pitch

Chromagram and its centroids and highest peak

2010/07/20

Tonality Tonality

Key clarity

Key mode

Harmonic change

(8)

Classification Methods and The Difficulties Classification Methods and The Difficulties

The tag predictor is an ensemble that combines the outputs of two classifiers

¾ SVM: Linear SVM implemented by the LIBLINEAR package

¾ AdaBoost: decision stump as the base learner

Two methods to merge the two prediction scores 1 Ranking Ensemble for the retrieval task

1. Ranking Ensemble for the retrieval task

¾ The scales of the two classifiers’ prediction scores are rather different

2. Probability Ensemble for the annotation task 2. Probability Ensemble for the annotation task

¾ The scores of different tag predictors are not comparable

Female R&B Guitar Metal Bass

(9)

Ranking Ensemble Ranking Ensemble

AdaBoost SVM AdaBoost SVM Merged

Prediction

1.9 7.1 1 2 1.5

-0.5

1 1

6.5

3 9

4

2

3

4

3.5

3 1.1

-2 3

3.9

-0 3

2

5

4

5

3

5 2.3

0.2

0.3

12

5

3

5

1

5

2

Prediction

S Respective

R ki

Average R ki

2010/07/20

Scores Rankings Ranking

(10)

Probability Ensemble Probability Ensemble

In the audio annotation task, we need to compare the scores of all tag predictors

¾ The raw scores of different tag classifiers are not comparable

W t f th t t f SVM d Ad B t i t

We transform the output scores of SVM and AdaBoost into probability scores with a sigmoid function:

) exp(

1 ) 1

| 1

Pr( y Af B

+

≈ +

= x

¾ f : the output score of a classifier

¾ A, B:, can be learned by solving a regularized maximum likelihood y g g problem

(11)

Model Selection Model Selection

MIREX evaluates submitted algorithms by 3-fold cross- validation

Inner cross-validation on the training set to determine the classifier parameters

¾ The cost parameter C in the linear SVM

¾ The number of base learners in AdaBoost

Re train the classifiers with the complete training set and the

Re-train the classifiers with the complete training set and the selected parameters

Model selection criterion: AUC-ROC

¾ Since the class distributions for

Inner Cross- Validation

¾ Since the class distributions for

some tags are imbalanced Outer Cross-

Validation

2010/07/20

(12)

MIREX 2009 Results on The

MIREX 2009 Results on The MajorMiner MajorMiner Dataset Dataset

Tag F-measure

Tag Accuracy

Tag AUC-ROC

Clip AUC-ROC No Seg 0 289 0 900 0 782 0 751 No Seg 0.289 0.900 0.782 0.751 Seg 0.311 0.903 0.807 0.774 BP1 0 277

Audio Retrieval:

0 868

Audio Annotation:

0 742 0 871

Better Than BP1 0.277 0.868 0.742 0.871

BP2 0.290 0.859 0.761 0.861

CC1 0 209 0 912 0 762 0 882

Audio Retrieval:

Given a tag query, correct audio clips should be ranked higher

Audio Annotation:

Given a clip, correct tags should have higher scores

CC1 0.209 0.912 0.762 0.882

CC2 0.241 0.905 0.791 0.882

CC3 0 170 0 913 0 721 0 854

CC3 0.170 0.913 0.721 0.854

CC4 0.263 0.890 0.749 0.854

GP 0 012 0 891

GP 0.012 0.891

GT1 0.290 0.850 0.784 0.872

(13)

MIREX 2009 Results on The

MIREX 2009 Results on The Mood Mood Dataset Dataset

Tag F-measure

Tag Accuracy

Tag AUC-ROC

Clip AUC-ROC No Seg 0 204 0 882 0 667 0 678 No Seg 0.204 0.882 0.667 0.678 Seg 0.219 0.887 0.701 0.704

BP1 0 195 0 837 0 648 0 854

BP1 0.195 0.837 0.648 0.854

BP2 0.193 0.829 0.632 0.859

CC1 0 172 0 878 0 652 0 849

CC1 0.172 0.878 0.652 0.849

CC2 0.180 0.882 0.681 0.848

CC3 0 147 0 882 0 629 0 812

CC3 0.147 0.882 0.629 0.812

CC4 0.183 0.862 0.646 0.812

GP 0 084 0 863

GP 0.084 0.863

GT1 0.211 0.823 0.649 0.860

GT2 0 209 0 824 0 655 0 861

2010/07/20

GT2 0.209 0.824 0.655 0.861

HBC 0.063 0.909 0.664 0.861

(14)

Extended Experiments Extended Experiments

We extensively evaluate the classifiers and the ensemble methods on the downloaded MajorMiner dataset

¾ MajorMiner is a web-based music labeling game: http://majorminer.org/

Our extended experiments basically follow the MIREX 2009 t

setup

¾ Use the same 45 tags and download all the audio clips that are associated with these tags

associated with these tags

¾ The dataset might be slightly different from that used in MIREX 2009

¾ The resulting audio database

metal instrumental horns piano guitar

contains 2,472 clips

Repeat cross-validation

metal instrumental horns piano guitar

ambient saxophone house loud bass

fast keyboard vocal noise british

solo electronica beat 80s dance

twenty times to reduce variance

solo electronica beat 80s dance

jazz drum machine strings pop r&b

female distortion voice rap male

(15)

Results of The Audio Retrieval Task Results of The Audio Retrieval Task

Mean±

Tag AUC-ROC Tag F-measure

Standard

Deviation Without Seg.

With Seg.

Without Seg.

With 4.23% Seg.

g g g g

AdaBoost 0.7520

±0.0026 0.7943

±0.0024 0.2856

±0.0036 0.3034

±0.0051 1.42%

Linear SVM 0.7848

±0.0029 0.7990

±0.0030 0.3092

±0.0028 0.3169

±0.0038 1.42%

2.14%

Better Than

Probability

Ensemble 0.7894

±0.0030 0.8108

±0.0020 0.3163

±0.0037 0.3296

±0.0039 2.14%

1.92%

Ranking

Ensemble 0.7997

±0.0022 0.8189

±0.0017 0.3211

±0.0032 0.3332

±0.0038 1.92%

6.69%

2010/07/20

(16)

Results of The Audio Annotation Task Results of The Audio Annotation Task

Mean±

Clip AUC-ROC Tag Accuracy

Standard

Deviation Without Seg.

With Seg.

Without Seg.

With Seg.

g g g g

AdaBoost 0.8627

±0.0009 0.8774

±0.0009 0.9162

±0.0004 0.9184

±0.0004

Linear SVM 0.8788

±0.0009 0.8828

±0.0012 0.9191

±0.0004 0.9200

±0.0003 Probability

Ensemble 0.8788

±0.0007 0.8848

±0.0007 0.9191

±0.0002 0.9201

±0.0003 Ranking

Ensemble 0.7626

±0.0012 0.7814

±0.0010 0.9016

±0.0004 0.9057

±0.0003 10.34%

(17)

Conclusion Conclusion

This paper has presented our methods for audio tag annotation and retrieval

Major contributions:

¾ Use a novelty curve to divide audio clips into homogeneous segments

¾ Exploit two classifier ensembles: ranking ensemble and probability ensemble

The ranking ensemble performs very well in the MIREX 2009 audio tag classification task in terms of audio retrieval

metrics

¾ But not very good in terms of audio annotation metrics

Homogeneous Segmentation and Homogeneous Segmentation and Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag Classifier Ensemble for Audio Tag

H S t ti d

H S t ti d