Summary - A NEW APPROACH FOR CLASSIFICATION OF GENERIC AUDIO

CHAPTER 2 A NEW APPROACH FOR CLASSIFICATION OF GENERIC AUDIO

2.4. Summary

In this chapter, we have presented a new method for the automatic classification of generic audio data. An accurate classification rate higher than 96% was achieved.

Two important and distinguishing features compared with previous work in the proposed scheme are the complexity and running time. Although the proposed scheme covers a wide range of audio types, the complexity is low due to the easy computing of audio features, and this makes online processing possible.

Besides the general audio types such as music and speech tested in existing work, we have taken hybrid-type sounds (speech with music background, speech with environmental noise background, and song) into account. While current existing approaches for audio content analysis are normally developed for specific scenarios, the proposed method is generic and model free. Thus, our method can be widely applied to many applications.

CHAPTER 3 A NEW APPROACH FOR AUDIO CLASSIFICATION AND SEGMENTATION USING GABOR WAVELETS AND

FISHER LINEAR DISCRIMINATOR

3.1. INTRODUCTION

In recent years, audio, as an important and integral part of many multimedia applications, has been gained more and more attentions. Rapid increase in the amount of audio data demands for an efficient method to automatically segment or classify audio stream based on its content. Many studies on audio content analysis [1-14]

haven been proposed.

A speech/music discriminator was provided in [3], based on thirteen features including cepstral coefficients, four multidimensional classification frameworks are compared to achieve better performance. The approach presented by Saunders [5]

takes a simple feature space, it is performed by exploiting lopsidedness of the distribution of zero-crossing rate, where speech signals show a marked rise that is not common for music signals. In general, for speech and music, it is not hard to reach a relatively high level of discrimination accuracy since they have quite different

properties in both time and frequency domains.

Besides speech and music, it is necessary to take other kinds of sounds into consideration in many applications. The classifier proposed by Wyse and Smoliar [7]

classifies audio signals into “music,” “speech,” and “others.” It was developed for the parsing of news stories. In [8], audio signals are classified into speech, silence, laughter, and non-speech sounds for the purpose of segmenting discussion recordings in meetings. However, the accuracy of the segmentation resulted using this method varies considerably for different types of recording. Besides the commonly studied audio types such as speech and music, the research in [12-14] has taken into account hybrid-type sounds, e.g., the speech signal with the music background and the singing of a person, which contain more than one basic audio type and usually appear in documentaries or commercials. In [12], 143 features are first studied for their discrimination capability. Then, the cepstral-based features such as Mel-frequency cepstral coefficients (MFCC), linear prediction coefficients (LPC), etc., are selected to classify audio signals. Zhang and Kuo [14] extracted some audio features including the short-time fundamental frequency and the spectral tracks by detecting the peaks from the spectrum. The spectrum is generated by autoregressive model (AR model) coefficients, which are estimated from the autocorrelation of audio signals. Then, the rule-based procedure, which uses many threshold values, is applied to classify audio

signals into speech, music, song, speech with music background, etc. Accuracy of above 90% is reported. However, this method is complex and time-consuming due to the computation of autocorrelation function. Besides, the thresholds used in this approach are empirical, they are improper when the source of audio signals is changed.

In this chapter, we will provide two classifiers, one is for speech and music (called two-way); the other is for five classes (called five-way) that are pure speech, music, song, speech with music background, and speech with environmental noise background. Based on the classification results, we will propose a merging algorithm to divide an audio stream into some segments of different classes.

One basic issue for content-based classification of audio sound is feature selection. The selected features should be able to represent the most significant properties of audio sounds, and they are also robust under various circumstances and general enough to describe various sound classes. The issue in the proposed method is addressed in the following: first, some perceptual features based on the Gabor wavelet filters [15-16] are extracted as initial features, then Fisher Linear Discriminator (FLD) [17] is applied to these initial features to explore the features with the highest discriminative ability.

Note that FLD is a tool for multigroup data classification and dimensionality

reduction. It maximizes the ratio of between-class variance to within-class variance in any particular data set to guarantee maximal separability. Experimental results show that the proposed method can achieve an accuracy rate of discrimination over 98% for a two-way speech/music discriminator, and more than 95% for a five-way classifier which uses the same database as that used in the two-way discrimination. Based on the classification result, we can also identify scene breaks in audio sequence quite accurately. Experimental results show that our method can detect more than 95% of audio type changes. These results demonstrate the capability of the proposed audio features for characterizing the perceptual content of an audio sequence.

The rest of the chapter is organized as follows. In Section 3.2, the proposed method is described in details. Experimental results and discussion are presented in Section 3.3. Finally, in Section 3.4, we give a summary.

3.2. THE PROPOSED METHOD

The block diagram of the proposed method is shown in Fig. 3.1. It is based on the spectrogram and consists of five phases: time-frequency distribution (TFD) generation, initial feature extraction, feature selection, classification and segmentation.

First, the input audio is transformed to a spectrogram, I(x,y), as mentioned in

Multi-resolution Short Time Fourier Transform section (Chapter 1, Section 1.3.2).

Second, for each clip with one-second window, some Gabor wavelet filters will be applied to the resulting spectrogram to extract a set of initial features. Third, based on the extracted initial features, the Fisher Linear Discriminator (FLD) is used to select the features with the best discriminative ability and also to reduce feature dimension.

Fourth, based on the selected features, classification method is then provided to classify each clip. Finally, based on the classified clips, a segmentation technique is presented to identify scene breaks in each audio stream. In what follows, we will describe the details of the proposed method.

Speech

Fig. 3.1. Block diagram of the proposed method, where “MB” and “NB” are the abbreviations for “music background” and “noise background”, respectively.

3.2.1 Initial Feature Extraction

Generally speaking, the spectrogram is a good representation for the audio since it is often visually interpretable. By observing a spectrogram, we can find that the energy is not uniformly distributed, but tends to cluster to some patterns (see Fig. 3.2 (a), 3.2 (b)). All curve-like patterns are called tracks [31]. Fig. 3.2 (a) shows that for a music signal, some line tracks corresponding to tones will exist on its spectrogram.

Fig. 3.2 (b) shows some patterns including clicks (broadband, short time), noise burst (energy spread over both time and frequency), and frequency sweeps in a song spectrogram.

(a) (b)

Fig. 3.2. Two examples to show some possible different kinds of patterns in a spectrogram. (a) Line tracks corresponding to tones in a music spectrogram.

(b) Clicks, noise burst and frequency sweeps in a song spectrogram.

Frequency Sweeps

Clicks Noise Burst Tones

Thus, if we can extract some features from a spectrogram to represent these patterns, the classification should be easy. Smith and Serra [32] proposed a method to extract tracks from a STFT spectrogram. Once the tracks are extracted, each track is classified. However, tracks are not well suited for describing some kinds of patterns such as clicks, noise burst and so on. To treat all kinds of patterns, a richer representation is required. In fact, these patterns contain various orientations and spatial scales. For example, each pattern formed by lines (see Fig. 3.2 (a)) will have a particular line direction (corresponding to orientation) and width (corresponding to spatial scale) between two adjacent lines; each pattern formed by curves (see Fig. 3.2 (b)) contains multiple line directions and a particular width between two neighboring curves. Since Gabor wavelet transform provides an optimal way to extract those orientations and scales [27], in this chapter, we will use the Gabor wavelet functions to extract some initial features to represent those patterns. The detail will be described in the following section.

3.2.1.1 Gabor Wavelet Functions and Filters Design

Two-dimensional Gabor kernels are sinusoidally modulated Gaussian Functions.

Let )g(x,y be the Gabor kernel, its Fourier Transform G( vu, ) can be defined as

follows [28]:

Gabor wavelets are sets of Gabor kernels which will be applied to different subbands with different orientations. It can be obtained by appropriate dilations and rotations of g(x,y) through the following generating functions [28]:

)

orientations, S is the number of scales in the multi-resolution decomposition, ω_h and ω_l are the highest and the lowest center frequency, respectively. In this chapter, we set

3.2.1.2 Feature Estimation and Representation where * indicates the complex conjugate. The above filtering process is executed by FFT (fast Fourier Transform). That is

W_mn(x,y)=F⁻¹

{

g_mn(x,y)

} {

⋅F I(x,y)

} }

. (3.8) Since peripheral frequency analysis in the ear system roughly follows a logarithmic axis, in order to keep with this way, the entire frequency band [0, Fs/2] is divided into six subbands of unequal width: F1=[0, Fs/64], F2=[Fs/64, Fs/32], F3=[Fs/32, Fs/16], F4=[Fs/16, Fs/8], F5=[Fs/8, Fs/4], and F6=[Fs/4, Fs/2]. In our experiments, high frequency components above Fs/4 (i.e., subband [Fs/4, Fs/2]) are discarded to avoid the influence of noise. Then, for each interested subband iF , the directional histogram, Hi(m,n), is defined to be

where m=0,L,6. and n=0,L,5.. Note that N_i(m,n) is the number of pixels in

components. Recall that in our experiments, we use seven scales (S=7), six orientations (K=6) and five subbands, this will result in a 7×6×5 dimensional

initial feature vector

H T

H H

f =[ ₀(0,0), ₀(0,1),L, ₄(6,5)] . (3.13)

3.2.1.3 Feature Selection and Audio Classification

The initial features are not used directly for classification since some features give poor separability among different classes and inclusion of these features will

lower down classification performance. In addition, some features are highly correlated so that redundancy will be introduced. To remove these disadvantages, in this chapter, the Fisher Linear Discriminator (FLD) is applied to the initial features to find those uncorrected features with the highest separability. Before describing FLD, two matrices, between-class scatter and within-class scatter, will first be introduced.

The within-class scatter matrix measures the amount of scatter between items in the same class and the between-class scatter matrix measures the amount of scatter between classes.

and the between-class scatter matrix S is defined as _b

∑

V used and one-second audio clip is taken as the basic classification unit.

Based on V_opt, the initial feature vector for each one-second audio clip in the training data and testing data is projected to the space generated by V_opt to get a new

feature vector f with dimension C-1. ^' f is then used to stand for the audio clip. ^' Before classification, it is important to give a good similarity measure. In our experiments, the Euclidean distance worked better than others (e.g., Mahalanobis, covariance, etc.). For each test sample, x_j with feature vector f , the Euclidean _j^' distance between the test sample and the class center of each class in the space generated by V_opt is evaluated. Then the sample is assigned to the class with

minimum distance. That is, x_j is assigned as class C according to the following ^'_j

Fig. 3.3 shows an example of using a two-way speech/music discriminator. In the figure, “x” stands for the projected result of an music signal, “o” stands for the projected result of a speech signal. From this figure, we can see that through FLD, music and speech samples can be easily separated. Fig. 3.4 outlines the process of feature selection and classification.

Two problems arise when using Fisher discriminator. First, the matrices needed for computation are very large. Second, since we may have fewer training samples than the number of features in each sample, the data matrix is rank deficient. To avoid the problems described above, it is possible to solve the eigenvectors and eigenvalues of a rank deficient matrix by using a generalized singular value decomposition routine.

One simple and speedup solution [33] is taken in this chapter.

3.2.1.4 Segmentation

The segmentation is to divide an audio sequence into semantic scenes called

“audio scene” and to index them as different audio classes. Due to some classification errors, a reassigning algorithm is first provided to rectify these classification errors.

For example, if we detect a pattern like speech-music-speech, and the music subpattern lasts a very short time, we can conclude that the music subpattern should be speech. First, for each one-second audio clip, the similarity measure between the audio clip and the center of its class is defined as

Fig. 3.4. A block diagram of feature selection and classification using

the feature space. If the similarity measure is less than 0.9, mark the clip as ambiguous. Note that ambiguous clips often arise in transition periods. For example, if a transition happens when speech stops and music starts, then each clip in the transition will contain both speech and music information. Then, each ambiguous clip will be reassigned as the class of the nearest unambiguous clip. After the reassignment is completed, all neighboring clips with the same class are merged into a segment.

Finally, for each audio segment, the length is evaluated. If the length is shorter than the threshold T (T=3 second), each clip in the segment is reassigned as the class of one of its two neighboring audio segments with the least Euclidean distance between the clip and the center of class of the selected neighboring segment.

3.3. EXPERIMENTAL RESULTS

In order to do comparison, we have collected a set of 700 generic audio pieces (with duration from several seconds to no more than one minute) of different types of sound according to the collection rule described in [14] as the testing database. Care was taken to obtain a wide variation in each category, and some of clips are taken from MPEG-7 content set [23]. The database contains 100 pieces of classical music played with varied instruments, 100 other music pieces of different styles (jazz, blues,

light music, etc.), 200 pieces of pure speech in different languages (English, Chinese, Japanese, etc.), 200 pieces of song sung by male, female, or children, 50 pieces of speech with background music (e.g. commercials, documentaries, etc.), and 50 pieces of speech with environmental noise (e.g. sport broadcast, news interview, etc.). These shorter audio clips are stored as 16-bit per sample with 44.1 kHz sampling rate in the WAV file format and are used to test the audio classification performance. Note that we take one-second audio signal as a test unit.

We also collected a set of 15 longer audio pieces recorded from movies, radio or video programs. These pieces last from several minutes to an hour and contain various types of audio. They are used to test the performance for audio segmentation.

3.3.1 Audio Classification Results

In order to examine the robust use for a variety of the audio source and the accuracy for audio classification, we present two experiments. One is two-way discrimination and the other is five-way discrimination. Concerning the two-way discrimination, we try to classify the audio set into two categories: music and speech.

As for the five-way discrimination, the audio set will be classified into five categories:

pure speech, pure music, song, speech with music background, and speech with

environmental noise background.

Tables 3.1 and 3.2 show the results of the classification. From these tables, we can see that the proposed classification approach for generic audio data can achieve an over 98% accuracy rate for the speech/music discrimination, and more than 95% for the five-way classification. Both classifiers use the same testing database. It is worth mentioning that the training is done using 50% of randomly selected samples in each audio type, and the test is operated on the remaining 50%. By changing training set several times and evaluating the classification rates, we find that the performance is stable and independent on the particular test and training sets. The experiments are carried out on a Pentium II 400 PC/Windows 2000 with less than one-eleventh of the time required to play the audio clip.

In our experiments, there are several misclassifications. From Table 3.2, we can see that most errors occur in the speech with music background category. This is due to that the music or speech component is weak. In order to do comparison, we also like to cite the efficiency of the existing system described in [14], which also includes the five audio classes considered in our method and uses similar database to ours. The authors of [14] report that less than one eighth of the time required to play the audio clip are needed to process an audio clip. They also report that their accuracy rates are more than 90%.

TABLE 3.1

TWO-WAY CLASSIFICATION RESULTS Audio Type Number Correct Rate

Speech 300 98.17%

Music 400 98.79%

TABLE 3.2

FIVE-WAY CLASSIFICATION RESULTS

Number Discrimination Results

Audio Type Pure Music Song Pure Speech Speech with MB Speech with NB

Pure Music 200 94.67% 3.21% 1.05% 1.07% 0%

Song 200 0.8% 96.43% 0% 1.97% 0.8%

Pure Speech 200 0% 0.14% 98.40% 0.11% 1.35%

Speech with MB 50 1.01% 4.2% 3.10% 89.62% 2.07%

Speech with NB 50 0.15% 0.71% 1.28% 0.63% 97.23%

3.3.2 Audio Segmentation Results

We tested our segmentation procedure with audio pieces recorded from radio, movies, and video programs. We made a demonstration program for online audio

segmentation and indexing as shown in Fig. 3.5. Fig. 3.5 (a) shows the classification result for a 66 second audio piece recorded from MPEG-7 data set CD19 that is a Spanish cartoon video called “Don Quijote de la Mancha.” Fig. 3.5 (b) shows the result of applying the segmentation method to Fig. 3.5 (a). Besides the above example, we also performed experiments on other audio pieces.

Listed in Table 3.3 is the result of the audio segmentation, where miss-rate and over-rate are defined as the ratio between the number of miss-segmented ones and the actual number of segments, and the ratio between the number of over-segmented ones and the actual number of segments in audio streams, respectively. Besides, error rate is defined as the ratio between the number of segments indexed in errors and the actual number of segments in audio stream.

The first column shows the segmentation result without applying the reassignment process to the classification result, and the second column shows the segmentation result using the reassignment process. The experiments have shown that the proposed scheme achieves satisfactory segmentation and indexing. Using human judgement as the ground truth, our method can detect more than 95% of audio type changes.

(a)

(b)

Fig. 3.5. Demonstration of audio segmentation and indexing, where “SMB” and

“SNB” are the abbreviations for “speech with music background” and “speech with noise background”, respectively. (a) Original result. (b) Final results after applying the segmenting algorithm to (a).

TABLE 3.3

SEGMENTATION RESULTS.

Without Using

Reassignment

Using Reassignment

Miss-Rate 0% 1.1%

Over-Rate 5.2% 1.8%

Error-Rate 2.5% 1.3%

3.4. SUMMARY

In this chapter, we have presented a new method for the automatic classification and segmentation of generic audio data. An accurate classification rate higher than 95% was achieved. The proposed scheme can treat a wide range of audio types.

Furthermore, the complexity is low due to the easy computing of audio features, and this makes online processing possible. The experimental results indicate that the extracted audio features are quite robust.

Besides the general audio types such as music and speech tested in existing work, we have taken into account other different types of sounds including hybrid-type

sounds (e.g. speech with music background, speech with environmental noise background, and song). While current existing approaches for audio content analysis are normally developed for specific scenarios, the proposed method is generic and

在文檔中一個關於一般音訊資料之音訊分類，音訊分段及音訊檢索之研究 (頁 52-0)