Synopsis of the Dissertation - 一個關於一般音訊資料之音訊分類，音訊分段及音訊檢索之研究

CHAPTER 1 INTRODUCTION

1.4 Synopsis of the Dissertation

obtained according to the following equation:

⎪⎪

frequency resolution is coarse and temporal resolution is fine at high frequencies while temporal resolution is coarse and frequency resolution is fine at low frequencies.

This means that I(x,y) meets the human psycho-acoustic system.

1.4 SYNOPSIS OF THE DISSERTATION

The rest of the dissertation is organized as follows. Chapter 2 describes the proposed hierarchical audio classification method. The non-hierarchical audio

classification and segmentation method based on Gabor wavelets is proposed in Chapter 3. The proposed method of audio retrieval based on Gabor wavelets is described in Chapter 4. Some conclusions and future research directions are drawn in Chapter 5.

CHAPTER 2 A NEW APPROACH FOR CLASSIFICATION OF GENERIC AUDIO DATA

2.1. INTRODUCTION

Audio classification [1-14] has many applications in professional media production, audio archive management, commercial music usage, content-based audio/video retrieval, and so on. Several audio classification schemes have been proposed. These methods tend to roughly divide audio signals into two major distinct categories: speech and music. Scherier and Slaney [3] provided such a discriminator.

Based on thirteen features including cepstral coefficients, four multidimensional classification frameworks are compared to achieve better performance. The approach presented by Saunders [5] takes a simple feature space and is performed by exploiting the distribution of zero-crossing rate. In general, speech and music have quite different properties in both time and frequency domains. Thus, it is not hard to reach a relatively high level of discrimination accuracy. However, two-type classification for audio data is not enough in many applications, such as content-based video retrieval

[11]. Recently, video retrieval has become an important research topic. To raise the retrieval speed and precision, a video is usually segmented into several scenes [11,14].

In general, neighboring scenes will have different types of audio data. Thus, if we can develop a method to classify audio data, the classified results can be used to assist scene segmentation. Different kinds of videos will contain different types of audio data. For example, in documentaries, commercials or news report, we can usually find the following audio types: speech, music, speech with musical or environmental noise background, and song.

Wyse and Smoliar [7] presented a method to classify audio signals into “music,”

“speech,” and “others.” The method was developed for the parsing of news stories. In [8], audio signals are classified into speech, silence, laughter, and non-speech sounds for the purpose of segmenting discussion recordings in meetings. The above-mentioned approaches are developed for specific scenarios, only some special audio types are considered. The research in [12-14] has taken more general types of audio data into account. In [12], 143 features are first studied for their discrimination capability. Then, the cepstral-based features such as Mel-frequency cepstral coefficients (MFCC), linear prediction coefficients (LPC), etc., are selected to classify audio signals. The authors concluded that in many cases, the selection of features is actually more critical to the classification performance. More than 90% accuracy rate

is reported. Zhang and Kuo [14] first extracted some audio features including the short-time fundamental frequency and the spectral tracks by detecting the peaks from the spectrum. The spectrum is generated by autoregressive model (AR model) coefficients, which are estimated from the autocorrelation of audio signals. Then, the rule-based procedure, which uses many threshold values, is applied to classify audio signals into speech, music, song, speech with music background, etc. More than 90%

accuracy rate is reported. The method is time-consuming due to the computation of autocorrelation function. Besides, many thresholds used in this approach are empirical, they are improper when the source of audio signals is changed. To avoid these disadvantages, in this chapter, we will provide a method with only few thresholds used to classify audio data into five general categories: pure speech, music, song, speech with music background, and speech with environmental noise background.

These categories are the basic sets needed in the content analysis of audiovisual data.

The proposed method consists of three stages: feature extraction, the coarse-level classification, and the fine-level classification. Based on statistical analysis, four effective audio features are first extracted to ensure the feasibility of real-time processing. They are the energy distribution model, variance and the third moment associated with the horizontal profile of the spectrogram, and the variance of the differences of temporal intervals. Then, the coarse-level audio classification based on

the first feature is conducted to divide audio signals into two categories: single-type and hybrid-type, i.e., with or without background components. Finally, each category is further divided into finer subclass through Bayesian decision function [15]. The single-type sounds are classified into speech and music; the hybrid-type sounds are classified into speech with environmental noise background, speech with music background and song. Experimental results show that the proposed method achieves an accuracy rate of more than 96% in audio classification.

The chapter is organized as follows. In Section 2.2, the proposed method will be described. Experimental results and discussion will be presented in Section 2.3.

Finally, the summary will be given in Section 2.4.

2.2. THE PROPOSED METHOD

The system diagram of the proposed audio classification method is shown in Fig.

2.1. It is based on the spectrogram and consists of three phases: feature extraction, the coarse-level classification and the fine-level classification. First, an input audio clip is transformed to a spectrogram as mentioned in Short Time Fourier Transform section (Chapter 1, Section 1.3.1) and four effective audio features are extracted. Figs.

2.2(a) – 2.2(e) show five examples of the spectrograms of music, speech with music

background, song, pure speech, and speech with environmental noise background, respectively. Then, based on the first feature, the coarse-level audio classification is conducted to classify audio signals into two categories: single-type and hybrid-type.

Finally, based on the remaining features, each category is further divided into finer subclasses. The single-type sounds are classified into pure speech and music. The hybrid-type sounds are classified into song, speech with environmental noise background and speech with music background. In the following, the proposed method will be described in details.

Speech A udio Signal Feature Extraction

C oarse-L evel

Fig. 2.1. Block diagram of the proposed system, where “MB” and “NB” are the abbreviations for “music background” and “noise background”, respectively.

(a)

(b)

(c)

Fig. 2.2. Five spectrogram examples. (a) Music. (b) Speech with music background.

2.2.1. Feature Extraction Phase

Four kinds of audio features are used in the proposed method, they are energy distribution model, variance and the third moment associated with the horizontal profile of the spectrogram, and variance of the differences of temporal intervals (which will be defined later). To get these features, the audio spectrogram for an audio signal is constructed first. Based on the spectrogram, these four features are extracted

(d)

(e)

Fig. 2.2. Five spectrogram examples. (a) Music. (b) Speech with music background. (c) Song. (d) Speech. (e) Speech with environmental noise background. (Continued)

and described as follows.

2.2.1.1 The Energy Distribution Model

For the purpose of characterizing single-type and hybrid-type sounds, i.e., with or without background components, the energy distribution model is proposed. The histogram of a spectrogram is also called the energy distribution of the corresponding audio signal. In our experiments, we found that there are two kinds of energy distribution models: unimodel and bimodel (see Figs. 2.3 (a) and 2.3 (b)), in audio signals. In Fig. 2.3, the horizontal axis represents the spectrogram energy.

For a hybrid-type sound, its energy distribution model is bimodel; otherwise, it is unimodel. Thus, to discriminate single-type sounds from hybrid-type sounds, we only need to detect the type of the corresponding energy distribution model. To reach this, for an audio signal, the histogram of its corresponding spectrogram, h(i), is

established first. Then, the mean µ and the variance σ² of h(i) are calculated. In general, if µ approaches to the position of the highest peak in h, )h(i will be a unimodel (see Fig. 2.3 (a)). On the other hand, for a bimodel, dividing h(i) into two parts from µ , each part will be unimodel (see Fig. 2.3 (b)). Thus, if we find a local peak in each part, these two peaks will not be close. Based on these phenomena, a

model decision algorithm is provided and described as follows.

Fig. 2.3. Two examples of the energy distribution models. (a) Unimodel (the histogram of the energy distribution of Fig. 2.2 (a)). (b) Bimodel (the histogram of the energy distribution of Fig. 2.2 (c)).

Algorithm 2.1. Model decision Algorithm

Input: The spectrogram S(τ ,ω) of an audio signal.

Output: The model type, T, and two parameters T1, T2.

Step 1. Establish the histogram, h(i),i=0,...,255, of S(τ ,ω).

⎩⎨

Step 8. Set T = bimodel if the following two conditions are satisfied Condition 1:

End of Algorithm 2.1.

Through the model decision algorithm described above, the model type for an audio signal can be determined. Note that in the algorithm, except the model type extracted, two parameters, T1 and T2, which will be used later, will be also obtained.

2.2.1.2 The Horizontal Profile Analysis

In this section, we will base on two facts to discriminate an audio clip with or without music components. One fact is that if an audio clip contains musical components, we can find many horizontal long-line like tracks (see Figs. 2.2 (a) – 2.2 (c)) in its spectrogram. The other fact is that if an audio clip does not contain musical components, most energy in the spectrogram of each frame will concentrate on a certain frequency interval (see Figs. 2.2 (d) – 2.2 (e)). Based on these two facts, two novel features will be derived and used to distinguish music from speech.

To obtain these features, the horizontal profile of the audio spectrogram is constructed first. Note that the horizontal profile (see Figs. 2.4 (a) – 2.4 (e)) is defined as the projection of the spectrogram of the audio clip on the vertical axis. Based on the first fact, we can find that for an audio clip with musical components, there will be many peaks in its horizontal profile (see Figs. 2.4 (a) – 2.4 (c)), and the location difference between two adjacent peaks is small and near constant. On the other hand, based on the second fact, we can see that for an audio clip without musical components, only few peaks can be found in its horizontal profile (see Figs. 2.4 (d) – 2.4 (e)), and the location difference between any two successive peaks is larger and variant. Based on the above description, for an audio clip, all peaks, P , in its _i

horizontal profile are first extracted; and the location difference, dP , between any _i two successive peaks is evaluated. Note that in order to avoid the influence of noise in high frequency, the frequency components above Fs/4 are discarded, where Fs is the sampling rate.

Then the variance,

dPi

v , and the third moment,

dPi

m , of dP_is are taken as the

second and third features and used to discriminate audio clips with or without music components. Note that variance and the third moment stand for the spread and skewness of the location differences of all two successive peaks in the horizontal profile respectively. For an audio clip with musical components, variance and the third moment will be small; however, for an audio clip without musical component, these two features will be larger.

dPi

(a) (b)

Fig. 2.4. Five examples of the horizontal profiles. (a) – (e) are the horizontal profiles of Figs. 2(a) - 2(e), respectively.

2.2.1.3 The Temporal Intervals

Up to now, we have provided three features. By processing the audio signals through these features, all audio signals can be classified successfully except the

(e)

Fig. 2.4. Five examples of the horizontal profiles. (a) – (e) are the horizontal profiles of Figs. 2(a) - 2(e), respectively. (Continued)

simultaneous speech and music category, which contains two kinds of signals: speech with music background and song. To discriminate these, a new feature is provided.

One important characteristic to distinguish them is the duration of the music-voice.

The duration of music-voice is defined as the duration of music appearing with human voice simultaneously. That is, two successive durations of music-voice is separated by the duration of a pure music component. For speech with music background, in order to emphasize the message of the talker, the signal energy contribution of voice is greater than the contribution of the music. In general, it is strongly speech-like, the difference between any two adjacent duration of music-voice is variable (see Fig. 2.5 (c)). Conversely, song is usually melodic and rhythmic, the difference between any two adjacent duration of music-voice in song is small and near constant (see Fig. 2.5 (a)).

By observing the spectrogram in different frequency bands, we can see that music-voice (i.e. speech and music appears simultaneously) has more energy in the neighboring middle frequency bands, while music without voice will possess more energy in the lower frequency band. These phenomena are shown in Fig. 2.5.

(a)

(b)

(c)

Human Speaking Music without Voice

Human Singing

Music without Voice

Fig. 2.5. Two examples of the filtered spectrogram. (a) The spectrogram of song.

(b) The filtered spectrogram of (a). (c) The spectrogram of speech with music background. (d) The filtered spectrogram of (c).

(d)

Based on these phenomena, the property of the duration of each continuous part of the simultaneous speech and music in a sound is used to discriminate the speech with music background from song. First, a novel feature associated with the temporal interval is derived. The temporal interval is defined as the duration of a continuous part of music-voice of a sound. Note that the signal between two adjacent temporal intervals will be music without human voice. Based on the phenomenon of the energy distribution in different frequency bands described previously, an algorithm will be proposed to determine the continuous music-voice parts in a sound. Note that some frequency noises usually exist in an audio clip, i.e., these noises will contribute to those frequencies with lower energy in spectrogram. In order to avoid the influence of frequency noise, a filtering procedure is applied in advance to get rid of those with Fig. 2.5. Two examples of the filtered spectrogram. (a) The spectrogram of song.

(b) The filtered spectrogram of (a). (c) The spectrogram of speech with music background. (d) The filtered spectrogram of (c). (Continued)

Filtering Procedure:

1) Filter out the higher frequency components with lower energy:

For the spectrogram of each frame τ , )S(τ ,ω , find the highest

frequency ω_h with S(τ ,ω_h)>T2. Set ^∧S(τ ,ω)=0, ∀ω >ω_h. 2) Filter other components:

For ω <ω_h,

Figs. 2.5 (b) and 2.5 (d) show the filtered spectrograms of Figs. 2.5 (a) and 2.5 (c), respectively. In what follows, we will be interested in how to determine the temporal intervals.

Note that an audio clip of the simultaneous speech and music category contains several temporal intervals and some short periods of background music, each of ones will separate two temporal intervals (see Fig. 2.5 (a)). To extract temporal intervals, the entire frequency band [0, Fs/2] is first divided into two subbands of unequal width:

[0, Fs/8] and [Fs/8, Fs/2]. Next, for each frame, evaluate the ratio of the non-zero part in each subband to the total non-zero part. If the ratio is larger than 10%, mark the subband. Based on the marked subbands, we can extract the temporal intervals. First,

If the higher subband (i.e., [Fs/8, Fs/2]) in a group is marked, the group will be regarded as a part of music-voice (also called raw temporal interval). That is, a temporal interval is a sequence of frames with higher energy in higher subband.

Since the results obtained after filtering procedure are usually sensitive to unvoiced speech and slight breathing, a re-merged process is then applied to the raw temporal intervals. During the re-merged process, two neighboring intervals are merged if the distance between them is less than a threshold. Fig. 2.6 shows an example of the re-merge process. Once we complete this step, we will obtain a set of temporal intervals and the duration difference between any two successive intervals is evaluated. Finally, the variance of these differences, v_dt, is taken as the last feature.

(a) (b)

Fig. 2.6. An example of the re-merged process. (a) Initial temporal intervals. (b) Result after re-merged process.

distance<threshold re-merge

2.2.2. Audio Classification

Since there are some similar properties among most of the five classes considered, it is hard to find distinguishable features for all of these five classes. To treat this problem, a hierarchical system is proposed. It will do coarse-level classification first, then the fine-level classification is performed. To meet the aim of on-line classification, features described above are computed on the fly with incoming audio data.

2.2.2.1 The Coarse-Level Classification

The aim of coarse-level audio classification is to separate the five classes into two categories such that we can find some distinguishable features in each category.

Based on the energy distribution model, audio signals can be first classified into two categories: single-type and hybrid-type, i.e., with or without background components.

Single-type sounds contain pure speech and music. And hybrid-type sounds contain song, speech with environmental noise background and speech with music background.

2.2.2.2 The Fine-Level Classification

The coarse-level classification stage yields a rough classification for audio data.

To get the finer classification result, the fine-level classifier is conducted. Based on the extracted feature vector X, the classifier is designed using a Bayesian approach under the assumption that the distribution of the feature vectors in each class w is a _k multidimensional Gaussian distribution N_k(m_k ,C_k) . The Bayesian decision function [15] for class w , )_k d_k( X has the form: the priori probability of class w . For a piece of sound, if its feature vector X satisfies _k

) ( )

(X d X

d_i > _j for all j≠ , it is assigned to class i w . _i

The fine-level classifier consists of two phases. During the first phase, we take (vdPi ,

dPi

m ) as the feature vector X and apply Bayesian decision function to each of

the two coarse-level classes separately. For each audio signal of the single-type class, we can successfully classify it as music or pure speech. And the classification is well done without needing any further processing. For that of the hybrid-type sounds, which may be speech with environmental noise background, speech with music

background or song, the same procedure is applied. Speech with environmental noise background is distinguished and what left in the first phase is the subclass including speech with music background and song. An additional process is needed to do further classification for the subclass. To do this, the Bayesian decision function with the feature v_dt is applied. And we can successfully classify each signal in this subclass as speech with music background or song.

2.3. EXPERIMENTAL RESULTS

In order to do comparison, we have collected a set of 700 generic audio pieces of different types of sound according to the collection rule described in [14] as the testing database. Care was taken to obtain a wide variation in each category, and most of clips are taken from MPEG-7 content set [14, 17]. For single-type sounds, there are 100 pieces of classical music played with varied instruments, 100 other music pieces of different styles (jazz, blues, light music, etc.), and 200 clips of pure speech in different languages (English, Chinese, Japanese, etc.). For hybrid-type sounds, there are 200 pieces of song sung by male, female, or children, 50 clips of speech with background music (e.g., commercials, documentaries, etc.), and 50 clips of speech with environmental noise (e.g., sport broadcast, news interview, etc.). These audio

clips (with duration from several seconds to no more than half minute) are stored as 16-bit per sample with 44.1 kHz sampling rate in the WAV file format.

2.3.1 Classification Results

Tables I and II show the results of the coarse-level classification and the final classification results, respectively. From Table II, it can be seen that the proposed classification approach for generic audio data can achieve an accuracy rate of more than 96% by using the testing database. The training is done using 50% of randomly selected samples in each audio type, and the test is operated on the remaining 50%.

在文檔中一個關於一般音訊資料之音訊分類，音訊分段及音訊檢索之研究 (頁 27-0)