Chapter 1 Introduction
1.3 Thesis Organization
The thesis is organized as follows. In chapter 2, related works will be reviewed and some audio signal processing techniques will be briefly introduced for further discussion. In chapter 3, we give a detail analysis of features used in our system. Chapter 4 discusses the proposed audio classification system which includes a neural fuzzy inference network, and the post processing process. The experimental results are shown in chapter 5, and some comments are also provided. Chapter 6, which summarizes the thesis, will give concluding remarks and possible future works.
Chapter 2 Background
2.1 Related Works
As mentioned previously, ASC includes many research areas such as speech recognition, music genre classification, speaker identification, and so on. Although research in speech recognition, a domain of ASC, has existed for many years [3], there were not significant research output in other areas of ASC until recent years (after 1990’s). Some of related works on this topic will be presented in the following paragraphs.
In [4], audio was classified into music, speech and others. For music, the system computes peaks in the magnitude spectrum, and then bases its decision on the average length of time that peaks exist in a narrow frequency region. To separate out speech, the pitch track is examined.
Kimber and Wilcox [5] classified and segmented discussion recordings in meetings into speech, silence, laughter, and nonspeech sound using cepstral coefficients and a hidden Markov model (HMM).
In [6], Pfeiffer et al. presented the analysis of the amplitude, frequency, pitch, onset, offset and frequency transitions of audio signals.
With the analysis results, violence in movie soundtracks can be detected by recognizing shots, cries and explosions. Furthermore, music indexing can be an application of the analysis results.
In [7], the goal of automatic retrieval, classification and clustering of musical instruments, sound effects, and environmental sounds can be
achieved by using statistical values (mean, variance, autocorrelation) of features (pitch, loudness, brightness, and bandwidth). In the article, some applications such as audio databases and file systems, audio database browsers, audio editors, and surveillance were also provided.
A simple approach to discriminate music from speech was presented by John Saunders [8]. The discriminator used straightforward features such as the energy contour and the zero-crossing rate (ZCR). Experiments were performed with four measures of the skewness of the distribution of ZCR, and 90% correct classification rate was obtained using these features. Improved performance of 98% correct classification rate was reported by including an energy contour dip measure into the discrimination process.
Scheirer and Slaney [9] introduced 13 features for speech/music discrimination. Statistical pattern recognition classifiers such as MAP, GMM, and KNN were evaluated. They used a 2.4-second window and got an error rate of 1.4%. When smaller windows as well as more classes were taken into consideration, the error rate would increase.
A method for content-based audio classification and retrieval was presented in [10]. The audio feature vector, named PercCepsL, consisted of an 18-dimensional perceptual feature vector and a 2L-dimensional cepstral feature vector. The perceptual feature vector was composed of the silence ratio, the pitched ratio, the means and standard deviations of total power, 4 subband powers, brightness, bandwidth and pitch. The 2L-dimensional cepstral feature vector came from the L MFCCs. A new pattern classification method called the nearest feature line (NFL) was also reported in this paper. Applying the proposed method to the audio
database of 409 sounds from Muscle Fish, NFL+PercCeps8 yielded the lowest error rate of 9.78%.
Zhang and Kuo [11] proposed a heuristic rule-based ASC system.
The system was divided into two stages. They used four features including the energy function, the average zero-crossing rate, the fundamental frequency, and the spectral peak tracks to achieve classification accuracy of more than 90%.
Lu et al. [12] classified an audio stream into speech, music, environment sound and silence using a robust two-stage audio classification and segmentation method. The features which were selected for classification such as high zero-crossing rate ratio (HZCRR), low short-time energy ratio (LSTER), spectrum flux (SF), band periodicity (BP), noise frame ratio (NFR), and LSP distance measure were described and discussed. An accuracy rate of over 96% was reported.
In [13], an audio clip was classified into five classes—silence, music, background sound, pure speech, and nonpure speech by using kernel SVM with Gaussian Radial basis. The feature set included 8 order MFCCs, zero-crossing rates (ZCR), short time energy (STE), sub-band powers distribution, brightness, bandwidth, spectrum flux (SF), band periodicity (BP), and noise frame ratio (NFR). The accuracy rate of the proposed method using SVM distributed from 87.62% to 96.20% for each individual class.
Panagiotakis and Tziritas [14] dealt with the characterization of an audio signal and developed a system for speech/music discrimination.
They fitted the amplitude distribution measured by the root mean square
(RMS) with the generalizedχ2distribution, and used the distribution to segment an audio signal. And then these segments were classified into music and speech by utilizing five actual features (normalized RMS variance, the probability of null zero-crossings, joint RMS/ZC measure, silence intervals frequency, and maximum mean frequency) deriving from two basic characteristics, i.e. the amplitude and the zero-crossings. The proposed system segmented signals with an accuracy rate of about 97%
and classified signals with an accuracy rate of about 95%.
Although most of the systems mentioned previously classify general audio signals into various classes such as speech, pure music, song etc, some systems specifically aimed to classify musical genres [15]–[17]. In [18], Tzanetakis and Cook proposed three feature sets which resulted in a 30-dimentional feature vector to describe timbral texture, rhythmic content and pitch content. After feature extraction, they used standard statistical pattern recognition classifiers for classification. Several classifiers such as Gaussian classifiers, Gaussian mixture model (GMM) classifiers, and K-nearest neighbor (KNN) classifiers were trained to evaluate the proposed feature sets, and an accuracy rate of 61% for 10 genres was achieved by using GMM classifiers.
To deserve to be mentioned, although the above systems mainly focus on processing audio signals individually, it is intriguing that audio segmentation and classification can be applied to video indexing.
Researches showed that audio parts are often more useful than the visual images for indexing films or news programs [19]. In [20], an audio-based approach for video indexing was provided. Minami et al. applied image
processing techniques to analyze the spectrogram of audio signals in video, and detect music by image edge detection. After detecting music components, the music components were removed from speech detection.
Speech detection was then accomplished by a comb filter. After music and speech detection, they used the information to construct two video indexing systems.
In this thesis, we focus on audio classification and segmentation, a critical problem in audio content analysis. Some audio signal processing techniques utilized in the thesis are provided in the following section.
2.2 Introduction to Audio Signal Processing
An audio signal is an extremely useful medium for conveying information. Humans are surrounded by audio signals as long as he or she is able to listen. In this section, we will introduce some important characteristics of audio signals related to audio signal classification, and audio signal processing techniques in order to extract information from these characteristics.
2.2.1 The Characteristics of Audio Signals
An audio signal, i.e. sound, is a form of energy. After vibrating, an object will carry particles of the air near the object and produce a longitudinal wave with velocity about 343 meters per second. The frequency of a wave refers to how often the particles of the air vibrate when a wave passes through the medium. The frequency of a wave is
measured as the number of complete back-and-forth vibrations of a particle of the medium per unit time.
In addition to frequency, sound has two other important characteristics, amplitude and complexity. These three physical characteristics influence three perceptual characteristics, pitch, loudness, and timbre, respectively. Roughly speaking, human can perceive what kind of sound he or she hears because the characteristics of each kind of sound are different. TABLE I lists the relationship [21].
TABLE I Relation between physical and perceptual features.
Physical characteristics Amplitude Frequency Complexity
Perceptual characteristics Loudness Pitch Timbre
In human’s daily life, music and speech are two main classes of audio signals. From the characteristics discussed above, we can summarize some salient differences between speech and music as following [22].
Tonality: Music tends to be composed of a multiplicity of tones, each with a unique distribution of harmonics. Speech consists of an alternating sequence of tonal and noise-like segments.
Bandwidth: The frequency of music is up to 20000 Hz while the frequency of speech is limited to 4000 Hz.
Energy sequences: Music usually has more stable energy sequences than speech does.
Some of these characteristics might be helpful to discriminate
between these two kinds of audio signals, and they can be extracted using signal processing techniques.
As mentioned previously, an audio signal can be represented as a function of density of air varying with time. Thus, it is a continuous function. In order to be processed in a computer, the function needs to be sampled and digitized, and becomes a discrete-time audio signal. There are two parameters, i.e. the sampling rate and the bit resolution which influence the quality of the digital signal.
Any discrete-time audio signal can be created by adding infinite number of discrete-time sinusoidal signals with different frequencies and amplitudes. That is
[ ] kcos( k )
k
s n =
∑
A ω n . (2.1)This implies that we can decompose an audio signal into its component sinusoids. To perform the function, we need Fourier analysis, which will be introduced in the following subsection.
2.2.2 Audio Signal Processing Techniques
With the development of digital technology such as computers and digital signal processing (DSP), not only audio signals can be sampled, digitized, processed and stored in digital form, but also complex algorithms are able to be implemented cheaply and speedily. In this section, we will discuss short time analysis of audio signals owing to the non-stationary property of audio signals.
2.2.2.1 Short Time Analysis of Audio Signals [23]
Generally speaking, an audio signal is time-varying. That is, the signal changes rapidly with time. Fig. 2 is an example of a 10-second audio signal. It has a quite large variation and lacks a regular pattern.
Fig. 2 A 10-second audio signal.
As we can see, it is difficult to acquire effective information from this kind of time-varying signal. However, when we examine the signal from a micro standpoint, the signal is stable and has a regular pattern as illustrated on Fig. 3. The waveform is extracted from the first 600 points of the signal in Fig. 2.
Fig. 3 The first 600 points of the signal in Fig. 2.
Thus, most audio signal processing techniques assume that the variation of an audio signal in a short time is relatively small. Based on this assumption, every small segment of an audio signal is independent of each other, and the properties in a single segment are fixed. Therefore, we can view the small segments as short-time stationary signals. These small segments are called frames. To deal with these frames, short-time processing techniques are adopted.
Most of the short-time processing techniques can be represented mathematically in the form
[ ( )] ( )
n m
Q T x m w n m
∞
=−∞
=
∑
− (2.2)The audio signal is subjected to a transform, T[ ], which may be linear or nonlinear. The transform is determined according to what features are to be extracted. Thus, Qn can be viewed as one of features that represent the short-time signal. For example, the short-time energy
function is defined as n 1 [ ( ) ( )]2
m
E x m w n m
= N
∑
− . w(n) is a short-time window such as Gaussian window, Hamming window, and Kaiser window. The function of a window is to gently scale the amplitude of the signal to zero at each end, reducing the discontinuity at frame boundaries.Using no windowing function is the same as using a rectangular window.
The windowing functions do not completely remove the frame boundary effects, but they do reduce the effects substantially.
When these windowing functions are applied to a signal, it is clear that some information near the frame boundaries is lost. For this reason, a further improvement is to overlap the frames. When each part of the signal is analyzed in more than one frame, information that is lost at a frame boundary is picked up between the boundaries of the next frame.
Figure 4 illustrates the concept of short-time analysis techniques and a windowing function.
Fig. 4 The concept of short-time analysis and a hamming window.
Among various types of short-time signal analysis methods, the Short Time Fourier Transform (STFT) is one of the most common and useful methods, and has the advantage of fast calculation based on the Fast Fourier Transform algorithm. The STFT of the nth frame is define as
2 2
( ) ( ) 0 1
j k j km
N N
n
m
X e x m w nL m e k N
π ∞ − π
=−∞
⎛ ⎞
⎜ ⎟
⎜ ⎟
⎝ ⎠=
∑
− ≤ ≤ − (2.3)where w(n) is a short-time window, and L is the window length. Many features used in the purposed system are based on the short-time magnitude of the STFT of the signal. The features will be introduced and discussed in detail in the next chapter.
Chapter 3
Audio Feature Analysis and Selection
It is difficult to classify audio signals directly based on raw data since raw data contain too much information for analysis, and important characteristics are lost in the noise of unreduced data. Thus, it is necessary to reduce the amount of data. The process is called feature extraction, which computes a numerical representation that can be used to characterize a segment of audio. The important information to characterize a segment of audio is usually in the form of quantities such as frequency, rhythm, pitch and so on. To extraction features or a feature vector (which consists of some features) is the first step in any pattern classification system as shown in Fig. 5.
A feature vector can be thought of as a short term description of the sound for that particular moment. For example, MFCCs (Mel-Frequency Cepstral Coefficients) characterize the vocal tract resonances and are commonly used in speech recognition.
Fig. 5 Feature extraction and the classification of the features are two major components of pattern classification.
Typically, the feature vectors are extracted within successive frames that overlap. For example, frames of 20 to 40 milliseconds overlapped by
10 milliseconds are often used because characteristics of the signal are relatively stable in this kind of frame. And feature vectors can be extracted from these frames.
After representing the raw data with the feature vectors, the audio classification problem can be viewed as a pattern classification problem based on a time series of feature vectors, which are points in a multi-dimensional feature space.
In the thesis, we break a long audio signal into small segments and a feature vector is computed for each segment. Therefore, the feature vector can be viewed as points in the feature space. Therefore, our goal is simplified as to classify the points into different classes.
Since the goal is to classify the points into different classes, it is true that the more discriminative the features are, the better the problem is solved. However, the problem is how to find a good feature to classify audio signals effectively.
As mentioned in the previous chapter, different types of audio signals bear different characteristics. Thus, if we are able to know how the characteristics behave in different types of audio signals, and quantify the characteristics, we can find a good feature for classification. In other words, the knowledge about audio signals is the key point.
The features used in audio signal classification systems are usually divided into two categories: perceptual and physical features [24].
Perceptual features rely on a great deal of perceptual modeling. Physical features are directly related to physical properties of the signal and are easier to define and measure.
In the following sections, we will introduce main features used in our
system. All of these features are computed from successive frames of 200 sampling points for a 1-second sample which contains 8000 sampling points. In other words, each frame is 25-millisecond long, and the sampling rate for the audio signals is 8k Hz.
3.1 Zero-Crossing Rate
The zero-crossing rate (ZCR) of the nth frame is defined as
( ) ( ) ( )
x(m) is a discrete time audio signal and w(n) is a 200-sample rectangular window. In other word, ZCR is how often an audio signal goes through the zero point in a frame.
The properties of ZCR are different in different types of audio signals.
Take speech signals for example, speech signals consist of alternating voiced and unvoiced sounds. For unvoiced sounds, they tend to have higher ZCR. For voiced sounds, they tend to have lower ZCR. Thus, the variation of ZCR of a series of speech tends to be large. On the other hand, music signals usually have lower variation as well as lower ZCR. In this way, we cannot only discriminate unvoiced from voiced speech using ZCR, but also use the variation of ZCR to distinguish between music and speech. The variance of ZCR in a 1-second window is defined as
( )
21
_ 1 N n
n
ZCR var ZCR ZCR
N =
=
∑
− (3.2)where ZCR is the average of all ZCRs in a 1-second sample.
The ZCR and ZCR_var of different type of audio signals in plotted in Fig. 6. As we can see, the ZCR curve of music is relatively smooth, and ZCR_var is smaller. For speech signals, the ZCR curve varies rapidly, and ZCR_var curve is relatively larger.
Fig. 6 ZCR and variance of ZCR.
Another way to show that ZCR_var can discriminate between speech and music effectively is illustrated in Fig. 7. The figure shows the
(a) ZCR of music (b) ZCR of speech
(c) ZCR_var of music (d) ZCR_var of speech
histograms of ZCR_var for speech and music signals. The overlap is quite small. If ZCR_var is used alone to discriminate speech from pure music, the discrimination error rate would be only about 9%.
Fig. 7 ZCR_var histograms for speech and music signals.
The most attractive property of ZCR and the variance of ZCR is that these features have slight computation consumption. This is because ZCR can be calculated simply on time domain. Thus, no transformation is needed. This is an important feature for systems which is designed for real-time usage. For example, broadcast monitors which keep monitoring the content of radio to decide whether the content should be discarded is a real-time system.
Although ZCR and the variance of ZCR are good features for speech/music discrimination, they are not sufficiently good when it comes to other classification. Thus, other features are necessary for further classification.
3.2 Spectrum Flux
Spectrum flux measures the average variation value of spectrum between two adjacent frames in a 1-second segment. It is defined as
2 2 2
⎝ ⎠ is the amplitude of the discrete Fourier transform of the nth frame of the input signal as defined in (2.3) and K is the order of DFT, N is the total number of frames and δ = 0.000001, which is a very small value to avoid calculation overflow.
Generally speaking, speech has larger SF value than pure music, song, and mix of speech and music. This is because the tone tends to vary in a short time when human speak, and a music note usually remains at the same level for a certain period of time. When people sing, the vocal sound follows the music note. Thus, the vocal sound also remains at the same level for a certain period of time. The difference between pure music and vocal sound is that vocal sound might lasts for more than one
Generally speaking, speech has larger SF value than pure music, song, and mix of speech and music. This is because the tone tends to vary in a short time when human speak, and a music note usually remains at the same level for a certain period of time. When people sing, the vocal sound follows the music note. Thus, the vocal sound also remains at the same level for a certain period of time. The difference between pure music and vocal sound is that vocal sound might lasts for more than one