Frequency Variation of Top-3 peaks - Audio Feature Analysis and Selection

Chapter 3 Audio Feature Analysis and Selection

3.6 Frequency Variation of Top-3 peaks

Although the features mentioned in previous sections are excellent features for speech/pure music discrimination, their performance are not sufficiently good when it comes to other kinds of classification such as pure music and song discrimination. Take ZCR_var, spectrum flux, and

normalized RMS variance for examples, their histograms for pure music and song are highly overlapped as shown in Fig. 15, 16, and 17. The solid lines represent histograms for these three features of pure music, and the dot lines represent histograms for these three features of song. Clearly, if only these features are employed for pure music/song discrimination, the recognition rate will take a nosedive.

Fig. 15 ZCR_var histograms for pure music and song.

Fig. 16 SF histograms for pure music and song.

Fig. 17 Normalized RMS variance histograms for pure music and song.

To sort out the problem, a feature called frequency variation of top-3 peaks (FVTP) was proposed. FVTP was derived from the idea that the spectrum structure of pure music during a note is much more stable than that of song and speech. Fig. 18, 19, and 20 show the spectrums of five adjacent frames of pure music, song and speech, respectively.

As we can see, the three largest peaks in the spectrum of music do not change their locations. On the other hand, the locations of the three largest peaks in the spectrum of song vary significantly. Thus, FVTP is defined as the sum of the variations of frequencies of the three largest peaks over 500 Hz in the spectrum during a note (for music) or a word (for song). That is, FVTP of kth note or word is defined mathematically as

average frequency of the ith peak in a note or word, and N is the number of frames in a note or word. The average of FVTPs of all notes or words in a 1-second sample is then calculated to be the feature, i.e.

1 ^K

k k

FVTP FVTP

K ₌

∑

Fig. 18 Five adjacent frames of pure music.

(a) The first frame (b) The second frame

(e) The fifth frame

Fig. 19 Five adjacent frames of song.

(a) The first frame (b) The second frame

(e) The fifth frame

Fig. 20 Five adjacent frames of speech.

(a) The first frame (b) The second frame

(e) The fifth frame

To find the boundaries between notes or words in one second, notes or words are segmented by amplitude. First, the average amplitude of the nth frame is calculated by RMSn defined in (3.4). For example, a 1-second music waveform with two notes and its RMS_nare illustrated in Fig. 21.

(a) (b)

Fig. 21 (a) A 1-second music waveform with two notes. (b) RMS of 40 frames of the signal in (a)

Generally speaking, there will be a sudden change in the RMS value when the audio signal changes from one note to another. Thus, in order to locate the point, the differences between RMSs in Fig. 21 should be calculated as illustrated in Fig. 22.

Fig. 22 The differences between 40 RMSs

Then, all local maximums of RMS differences which are larger than one-fifth of the global maximum of RMS differences are viewed as the transition points as shown in Fig. 23. In this case, only the global maximum of RMS differences is indexed and it is exactly where the note change happens.

Fig. 23 The transition point is marked by ‘o’.

Last of all, two FVTPs are computed separately, and the average of these two FVTPs can be obtained to be the FVTP of the 1-second sample.

FVTP is an effective feature to discriminate pure music from song.

Generally speaking, vocal components are prominent in song, so peaks in the spectrum are usually generated by vocal components. Vocal components in song might last for more than one musical note and human vocal cords tend to vibrate when singing, so the locations of the top-3 peaks in spectrum will fluctuate constantly. This causes a large FVTP value for song. In contrast, pure music produced by musical instruments normally has a stable spectrum structure and caused a relatively small

FVTP value. Fig. 24 shows the histogram for pure music signals and the histogram for song. As we can see clearly, that FVTP value of pure music is around 0.1 10× ⁶, while FVTP value of song is around 0.2 10× ⁶ to 1 10× 6. Thus, FVTP is a good discriminator between pure music and song.

Fig. 24 FVTP histograms for pure music signals and song

With the features introduced in previous sections, we have accomplished the feature extraction for audio classification. In next chapter, we will step forward to discuss the framework of our audio classification system.

Chapter 4 SONFIN-Based Audio Signal Classification and Segmentation System

The proposed audio classification system consists of four major parts.

Those are feature extraction, silence detection, SONFIN classifier, and a post-processing process. The framework of the system and the classification flow will be introduced in the following sections.

4.1 Neural Fuzzy Inference Network

The main classifier employed in the proposed system is a particular neural fuzzy network named SONFIN [26] (self-constructing neural fuzzy inference network). SONFIN is a general connectionist model of a fuzzy logic system, which is able to find its optimal structure and parameters automatically. Initially, there are no rules in the SONFIN, and rules are created and adapted as on-line learning proceeds via simultaneous structure and parameter learning.

The structure of the SONFIN is shown in Fig. 25. This 6-layered network realizes a fuzzy model of the following form:

Rule i: IF x1 is Ai1 and … and xn is Ain

THEN y is m0i + ajixj + … (4.1)

where Aij is a fuzzy set, m0i is the center of a symmetric membership function on y, and aji is a consequent parameter. It is noted that unlike the

traditional TSK model where all the input variables are used in the output linear equation, only the significant ones are used in the SONFIN. The functions of the nodes in each of the six layers of the SONFIN are described in the following paragraph.

Fig. 25 Network structure of SONFIN.

Each node in Layer 1, which corresponds to one input variable, only transmits input values to the next layer directly. Each node in Layer 2, the membership value that specifies the degree how an input value belongs to a fuzzy set is calculated. Each node in Layer 3 represents one fuzzy logic rule and performs precondition matching of a rule. The number of nodes in layer 4 is equal to that in Layer 3, and the result (firing strength) calculated in Layer 3 is normalized in this layer. Layer 5 is called the consequent layer. Two types of nodes are used in this layer. The node

denoted by a blank circle is the essential node representing a fuzzy set of the output variable. The shaded node is generated only when necessary.

One of the inputs to a shaded node is the output delivered from Layer 4, and the other possible inputs are the selected significant input variables from Layer 1. Combining these two types of nodes in Layer 5, we obtain the whole function performed by this layer as the linear equation on the THEN part of the fuzzy logic rule in (4.1). Each node in Layer 6 corresponds to one output variable. The node integrates all the actions recommended by Layer 5 and acts as a defuzzifier to produce the final inferred output.

Two types of learning, i.e. structure and parameter learning are used concurrently to construct the SONFIN. The structure learning includes both the precondition and consequent structure identification of a fuzzy if-then rule. For the parameter learning, based upon supervised learning algorithms, the parameters of the linear equations in the consequent parts are adjusted to minimize a given cost function. The SONFIN can be used for normal operation at any time during the learning process without repeated training on the input-output patterns when on-line operation is required. There are no rules in the SONFIN initially, and rules are created dynamically as learning proceeds upon receiving on-line incoming training data by performing the following learning processes simultaneously,

(A) Input/output space partitioning, (B) Construction of fuzzy rules,

Processes A, B, and C belong to the structure learning phase and process D belongs to the parameter learning phase.

4.2 Classification Flow and Post-processing

The proposed audio classification flow is illustrated in Fig. 26. After an audio stream comes in, all input signals are downsampled into 8k Hz sampling rate and segmented into 1-second subsegments (samples) which is the classification unit in the system. Although it is possible that there is a mixture of two types of audio signals in a subsegment, the dominant type is chosen to index the subsegment.

After the pre-processing, audio features are first extracted. Then, silence segments are detected and indexed by a silence detector according to some features extracted in the previous step. The non-silent sounds are classified into speech segments and segments with music components.

After that, segments with music components are categorized into two groups, namely song and pure music. Different sets of feature vectors are applied in these two stages. In both classifying stages, a post-processing technique is utilized to correct the misclassification according to the property of continuity of an audio stream. The following will describe each of these processes.

Fig. 26 The proposed audio classification flow.

A. Feature Extraction

Audio features including ZCR, ZCR_var, spectrum flux, normalized RMS variance (σ_A²), LSTER, HZCRR, and FVTP introduced in chapter 3 are first computed for 1-second duration to represent these samples.

B. Silence Detection

Silence segments are detected and indexed by a silence detector robust estimate of signal amplitude from experiment results. If a segment satisfies the criteria, it is indexed as silence or 0 in our system.

C. Stage 1: Speech and Sound with Music Components Classification The non-silent sounds are then classified into speech and segments with music components. In this stage, spectrum flux, normalized RMS variance, LSTER, and HZCRR are employed to form a feature vector, {SF,σ²_A, LSTER, HZCRR}, to represent the audio samples. Then, the SONFIN is employed for classification.

The classification works well in most cases. However, in some special cases, classification errors might occur. Thus, in order to optimize the classification performance, a post-processing technique is indispensable.

D. Post-processing Technique

As mentioned previously, there might be some potential classification errors. To deal with the problem, a post-processing named

“smoothing” is applied to correct the classification errors. The main idea

of smoothing was derived from the fact that a genuine audio stream possesses the property of continuity. That is, there are few abrupt changes in a real audio stream. For example, there should not be a sudden 1-second speech segment in a pure music track. There should not be a sudden 1-second music segment in news broadcasting, either.

Fig. 27 The concept of “smoothing”.

Smoothing searches for a 1-second-length discontinuity, and set the index of the sample the same as previous and following samples. Fig. 27 illustrates the concept of smoothing. And the rule can be expressed as

for i=1:length(x)-2{

if ( x(i+1)!=x(i) and x(i+1)!=0 and x(i+2)!=x(i+1) ) then x(i+1) = x(i);

}

where x(i) is the index number of the ith input audio segment. In the system, “smoothing” is applied to both classification stages to refine the classification result.

E. Stage 2: Music and Song Discrimination

In the second classification stage, segments with music components are categorized into two groups, namely song and pure music. FVTP and

x(i) x(i+1) x(i+2)

short-time energy are chosen to form a feature vector instead of {SF,σ_A², LSTER, HZCRR} because the distributions of these features for pure music and song overlap significantly and result in a high classification error. As how the first classification stage is designed, SONFIN and smoothing are employed for classification and refinement, respectively.

F. Segmentation

Technically, segmentation of an audio signal is accomplished once the 1-second segments classification and “smoothing” are done. For example, a 100-second audio stream with silence, speech, pure music and song is about to be segmented. The audio stream is segmented into 100 subsegments and classified into the four classes using the proposed classification flow. Next, the “smoothing” is applied to search for the classification errors. This procedure works well for most cases. The experimental results are provided in the next chapter.

Chapter 5 Experimental Results

5.1 Audio Database

In order to evaluate the proposed audio classification system, an audio database was built. The database contains three types of audio signals, i.e. speech, pure music and song. The number of speech, pure music and song are 2460 seconds, 2884 seconds and 1843 seconds, respectively. These data were acquired randomly from language teaching radio programs, TV news, music CD tracks, and MP3 files. They were hand-labeled into the three categories: speech, pure music and song. All the files of the database are in a format of 8000 Hz sample rate, 16-bit resolution, and a mono channel.

5.2 Evaluation with SONFIN and k-NN

As mentioned previously, SONFIN was employed as the main classifier in the proposed system. Moreover, in order to verify the classification flow and the proposed feature vectors, a k-NN decision rule for classification is also applied. Here, a 1-NN decision rule combined with leave-one-out cross-validation is employed. In TABLE II, we present the experimental result of classification in stage 1 of the proposed system. The various features are evaluated alone with 1-NN decision rule combined with leave-one-out cross-validation. The feature vector

combing {SF, σ_A² , LSTER, HZCRR} has the best classification performance. Listed in the last row is the experimental result by using SONFIN in stage 1. The performance is so good that almost all samples can be classified correctly.

TABLE II

Classification performance of different features in stage 1 of the proposed system. The “average”

column shows the average accuracy rate of all samples while the other two columns show the accuracy rate of speech and “with music components”, respectively.

Accuracy (%) Features

Average Pure Speech with music components

The results of stage 2 of the proposed system are listed in TABLE III.

It should be noted that these experiments are carried out individually. The result of “with music components” of stage 1 is not provided as the input of stage 2 in this experiment. In this way, the evaluation can be carried

out without interference. On the other hand, TABLE IV lists the result of stage 2 where the result of “with music components” of stage 1 is considered.

From TABLE III, we can see that when features are applied alone to discriminate “pure music” from “song”, the proposed feature, FVTP, has the best performance. Furthermore, an accuracy rate of over 90% will be achieved when the proposed feature, FVTP, is combined with a basic feature, Energy.

TABLE III

Classification performance of different features in stage 2. The “average” column shows the average accuracy rate of all samples while the other two columns show the accuracy rate of pure music and

song, respectively.

Classification performance of stage 2 with the influence of stage 1.

Accuracy (%)

In addition to stage 1 and stage 2 of the proposed system, we also conducted experiments on speech/song discrimination. The experimental result is listed in TABLE V.

TABLE V

Classification performance of speech/song discrimination

Accuracy (%)

For practical audio stream classification and segmentation, the results are illustrated in Fig. 28 and 29. Stage 1 and stage 2 are combined to perform classification when a real audio stream is about to be classified and segmented. The first 40-second audio clip was recorded from an English language teaching program called Studio classroom. There are two short musical interludes in the clip. The last 32 seconds are a 10-second song clip, a 12-second music clip, and an another 10-second song clip, respectively.

Figure 28 shows the result of stage 1. The upper plot is the original input audio waveform which is 72-second long, and the middle plot is the result after classification without “smoothing”. The lower plot illustrates the result of the middle plot after “smoothing”. In the middle plot of Fig.

28, we can see that a 1-second segment indicated by an ellipse is misclassified. However, it is corrected after “smoothing”, as shown in the

lower plot.

Figure 29 shows the final segmentation result where 0 corresponds to silence, 1 corresponds to pure speech, 2 corresponds to pure music, and 3 corresponds to song. In Fig. 29, we can see from the final result that the system successfully classified and segmented the audio stream.

Fig. 28 Result of practical audio stream classification and segmentation in stage 1.

Fig. 29 Result of practical audio stream classification and segmentation in stage 2.

Another practical experimental result with slightly erroneous classification is illustrated in Fig. 30. The first 12 seconds of the 18-second audio clip are pure music and others are song. In stage 1 (the second and the third plot), the performance is good. All segments are classified into “with music components” correctly. In stage 2 (the fourth plot), the last two seconds are song but misclassified as music. The main reason might be that in these two seconds, vocal components are relatively weak and lead the system into a misclassification.

Fig. 30 Practical experimental result of music and song.

5.3 Discussion

It was shown by these experiments that the proposed classification

system and the proposed feature, FVTP, performed well for audio classification and segmentation. Speech/music discrimination achieves a recognition rate of 99% using the proposed system and the combination of features. When it comes to pure music/song classification, most of existing features performs poorly except FVTP. When FVTP is combined with energy, the problem of pure music/song classification which is quite difficult can be solved effectively.

To deserve to be mentioned, FVTP should have performed better theoretically according to our experiments on a single musical note and a speech or song utterance. FVTP of a musical note indeed has quite small variation and FVTP of a speech or song utterance has relatively large variation as illustrated in Fig. 18, 19, and 20 in 3.6. The main reason which decreases the classification accuracy might be that the transition point between notes or utterances is not located precisely enough. This might result in a larger FVTP for pure music or a smaller FVTP for song.

Thus, an attempt to develop a technique which is able to locate the transition point more precisely is one of our future works.

A general k-NN decision rule combined with leave-one-out cross-validation was also applied for verification and the result was consistent with that of our system. Thus, the results are quite believable.

Indeed, there are some misclassifications under certain circumstances.

Nevertheless, the “smoothing” technique performs well for errors correcting since real audio streams possess the property of continuity.

Chapter 6 Conclusion

In this thesis, we have presented an audio classification and segmentation system which distinguishes the difference between instrument music, pure speech, song and silence.

We have applied some signal processing techniques on the signals to acquire some good features. The features have been analyzed and discussed in detail. In addition to analyzing the existing features, we have also proposed a novel feature named FVTP in order to classify audio signals with musical components into pure music and song with a higher accuracy rate.

The system consists of two main stages. Different sets of features have been applied in each of these two stages of the system. A neural fuzzy inference network named SONFIN has been adopted in the proposed system as the classifier. A simple k-NN decision rule combined

在文檔中利用模糊類神經網路之音頻信號分類與切割技術 (頁 39-0)