• 沒有找到結果。

Chapter 5 Experimental Results

5.3 Discussion

It was shown by these experiments that the proposed classification

system and the proposed feature, FVTP, performed well for audio classification and segmentation. Speech/music discrimination achieves a recognition rate of 99% using the proposed system and the combination of features. When it comes to pure music/song classification, most of existing features performs poorly except FVTP. When FVTP is combined with energy, the problem of pure music/song classification which is quite difficult can be solved effectively.

To deserve to be mentioned, FVTP should have performed better theoretically according to our experiments on a single musical note and a speech or song utterance. FVTP of a musical note indeed has quite small variation and FVTP of a speech or song utterance has relatively large variation as illustrated in Fig. 18, 19, and 20 in 3.6. The main reason which decreases the classification accuracy might be that the transition point between notes or utterances is not located precisely enough. This might result in a larger FVTP for pure music or a smaller FVTP for song.

Thus, an attempt to develop a technique which is able to locate the transition point more precisely is one of our future works.

A general k-NN decision rule combined with leave-one-out cross-validation was also applied for verification and the result was consistent with that of our system. Thus, the results are quite believable.

Indeed, there are some misclassifications under certain circumstances.

Nevertheless, the “smoothing” technique performs well for errors correcting since real audio streams possess the property of continuity.

Chapter 6 Conclusion

In this thesis, we have presented an audio classification and segmentation system which distinguishes the difference between instrument music, pure speech, song and silence.

We have applied some signal processing techniques on the signals to acquire some good features. The features have been analyzed and discussed in detail. In addition to analyzing the existing features, we have also proposed a novel feature named FVTP in order to classify audio signals with musical components into pure music and song with a higher accuracy rate.

The system consists of two main stages. Different sets of features have been applied in each of these two stages of the system. A neural fuzzy inference network named SONFIN has been adopted in the proposed system as the classifier. A simple k-NN decision rule combined with leave-one-out cross-validation was also applied for verification. Also integrated in the system is a post-processing procedure named

“smoothing”. Both experiments showed that the classification flow and the proposed feature performed well. The accuracy rate was higher than 90%.

The system can be employed in many applications such as a front-end for current audio application, de-advertising, automatic equalization, audio indexing and retrieval and even audio-based video indexing.

Several future works can be conducted in the future. First of all, the ability to classify audio signals into more categories is necessary. More specifically, the design of music genre classifiers or instrument recognition systems is a very interesting topic. We will also try to improve the robustness of the system such that it can work well in all kinds of situations.

Of course, to form a human-hearing-based audio signal processing system by combining the proposed audio classification system with our previous audio signal processing system such as speech recognition and speaker identification will be an exciting and practical future research topic.

References

[1] E. Scheirer, Music Listening Systems, PhD thesis, Media Laboratory, Massachusetts Institute of Technology, 2000.

[2] D. Gerhard, Computationally Measurable Temporal Differences between Speech and Song, PhD thesis, Computer Science, Simon Fraser University, 2003.

[3] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall, New Jersey, 1993.

[4] L. Wyse and S. Smoliar, “Toward content-based audio indexing and retrieval and a new speaker discrimination technique,”Computational auditory scene analysis, Lawrence Erlbaum Associates, Inc., Mahwah, NJ, 1998.

[5] D. Kimber, and L. Wilcox, “Acoustic segmentation for audio browsers,” in Proc. Interface Conf., Sydney, Australia , Jul. 1996.

[6] S. Pfeiffer, S. Fischer, and W. Effelsberg, “Automatic audio content analysis,” in Proc. 4th ACM Int. Conf. Multimedia, pp. 21–30, Nov.

1996.

[7] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classification, search, and retrieval of audio,” IEEE Multimedia, pp.

27–36, Vol. 3, No. 3, Fall 1996.

[8] J. Saunders, “Real-time discrimination of broadcast speech/music,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing’96, Vol. 2, pp.

993–996, Atlanta, GA, May 1996.

[9] E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator,” in Proc. Int. Conf.

Acoustics, Speech, Signal Processing’97, Vol. 2, pp. 1331–1334, Munich, Germany, Apr. 1997.

[10] S. Z. Li, “Content-based audio classification and retrieval using the nearest feature line method,” IEEE Trans. Speech and Audio Processing, Vol. 8, No. 5, pp. 619–625, Sep. 2000.

[11] T. Zhang and C.-C. J. Kuo, “Audio content analysis for on-line audiovisual data segmentation and classification,” IEEE Trans.

Speech and AudioProcessing, Vol. 9, No. 3, pp. 441–457, May 2001.

[12] L. Lu, H. J. Zhang, and H. Jiang, “Content analysis for audio classification and segmentation,” IEEE Trans. Speech and Audio Processing, Vol. 10, No. 7, pp. 504–516, Oct. 2002.

[13] L. Lu, H. J. Zhang, S. Li, “Content-based audio classification and segmentation by using support vector machines,” ACM Multimedia Systems Journal, Vol. 8, No. 6, pp. 482–492, Mar. 2003.

[14] C. Panagiotakis and G. Tziritas, “A speech/music discriminator based on RMS and zero-crossings,” IEEE Trans. Multimedia, Vol. 7, No. 1, pp.155–166, Feb. 2005.

[15] S. Esmaili, S. Krishnan and K. Raahemifar, “Content based audio classification and retrieval using joint time-frequency analysis,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing’04, Vol. 5, pp.

665–668, Montreal, Quebec, May 2004.

[16] C. H. L. Costa, J. D. Valle Jr., and A. L. Koerich, “Automatic classification of audio data,” in Proc. Int. Conf. Systems, Man and Cybernetics’04, Vol. 1, pp. 562–567, The Hague, The Netherlands, Oct 2004.

[17] G. Tzanetakis, Manipulation, Analysis and Retrieval Systems for

Audio Signals, PhD thesis, Computer Science, Princeton University, 2003.

[18] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. Speech and Audio Processing, Vol. 10, No. 5, pp. 293–302, Jul. 2002.

[19] P. Gelin and C. J. Wellekens, “Keyword spotting for video soundtrack indexing,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing’96, Vol. 1, pp. 299–302, Atlanta, GA, USA, May, 1996.

[20] K. Minami, A. Akutsu, and H. Hamada et al., “Video handling with music and speech detection,” IEEE Multimedia, Vol. 5, No. 3, pp.

17–25,Jul.–Sep. 1998.

[21] D. W. Robinson and R. S. Dadson, “A redetermination of the equal loudness relations for pure tones,” British Journal of Applied Physiology, Vol. 7, pp.166–181, 1956.

[22] J. Backus , The Acoustical Foundations of Music, Murray, London, 1970.

[23] L. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978.

[24] D. Gerhard, “Audio signal classification: history and current techniques,” Technical Report TR-CS 2003-7, Computer Science, University of Regina, Nov. 2003.

[25] P. R. Krishnaiah and P. K. Sen, Handbook of Statisticals:

Nonparametric Methods, North-Holland, Amsterdam, The Netherlands, 1984.

[26] C. F. Juang and C. T. Lin, “An on-line self constructing neural fuzzy inference network and its applications,” IEEE Trans. Fuzzy System, Vol. 6, No. 1, pp. 12–32, Feb. 1998.

相關文件