Retrieval - Audio Retrieval and Similarity Measurement

CHAPTER 4 CONTENT-BASED AUDIO RETRIEVAL BASED ON GABOR

4.2.2 Audio Retrieval and Similarity Measurement

4.2.2.2 Retrieval

For a query audio sequence, y , with length p-seconds, it is first divided into p _q

successive one-second clips. That is

y_q =[y¹_q,y_q²,L,y_q^p]. (4.13) Next, for each clip y_qⁱ (i=1,2,L,p) and a candidate audio sequence y , the _c similarity measure is first performed and the corresponding grades, Gd_qⁱ_,_j (i=1,2,L,pand j =1,2,L,l), are evaluated based on Eq. (4.12). According to these grades, the total grade of the candidate clip y_j ( j=1,2,L,l), Gd_T_q_,_j, is defined

to be

the higher similarity to the query one can be retrieved. For example, the matching clip with the highest similarity to the query one can be retrieved according to the

following criterion:

sequence will result in the following audio sequence:

]

In order to show the efficiency of the proposed method, we have collected a set of 150 musical pieces (50 musical instruments, 100 songs) with total length about three hours and 10000 phrases as the testing database. Care was taken to obtain a wide variation in each type such as varied instruments, different languages (English, Chinese, Japanese, etc.), different singers (male, female, or children), and different

style (jazz, rock, folk, etc.). These audio clips are stored as 16-bit per sample with 44.1 kHz sampling rate in the WAV file format and are used to test the audio retrieval performance. Note that in order to do comparison, the testing database includes the dataset described in [17, 18], and some of clips are taken from MPEG-7 content set [29].

4.3.1 Experiment Results

There are two major factors affecting the performance of the proposed approach, i.e., the number of the basis functions used and the length of the query example. In order to examine the performance of the proposed method, we present two experiments. In the first experiment, for each music object in the database, we use its refrain as the query example to retrieve all repeating phrases similar to this refrain.

Therefore, 150 queries are performed. This experiment is presented to examine the quality of the proposed retrieval approach with two above-mentioned major factors.

As for the second experiment, for each song, there will have two versions which are sung in different languages or by different persons in the database. We use its refrain in a certain version (e.g. the Chinese version) as the query example to retrieve all repeating phrases similar to this refrain in other version (e.g. the English version).

In this chapter, the performance is evaluated by the precision rates (P_r) and the recall rates (R_e) [30]. Note that the recall rate, R_e, and the precision rate, P_r, are defined as

R_e = N and

P_r = N , (4.18)

where N is the number of relevant items retrieved (i.e. correctly retrieved items), T is the total number of relevant items (i.e. correctly retrieved items and the relevant items that have not been retrieved) and K is the total number of the retrieved items.

The recall rate is typically used in conjunction with the precision rate, which measures the fraction of the retrieved patterns that is relevant. The precision and recall rate can often be traded-off. That is one can achieve high precision rate and low recall rate or the other way round.

Tables 4.1 and 4.2 show the results of two experiments presented in this chapter.

TABLE 4.1

THE AVERAGE RECALL RATES OF THE FIRST EXPERIMENT Query Sample Length

Basic Function

Numbers One second Two seconds Three seconds

5 29% 71% 74%

10 31% 75% 75%

15 40% 98% 98%

TABLE 4.2

THE AVERAGE RECALL RATES OF THE SECOND EXPERIMENT Query Sample Length

Basic Function

Numbers One second Two seconds Three seconds

5 31% 71% 72%

10 31% 71% 74%

15 38% 94% 94%

In our experiments, the number of retrieved patterns was adjusted to the number of relevant patterns, so the precision rate and recall rate are the same. From Table 4.1, we can see that the above-mentioned two factors affect the performance of the proposed approach. The more basis functions are used, the higher the recall rate will be. And the longer length of the query sample is used, the higher the recall rate will be. Based on the first experiment, we can see that it is best to perform retrieval using 15 basis functions and two-second length of query sample. From Table 4.2, we can also see the same phenomena as Table 4.1 except for the lower recall rate.

Besides, by examining the occurrence of missing in the experiments based on human judgement as the ground truth, we found two major factors. First, for the first experiment, we find that some errors occur in those searched clips containing a

transition, which is made due to that we simply segment an audio object into several one-second clips uniformly against pre-dividing the audio object into sequences of audio phrases. As a matter of fact, this kind of errors can be reduced by increasing the length of query sequence (i.e., clip number) to get more related information or performing the pre-dividing for the audio phrases. Secondly, we find that some errors occur due to that the refrains of some songs are performed at different tempo. From these tables, we can see that the proposed retrieval approach for music data can achieve an over 96% accuracy rate. The experiments are carried out on a Pentium II 400 PC/Windows 2000. The 150 queries can be processed in less than five seconds for 10000 phrases. In order to do comparison, we also like to cite the efficiency of the existing system described in [17, 18], which also uses similar database to ours. The authors reported that their accuracy rates are more than 90%.

4.4 SUMMARY

Digital audio signals, especially for music are an important type of media.

However, few works were focused on the music databases. In this chapter, we have presented a new method for content-based music retrieval to retrieve perceptually similar music pieces in audio documents. In the proposed method, based on the Gabor

wavelet filters, the extracted perceptual features are general enough to meet the human auditory system. An accurate retrieval rate higher than 96% was achieved.

Furthermore, the complexity is low due to the easy computing of audio features, and this makes online processing possible.

There are several related tasks to be conducted in the future. First, we will work on the other type of audio source such as sound effects and the compression domain.

Second, we will work on developing an automatic segmentation technique to divide the musical objects into sequences of phrase.

CHAPTER 5 CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS

5.1 Conclusions

Rapid increase in the amount of audio data demands for an efficient method to automatically analysis audio signal based on its content. In this dissertation, we have presented three methods to address the problems of audio segmentation, classification and content-based retrieval.

Besides the general audio types such as music and speech tested in existing work, in this dissertation, we have taken hybrid-type sounds (speech with music background, speech with environmental noise background, and song) into account.First, we have proposed a hierarchical audio classification method to classify audio data into five general categories: pure speech, music, song, speech with music background, and speech with environmental noise background. These categories are the basic sets needed in the content analysis of audiovisual data. An accurate classification rate higher than 96% was achieved. The experimental results indicate that the extracted audio features are quite robust.

We also propose a classification-based audio segmentation method based on Gabor wavelets. The proposed method provides two classifiers, one is for speech and music (called two-way); the other is for five classes (called five-way) that are pure speech, music, song, speech with music background, and speech with environmental noise background. In order to make the proposed method robust for a variety of audio sources, we use Fisher Linear Discriminator to obtain features with the highest discriminative ability. Based on the classification results, a merging algorithm is provided to divide an audio stream into some segments of different classes to achieve segmentation. Experimental results show that the proposed method can achieve over 98% accuracy rate for speech and music discrimination, and more than 95% for a five-way discrimination. By checking the class types of adjacent clips, we also can identify more than 95% audio scene breaks in audio sequence.

Two important and distinguishing features compared with previous work in the above two proposed schemes are the complexity and running time. Although the proposed schemes covers a wide range of audio types, the complexity is low due to the easy computing of audio features, and this makes online processing possible. Thus, the proposed methods can be widely applied to many audiovisual analysis applications such as content-based video retrieval.

Finally, we have presented a new method for content-based music retrieval to

retrieve perceptually similar music pieces in audio documents. It is based on the QBE paradigm and allows the user to select a reference passage within an audio file and retrieve perceptually similar passages such as repeating phrases within a music piece, similar music clips in a database or one song sung by different persons or in different languages. First, an audio stream is divided into clips and the frame-based features of each clip are extracted based on the Gabor wavelet filters. Then, a similarity measuring technique is provided to perform pattern matching on the resulting sequences of feature vectors. The experimental results demonstrate the capability of the proposed audio features for characterizing the perceptual content of an audio sequence.

5.2 Future Research Directions

Content-based audio analysis is still a new area that is not well explored. There are some possible future research directions. For example, in audio classification and segmentation, we will work on the other type of audio source such as sound effects and the compression domain. In the content-based audio retrieval, we will emphasize in query by humming (QBH).

REFERENCES

[1] S. Pfeiffer, S. Fischer, and W. Effelsberg, “Automatic audio content analysis,” in Proc. ACM Multimedia’96, Boston, MA, April 1996, pp. 21-30.

[2] J. Foote, “An overview of audio information retrieval,” ACM Multimedia Systems, vol. 7, no. 1, pp. 2-11, January 1999 .

[3] E. Scherier and M. Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing’97, Munich, Germany, April 1997, pp. 1331-1334.

[4] S. Rossignol, X. Rodet, and J. Soumagne et al., “Feature extraction and temporal segmentation of acoustic signals,” in Proc. ICMC 98, Ann Arbor, Michigan, 1998, pp. 199-202.

[5] J. Saunders, “Real-time discrimination of broadcast speech/music,” in Proc. Int.

Conf. Acoustics, Speech, Signal Processing’96, vol. 2, Atlanta, GA, May 1996,

pp. 993-996.

[6] I. Fujinaga, “Machine recognition of timbre using steady-state tone of acoustic instruments,” in Proc. ICMC 98, Ann Arbor, Michigan, 1998, pp. 207-210.

[7] L. Wyse and S. Smoliar, “ Toward content-based audio indexing and retrieval and a new speaker discrimination technique,” in Proc. ICJAI’95, Singapore, December 1995.

[8] D. Kimber and L. D. Wilcox, “Acoustic segmentation for audio browsers,” in Proc. Interface Conf., Sydney, Australia, July 1996.

[9] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classification, search, and retrieval of audio,” IEEE Multimedia Mag., vol. 3, no. 3, pp. 27-36,

[10] L. Guojun and T. Hankinson, “A technique towards automatic audio classification and retrieval,” in Proc. Int. Conf. Signal Processing’98, vol. 2, 1998, pp. 1142-1145.

[11] J. S. Boreczky and L. D. Wilcox, “A hidden markov model framework for video segmentation using audio and image features,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing’98, Seattle, May 1998, pp. 3741-3744.

[12] D. Li, I. K. Sethi, N. Dimitrova, and T. McGee, “Classification of general audio data for content-based retrieval,” Pattern Recognition Letters, vol. 22, no. 5, pp.

533-544, April 2001.

[13] T. Zhang and C.-C. J. Kuo, “Hierarchical classification of audio data for archiving and retrieving,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing’99, vol. 6, 1999, pp. 3001-3004.

[14] T. Zhang and C.-C. J. Kuo, “Audio content analysis for online audiovisual data segmentation and classification,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 4, pp. 441-457, May 2001.

[15] G. Smith, H. Murase, and H. Kashino, “Quick audio retrieval using active search,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing’98, Seattle, WA, May 1998, pp.3777-3780.

[16] J. Foote, “Content-based retrieval of music and audio,” in Proc. SPIE, Multimedia Storage and Archiving systems II, Vol.3229, 1997, pp. 138-147.

[17] T. Zhang and C.-C. J. Kuo, “Content-based classification and retrieval of audio,”

in Proc. SPIE, Conf. Advanced Signal Processing Algorithm, Architectures, and Implementations VIII, vol.3461 , San Diego, July 1998.

[18] A. Ghias, J. Logan, D. Chamberlin, and B. Smith, “Query by humming: musical information retrieval in an audio database,” in Proc. Int. Conf. ACM Multimedia,

[19] G. Tzanetakis and P. Cook, “Audio information retrieval (AIR) tools,” In Proc.

Int. Symposium on Music Information Retrieval (ISMIR), 2000.

[20] K. Martin, E. Scheirer, and B. Vercoe, “Musical content analysis through models of audition,” In Proc. ACM Multimedia Workshop on Content-Based Processing of Music, Bristol, UK, 1998.

[21] C. Spevak and E. Favreau, “Soundspotter-a prototype system for contest-based audio retrieval,” in Proc. Int. Conf. Digital Audio Effects, September 2002, pp.

27-32.

[22] G. Tzanetakis, Manipulation, Analysis and Retrieval System for Audio Signals.

Ph.D. thesis, Princeton University, 2002.

[23] C. Yang, “MACS: music database retrieval based on spectral similarity,” In IEEE Workshop on Applications of Signal Processing, 2001.

[24] C. Yang, Music database retrieval based on spectral similarity. Stanford University Database Group Technical Report 2001-14, 2001.

[25] S.-T. Bow, Pattern Recognition and Image Preprocessing. Marcel Dekker, 1992.

[26] MPEG Requirements Group, “Description of MPEG-7 content set,” Doc.

ISO/MPEG N2467, MPEG Atlantic City Meeting, October 1998.

[27] D. Gabor, “Theory of communication,” Journal of the Institute for Electrical Engineers, vol. 93, pp. 429-439, 1946.

[28] B. S. Manjunath and W.Y. Ma, “Texture features for browsing and retrieval of image data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 173-188, 1992.

[29] S. Qian and D. Chen, Joint Time-Frequency Analysis Methods and Applications.

Upper Saddle River, NJ: Prentice-Hall, 1966.

[30] E. Zwicker and H. Fastl, Psychoacoustics, Facts and Models. Springer, 1990.

Associates, Inc., 1998.

[32] J. Smith M. and X. Serra, “An analysis/resynthesis program for non-harmonic sounds based on a sinusoidal representation,” in Proc. ICMC 87, Ann Arbor, Michigan, 1987, pp. 290ff.

[33] N. Peter Belhumeur, and David J. Kriegman, “Eigenfaces vs. fishfaces:

recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711-720, July 1997.

[34] B. S. Manjunath, P. Salembier, and T. Sikora, Introduction to MPEG-7:

Multimedia Content Description Interface. John Wiley, 2002.

[35] M. Casey, “MPEG-7 sound recognition tools,” IEEE Transactions on Circuits and Systems Video Technology, special issue on MPEG-7, IEEE, vol. 11, no. 6,

pp. 737-747, 2001.

[36] ISO/IEC JTC1/SC29/WG11 Moving Pictures Expert Group, “Information technology–multimedia content description interface – part 4: Audio. Comittee Draft 15938-4,” ISO/IEC, 2000.

[37] ISO/IEC JTC1/SC29/WG11 Moving Pictures Expert Group, “Introduction to MPEG-7,” available from http://www.cselt.it/mpeg.

[38] C. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 1999.

[39] A. J. Willis and L. Myers, “A cost-effective fingerprint recognition system for use with low-quality prints and damaged fingertips” Pattern Recognition, Vol. 34, No. 2, pp. 255-270, February 2001.

PUBLICATION LIST

We summarize the publication status of the proposed methods and our research status in the following.

(1) Ruei-Shiang Lin and Ling-Hwei Chen, “A New Approach for Classification of Generic Audio Data,” accepted by International Journal of Pattern Recognition and Artificial Intelligence.

(2) Ruei-Shiang Lin and Ling-Hwei Chen, “A New Approach for Audio Classification and Segmentation Using Gabor Wavelet Filtering and Fisher Linear Discriminator,” accepted by International Journal of Pattern Recognition and Artificial Intelligence.

(3) Ruei-Shiang Lin and Ling-Hwei Chen, “Content-based Retrieval of Audio Based on Gabor Wavelet Filtering,” accepted by International Journal of Pattern Recognition and Artificial Intelligence.

在文檔中一個關於一般音訊資料之音訊分類，音訊分段及音訊檢索之研究 (頁 88-0)