Local-Minimum Algorithm - Beat Detection and Analysis

Chapter 3 Vision-based Conductor Gesture Tracking

3.3 Beat Detection and Analysis

3.3.2 Local-Minimum Algorithm

Figure 3.8 also confirms the fact that a beat always directly corresponds with a downward-to-upward change in direction. While the y-axis data can be searched for a change in direction from downward-to-upward movement, the x-axis data we collected can be completely ignored.

We assume the situation that the y-axis scale increases from top to bottom. So, if the y-axis waveforms are rising, subtracting the previous y position value from the current y position value produces a negative value. If the waveform is unchanging, the result is zero. If the waveform is falling, this produces a positive value. Mathematically, this can be expressed as:

sign t 1 if y t y t 1 0

1 if y t y t 1 0 (3.14)

According to the reference scale used, a maximum is characterized by a change from a negative to a positive . Using similar idea, a minimum is characterized by a change from a positive to a negative . Since the beat directly corresponds

with the minima, the system only has to search for a change in from positive to negative. A minimum cannot be detected until the first rising point after the minima has been acquired, so a slight delay will always be present.

Chapter 4 Experimental Results and Discussions

4.1 Overviews

We designed a conductor gesture tracking system to help users training their conducting gesture. This system which realizes the algorithms described in the previous chapter will track the target and detect the beat event to calculate the relative correction rates.

Figure 4.1 A snapshot of our conductor gesture tracking system

Figure 4.1 shows a snapshot of our conductor gesture tracking system. In our experiments, the video inputs are captured from the Logitech QuickCam webcam with the resolution of 320×240 pixels and the music files are sampled from compact discs, 10 constant-speed songs with different Beats per Minute (BPM). The BPMs of the music are from 60 to 124 beats per minute and the lengths of music are from 54s to 85s.

All the software of our real-time conductor’s gesture tracking system is run on a personal computer with 1G RAM. The software includes the CodeGear C++ Builder 2007 and Windows XP Home SP2. The throughput obtained is from 5-8 frames per second. In section 4.2 and 4.3, we will explain our experimental procedure and demonstrate the results and analysis afterward.

Combined with Figure 3.1 and Figure 4.1, we can figure out the relation between the processing procedure and user interface demonstration in Figure 4.2. Our system tracks a target from an image sequence and provides a robust tracking result. However, computer vision-based tracking is extremely difficult in an unconstrained environment, and many situations may affect the accuracy of tracking such as interfering with other objects with the same color in the background, variety of illuminant condition, too small tracking targets. In order to simplify tracking we assume there is no other object with the same color to interfere with the system.

After choosing the tracking target, our system starts producing probability map at

Right-UP column of our system UI. Upon the probability map image, we also can see the centroid and rectangle region of the target we tracked via CAMSHIFT Algorithm.

At the same time, we can see the trajectory of the target and DCPs at Right-Bottom column of our system UI.

Figure 4.2 Experimental Flowchart

To demonstrate the beat events in real-time, we drew the time information when

beat events happened on the WAV form data of music file. We defined the correct interval is Correct_Time RT Correct_Time RT, where T is the time we detected beat events, Correct_Time which came from the ground truth (via

calculation at different BPM or men-make file) is the correct time of beat event i and the RT is based on reaction time we will mention later. If the time of beat event we detected is in the correct interval we defined, we called it a correct beat detection.

4.2 Experimental Results

We use Precision, Recall and F-measure rate to evaluate the correctness of the experiment output. In an Information Retrieval scenario, Precision is the fraction of the documents retrieved that are relevant to the user's information need and Recall is the fraction of the documents that are relevant to the query that are successfully retrieved [8]. With similar thought, we can define Precision, Recall more specifically.

Precision Rate #of correct beat we detected

# of beat we detected (4.1)

Recall Rate #of correct beat we detected

# of correct beat from ground truth (4.2)

That means, Recall parameter denotes the percentage of correct detection by the detection algorithm with respect to the overall beat events and the Precision is the percentage of correct detection with respect to the overall declared beat events.

Precision and Recall usually trade off against each other. As precision goes up, recall usually goes down (and vice versa). In order to take these two parameters into account, we use F-measure to combine precision and recall.

F 1 α precision recall

α precision recall (4.3)

And F1-measure combines recall and precision with an equal weight as follows:

F 2 precision recall

precision recall (4.4)

We experiment our system using the same video and audio inputs with these two

methods, K-Curvature and Local-Minimum algorithm, respectively. In K-Curvature algorithm, there exists a factor θ, the degree between two consecutive motion vectors,

to control in K-Curvature algorithm. Therefore, we compute our experiment twice with two different factors, θ 60^° and θ 90^°, for comparison purpose.

The experiment contains two parts: Music-based and Vision-based evaluations.

The ground truth for Music-Based evaluation was determined by the constant BPM of music. If BPM =80, the time interval between two consecutive beats should be 0.75s and the continuous beat point series should be { 0.00, 0.75, 1.50, 2.25, …} until the music ends. The ground truth for the Vision-based evaluation was determined by men-make file whose time interval may not be so as stable as the Music-based one.

In Vision-based evaluations, we set the ground truth manual when we saw the beat events (the time point of direction change); in music-based evaluation, we conducted when we heard the beat events. Because people always need some time to react the visual or sound stimulus, it is overconstrained to limit the time point of beat event we detected must be exactly match with the ground truth. So we need to join the idea of

reaction time (RT), which is usually defined as the time that an observer might be

asked to press a button as soon as a light or sound appears. In the Master thesis of Lain [42], mean RT is 351±44 milliseconds to detect visual stimuli, whereas for sound it is 256±41 milliseconds.

Table 4.1 The overall experimental results with different methods

Methods

Music-Based Evaluation (RT=256)

Vision-Based Evaluation (RT=351)

Precision Recall F1-measure Precision Recall F1-measure K-Curvature

° 90.59 84.43 87.34 89.18 84.03 86.46

K-Curvature

° 85.43 92.97 89.02 84.69 92.67 88.48

Local Minimum 95.91 86.98 91.21 93.70 85.30 89.29 Table 4.1 shows the overall experimental results using different methods. Figure 4.3 and Figure 4.4 show the F1-measures in two different evaluations. In Music-based evaluation, the lowest F1-measure is 78.29% and highest F1-measure is 96.31%. The lowest and highest F1-measure is 77.26% and 94.81% in Vision-based evaluation.

We further analyze the vision-based evaluation more specifically: Compared with

the F₁-measure rate in θ 60^° and θ 90^°, we find that the F₁-measure at θ 90^° is higher than the F1-measure at θ 60^°. When θ 60^°, the range that precision

rate decreased is higher than the range that the recall rate increased. Also, there is no significant impact between the BPM (the music speed) and F1-measure and we can handle the music whose BPM is between 60 and 120 without any further support.

Moreover, the Reaction Time (RT) can be adjusted dynamically based on the reacting ability of user. If we set RT more widely, it can help to raise the recall and precision rate efficiently, but it may decrease the accuracy of the system.

The following Table 4.2, Table 4.3 and Table 4.4 show the precision rates, recall rates and F1-measurein every test sequence respectively.

Table 4.2 The precision, recall rates and F1-measures using K-Curvature ( ^°)

Sequence

Precision Recall F1-measure Precision Recall F1-measure

1 60 89.91 97.01 93.33 86.37 93.94 90.00

Table 4.3 The precision, recall rates and F1-measures using K-Curvature ( ^°)

Sequence

Precision Recall F1-measure Precision Recall F1-measure

1 60 94.73 87.06 90.74 89.54 83.33 86.33

Table 4.4 The precision, recall rates and F1-measures using Local-Minimum

Sequence

Precision Recall F1-measure Precision Recall F1-measure

1 60 99.40 86.57 92.54 92.17 81.31 86.40

Figure 4.3 and Figure 4.4 shows the Vision-based evaluation results in chart forms. We will discuss the reasons of failed detection in vision-based evaluation in the next section.

Figure 4.3 F-measure results of Vision-Based Evaluation using K-curvature algorithm

90.00

Figure 4.4 F-measure results of Vision-Based Evaluation using local-minimum algorithm

4.3 Analysis of the Experimental Results

In this section, we classified our beat detection errors into two categories based on Vision-based evaluation. They are known as false positives and false negatives.

(1) False Positive Error: A false positive is the error which normally means that a

test claims something to be positive, when that is not the case. In our experiment, our beat detection with a positive result (indicating that here is a beat event) has produced a false positive in the case where there is no beat event. We separate the errors into two situations: detected the beat events when there was no beat event and detected the beat events when the correct beat event was already detected.

(2) False Negative Error: A false negative is the error of failing to observe when in

truth there is one. In our case, our beat detection without a positive result has

86.40

produced a false negative where there is a beat event at that moment.

Table 4.5 The False Positive and False Negative Error Rates

Type of Error Method

False Positive False Negative Non-beat but detected duplicate detected beats but not detected K-Curvature

° 4.72% 6.10% 15.97%

K-Curvature

° 5.37% 9.93% 7.34%

Local Minimum 4.79% 1.51% 14.70%

When the false positive errors occurred, we need to separate the errors into two kinds of situations: detected the beat events when there was no beat event and detected the beat events when the correct beat event was already detected. We detected non-beat events falsely due to the trajectory of user including some non-beat change of direction.

The duplicate detection always occurred due to the tracking lost when the target left the scene or the vibration of the target. This kind of situation might be solved by the design of dynamic beats filter to eliminate those successive wrong beats while the time interval between two consecutive beats is too close.

The false negative errors always occurred if θ 60^°(in K-Curvature algorithm) or in Local-Minimum algorithm

.

It is due to the frame lost because of the performance of the program. If we can increase the maximum frame rates we processed, this kind of mistake might be solved.

Chapter 5 Conclusion and Future Work

5.1 Conclusion

In this thesis, we have presented an efficient real-time target tracking system for conductor gesture tracking based on CAMSHIFT algorithm. Also, in order to extract beat events of trajectory of the trajectory of the target, we used K-Curvature algorithm and Local-Minimum algorithm to interpret different kinds of conductor gesture.

The major part of our framework is based on CAMSHIFT algorithm which is a very simple, computationally efficient colored object tracker. It is usable as a visual interface and it can be incorporated into our system that provides the conductor gesture

tracking. The CAMSHIFT algorithm handles computer-vision problems as follows:

z Irregular object motion: CAMSHIFT scales its search window to object size

thus naturally handling perspective–included motion irregularities, so it is suitable

for our purpose to detect the change of direction.

z Distracters: CAMSHIFT ignores objects outside its search window so other

objects do not affect CAMSHIFT’s tracking.

z Lighting Variation: Using only hue from the HSV color space and ignoring

pixels with high/low brightness gives CAMSHIFT wide lightness tolerance.

In other words, we design an HCI system for interpreting a conductor’s gestures and translating theses gestures into musical beats that can be explained as the major part of the music. This system does not require the use of active sensing, special baton, or other constraints on the physical motion of the conductor. Thus, this framework can also be used for human analysis and many other applications such as interactive virtual worlds that allow user to interact with computer systems.

5.2 Future Work

Since CAMSHIFT relies on color distribution alone, errors in color (color lighting, dim illumination, too much illumination…etc) will cause errors in tracking procedure.

More sophisticated trackers use multiple modes such as feature tracking and motion analysis to compensate for this, but more complexity would undermine the original

design criterion for CAMSHIFT. Other possible improvements include:

z Improve tempo following: the current system cannot react to some complex and

subtle movement of professional conductor, not only for the direction change. We can replace our beat detection and analysis module with some more sophisticated gesture recognition algorithms, so that we can adjust our module according to the different level of users conducting skill.

z Include time stretching algorithm: time stretching algorithm is the process of

changing the speed or duration of an audio signal without affecting its pitch.

While our system can adjust the music playback speed according to the beat event,

it can help users to understand the conducting speed he/she performed.

z New application area: we can use our framework to several different areas in

interface with vision and some other multimedia. We not also can estimate the accuracy of beat events for conducting gesture, but also the movement of the dancer. Based on the previous future work, we can design another system for a dancer whose routine is no longer constrained by the tempo of a recording and the music would spontaneously react to his/her movements.

In conclusion, since the system we proposed is a framework which combines the video and audio processing areas, the applications of this technology can help us to examine unexplored area in interfaces with music and other multimedia. We can build an “interactive karaoke” system where a user could sing a song along to a recording, but have the recording adjusting to the user’s tempo. Some other applications can be implemented following these rules, including conductor training, live performance and music synthesis control, and so forth. We hope that the flexible and interchangeable modules would make the further researches easier in the future.

References

[1] T. T. Hewett, et al., "ACM SIGCHI Curricula for Human-Computer Interaction",

ACM Press, New York, NY, 1992, ACM Order Number: 608920.

[2] B. A. Myers, "A Brief History of Human-Computer Interaction Technology",

Interactions, vol. 5, pp. 44-54, 1998.

[3] M.T. Driscoll, "A Machine Vision System for Capture and Interpretation of Orchestra Conductor’s Gestures", M. S. Degree Thesis, May, 1999.

[4] E. Lee, I. I. Grüll, H. Kiel and J. Borchers, "Conga: A Framework for Adaptive Conducting Gesture Analysis", NIME '06: Proceedings of the 2006 Conference on

New Interfaces for Musical Expression, pp. 260-265, 2006.

[5] D. Murphy, “Tracking a Conductor's Baton” , Søren I. Olsen, Editor, Proceedings

of the 12th Danish Conference on Pattern Recognition and Image Analysis,

volume 2003/05 of DIKU technical report series, pp. 59-66, Copenhagen, Denmark, August 2003.

[6] R. Behringer, "Conducting Digitally Stored Music by Computer Vision Tracking",

AXMEDIS '05: Proceedings of the First International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, pp. 271,

2005.

[7] The Church of Jesus Christ of Latter Day Saints. Conducting Course.

[8] Wikipedia, The free Encyclopedia, http://en.wikipedia.org

[9] M. Lambers, "How Far is Technology from Completely Understanding a

Conductor?”, 4th Twente Student Conference on IT, Enschede, January 30, 2006.

[10] Paul Kolesnik, "A Conducting Gesture Recognition, Analysis and Performance System", M. S. Degree Thesis, McGill University, June, 2004.

[11] R. Boulanger and M. Mathews, “The 1997 Mathews Radio-baton and

Improvisation Modes”, Proceedings of the 1997 International Computer Music

Conference, pp.395-398, Thessaloniki, Greece, 1997.

[12] Buchla Lightning II. <http://www.buchla.com>

[13] B. Brecht and G. Garnett, “Conductor Follower” , Proceedings of the 1995

International Computer Music Conference, pp. 185-186, Banff, Canada, 1995,

Available: http://cnmat.berkeley.edu/publication/conductor_follower.

[14] J. Borchers, W. Samminger and M. Mühlhäuser, "Personal Orchestra: Conducting Audio/Video Music Recordings", Proceedings of the Second International

Conference on WEB Delivering of Music (WEDELMUISC’02), 2002.

[15] Carmine Cascaito and Marcelo M. Wanderley, “Lessons from Long Term

Gestural Controller Users”, in Proceedings of the 4th International Conference on

Enactive Interfaces (ENACTIVE'07), pp. 333-336, Grenoble, France, 2007.

[16] F. Tobey and Ichiro Fujinaga, "Extraction of Conducting Gestures in 3D Space",

Proceedings of the 1996 International Computer Music Conference, pp. 305-307,

San Francisco, 1996.

[17] T. Marrin and J. Paradiso, “The Digital Baton: a Versatile Performance

Instrument”, Proceedings of the 1997 International Computer Music Conference, pp.313-316, Thessaloniki, Greece, 1997.

[18] T. Marrin and R. Picard, “The Conductor's Jacket: a Device for Recording Expressive Musical Gestures”, Proceedings of the 1998 International Computer

Music Conference, pp.215-219, Ann Arbor, MI, 1998.

[19] T. Marrin, "Inside the Conductor's Jacket: Analysis, Interpretation and Musical Synthesis of Expressive Gesture", Ph.D. Dissertation, MIT Media Lab, February, 2000.

[20] T. Ilmonen, “Tracking Conductor of an Orchestra Using Artificial Neural Networks”, M. S. Degree Thesis, Helsinki University of Technology, Espoo, Finland, 1999.

[21] T. Ilmonen and T. Takala, "Conductor Following with Artificial Neural

Networks", Proceedings of the 1999 International Computer Music Conference, pp. 367-370, Beijing, China, October, 1999.

[22] H. Morita, "A Computer Music System that Follows a Human Conductor,"

Computer, vol. 24, pp. 44-53, 1991.

[23] Light Baton. < http://web.tiscali.it/pcarosi/Lbs.htm>

[24] J. Segen, A. Majumder, and J. Gluckman, "Virtual Dance and Music Conducted by a Human Conductor", Eurographics, vol. 19(3), EACG, 1999.

[25] J. Segen, S. Kumar and J. Gluckman, "Visual Interface for Conducting Virtual Orchestra", Proceedings of the 15th International Conference on Pattern

Recognition (ICPR’00), vol.1, pp. 276-279, 2000.

[26] E. Lee, T. Marrin and J. Borchers, "You're the Conductor: A Realistic Interactive Conducting System for Children", NIME '04: Proceedings of the 2004

Conference on New Interfaces for Musical Expression, pp. 68-73, Hamamatsu,

Japan, June 3-5, 2004.

[27] E. Lee, M. Wolf and J. Borchers, "Improving Orchestral Conducting Systems in Public Spaces: Examining the Temporal Characteristics and Conceptual Models of Conducting Gestures", Proceedings of the CHI 2005 Conference on Human

Factors in Computing Systems, pp. 731-740, Portland, Oregon, April 2-7, 2005.

[28] E. Lee and J. Borchers, "The Role of Time in Engineering Computer Music Systems", NIME '05: Proceedings of the 2005 Conference on New Interfaces for

Musical Expression, pp. 204-207, Vancouver, Canada, May 26-28, 2005.

[29] D. Murphy, T. H. Andersen, and K. Jensen, “Conducting Audio Files via Computer Vision”, Gesture-Based Communication in Human-Computer

Interaction: Selected Revised Papers from the 5th International Gesture Workshop, volume 2915 of LNAI, pp. 529-540, Genoa, Italy, April, 2003.

[30] D. Murphy, “Live Interpretation of Conductors' Beat Patterns” , Proceedings of

the 13th Danish Conference on Pattern Recognition and Image Analysis,

Copenhagen, Denmark, pp. 111-120, 2004.

[31] T. Sim, D. Ng, and R. Janakiraman, "VIM: Vision for Interactive Music",

Proceedings of IEEE Workshop on Applications of Computer Vision (WACV '07),

pp.32-32, February, 2007.

[32] K. C. Ng, "Music via Motion: Transdomain Mapping of Motion and Sound for Interactive Performances", Proceedings of the IEEE, vol.92, no.4, pp. 645-655, April, 2004.

[33] W. Hu, T. Tan, L. Wang and S. Maybank, "A survey on visual surveillance of object motion and behaviors", IEEE Transactions on Systems, Man, and

Cybernetics, Part C: Applications and Reviews, vol. 34, pp. 334-352, 2004.

[34] Wen-Han Yao, "Mean-Shift Object Tracking Based On A Multi-blob Model", M.

S. Degree Thesis, National Tawan Chiao Tung University, June, 2006.

[35] G. Bradski, "Computer vision face tracking for use in perceptual user interface",

Intel Technology Journal, vol. 2nd Quarter, 1998.

[36] G. John Allen, Y. D. Richard Xu and S. Jin Jesse, "Object Tracking Using CamShift Algorithm and Multiple Quantized Feature Spaces", Inc. Australian

Computer Society, vol.36, 2004.

[37] Dorin Comaniciu and Peter Meer, "Mean Shift: A robust approach toward feature space analysis", IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603-619, May, 2002.

[38] Xia Liu, "Research of the Improved Camshift Tracking Algorithm", International

Conference on Mechatronics and Automation, ICMA 2007, pp. 968-972, 2007.

[39] Hongmo Je, Jiman Kim and Daijin Kim, "Vision-Based Hand Gesture Recognition for Understanding Musical Time Pattern and Tempo", The 33rd

Annual Conference of the IEEE Industrial Electronics Society (IECON), pp.

2371-2376, , Taipei, Taiwan, November 5-8, 2007.

[40] W.S. Rutkowski, A. Rosenfeld, "A comparison of corner-detection techniques for chain-coded curves", TR-623. Computer Science Center, University of Maryland, 1978.

[41] T. Peli, "Corner extraction from radar images", 1988 International Conference on

Acoustics, Speech, and Signal Processing. ICASSP-88, pp. 1216-1219 vol.2,

1988.

[42] Huang-Yu Lian, "The Effects of Human Factors on Reaction Speed to Visual and Auditory Signals ", M. S. Degree Thesis, National Kaohsiung First University of Science and Technology, Kaohsiung, Taiwan, 2000.

在文檔中以視覺為基礎之即時指揮手勢追蹤系統 (頁 47-0)