Thesis Organization - 以視覺為基礎之即時指揮手勢追蹤系統

Chapter 1 Introduction

1.4 Thesis Organization

Following the Introductory chapter, Chapter 2 presents an overview of existing conductor gesture tracking system. We also discuss the performances of the previous methods. Chapter 3 describes our system framework and details the algorithms we proposed for conductor’s gesture tracking. Chapter 4 is the experimental result and discussions. Finally, conclusions and future works are given in Chapter 5.

Chapter 2 Literature Review

2.1 Key Terms in Music Conducting

This section describes some music conducting terminologies for those users with no prior music experience. It is based on several conducting on-line course documents [7] and the use of Wikipedia encyclopedia [8][9].

2.1.1 Beat

In written music, beats and notes are grouped into measures. The beat of the music is typically indicated with the conductor's right hand, with or without a baton.

The hand traces a shape in the air in measure and indicates each beat with a change from downward to upward motion. The instant at which the beat occurs is usually indicated by a sudden click of the wrist or change in baton direction (as Figure 2.1).

Figure 2.1 The conductor has to “write” in the air to create different beats [9]

2.1.2 Tempo

Tempo, the Italian word for “time”, is the speed of the fundamental beat and should stay even from beat to beat. The tempo of a piece will typically be written at the start of a piece of music, and in modern music is usually indicated in beats per minute (BPM). The greater the tempo, the larger the number of beats that must be played in a minute is and, therefore, the faster a piece must be played.

2.1.3 Dynamics (Volume)

In music, dynamics normally refers to the softness or loudness a sound or note. It always be communicated by the size of the conducting gestures: the larger the shape, the louder the sound. Changes in dynamic may be signaled with hand that is not being used to indicate the beat: an upward motion indicates a crescendo and a downward motion indicates a diminuendo or decrescendo. In general, loud dynamic levels are conducted with larger beat patterns, and soft dynamic levels with small beat patterns.

2.1.4 Other Musical Elements

Time signatures are figures written on the score at the start of the composition.

Each measure is assigned a meter that tells the musician how many rhythmic beats there are within the measure and what type of note equals one beat [3]. For instance, a

measure with a meter tells the musician there are two rhythmic beats in the measure

(from the “2” in the numerator) and quarter notes are the fundamental beats for this

measure (from the “4” in the denominator). The number of beats per measure and the

time signature usually stay the same from the start of a song to the end, but it may vary

on different time signature.

Figure 2.2 Two kinds of 2-beat conducting patterns [3]

(a) A legato style pattern (b) a staccato style pattern

Every meter can have a range of conducting patterns depending on the style of the piece and the type of mood that the conductor is trying to present to the musicians and audience. The conductor may use flowing gestures as a means of expressing the legato style. Alternatively, a piece like a “march” that is in the same meter as the legato piece would be conducted in a much more angular manner which we called it staccato style.

These two terms legato and staccato indicate how much silence is to be left between notes played one after another.

2.2 Reviews of Conductor Gesture Tracking Systems

There are many approaches to design a system which can understand a human conductor gesture [10]. In this section, we classified these systems into two categories

in conductor’s gesture tracking. The reviews of instrumented baton will be introduced in Section 2.2.1, and the other categories of vision-based conductor’s gesture tracking systems will be mentioned in Section 2.2.2.

2.2.1 Instrumented Batons

Some approaches introduce special instrumented batons as the input of the system.

Max Mathews, who is called one of the fathers of computer music, created the first conductor’s gesture tracking system. His system, Radio Baton, consisted of two batons that omitted radio waves from their tips and a plate which was equipped with antennas to receive the signals emitted by the batons [11].

Buchla Lightning Baton Series (see Figure 2.3) is another instrumented baton that

senses the position and movement of wands and transforms this information to MIDI signals to control musical instrumentation more expressively [12]. The baton has been used in many systems, which is like Adaptive Conductor Follower [13] and Personal

Orchestra [14].

Figure 2.3 Radio Baton (left) and Buchla Lightning II (right) [14]

Besides using Lightning Baton as an input, Adaptive Conductor Follower [13]

provided three methods of tracking and predicting tempo at beat analysis stage. The most important achievement of the system was that it produced the first attempt using Neural Networks for recognition purpose. In the Extraction of Conducting Gestures in

3D Space [16], it included two Lightning batons to track the baton’s movement in 3D

space and extracted information including tempo, dynamic, beat pattern and beat style.

Teresa Marrin Nekra et al. presented Digital Baton in 1997 [17]. They developed an input device which included acceleration sensors to measure baton’s movement and other pressure sensors to obtain the every finger pressure values of the hand holding the baton with an infrared LED at the tip of the baton. A position-sensitive photodiode was placed behind a camera lens to track position of the infrared LED. But there was no beat information was derived from the input to control the tempo of the piece. In 1998, they created the Conductor’s Jacket system [18][19] used a multitude of sensors built into a jacket to record physiological and motion data. The Conductor’s Jacket consisted of four muscle tension (EMG) sensors, heart rate and respiration monitor, temperature sensor, and skin response sensor (see Figure 2.4(a)).

Tommi Ilmonen et al. created a system to track conducting gestures with neural networks and it was part of a system DIVA (see Figure 2.4(b)) to extract rhythm data from conductor’s movements [20][21]. It used magnetic motion tracker as the input to

collect positional information and used the neural networks to classify and predict beats. But the system has some limitations: The system only understand the standard conducting techniques, so users must have some prior knowledge about conducting.

Figure 2.4 (a) Conducting an orchestra with the Conductor's Jacket system [10]

(b) The magnetic sensors for motion tracking at DIVA [21]

In 2002, Jan Borchers et al. designed the Personal Orchestra system [14] using a Buchla Lightning baton [12]. Users can interact not with a synthetic, but an original audio/video recording of a real orchestra. A beat was detected each time the baton changed from going down to going up and vertical coordinates of the left hand was dynamic indicators. Its audio stretching algorithm which rendered audio and video at variable speed without time-stretching artifacts, such as pitch changes, made a notable contribution in further systems. Moreover, it did not contain complex rules to extract beats and tempo. So the system could be a more general system for those users with little or no conducting experience. It is also the first system that allows users to control an audio/video recording in real time, using natural conducting gestures.

(a) (b)

2.2.2 Vision-based Conductor Gesture Tracking Systems

In 1989, Morita et al. built a system, A Computer Music System that Follows a

Human Conductor. This system tracked either a white marker attached to the baton or

the hand wearing a white glove, using a CCD camera and special feature extraction hardware that passed two-dimensional position values to a PC. The computer derived tempo and volume information from upper and lower turning points of the trajectory. It is the first project to use a CCD camera as an input device [22].

Light Baton created in 1992 by Graziano Bertini et al. was another system using a

CCD camera and the baton with a LED light on its tip [22]. The position of the baton was analyzed by a special image acquisition board and the playback of pre-recorded score is adjusted in terms of tempo and intensity of notes.

Michael T. Driscoll [3] proposed A Machine Vision for Capture and Interpretation

of an Orchestra Conductor’s Gestures in his master thesis in 1999. This work involves

the design and implementation of a real-time Vision-based HCI that analyzes and interprets a music conductor's gestures to detect the beat. It used several basic image processing methods, such as rapid RGB color thresholding, multiple contour extraction, and center of mass calculations to understand the time location of beats. But there were performance constraint due to the insufficient time resolution of the system. This issue might create a “choppy" response by the Virtual Orchestra.

In 2000, Jakub Segen et al. created their Virtual Dance and Music system [24] and

Visual Interface for Conducting Virtual Orchestra [25]. It used two synchronized

cameras to acquire a 3D trajectory of the baton and beats were placed at the locally lowest trajectory points. It aloes presented an simple scenario in beat following: If the music sequencer had already played all notes corresponding to the current beat, the sequencer should wait until the user conducted the next beat, and if the user had already conducted the next beat before the music sequencer finished playing the notes corresponding to the last beat, the sequencer should increase the tempo slightly to catch up with the conductor.

Declan Murphy et al. presented a conducting gesture recognition system which was able to control the tempo of an audio file playback through standard conducting movements in 2003 [29]. It worked with one or two cameras as input sources for front and side view of the conductor. Computer vision techniques were then used to extract position and velocity of the tip of the baton or of the conductor’s right hand.

In Section 2.2.1, we mentioned about T. Marrin and J. Borchers who involved in this field several years. They worked with Eric Lee to design another museum exhibit for the Children’s Museum in 2004. The final system was called You’re the Conductor [26]. Instead of employing a Buchla Lightning baton for input, a rugged baton-like device was developed, which was mainly a light source and could stand heavy use.

Movement of baton was translated into playback speed and volume, so that children for all ages could use the system. If the child started moving the baton faster or slower, the orchestra sped up or slowed down, respectively, and if the child stopped moving the baton, the orchestra slowed to a halt. Compared with Personal Orchestra, it used a real-time, high-quality time stretching algorithm for polyphonic audio and allowed the system to respond more instantaneously and accurately to user input.

Figure 2.5 (a) Declan Murphy using his conducting system [29]

(b) You’re the Conductor exhibit at Children’s Museum of Boston [26]

R. Behringer presented another system Conducting Digitally Stored Music by

Computer Vision Tracking in 2005 [6]. At this stage, Computer vision methods are

used to track the motion of the baton and to deduce musical parameters (volume, pitch, expression) for the time synchronized replay of previously recorded music notation sequences. Combined with acoustic signal processing, this method can provide the automatic playing which the conductor conducts both this instrument as well as the human musicians.

In 2007, Terence Sim et al. presented a vision-based, interactive music playback

(a) (b)

system, called VIM, which allows anyone, even untrained musicians, to conduct music

[31]. Because of the assumption above, they did not recognize any specific musical gestures which mean any kind of motion will suffice and used a webcam to capture the movements. This system applied the Intel OpenCV Library as a useful programming tool, to decide the speed and area of the moving object. This system also provided some visualization to project colorful patterns that respond to the user.

2.2.3 Summary of Conductor Gesture Tracking Systems

This section presents a summarized description of all systems described in this chapter in table format [10] (see Table 2.1). To simplify our expression, we usually have the name of the latest system as the representative of the entire series which were created by the same research group unless there are major changes between two systems.

Table 2.1 The summary of the conductor gesture tracking system

Year Name Authors Input Device Tracked Info. Control Var. Output Features

1989 Computer Music System that

1991 Radio Baton Max Mathews two radio-wave batons, a plate with antennas

2 batons positions Limit: small work area 1992 Light Baton Bertini

Graziano, Paolo Carosi

CCD camera Baton with LED

Baton 2D position, Tempo, Dynamics,

Prerecorded MIDI

Use image acquisition board to get the baton position

Baton 2D position, Tempo, Dynamics,

Prerecorded MIDI

First system to use neural network for beat analysis

Table 2.1 The summary of the conductor gesture tracking system (Cont.)

Year Name Authors Input Device Tracked Info. Control Var. Output Features

First system to acquire 3D position

coordinates

1996 Digital Baton Teresa Marrin, Joseph Paradiso

Digital Baton Pressure, Acceleration sensors onto one baton

1998 Conductor’s Jacket Teresa Marrin, Rosalind Picard gesture, but Jacket is weird

1999 Conductor Following with Artificial Neural Network

Tommi Ilmonen, Tapio Takala

Data dress suit with 6-dof sensors

1999 A Machine Vision System of an

Conductor’s Gestures

Michael T. Discoll 1 video camera Right hand 2D position

Tempo, Prerecorded MIDI,

Use basic IP methods to extract the hand’s position.

Table 2.1 The summary of the conductor gesture tracking system (Cont.)

Year Name Authors Input Device Tracked Info. Control Var. Output Features

2000 Visual Interface for Conducting Virtual Orchestra

Jakub Segen Senthil Kumar Joshua Gluckman

2 video cameras Right hand 3D position

2002 Personal Orchestra Jan O. Borchers, Wolfgang Samminger, beat patterns, just up or down movements 2003 Conducting Audio

Files via Computer

2004 You’re the Conductor Eric Lee, Teresa Marrin,

2005 Conducting Digitally Stored Music by Computer Vision Tracking

ReinHold Behringer CMOS camera Right hand 2D position

Is more Intuitive, and provide more

A simple webcam Hand 2D position

2.3 Backgrounds on Object Tracking

In general, many previously approaches for motion detection and tracking are conceptually ambiguous. According to the articles written by W. Hu et al. [33] and W.

Yao [34], we can classify these two issues and introduce them separately in Section 2.3.1 and 2.3.2.

2.3.1 Motion Detection

Identifying moving objects from a video sequence is a fundamental and critical task in many computer-vision applications. Motion detection aims at detecting regions corresponding to moving objects from the rest of an image. Detecting moving regions provides a focus of attention for later processes such as tracking procedure. Here we introduce basic methods as examples: background subtraction, temporal differencing, and optical flow.

(1) Background Subtraction

It is a popular method for motion detection, especially under those situations with a relatively static background. It detects moving regions in an image by identifying moving objects within the current image and the reference background image. But it is highly dependent on changes in dynamic scenes derived from lighting and extraneous events. Therefore, an active construction and updating of the background model are indispensable to reduce the influence of these changes.

(2) Temporal Differencing

Temporal differencing makes use of the pixel-wise differences between two or three consecutive frames in an image sequence to extract moving regions. Temporal differencing is very adaptive to dynamic environments, but it may not work well for extracting all relevant pixels. That is, there may be holes left inside moving entities.

(3) Optical Flow

Optical-flow-based algorithms was used to calculate the optical flow field from a video sequence attempt to find correlations between adjacent frames, generating a vector field showing where each pixel or region in one frame moved to in the next frame. Typically, the motion is represented as vectors originating and terminating at locations in consecutive video sequences. However, most optical-flow-based methods are computationally complex and sensitive to noise, and they cannot be applied to streams in real time without specialized hardware.

2.3.2 Motion Tracking

The tracking algorithms usually have some intersection with motion detection during processing. Although there are many researches trying to deal with the motion tracking problem, existing techniques are still not robust enough for stable tracking.

Tracking methods are divided into four major categories: region-based tracking, active

contour-based tracking, feature-based tracking, and model-based tracking.

(1) Region-Based Tracking

Region-based tracking algorithms track objects according to variations of the image regions corresponding to moving objects. In these algorithms, the background image is maintained dynamically, and motion regions are detected by subtracting the background from the current image. They work well in scenes containing only a few objects, but they may not handle occlusion between objects reliably. So they cannot satisfy the requirements in a cluttered background or with multiple moving objects.

(2) Active Contour-Based Tracking

Active contour-based tracking algorithms track objects by representing their outlines as bounding contours and updating these contours dynamically in successive frames. In contrast to region-based algorithms, these algorithms describe objects more simply and more effectively and reduce computational complexity. Even under partial occlusion, these algorithms may track objects continuously. However, a difficulty is that they are highly sensitive to the initialization of tracking, making it more difficult to start tracking automatically.

(3) Feature-Based Tracking

Feature-based tracking algorithms do the objects tracking by extracting elements, clustering them into higher level features and then matching the features between images. These are lots of features that can help us in tracking objects, like edges,

corners, color distribution, skin tone and human eyes. However, the recognition rate of objects based on 2D image features is low, and the stability of dealing effectively with occlusion is generally poor.

(4) Model-Based Tracking

Model-based tracking algorithms track objects by matching projected object models, produced with prior knowledge, to image data. The models are constructed off-line with manual measurement, or other computer vision techniques. Compared with other tracking algorithms, the algorithms can obtain better results even under occlusion (including self-occlusion for humans) or interference between nearby image motions. Ineluctably, model-based tracking algorithms have some disadvantages such as the necessity of constructing the models, high computational cost, etc.

Chapter 3 Vision-based Conductor Gesture Tracking

3.1 Overview

Most of conductor gesture tracking systems presented before focused on technical issues and did not mention how to organize a framework to build a complete HCI. The framework we present here is based on tracking the target that user defined, and the output must be the timing of musical beat after we analyzed. Inspired by face tracking approaches to track the center of our target, the position data must be detected via a real-time algorithm. After acquiring the successive position data, our algorithm was formulated to detect the time point when the target changed its direction. Our proposed framework can be divided into two independent modules, CAMSHIFT tracking and beat detection and analysis module. The diagram of our system is shown in 1Figure 3.1, and the details of these two modules will be discussed in Section 3.2 and 3.3.

(1) CAMSHIFT Tracking Module

We created the ROI (Region of Interest) probability map using the 1D histogram from the Hue channel in Hue Saturation Value (HSV) color system that corresponds to

projecting standard RGB color space. Then we computed Histogram Back-projection algorithm to calculate the ROI probability map. The major consideration is the correct rate in detecting the moving target, which is a critical issue to the following module.

(2) Beat Detection and Analysis Module

Our system used the movement of the target that we detected in the last module to determine the change of direction. After selecting the WAV file and other parameters, we also display the beat detection result in the visualization waveform of WAV file and calculate the precision and recall rate.

Figure 3.1 Diagram of the framework we proposed

CAMSHIFT Tracking Module

Beat Detection and Analysis Module

This system was designed in this way so that all algorithms within one module can be changed at will, without affecting the functionality of the other module.

3.2 User-defined Target Tracking Using CAMSHIFT

User-defined target tracking module is the first stage of the system we proposed, which separates the target from the background and extracts position information. In our system, we applied the CAMSHIFT (Continuously Adaptive Mean Shift) Algorithm [35][36],which is an efficient and simple colored object tracker, as the kernel of this module.

The target area from the user are sampled by mouse, so the target can be user’s head/hand, baton and other objects which color are different from the background.

Using the Histogram Back-projection method, we can measure the characteristics of target which we are interested in to build ROI probability model which is also the first step of the. At the same time, color distribution derived from video change over time,

在文檔中以視覺為基礎之即時指揮手勢追蹤系統 (頁 15-0)