Summary of Conductor Gesture Tracking Systems

Chapter 2 Literature Review

2.2 Reviews of Conductor Gesture Tracking Systems

2.2.3 Summary of Conductor Gesture Tracking Systems

This section presents a summarized description of all systems described in this chapter in table format [10] (see Table 2.1). To simplify our expression, we usually have the name of the latest system as the representative of the entire series which were created by the same research group unless there are major changes between two systems.

Table 2.1 The summary of the conductor gesture tracking system

Year Name Authors Input Device Tracked Info. Control Var. Output Features

1989 Computer Music System that

1991 Radio Baton Max Mathews two radio-wave batons, a plate with antennas

2 batons positions Limit: small work area 1992 Light Baton Bertini

Graziano, Paolo Carosi

CCD camera Baton with LED

Baton 2D position, Tempo, Dynamics,

Prerecorded MIDI

Use image acquisition board to get the baton position

Baton 2D position, Tempo, Dynamics,

Prerecorded MIDI

First system to use neural network for beat analysis

Table 2.1 The summary of the conductor gesture tracking system (Cont.)

Year Name Authors Input Device Tracked Info. Control Var. Output Features

First system to acquire 3D position

coordinates

1996 Digital Baton Teresa Marrin, Joseph Paradiso

Digital Baton Pressure, Acceleration sensors onto one baton

1998 Conductor’s Jacket Teresa Marrin, Rosalind Picard gesture, but Jacket is weird

1999 Conductor Following with Artificial Neural Network

Tommi Ilmonen, Tapio Takala

Data dress suit with 6-dof sensors

1999 A Machine Vision System of an

Conductor’s Gestures

Michael T. Discoll 1 video camera Right hand 2D position

Tempo, Prerecorded MIDI,

Use basic IP methods to extract the hand’s position.

Table 2.1 The summary of the conductor gesture tracking system (Cont.)

Year Name Authors Input Device Tracked Info. Control Var. Output Features

2000 Visual Interface for Conducting Virtual Orchestra

Jakub Segen Senthil Kumar Joshua Gluckman

2 video cameras Right hand 3D position

2002 Personal Orchestra Jan O. Borchers, Wolfgang Samminger, beat patterns, just up or down movements 2003 Conducting Audio

Files via Computer

2004 You’re the Conductor Eric Lee, Teresa Marrin,

2005 Conducting Digitally Stored Music by Computer Vision Tracking

ReinHold Behringer CMOS camera Right hand 2D position

Is more Intuitive, and provide more

A simple webcam Hand 2D position

2.3 Backgrounds on Object Tracking

In general, many previously approaches for motion detection and tracking are conceptually ambiguous. According to the articles written by W. Hu et al. [33] and W.

Yao [34], we can classify these two issues and introduce them separately in Section 2.3.1 and 2.3.2.

2.3.1 Motion Detection

Identifying moving objects from a video sequence is a fundamental and critical task in many computer-vision applications. Motion detection aims at detecting regions corresponding to moving objects from the rest of an image. Detecting moving regions provides a focus of attention for later processes such as tracking procedure. Here we introduce basic methods as examples: background subtraction, temporal differencing, and optical flow.

(1) Background Subtraction

It is a popular method for motion detection, especially under those situations with a relatively static background. It detects moving regions in an image by identifying moving objects within the current image and the reference background image. But it is highly dependent on changes in dynamic scenes derived from lighting and extraneous events. Therefore, an active construction and updating of the background model are indispensable to reduce the influence of these changes.

(2) Temporal Differencing

Temporal differencing makes use of the pixel-wise differences between two or three consecutive frames in an image sequence to extract moving regions. Temporal differencing is very adaptive to dynamic environments, but it may not work well for extracting all relevant pixels. That is, there may be holes left inside moving entities.

(3) Optical Flow

Optical-flow-based algorithms was used to calculate the optical flow field from a video sequence attempt to find correlations between adjacent frames, generating a vector field showing where each pixel or region in one frame moved to in the next frame. Typically, the motion is represented as vectors originating and terminating at locations in consecutive video sequences. However, most optical-flow-based methods are computationally complex and sensitive to noise, and they cannot be applied to streams in real time without specialized hardware.

2.3.2 Motion Tracking

The tracking algorithms usually have some intersection with motion detection during processing. Although there are many researches trying to deal with the motion tracking problem, existing techniques are still not robust enough for stable tracking.

Tracking methods are divided into four major categories: region-based tracking, active

contour-based tracking, feature-based tracking, and model-based tracking.

(1) Region-Based Tracking

Region-based tracking algorithms track objects according to variations of the image regions corresponding to moving objects. In these algorithms, the background image is maintained dynamically, and motion regions are detected by subtracting the background from the current image. They work well in scenes containing only a few objects, but they may not handle occlusion between objects reliably. So they cannot satisfy the requirements in a cluttered background or with multiple moving objects.

(2) Active Contour-Based Tracking

Active contour-based tracking algorithms track objects by representing their outlines as bounding contours and updating these contours dynamically in successive frames. In contrast to region-based algorithms, these algorithms describe objects more simply and more effectively and reduce computational complexity. Even under partial occlusion, these algorithms may track objects continuously. However, a difficulty is that they are highly sensitive to the initialization of tracking, making it more difficult to start tracking automatically.

(3) Feature-Based Tracking

Feature-based tracking algorithms do the objects tracking by extracting elements, clustering them into higher level features and then matching the features between images. These are lots of features that can help us in tracking objects, like edges,

corners, color distribution, skin tone and human eyes. However, the recognition rate of objects based on 2D image features is low, and the stability of dealing effectively with occlusion is generally poor.

(4) Model-Based Tracking

Model-based tracking algorithms track objects by matching projected object models, produced with prior knowledge, to image data. The models are constructed off-line with manual measurement, or other computer vision techniques. Compared with other tracking algorithms, the algorithms can obtain better results even under occlusion (including self-occlusion for humans) or interference between nearby image motions. Ineluctably, model-based tracking algorithms have some disadvantages such as the necessity of constructing the models, high computational cost, etc.

Chapter 3 Vision-based Conductor Gesture Tracking

3.1 Overview

Most of conductor gesture tracking systems presented before focused on technical issues and did not mention how to organize a framework to build a complete HCI. The framework we present here is based on tracking the target that user defined, and the output must be the timing of musical beat after we analyzed. Inspired by face tracking approaches to track the center of our target, the position data must be detected via a real-time algorithm. After acquiring the successive position data, our algorithm was formulated to detect the time point when the target changed its direction. Our proposed framework can be divided into two independent modules, CAMSHIFT tracking and beat detection and analysis module. The diagram of our system is shown in 1Figure 3.1, and the details of these two modules will be discussed in Section 3.2 and 3.3.

(1) CAMSHIFT Tracking Module

We created the ROI (Region of Interest) probability map using the 1D histogram from the Hue channel in Hue Saturation Value (HSV) color system that corresponds to

projecting standard RGB color space. Then we computed Histogram Back-projection algorithm to calculate the ROI probability map. The major consideration is the correct rate in detecting the moving target, which is a critical issue to the following module.

(2) Beat Detection and Analysis Module

Our system used the movement of the target that we detected in the last module to determine the change of direction. After selecting the WAV file and other parameters, we also display the beat detection result in the visualization waveform of WAV file and calculate the precision and recall rate.

Figure 3.1 Diagram of the framework we proposed

CAMSHIFT Tracking Module

Beat Detection and Analysis Module

This system was designed in this way so that all algorithms within one module can be changed at will, without affecting the functionality of the other module.

3.2 User-defined Target Tracking Using CAMSHIFT

User-defined target tracking module is the first stage of the system we proposed, which separates the target from the background and extracts position information. In our system, we applied the CAMSHIFT (Continuously Adaptive Mean Shift) Algorithm [35][36],which is an efficient and simple colored object tracker, as the kernel of this module.

The target area from the user are sampled by mouse, so the target can be user’s head/hand, baton and other objects which color are different from the background.

Using the Histogram Back-projection method, we can measure the characteristics of target which we are interested in to build ROI probability model which is also the first step of the. At the same time, color distribution derived from video change over time, so we utilize the Mean Shift algorithm to calculate the center of mass of the color probability within its 2D window of calculation, re-centers the window, then calculates the area for the next window size, until convergence (or until the mean location moves less than a threshold which means there is no significant shift). The details are described in as follows.

3.2.1 The CAMSHIFT Algorithm

In the latest years, the tracking algorithm of the Continuously Adaptive Mean Shift Algorithm is being concerned because of its practicability and robustness. For object tracking, CAMSHIFT is an adaptation of the Mean Shift algorithm which is a robust non-parametric technique that climbs the gradient of a probability distribution to find the mode (peak) of the distribution [37].

The primary difference between CAMSHIFT and the Mean Shift algorithm is that CAMSHIFT uses continuously adaptive probability distributions (That is, distributions may be recomputed at each frame) while Mean Shift is based on static distributions, which are not updated unless the target experiences significant changes in shape, size or color [36].

For each video frame, the raw image is converted to a probability distribution image via a color histogram model of the color being tracked. We use the zeroth and first-order moments of target color probability model to compute the centroid of an area of high probability which is the main idea in the Mean Shift algorithm. So, the center and size of the color object are found via the mean shift algorithm operating.

The current size and location of the tracked object are reported and used to set the size and location of the search window in the next frame. This process is repeated for continuous tracking [35].

The procedure can be summarized in the following steps [35]:

Figure 3.2 CAMSHIFT Algorithm

Because CAMSHIFT uses the color feature to track the moving object, the change of the color feature is very small when the object is moving. So it is a robust method.

Moreover, much time is economized because the moving object is searched around the position where the object possibly appears. So CAMSHIFT also has real time feature.

3.2.2 Color Space Used for ROI Probability Model

The ROI probability model, also called probability distribution image (PDF), can be determined by methods which associate a pixel value with a probability that the given pixel belongs to the target. In order to create the probability distribution image of the desired color, we first create a model using a color histogram.

The color model that we capture from an image stream is within the standard Red, 1. Set the region of interest (ROI) of the probability distribution

image (PDF) to the entire image.

2. Select an initial location of the Mean Shift search window.

3. Calculate a color probability distribution of the region centered at the Mean Shift search window.

4. Iterate Mean Shift algorithm to find the centroid of the probability image. Store the zero^th moment (distribution area) and centroid location.

5. For the following frame, center the search window at the mean location found in Step 4 and set the window size to a function of the zero^th moment. Go to Step 3.

Green, and Blue (RGB) color model. However, the RGB color model is influenced by illumination easily. Hence, we adopt the Hue Saturation Value (HSV) color model to the RGB one, because of its hue (color) separated from saturation (how concentrated the color is) and from brightness. The RGB color model is additive, defining a color in terms of the combination of primaries, whereas the HSV color model encapsulates the information about a color in terms that are more familiar to humans. Figure 3.3 shows the RGB and HSV color model respectively. Descending V axis in Figure 3.3 (b) gives us smaller hexcone corresponding to smaller (darker) RGB subcube in Figure 3.3 (a).

Figure 3.3 RGB color cube (left) and HSV color hexcone (right)

The formula from the RGB color model to the HSV one is described below:

H π

3 0 b r

max r, g, b min r, g, b if g max r, g, b π

3 2 b r

max r, g, b min r, g, b if b max r, g, b π

3 4 b r

max r, g, b min r, g, b if r max r, g, b

(3.1)

S max r, g, b minx r, g, b

max r, g, b (3.2)

V max r, g, b (3.3)

Where H, S and V represent hue, saturation and luminescent components in HSV color model, the r, g, b represents red, green and blue component in RGB color model.

To generate the PDF, an initial histogram is computed from the initial ROI of the filtered image. We create our color models by taking 1D histogram from the H (Hue) channel in HSV space. When sampling is complete, the histogram is used as a lookup table, to convert incoming video pixels to a corresponding probability of ROI image.

When using real cameras with discrete pixel values, a problem can occur when using HSV space. When brightness is low (V near 0), saturation is also low (S near 0).

Hue value becomes quite noisy, since in such a small hexcone, the small number of discrete hue pixels cannot adequately represent slight changes in RGB. To overcome this problem, we simply ignore hue pixels that have very low brightness values. In our system, we set minimum brightness value of target as the lower threshold of brightness value at first. Users also can adjust these two threshold values by themselves.

Figure 3.4 (a) The Original image and ROI (b) The Probability Map with Vmin = 0 (c) The Probability Map with Vmin = 108

3.2.3 Histogram Back-projection

Histogram Back-projection is used to answer the question "Where are the colors that belong to the object we are looking for (the target)?" The back-projection of the target histogram with any consecutive video frame can generate a probability image where the value of each pixel characterizes probability that the input pixel belongs to the histogram that was used.

Suppose we have a target image and try to locate it in a crowded image. We use the 1D Hue histogram of the target image we calculate in the last section. Given that

m-bin histograms are used, we define the n image pixel locations

x _… and q _… are the histogram value of the elements. We also define a function c:

1 … m that associates to the pixel at location x and the histogram bin index c x .

The unweighted histogram is computed as:

q_u δ c x_i^* ‐u

i 1

(3.4)

, where δ x is the unit impulse function.

Using the formula above, we can derive the unweighted histogram, which is also the probability distribution image (PDF) q of the target.

The formula of histogram normalization is as:

p min 255

max q q , 255

…

(3.5)

we tracked, and Figure 3.6 (e)(f) are the probability Image of (b)(c) respectively.

Figure 3.6

(a) Original Image and ROI (b) A Tracked Frame #1 (c) A Tracked Frame #2 (d) The trajectory of ROI (e) Probability Image of #1 (f) Probability Image of #2

3.2.4 Mass of Center Calculation

Image moments is a global image descriptor and have been applied the shape analysis of binary image before. The moments in one image sum over all pixels, so the moment are robust against small pixel value changes. The centroid, mean location, within the search window of the discrete probability image computed found using moments [35].

Given I(x, y) is the intensity of the discrete probability image at (x, y), the Mean Shift location can be found as follows:

(1) Compute the zeroth order moment

M I x, y (3.6)

(a) (b) (c)

(d) (e) (f)

(2) Find the first order moment for x and y

M x I x, y (3.7)

M y I x, y (3.8)

(3) Compute the mean search window location Center of Mass M

M ,M

M (3.9)

where Mmn denotes the mth and nth moments in x and y directions. By definition, M00 is the area of a two dimensional object. Using these Equations, we can descend the pixel data to an object position.

The target of the area in the image is reflected by the moments. The color probability distribution is discrete gray image whose maximum is 255, therefore the relation of the size of search window S and 255 is

Search window size 2 M

256 (3.10)

We also need to set the ratio of window width and length according to the

interested probability distribution in search window. For our object tracking, we set the window width to S and window length to S , where height and width are the

height and width value of target selected by user.

3.3 Beat Detection and Analysis

After extracting the target location by user-defined target tracking module, the next step is the beat detection and analysis module. Compared with the exactly hand sign language recognition, the detection of beat events for analyzing a musical time patterns and tempo does not have to be so accurate [39]. Compared with slight posture or hands movement in the hand sign language represents an independent and important meaning, only salient features like beat transition point of hand motion are the most important information in the conducting gesture.

(a)

(c) (b)

Figure 3.7 The trajectories and the approximated directions of each conducting pattern. (solid line - trajectory, dashed line - motion direction, red circle- direction change point)

Figure 3.7 illustrates the representative features which are called as direction change points (red circle) of each musical time pattern. During the detection of these points (DCP), we can detect the beat event more easily. That is, we can calculate the motion of consecutive extracted target region and generates the gesture features, direction change point, which is the point of sudden motion direction changes.

In order to measure every DCP happened exactly, we compute an approximation of k-curvature, which is defined by the angle between the two vectors [P(i-k), P(i)] and [P(i), P(i+k)], where k is a constant and P(i)=(x(i), y(i)) is the list of trajectory points.

When the angle between these two vectors is less than the threshold we defined, we can imply the change of direction point which is also a beat event in conducting gesture.

Figure 3.8 Another representative of conducting patterns

Based on another conducting style [25], the beat always occurs at locally lowest points which we called local-minimum points of trajectory. A typical 2-beat, 3-beat and 4-beat conducting pattern is shown in Figure 3.8. A beat event is defined to be a local

minimum of the vertical component of target’s trajectory. We evaluate the trajectory of the target and seek a local minimum of its vertical component, y t . Thus, we obtain

the current timing of beat and tempo of the playing music via these two methods.

3.3.1 K-curvature Algorithm

The k-curvature algorithm was proposed in [40] and this method was used in chain-code corner detection field widely. In this thesis, we use this method to measure

the angle between two consecutive motion vectors. The details of this algorithm are described as follow [41]:

A point pi one the curve is defined in terms of the chain-code corner vectors vi as

p p v

, where p0 is the initial point on the trajectory and vj is a vector from the set {(1, 0), (1, 1), (0, 1), (-1, 1), (-1, 0), (-1, -1), (0, -1), (1, -1) }. The k-curvature θ at pi is given by

θ cos V · V

V V

, where V ∑ w v and V ∑ w v . We chose the weight w to be 1.

For this case, the angle is illustrated in Figure 3.9.

Figure 3.9 Angle measured by the k-curvature algorithm

More specifically, we define the object trajectory as a time sequential list P, P x , y , x , y , x , y , … x , y , where t is the frame number we finished

tracking and θ is the angle we want to measure . We also defined two vectors:

V x x , y y (3.11)

V x x , y y (3.12)

V · V 0 0^° 90^°

V · V 0 θ 90^°

V · V 0 180^° 90^°

(3.13)

So we can compute the inner product of V and V , V · V , to decide the angle

在文檔中以視覺為基礎之即時指揮手勢追蹤系統 (頁 25-0)