Motion Tracking - Backgrounds of Object Tracking

Chapter 2 Literature Review

2.3 Backgrounds of Object Tracking

2.3.2 Motion Tracking

The tracking algorithms usually have some intersection with motion detection during processing. Although there are many researches trying to deal with the motion tracking problem, existing techniques are still not robust enough for stable tracking.

Tracking methods are divided into four major categories: region-based tracking, active

contour-based tracking, feature-based tracking, and model-based tracking.

(1) Region-Based Tracking

Region-based tracking algorithms track objects according to variations of the image regions corresponding to moving objects. In these algorithms, the background image is maintained dynamically, and motion regions are detected by subtracting the background from the current image. They work well in scenes containing only a few objects, but they may not handle occlusion between objects reliably. So they cannot satisfy the requirements in a cluttered background or with multiple moving objects.

(2) Active Contour-Based Tracking

Active contour-based tracking algorithms track objects by representing their outlines as bounding contours and updating these contours dynamically in successive frames. In contrast to region-based algorithms, these algorithms describe objects more simply and more effectively and reduce computational complexity. Even under partial occlusion, these algorithms may track objects continuously. However, a difficulty is that they are highly sensitive to the initialization of tracking, making it more difficult to start tracking automatically.

(3) Feature-Based Tracking

Feature-based tracking algorithms do the objects tracking by extracting elements, clustering them into higher level features and then matching the features between images. These are lots of features that can help us in tracking objects, like edges,

corners, color distribution, skin tone and human eyes. However, the recognition rate of objects based on 2D image features is low, and the stability of dealing effectively with occlusion is generally poor.

(4) Model-Based Tracking

Model-based tracking algorithms track objects by matching projected object models, produced with prior knowledge, to image data. The models are constructed off-line with manual measurement, or other computer vision techniques. Compared with other tracking algorithms, the algorithms can obtain better results even under occlusion (including self-occlusion for humans) or interference between nearby image motions. Ineluctably, model-based tracking algorithms have some disadvantages such as the necessity of constructing the models, high computational cost, etc.

Chapter 3 Vision-based Conductor Gesture Tracking

3.1 Overview

Most of conductor gesture tracking systems presented before focused on technical issues and did not mention how to organize a framework to build a complete HCI. The framework we present here is based on tracking the target that user defined, and the output must be the timing of musical beat after we analyzed. Inspired by face tracking approaches to track the center of our target, the position data must be detected via a real-time algorithm. After acquiring the successive position data, our algorithm was formulated to detect the time point when the target changed its direction. Our proposed framework can be divided into two independent modules, CAMSHIFT tracking and beat detection and analysis module. The diagram of our system is shown in 1Figure 3.1, and the details of these two modules will be discussed in Section 3.2 and 3.3.

(1) CAMSHIFT Tracking Module

We created the ROI (Region of Interest) probability map using the 1D histogram from the Hue channel in Hue Saturation Value (HSV) color system that corresponds to

projecting standard RGB color space. Then we computed Histogram Back-projection algorithm to calculate the ROI probability map. The major consideration is the correct rate in detecting the moving target, which is a critical issue to the following module.

(2) Beat Detection and Analysis Module

Our system used the movement of the target that we detected in the last module to determine the change of direction. After selecting the WAV file and other parameters, we also display the beat detection result in the visualization waveform of WAV file and calculate the precision and recall rate.

Figure 3.1 Diagram of the framework we proposed

CAMSHIFT Tracking Module

Beat Detection and Analysis Module

This system was designed in this way so that all algorithms within one module can be changed at will, without affecting the functionality of the other module.

3.2 User-defined Target Tracking Using CAMSHIFT

User-defined target tracking module is the first stage of the system we proposed, which separates the target from the background and extracts position information. In our system, we applied the CAMSHIFT (Continuously Adaptive Mean Shift) Algorithm [35][36],which is an efficient and simple colored object tracker, as the kernel of this module.

The target area from the user are sampled by mouse, so the target can be user’s head/hand, baton and other objects which color are different from the background.

Using the Histogram Back-projection method, we can measure the characteristics of target which we are interested in to build ROI probability model which is also the first step of the. At the same time, color distribution derived from video change over time, so we utilize the Mean Shift algorithm to calculate the center of mass of the color probability within its 2D window of calculation, re-centers the window, then calculates the area for the next window size, until convergence (or until the mean location moves less than a threshold which means there is no significant shift). The details are described in as follows.

3.2.1 The CAMSHIFT Algorithm

In the latest years, the tracking algorithm of the Continuously Adaptive Mean Shift Algorithm is being concerned because of its practicability and robustness. For object tracking, CAMSHIFT is an adaptation of the Mean Shift algorithm which is a robust non-parametric technique that climbs the gradient of a probability distribution to find the mode (peak) of the distribution [37].

The primary difference between CAMSHIFT and the Mean Shift algorithm is that CAMSHIFT uses continuously adaptive probability distributions (That is, distributions may be recomputed at each frame) while Mean Shift is based on static distributions, which are not updated unless the target experiences significant changes in shape, size or color [36].

For each video frame, the raw image is converted to a probability distribution image via a color histogram model of the color being tracked. We use the zeroth and first-order moments of target color probability model to compute the centroid of an area of high probability which is the main idea in the Mean Shift algorithm. So, the center and size of the color object are found via the mean shift algorithm operating.

The current size and location of the tracked object are reported and used to set the size and location of the search window in the next frame. This process is repeated for continuous tracking [35].

The procedure can be summarized in the following steps [35]:

Figure 3.2 CAMSHIFT Algorithm

Because CAMSHIFT uses the color feature to track the moving object, the change of the color feature is very small when the object is moving. So it is a robust method.

Moreover, much time is economized because the moving object is searched around the position where the object possibly appears. So CAMSHIFT also has real time feature.

3.2.2 Color Space Used for ROI Probability Model

The ROI probability model, also called probability distribution image (PDF), can be determined by methods which associate a pixel value with a probability that the given pixel belongs to the target. In order to create the probability distribution image of the desired color, we first create a model using a color histogram.

The color model that we capture from an image stream is within the standard Red, 1. Set the region of interest (ROI) of the probability distribution

image (PDF) to the entire image.

2. Select an initial location of the Mean Shift search window.

3. Calculate a color probability distribution of the region centered at the Mean Shift search window.

4. Iterate Mean Shift algorithm to find the centroid of the probability image. Store the zero^th moment (distribution area) and centroid location.

5. For the following frame, center the search window at the mean location found in Step 4 and set the window size to a function of the zero^th moment. Go to Step 3.

Green, and Blue (RGB) color model. However, the RGB color model is influenced by illumination easily. Hence, we adopt the Hue Saturation Value (HSV) color model to the RGB one, because of its hue (color) separated from saturation (how concentrated the color is) and from brightness. The RGB color model is additive, defining a color in terms of the combination of primaries, whereas the HSV color model encapsulates the information about a color in terms that are more familiar to humans. Figure 3.3 shows the RGB and HSV color model respectively. Descending V axis in Figure 3.3 (b) gives us smaller hexcone corresponding to smaller (darker) RGB subcube in Figure 3.3 (a).

Figure 3.3 RGB color cube (left) and HSV color hexcone (right)

The formula from the RGB color model to the HSV one is described below:

H π

3 0 b r

max r, g, b min r, g, b if g max r, g, b π

3 2 b r

max r, g, b min r, g, b if b max r, g, b π

3 4 b r

max r, g, b min r, g, b if r max r, g, b

(3.1)

S max r, g, b minx r, g, b

max r, g, b (3.2)

V max r, g, b (3.3)

Where H, S and V represent hue, saturation and luminescent components in HSV color model, the r, g, b represents red, green and blue component in RGB color model.

To generate the PDF, an initial histogram is computed from the initial ROI of the filtered image. We create our color models by taking 1D histogram from the H (Hue) channel in HSV space. When sampling is complete, the histogram is used as a lookup table, to convert incoming video pixels to a corresponding probability of ROI image.

When using real cameras with discrete pixel values, a problem can occur when using HSV space. When brightness is low (V near 0), saturation is also low (S near 0).

Hue value becomes quite noisy, since in such a small hexcone, the small number of discrete hue pixels cannot adequately represent slight changes in RGB. To overcome this problem, we simply ignore hue pixels that have very low brightness values. In our system, we set minimum brightness value of target as the lower threshold of brightness value at first. Users also can adjust these two threshold values by themselves.

Figure 3.4 (a) The Original image and ROI (b) The Probability Map with Vmin = 0 (c) The Probability Map with Vmin = 108

3.2.3 Histogram Back-projection

Histogram Back-projection is used to answer the question "Where are the colors that belong to the object we are looking for (the target)?" The back-projection of the target histogram with any consecutive video frame can generate a probability image where the value of each pixel characterizes probability that the input pixel belongs to the histogram that was used.

Suppose we have a target image and try to locate it in a crowded image. We use the 1D Hue histogram of the target image we calculate in the last section. Given that

m-bin histograms are used, we define the n image pixel locations

x _… and q _… are the histogram value of the elements. We also define a function c:

1 … m that associates to the pixel at location x and the histogram bin index c x .

The unweighted histogram is computed as:

q_u δ c x_i^* ‐u

i 1

(3.4)

, where δ x is the unit impulse function.

Using the formula above, we can derive the unweighted histogram, which is also the probability distribution image (PDF) q of the target.

The formula of histogram normalization is as:

p min 255

max q q , 255

…

(3.5)

we tracked, and Figure 3.6 (e)(f) are the probability Image of (b)(c) respectively.

Figure 3.6

(a) Original Image and ROI (b) A Tracked Frame #1 (c) A Tracked Frame #2 (d) The trajectory of ROI (e) Probability Image of #1 (f) Probability Image of #2

3.2.4 Mass of Center Calculation

Image moments is a global image descriptor and have been applied the shape analysis of binary image before. The moments in one image sum over all pixels, so the moment are robust against small pixel value changes. The centroid, mean location, within the search window of the discrete probability image computed found using moments [35].

Given I(x, y) is the intensity of the discrete probability image at (x, y), the Mean Shift location can be found as follows:

(1) Compute the zeroth order moment

M I x, y (3.6)

(a) (b) (c)

(d) (e) (f)

(2) Find the first order moment for x and y

M x I x, y (3.7)

M y I x, y (3.8)

(3) Compute the mean search window location Center of Mass M

M ,M

M (3.9)

where Mmn denotes the mth and nth moments in x and y directions. By definition, M00 is the area of a two dimensional object. Using these Equations, we can descend the pixel data to an object position.

The target of the area in the image is reflected by the moments. The color probability distribution is discrete gray image whose maximum is 255, therefore the relation of the size of search window S and 255 is

Search window size 2 M

256 (3.10)

We also need to set the ratio of window width and length according to the

interested probability distribution in search window. For our object tracking, we set the window width to S and window length to S , where height and width are the

height and width value of target selected by user.

3.3 Beat Detection and Analysis

After extracting the target location by user-defined target tracking module, the next step is the beat detection and analysis module. Compared with the exactly hand sign language recognition, the detection of beat events for analyzing a musical time patterns and tempo does not have to be so accurate [39]. Compared with slight posture or hands movement in the hand sign language represents an independent and important meaning, only salient features like beat transition point of hand motion are the most important information in the conducting gesture.

(a)

(c) (b)

Figure 3.7 The trajectories and the approximated directions of each conducting pattern. (solid line - trajectory, dashed line - motion direction, red circle- direction change point)

Figure 3.7 illustrates the representative features which are called as direction change points (red circle) of each musical time pattern. During the detection of these points (DCP), we can detect the beat event more easily. That is, we can calculate the motion of consecutive extracted target region and generates the gesture features, direction change point, which is the point of sudden motion direction changes.

In order to measure every DCP happened exactly, we compute an approximation of k-curvature, which is defined by the angle between the two vectors [P(i-k), P(i)] and [P(i), P(i+k)], where k is a constant and P(i)=(x(i), y(i)) is the list of trajectory points.

When the angle between these two vectors is less than the threshold we defined, we can imply the change of direction point which is also a beat event in conducting gesture.

Figure 3.8 Another representative of conducting patterns

Based on another conducting style [25], the beat always occurs at locally lowest points which we called local-minimum points of trajectory. A typical 2-beat, 3-beat and 4-beat conducting pattern is shown in Figure 3.8. A beat event is defined to be a local

minimum of the vertical component of target’s trajectory. We evaluate the trajectory of the target and seek a local minimum of its vertical component, y t . Thus, we obtain

the current timing of beat and tempo of the playing music via these two methods.

3.3.1 K-curvature Algorithm

The k-curvature algorithm was proposed in [40] and this method was used in chain-code corner detection field widely. In this thesis, we use this method to measure

the angle between two consecutive motion vectors. The details of this algorithm are described as follow [41]:

A point pi one the curve is defined in terms of the chain-code corner vectors vi as

p p v

, where p0 is the initial point on the trajectory and vj is a vector from the set {(1, 0), (1, 1), (0, 1), (-1, 1), (-1, 0), (-1, -1), (0, -1), (1, -1) }. The k-curvature θ at pi is given by

θ cos V · V

V V

, where V ∑ w v and V ∑ w v . We chose the weight w to be 1.

For this case, the angle is illustrated in Figure 3.9.

Figure 3.9 Angle measured by the k-curvature algorithm

More specifically, we define the object trajectory as a time sequential list P, P x , y , x , y , x , y , … x , y , where t is the frame number we finished

tracking and θ is the angle we want to measure . We also defined two vectors:

V x x , y y (3.11)

V x x , y y (3.12)

V · V 0 0^° 90^°

V · V 0 θ 90^°

V · V 0 180^° 90^°

(3.13)

So we can compute the inner product of V and V , V · V , to decide the angle

between these two vectors. Because we cannot get the trajectory information at future time, what we can do is to decide whether the point x , y is a direction change point or not at frame t. That means we need to allow the detecting delay k. To deuce the impact of detecting delay, we use 1 as the K factor in this thesis.

3.3.2 Local-Minimum Algorithm

Figure 3.8 also confirms the fact that a beat always directly corresponds with a downward-to-upward change in direction. While the y-axis data can be searched for a change in direction from downward-to-upward movement, the x-axis data we collected can be completely ignored.

We assume the situation that the y-axis scale increases from top to bottom. So, if the y-axis waveforms are rising, subtracting the previous y position value from the current y position value produces a negative value. If the waveform is unchanging, the result is zero. If the waveform is falling, this produces a positive value. Mathematically, this can be expressed as:

sign t 1 if y t y t 1 0

1 if y t y t 1 0 (3.14)

According to the reference scale used, a maximum is characterized by a change from a negative to a positive . Using similar idea, a minimum is characterized by a change from a positive to a negative . Since the beat directly corresponds

with the minima, the system only has to search for a change in from positive to negative. A minimum cannot be detected until the first rising point after the minima has been acquired, so a slight delay will always be present.

Chapter 4 Experimental Results and Discussions

4.1 Overviews

We designed a conductor gesture tracking system to help users training their conducting gesture. This system which realizes the algorithms described in the previous chapter will track the target and detect the beat event to calculate the relative correction rates.

Figure 4.1 A snapshot of our conductor gesture tracking system

Figure 4.1 shows a snapshot of our conductor gesture tracking system. In our experiments, the video inputs are captured from the Logitech QuickCam webcam with the resolution of 320×240 pixels and the music files are sampled from compact discs, 10 constant-speed songs with different Beats per Minute (BPM). The BPMs of the music are from 60 to 124 beats per minute and the lengths of music are from 54s to 85s.

All the software of our real-time conductor’s gesture tracking system is run on a personal computer with 1G RAM. The software includes the CodeGear C++ Builder 2007 and Windows XP Home SP2. The throughput obtained is from 5-8 frames per second. In section 4.2 and 4.3, we will explain our experimental procedure and demonstrate the results and analysis afterward.

Combined with Figure 3.1 and Figure 4.1, we can figure out the relation between the processing procedure and user interface demonstration in Figure 4.2. Our system tracks a target from an image sequence and provides a robust tracking result. However, computer vision-based tracking is extremely difficult in an unconstrained environment, and many situations may affect the accuracy of tracking such as interfering with other objects with the same color in the background, variety of illuminant condition, too small tracking targets. In order to simplify tracking we assume there is no other object with the same color to interfere with the system.

After choosing the tracking target, our system starts producing probability map at

Right-UP column of our system UI. Upon the probability map image, we also can see the centroid and rectangle region of the target we tracked via CAMSHIFT Algorithm.

At the same time, we can see the trajectory of the target and DCPs at Right-Bottom column of our system UI.

Figure 4.2 Experimental Flowchart

To demonstrate the beat events in real-time, we drew the time information when

beat events happened on the WAV form data of music file. We defined the correct interval is Correct_Time RT Correct_Time RT, where T is the time we detected beat events, Correct_Time which came from the ground truth (via

calculation at different BPM or men-make file) is the correct time of beat event i and

在文檔中以視覺為基礎之即時指揮手勢追蹤系統 (頁 30-0)