Object Detection - Object Tracking

Chapter 2. Related Work

2.1 Object Tracking

2.1.1 Object Detection

Before tracking objects, we have to extract objects either in every frame or when they first appear in the video. That is, we will present the object detection methods before we start to discuss the object tracking algorithms. The object detection methods can be classified into four categories: point detectors, segmentation, background subtraction, and supervised learning [13]. Table 2.1 shows the four categories and their representative work, respectively.

Background modeling Mixture of Gaussians [21], Eigenbackground [22], Wall flower [23],

Dynamic texture background [24]

Supervised classifiers Support Vector Machine [25], Neural Networks [26], Adaptive boosting [27]

Point detectors are used to find points of interest in images which have an expressive texture in their respective region. To find points of interest, Moravec’s operator [14] computes the variation of the image intensities within a 4-by-4 window in the horizontal, vertical, diagonal, and anti-diagonal directions, and then chooses the minimum of the four variations as representative values for the window. A point is declared interesting if the intensity variation is a local maximum in a 12-by-12 window. The Harris detector [15] computes the first order image derivatives in horizontal and vertical directions to emphasize the directional intensity variations, and then construct a structure matrix 𝐒_𝑚 over a small window around each pixel. The different transformations, Lowe introduced the SIFT (Scale Invariant Feature

Transform) method [16], which is confirmed outperforming most point detectors and more tolerable to image deformations according to the survey by Mikolajczyk and Schmid [17].

The objects we are interested in are usually moving objects in videos. Frame difference is a typical method and is well studied since Jain and Nagel’s work [28].

However, differencing temporally adjacent frames cannot achieve robust results under some circumstances. Thus, background subtraction became popular which builds a representation of the scene called the background model and regards any significant change in an image region from the background model as moving object. Stauffer and Grimson [21] use a mixture of Gaussians to model the pixel color. Each pixel is classified based on whether the matched distribution represents the background process. Instead of modeling the variation of individual pixels, Oliver et al.

introduce an integral approach using the eigenspace decomposition [22]. It first forms a background matrix 𝐁 of dimension 𝑘 × 𝑙 from 𝑘 input frames of dimension 𝑛 × 𝑚, where 𝑙 = 𝑛𝑚. The background is then determined by the most descriptive eigenvectors.

Segmentation algorithms partition an image into regions of reasonable homogeneity. The mean-shift [18] method is proposed to find clusters in the spatial-color space, which is scalable to various other applications such as edge detection, image regularization [30], and tracking [31]. Shi and Malik [19]

formulate image segmentation as a graph partitioning problem, where the vertices (pixels) are partitioned into disjoint subgraphs (regions), and overcome the difficulty of oversegmentation by the proposed normalized cut.

Figure 2.1: Taxonomy of tracking methods [13].

2.1.2 Object Tracking

The goal of object tracking is to gather the trajectory of a specific object. Take our system for example, since we intend to identify what tactics are executed, we have to analyze how the players move. That is, we must track players during the game in order to obtain their trajectories. Tracking algorithms can be classified into three main categories: point tracking, kernel tracking, and silhouette tracking. Figure 2.1 illustrates the taxonomy of tracking methods and Table 2.2 demonstrates their most notable works.

Detected objects over a video clip can be represented by points, and the point tracking finds the point correspondence across frames. Point tracking methods can be divided into two categories: deterministic and statistical methods. Deterministic methods define a cost of associating each object to a single object in two adjacent

8 assumes that objects in the 3D world are rigid, so the distance between any two points on the actual object will remain unchanged.

Table 2.2: Tracking categories [13].

Contour evolution State space models [41], Variational methods [42], Heuristic methods [43]

Matching shapes Hausdorff [44],

Hough transform [45], Histogram [46]

Statistical methods consider the measurement and the model uncertainties during object state estimation. State space approach is used to model the object properties

such as position, velocity, and acceleration. Measurements usually consist of the object position in the image, which is obtained by a detection algorithm. The Kalman filter [34] computes the covariance for state estimation while the particle filter [47] uses the conditional state density to estimate the next state, which can be regarded as the generalized Kalman filter since the Kalman filter concentrates on estimating the state of a linear system where the state variables are assumed to be normally distributed (Gaussian) and the particle filter deals with the non-Gaussian state.

Figure 2.2: Motion constraints [13]. (a) Proximity. (b) Maximum velocity. (c) Small velocity-change. (d) Common motion. (e) Rigidity constraint.

Kernel refers to the object shape and appearance, and kernel tracking is typically performed by computing the motion of the object, which is represented by a primitive object region and generally in the form of parametric motion or the dense flow field computed in subsequent frames. The major differences among kernel tracking methods are the appearance representation used, the number of objects tracked, and the method used to estimate the object motion. For instance, the mean-shift tracking method [31] uses templates and density-based appearance models, while the SVM tracker [40] tracks objects with multiview appearance models.

Objects may have complex shapes. Humans, for example, have head, arms, and legs, and cannot be well described by simple geometric shapes. The aim of

silhouette-based methods is to provide an accurate shape description, and to find the object region in each frame through an object model generated according to the previous frames. One category of the silhouette-based methods is shape matching [44-46], which can be performed similar to tracking based on template matching where an object silhouette and its corresponding model is searched in the current frame. The search is invoked by computing the similarity between the object and the model generated from the hypothesized object silhouette according to the previous frame. The other category of the silhouette-based methods is contour tracking [41-43], which iteratively evolve an initial contour in the previous frame to its new position in the current frame. Tracking by evolving a contour can be performed with either state space models which model the contour shape and motion or direct evolution through minimizing the contour energy using direct minimization techniques such as gradient descent.

在文檔中籃球影片中的球員追蹤與戰術分析 (頁 15-21)