Organization - 籃球影片之場景偵測及其在戰術分析之應用

Chapter 1 Introduction

1.2 Organization

The rest of this thesis is organized as follows. In Chapter 2, we introduce some

background knowledge required for video technology. We also survey some previous related works in sports video analysis, event detection, shot classification, image enhancement, object extraction, object tracking and camera calibration. Chapter 3 introduces an algorithm to detect scene changes of videos and constructs a shot classification model to identify clips into close-up view, medium view, and full-court view shots. Chapter 3 also shows the tracking process of the ball and players, describes how the points in the 2D video image correspond to the determined 3D court model, and infers the shot position. Chapter 4 presents the experimental result and discussion. Finally, we conclude the thesis and describe the future work in Chapter 5.

Chapter 2 Background and Related Work

In Chapter 2, we introduce the background knowledge for video technology and some previous related works in sports video analysis. In section 2.1, we present an overview of MPEG standard. In the following sections, some related works in shot classification, event detection, tracking and camera calibration for sports video are described.

2.1 Overview of MPEG Standard

MPEG is the international standard[1, 2] for moving picture video compression. In compressed domain, we can obtain low-level features such as DC values and motion vectors to infer more semantic information. The MPEG video syntax support three types of coded frames or pictures, intra (I-) pictures, coded separately by themselves; predictive (P-) pictures, coded with respect to the immediately previous I- or P-pictures; and bi-directionally predictive (B-) pictures, coded with respect to the immediately previous I- or P-pictures as well as the immediately next I- or P-pictures. Fig. 2-1 shows an example picture structure in MPEG video coding that uses three B-pictures between two reference (I- or P-) pictures. In MPEG video coding, an input video sequence is divided into units of groups of pictures (GOPs). Each GOP typically starts with an I-picture and the rest of the GOP is made up of P-pictures and B-pictures in a certain arrangement. A GOP serves as a basic access unit, and the start picture, an I-picture, is the entry point to facilitate random access.

Fig. 2-1 An example of GOP structure in MPEG coding.

MPEG coding uses techniques such as block-based transform coding, predictive coding, entropy coding, motion-compensated interpolation, etc. Among the above techniques, block-based transform coding and motion compensation are the most important ones.

Block-based transform coding reduces the spatial redundancy in digital video.

By 8x8-block discrete cosine transform (DCT), pixels in spatial domain are transformed to frequency coefficients, and the substantial correlation between neighbor pixels is greatly reduced. Coefficients in frequency domain need not be coded with full accuracy and can be entropy-coded for compression. The first coefficient of each block is called DC value, which contains most information of that block.

Motion compensation reduced the temporal redundancy in digital videos. In the current frame, a best match for each block in previous frame will be found, and the difference between the block and the match will be coded. MPEG-1 and MPEG-2 apply backward and bi-directional motion compensations which provide higher coding efficiency.

In the proposed framework, we will use the DC values, Rb ratio (the ratio of number of backward motion vectors over the number of forward motion vectors), and Rf ratio (the ratio of number of forward motion vectors over the number of backward motion vectors), to detect scene change of basketball video.

2.2 Related Work in Sports Video Analysis

Due to tremendous commercial potentials, sports video has been widely studied.

Y.H. Gong et al.[3] proposed a system that can automatically parse soccer video programs using domain knowledge. The parsing process was mainly built upon line mark recognition and motion detection. They categorized the position of a play into several predefined classes by recognizing the compound line pattern with signature method. The motion vectors field is used to infer the play positions for those scenes without line marks. Despite the strong semantic indexes from the categorization of play positions, they have yet to address the following two problems: 1) how to identify different camera angle and shooting scale, otherwise the line mark recognition cannot be robust; 2) how to determine reasonable segments for processing.

Frame-by-frame processing is improper for large amounts of video data, and moreover, the customized algorithms have to undergo much noise from unrelated segments. As discussed above, video shots have furnished us with natural segments.

Y.P. Tan et al.[4] introduced camera motion estimation into the analysis and

annotation of MPEG basketball video. They estimated camera motion directly from MPEG motion vectors fields. By measuring the variation of the estimated pan rate and the persistence of accumulated directional pan, a semantic annotation was generated, such as Fast breaks (FB), Full court advances (FCA), Close-up shots, etc. No doubt camera motion is an important clue to annotate video and select interesting video segments. However, if the projective transformation parameter is utilized to recover camera motion, the camera motion does necessarily dominate the change in image intensity between frames. Although we can exploit robust statistics technique to deal with noisy motion vectors, the global motion estimation may be poor if the underlying motion vectors field is totally unreliable, for instance with unstructured scenes or the loss of focus caused by fast camera movement. Hence we must evaluate the quality of motion vectors fields before applying regression procedure.

D. Zhong et al.[5] proposed a general framework to analyze the temporal structure of live broadcasted sports videos. They formulated structure analysis as the problem of detecting the fundamental views by using supervised learning and domain-specific rules. For instance, in tennis, we can determine the serve scene by detecting the court view. They utilized the techniques of color-filtering, object segmentation and edge verification. This kind of view-based approach depends on two assumptions: 1) the fundamental views consist of unique visual cues, such as color, motion, and object layout; 2) the basic units start with a special scene. Since sports videos usually feature a fixed number of camera views, it is useful to perform frame-level view analysis. However, the complete view analysis does not make full use of motion vectors information. Despite the combination of view analysis and individual motion field analysis [6], it is difficult to capture the distinguished dynamic characteristics from an individual motion vector field, which could be contaminated.

motion pattern within one shot.

We have discussed several representative works in sports video analysis with an emphasis on motion information usage. Now let us briefly review some other related works. G. Sudhir et al. [7] developed an approach for automatic classification of tennis video. They used the automatically extracted tennis court lines and the players’

position for highlevel reasoning where the relative positions of the two players are mapped to high-level events such as baseline-rallies, passing-shot, etc. W. Hua et al.

[8] introduced the maximum entropy scheme to integrate multimedia clues for baseball scene classification. J. Assfalg et al. [9] tried to use HMM for modeling the transitions between the states of camera motion patterns or players locations for each soccer highlights. Once all the HMMs are trained, the maximum likelihood function is computed to recognize an unknown video shot.

Below, we will review related works about shot classification. C.W. Ngo et al.

[10] proposed a hierarchical clustering approach by aggregating shots with similar motion and color features. By coupling clustering issues with retrieval problems, the clustering structure inherently provides an indexing scheme for retrieval. Through manual investigation of the clustering results, they have tried to explain the semantic meanings for each cluster. However, this kind of clustering procedure did not establish direct relationships between resulting shot clusters and clear semantic meaning. Moreover, the clustering-based approach did not provide a feasible solution for classifying unknown video shots into known shot classes with strong semantic meanings.

J. Assfalg et al. [11] proposed an approach for semantic annotation of sports video according to elements of visual content at different layers of semantic significance. They used neural network classifiers to perform the classification of visual shot features (e.g. edge, segment, and color features, etc.) in terms of playing

field, player, and audience classes. Such classification scheme is based on the key frames. Motion information has never been used.

2.3 Related Work in Tracking

Most researches track players by using template matching[12-14], however, users often have to specify the position of players manually during occlusion.

Moreover, most methods do not track a ball or only track a ball in easy cases [15, 16].

[17] proposed a system which could automatically track players and a ball in soccer games in the images taken by fixed camera. The method proposed in [17] can also cope with occlusion and the posture change and can calculate the position of the players on the field and the position of the ball in the 3D space.

2.4 Related Work in Camera Calibration

The mapping between the observed image and the real-world coordinates can be taken to be a projective transform. With a set of positions well-defined in an image, we can obtain the transformation parameters. Lines provide a good feature for calibration when the sport has specific line structure on the playfield. In early work[7], a method to detect four predefined points on a tennis court for calibration is proposed.

However, the algorithm has to be initialized manually and it is not robust against occlusions of the court lines connecting these four points. In[18, 19], more detection of court (for soccer videos) is described, but it requires computationally complex initialization because of using an exhaustive search through the parameter space. [20]

applys a Hough transformation to detect court lines for calibration, but the use of

case. [21] uses a combinatorial search to establish correspondences between the lines that were detected with a Hough transform and the court model. This provides a high robustness even for bad lightening conditions or large occlusions.

2.4.1 Transformation from 3D to 2D

We typically use a pinhole camera model that maps points in a 3D camera frame to a 2D projected image frame. Using similar triangles, we can relate 2D image plane and 3D real world space coordinates by a transformation matrix. As Fig. 2-2 shows,

C C

C Y and Z

X , are three axes in 3D camera coordinates; x and y are the axes in 2D image plane. We have 3D points P=(0,Y_C,Z_C) and Q=(X_C,0,Z_C) which project onto the image plane at p=(0,y) and q=(x,0). O is the origin of _c camera coordinate system, known as the center of projection (COP) of the camera.

The origin of the image plane is O. The camera focal length is denoted by f . _c

Fig. 2-2 Image geometry showing relationship between 3D points and 2D image plane pixels.

From similar triangles PP₁O_c and poO and also similar triangles _c QQ₁O_c and qoO , we can write down the relationships: _c

If f =1, note that perspective projection is just scaling a world coordinate by its _c Z value. All 3D points along a line from the COP through a position (x,y) will have the same image plane coordinates. We can also describe perspective projection by the matrix equation:

⎥⎥

We can generate image space coordinates from projected camera space coordinates. However, in image processing, we use the actual pixel values. Hence we have to transform the 2D image coordinates (x,y) to pixel values ( vu, ) by scaling the camera image plane coordinate in the x and y directions, and adding a translation to the origin of the image space plane. We can call these scale factors D _x and D , and the translation to the origin of the image plane as _y (u₀,v₀). If the pixel

where D and _x D are the physical dimensions of a pixel and _y (u₀,v₀) is the origin of the pixel coordinate system.

x and Dy

y are simply the number of pixels, and

we center them at the pixel coordinate origin. We can also put this into matrix form as:

Camera calibration is used to find the mapping from 3D to 2D image space coordinates. There are 2 approaches:

¾ Method I: Find both extrinsic and intrinsic parameters of the camera system.

However, this can be difficult to do.

¾ Method 2: An easier method is the “Lumped” transform. Rather than finding individual parameters, we find a composite matrix that relates 3D to 2D. Given Eq.(3), we can derive a 3x4 calibration matrix C:

⎥ ⎥

)

We apply method 2 which finds the 11 parameters to transform an arbitrary 3D world point to a pixel in a computer image:

(5)

C is a single 3x4 transform that we can calculate empirically.

[ ]

Multiplying out the equations, we get:

⎪ ⎩

know calibration matrix C and a 3D point, we can predict its image space coordinates.

¾ If we know x, y, z,u',v', we can find c . Each 5-tuple gives 2 equations in _ij c . _ij This is the basis for empirically finding the calibration matrix C (more on this later).

¾ If we know c ,_ij u',v',we have 2 equations in x, y and z. The two equations represent two planes in 3-D and form an intersection which is a line. These are the equations of the line emanating from the center of projection of the camera, through the image pixel location (u',v') and containing point (x, y, z).

Set up a linear system to solve for c_ij : AC = B

N is the number of points whose 2D and 3D coordinates are known and used to solve for c . Each set of points x, y, z,_ij u',andv' yields 2 equations in 11 unknowns (the

c ’s). To solve for C, A needs to be invertible (square). We can over determine A and ij

find a Least-Squares fit for C by using a pseudo-inverse solution.

If A is 2N x11, where 2N > 11:

For basketball video, most of the previous work emphasizes on shot classification and event detection[22-25]. In this paper, we want to stress the analysis of tactics.

Chapter 3 Scene Change Detection of Basketball Video and

Its Application in Tactic Analysis

In this chapter, we will present the framework of our system as depicted in Fig.

3-1. The system architecture has three main parts: Full Court Shot Retrieval, 2D Ball Trajectory Extraction, and 3D Shooting Location Positioning. Full Court Shot Retrieval utilizes scene change detection to cut a video into clips and classifies each clip as close-up view, medium view, or full court view shot. 2D Ball Trajectory Extraction uses all the full court view shots to search the ball candidates and to track the 2D ball trajectory. 3D Shooting Location Positioning applies camera calibration to find the relationship between 2D and 3D points. Therefore, we can extract the 3D trajectory of the basketball. Finally the shooting position could be found.

Fig. 3-1 The framework of the system.

Section3.1 introduces a GOP-based approach to detect scene changes in videos.

Section 3.2 constructs a shot classification model to find “full-court view shots”.

Section 3.3 shows how to find ball candidates. Section 3.4 represents the tracking process of the ball. In section 3.5, we describe a camera calibration model to establish correspondence between points in the video image and the determined court model.

3.1 Scene Change Detection Using GOP-Based Method

In order to analyze tactics in the basketball video, we have to detect scene change and cut the video into clips. After that, we classify clips into three kinds of shot and choose the full-court view shots to do further processing.

Most of existed approaches detect scene change frame by frame. However, the scene change does not occur on each frame, hence it is not necessary to do frame-wise scene change detection. We use a GOP-based method to improve the efficiency of scene change detection. The format of MPEG-Ⅱ includes a GOP layer. As Fig. 3-2 shows, a GOP structure contains the header and an intra-frame coding frame (I-frame) accompanies series of frames in two types including predictive coding frame (P-frame), and bi-directionally predictive coding frame (B-frame).

GOP Header I B B P B B P B B P B B GOP Header I B B

Fig. 3-2 Structure of GOP.

The GOP-based scene change detection approach has two steps[26]. The workflow of this approach is shown in Fig. 3-3. In the first step (Inter-GOP scene

change detection), the possible occurrence of scene change is checked GOP by GOP instead of frame by frame. If a GOP is detected having possible scene change, go to the second step. In the second step (Intra-GOP scene change detection), we check whether the scene change exits and find the actual frame where the scene change occurs within the GOP. The detailed process of the two steps is described in Section 3.1.1 and 3.1.2, respectively.

Fig. 3-3 The workflow of the scene change detection method.

3.1.1 Inter-GOP scene change detection

For each I frame, divide it into k sub-regions. Sum the DC values in each sub-region. The image feature of a GOP { | 1,... }

1 ,

, DC i k

SumDC

g ^Nⁱ

j i j

g = =

∑

, where i is the index of sub-region in I-frame, N is the total number of DC values in the _i i th sub-region, and DC_i_,_jis the j th DC value of sub-region i . The Distance

Inter-GOP scene change detection Calculate the difference

in each GOP-pair

Intra-GOP scene change detection Find out the actual scene change

frame within the GOP Does the difference exceed the threshold?

Yes

No Step 1.

Step 2.

between two GOPs g and g+1 is represented asD(g,g+1), and the value of D(g,g+1) is computed as follows:

mark_i =1 if |SumDC_g_,_i −SumDC_g₊₁_,_i |>threshold_subregion mark_i =0 otherwise

∑

+ ^k

marki

g g D

) 1 , (

WhenD(g,g+1)≤threshold_GOP, which means the successive GOPs are similar, we say no scene change occurs. WhenD(g,g+1)>threshold_GOP which means GOP g and GOP g+1 are dissimilar, we assume that a possible scene change occurs in GOP g+1. However, large difference may be caused by the camera motion and object moving rather than the real scene change. To solve this problem, Intra-GOP scene change detection is proposed.

3.1.2 Intra-GOP scene change detection

The Fast Pure Motion Vector Approach[27] is used for efficient scene change detection within a GOP. This approach only uses motion vectors of B-frames to detect scene change since B-frames are motion-compensated with respect to referential frames. If a B-frame is most similar to previous referential frame, most of the motion vector will refer to forward direction. If a B-frame is most similar to back referential frame, most of the motion vector will refer to backward direction. Two notations are defined below.

Rb: The ratio of number of backward motion vectors over the number of forward motion vectors.

Rf : The ratio of number of forward motion vectors over the number of backward motion vectors.

B I B B B P B B

…… ……

More Backward Less Forward

Peak Rb Scene change Case1: Scene change occurs on I-frame or P-frame.

Case2: Scene change occurs on the first B-frame between two successive reference frames.

Case3: Scene change occurs on the second or later B-frame between two successive

reference frames.

We discuss the three cases and infer the rule to find the actual scene change frame.

Fig. 3-4 Scene change occurs on I-frame or P-frame.

Case 1 is shown in Fig. 3-4. If scene change occurs on I-frame or P-frame, the first previous B-frame will be similar to its previous referential frame. Therefore, most of the motion vectors of the first previous B-frame refer to the forward referential frame, and Rf of the first previous B-frame will be very large and exceed the threshold_ Rf .

Fig. 3-5 Scene change occurs on the first B-frame.

Scene change Less Backward More Forward

Peak Rf

B I B B B P B B

…… ……

Scene change

Less Backward More Forward

Peak Rf Peak Rb

More Backward Less Forward

B I B B B P B B

…… ……

Case 2 is shown in Fig. 3-5. If Scene change occurs on the first B-frame between two successive reference frames, the B-frame itself will be similar to its back

在文檔中籃球影片之場景偵測及其在戰術分析之應用 (頁 16-0)