Related Work in Camera Calibration - Background and Related Work

Chapter 2 Background and Related Work

2.4 Related Work in Camera Calibration

The mapping between the observed image and the real-world coordinates can be taken to be a projective transform. With a set of positions well-defined in an image, we can obtain the transformation parameters. Lines provide a good feature for calibration when the sport has specific line structure on the playfield. In early work[7], a method to detect four predefined points on a tennis court for calibration is proposed.

However, the algorithm has to be initialized manually and it is not robust against occlusions of the court lines connecting these four points. In[18, 19], more detection of court (for soccer videos) is described, but it requires computationally complex initialization because of using an exhaustive search through the parameter space. [20]

applys a Hough transformation to detect court lines for calibration, but the use of

case. [21] uses a combinatorial search to establish correspondences between the lines that were detected with a Hough transform and the court model. This provides a high robustness even for bad lightening conditions or large occlusions.

2.4.1 Transformation from 3D to 2D

We typically use a pinhole camera model that maps points in a 3D camera frame to a 2D projected image frame. Using similar triangles, we can relate 2D image plane and 3D real world space coordinates by a transformation matrix. As Fig. 2-2 shows,

C C

C Y and Z

X , are three axes in 3D camera coordinates; x and y are the axes in 2D image plane. We have 3D points P=(0,Y_C,Z_C) and Q=(X_C,0,Z_C) which project onto the image plane at p=(0,y) and q=(x,0). O is the origin of _c camera coordinate system, known as the center of projection (COP) of the camera.

The origin of the image plane is O. The camera focal length is denoted by f . _c

Fig. 2-2 Image geometry showing relationship between 3D points and 2D image plane pixels.

From similar triangles PP₁O_c and poO and also similar triangles _c QQ₁O_c and qoO , we can write down the relationships: _c

If f =1, note that perspective projection is just scaling a world coordinate by its _c Z value. All 3D points along a line from the COP through a position (x,y) will have the same image plane coordinates. We can also describe perspective projection by the matrix equation:

⎥⎥

We can generate image space coordinates from projected camera space coordinates. However, in image processing, we use the actual pixel values. Hence we have to transform the 2D image coordinates (x,y) to pixel values ( vu, ) by scaling the camera image plane coordinate in the x and y directions, and adding a translation to the origin of the image space plane. We can call these scale factors D _x and D , and the translation to the origin of the image plane as _y (u₀,v₀). If the pixel

where D and _x D are the physical dimensions of a pixel and _y (u₀,v₀) is the origin of the pixel coordinate system.

x and Dy

y are simply the number of pixels, and

we center them at the pixel coordinate origin. We can also put this into matrix form as:

Camera calibration is used to find the mapping from 3D to 2D image space coordinates. There are 2 approaches:

¾ Method I: Find both extrinsic and intrinsic parameters of the camera system.

However, this can be difficult to do.

¾ Method 2: An easier method is the “Lumped” transform. Rather than finding individual parameters, we find a composite matrix that relates 3D to 2D. Given Eq.(3), we can derive a 3x4 calibration matrix C:

⎥ ⎥

)

We apply method 2 which finds the 11 parameters to transform an arbitrary 3D world point to a pixel in a computer image:

(5)

C is a single 3x4 transform that we can calculate empirically.

[ ]

Multiplying out the equations, we get:

⎪ ⎩

know calibration matrix C and a 3D point, we can predict its image space coordinates.

¾ If we know x, y, z,u',v', we can find c . Each 5-tuple gives 2 equations in _ij c . _ij This is the basis for empirically finding the calibration matrix C (more on this later).

¾ If we know c ,_ij u',v',we have 2 equations in x, y and z. The two equations represent two planes in 3-D and form an intersection which is a line. These are the equations of the line emanating from the center of projection of the camera, through the image pixel location (u',v') and containing point (x, y, z).

Set up a linear system to solve for c_ij : AC = B

N is the number of points whose 2D and 3D coordinates are known and used to solve for c . Each set of points x, y, z,_ij u',andv' yields 2 equations in 11 unknowns (the

c ’s). To solve for C, A needs to be invertible (square). We can over determine A and ij

find a Least-Squares fit for C by using a pseudo-inverse solution.

If A is 2N x11, where 2N > 11:

For basketball video, most of the previous work emphasizes on shot classification and event detection[22-25]. In this paper, we want to stress the analysis of tactics.

Chapter 3 Scene Change Detection of Basketball Video and

Its Application in Tactic Analysis

In this chapter, we will present the framework of our system as depicted in Fig.

3-1. The system architecture has three main parts: Full Court Shot Retrieval, 2D Ball Trajectory Extraction, and 3D Shooting Location Positioning. Full Court Shot Retrieval utilizes scene change detection to cut a video into clips and classifies each clip as close-up view, medium view, or full court view shot. 2D Ball Trajectory Extraction uses all the full court view shots to search the ball candidates and to track the 2D ball trajectory. 3D Shooting Location Positioning applies camera calibration to find the relationship between 2D and 3D points. Therefore, we can extract the 3D trajectory of the basketball. Finally the shooting position could be found.

Fig. 3-1 The framework of the system.

Section3.1 introduces a GOP-based approach to detect scene changes in videos.

Section 3.2 constructs a shot classification model to find “full-court view shots”.

Section 3.3 shows how to find ball candidates. Section 3.4 represents the tracking process of the ball. In section 3.5, we describe a camera calibration model to establish correspondence between points in the video image and the determined court model.

3.1 Scene Change Detection Using GOP-Based Method

In order to analyze tactics in the basketball video, we have to detect scene change and cut the video into clips. After that, we classify clips into three kinds of shot and choose the full-court view shots to do further processing.

Most of existed approaches detect scene change frame by frame. However, the scene change does not occur on each frame, hence it is not necessary to do frame-wise scene change detection. We use a GOP-based method to improve the efficiency of scene change detection. The format of MPEG-Ⅱ includes a GOP layer. As Fig. 3-2 shows, a GOP structure contains the header and an intra-frame coding frame (I-frame) accompanies series of frames in two types including predictive coding frame (P-frame), and bi-directionally predictive coding frame (B-frame).

GOP Header I B B P B B P B B P B B GOP Header I B B

Fig. 3-2 Structure of GOP.

The GOP-based scene change detection approach has two steps[26]. The workflow of this approach is shown in Fig. 3-3. In the first step (Inter-GOP scene

change detection), the possible occurrence of scene change is checked GOP by GOP instead of frame by frame. If a GOP is detected having possible scene change, go to the second step. In the second step (Intra-GOP scene change detection), we check whether the scene change exits and find the actual frame where the scene change occurs within the GOP. The detailed process of the two steps is described in Section 3.1.1 and 3.1.2, respectively.

Fig. 3-3 The workflow of the scene change detection method.

3.1.1 Inter-GOP scene change detection

For each I frame, divide it into k sub-regions. Sum the DC values in each sub-region. The image feature of a GOP { | 1,... }

1 ,

, DC i k

SumDC

g ^Nⁱ

j i j

g = =

∑

, where i is the index of sub-region in I-frame, N is the total number of DC values in the _i i th sub-region, and DC_i_,_jis the j th DC value of sub-region i . The Distance

Inter-GOP scene change detection Calculate the difference

in each GOP-pair

Intra-GOP scene change detection Find out the actual scene change

frame within the GOP Does the difference exceed the threshold?

Yes

No Step 1.

Step 2.

between two GOPs g and g+1 is represented asD(g,g+1), and the value of D(g,g+1) is computed as follows:

mark_i =1 if |SumDC_g_,_i −SumDC_g₊₁_,_i |>threshold_subregion mark_i =0 otherwise

∑

+ ^k

marki

g g D

) 1 , (

WhenD(g,g+1)≤threshold_GOP, which means the successive GOPs are similar, we say no scene change occurs. WhenD(g,g+1)>threshold_GOP which means GOP g and GOP g+1 are dissimilar, we assume that a possible scene change occurs in GOP g+1. However, large difference may be caused by the camera motion and object moving rather than the real scene change. To solve this problem, Intra-GOP scene change detection is proposed.

3.1.2 Intra-GOP scene change detection

The Fast Pure Motion Vector Approach[27] is used for efficient scene change detection within a GOP. This approach only uses motion vectors of B-frames to detect scene change since B-frames are motion-compensated with respect to referential frames. If a B-frame is most similar to previous referential frame, most of the motion vector will refer to forward direction. If a B-frame is most similar to back referential frame, most of the motion vector will refer to backward direction. Two notations are defined below.

Rb: The ratio of number of backward motion vectors over the number of forward motion vectors.

Rf : The ratio of number of forward motion vectors over the number of backward motion vectors.

B I B B B P B B

…… ……

More Backward Less Forward

Peak Rb Scene change Case1: Scene change occurs on I-frame or P-frame.

Case2: Scene change occurs on the first B-frame between two successive reference frames.

Case3: Scene change occurs on the second or later B-frame between two successive

reference frames.

We discuss the three cases and infer the rule to find the actual scene change frame.

Fig. 3-4 Scene change occurs on I-frame or P-frame.

Case 1 is shown in Fig. 3-4. If scene change occurs on I-frame or P-frame, the first previous B-frame will be similar to its previous referential frame. Therefore, most of the motion vectors of the first previous B-frame refer to the forward referential frame, and Rf of the first previous B-frame will be very large and exceed the threshold_ Rf .

Fig. 3-5 Scene change occurs on the first B-frame.

Scene change Less Backward More Forward

Peak Rf

B I B B B P B B

…… ……

Scene change

Less Backward More Forward

Peak Rf Peak Rb

More Backward Less Forward

B I B B B P B B

…… ……

Case 2 is shown in Fig. 3-5. If Scene change occurs on the first B-frame between two successive reference frames, the B-frame itself will be similar to its back referential frame. Therefore, most of the motion vectors of this B-frame refer to the backward referential frame, and Rbof the first B-frame will be very large and exceed the threshold_Rb.

Fig. 3-6 Scene change occurs on the second or later B-frame.

Case 3 is shown in Fig. 3-6. If scene change occurs on the second or later B-frame between two successive reference frames, the B-frame itself will be similar to its back referential frame and the first preceding B-frame will be similar to previous referential frame. Therefore, most of the motion vectors of the first preceding B-frame refer to its forward referential frame, and the second B-frame mostly refer to the backward referential frame; i.e. Rf of the first B-frame andRbof the second B-frame will be very large.

After examining values of Rb and Rf on B-frames, scene changes could be detected in GOP whileRb or Rf on B-frame exceeds the predefined threshold.

Some noise such as camera or object moving which leads to possible scene change in the first step can be removed because such kinds of frame are usually not similar to both its previous and back referential frame, and its values of Rb and Rf on

Detect Dominant Color Region

Reset Frame Buffer and Set Weights

Update Statistics

Compute Color Statistics

Construct

Add to Frame Buffer

Fusion of

Detection Result Set weight=1.0

New Frame

Segmented Image Primary Color Space

Initial Statistics

Yes N frames

In Buffer Enough

Dominant Color Pixels

Inconsistent Segmented

Mask Yes

Yes Local Statistics

Control Color Space

3.2 Shot Classification

To analyze tactics in basketball video, we must have enough information to support the inference of possible shot positions. Three kinds of basketball shots such as close-up view, medium view and full court view are predefined. We will use the full court view shots which contain more information of the game to do better analysis.

Some related works in shot classification are described in Chapter 2, and we apply the main idea of dominant color ratio [28].

Fig. 3-7 Flowchart of dominant color region detection algorithm.

The flowchart of dominant color region detection algorithm is shown in Fig. 3-7.

At start-up, the system computes initial statistics and the values of several parameters for each color space from the frames in the training set. After the initialization of

parameters, dominant color region for each new frame is detected in both control and primary color spaces. Segmentation results in these spaces are used by the fusion algorithm to obtain more accurate final segmentation mask. The rest of the blocks in the flowchart are utilized for adaptation of primary color space statistics by two feedback loops. The inner feedback loop, connected with the dashed lines, computes local statistics in primary color space and captures local variations, whereas the other feedback loop, connected with the dotted lines, becomes active when segmentation results conflict with each other, which indicates drifting of local statistics from true statistics in primary color space. The activation of this outer feedback loop resets primary color statistics to their initial values.

The RGB and HSI histograms of dominant color (the color of the court) are illustrated in Fig. 3-8, where the x-axis represents the quantized bins for each color component, and the y-axis is the number of pixels in corresponding bin. The ratio of dominant color pixels can be exploited to identify which kind of shot the current frame belongs to.

Fig. 3-9 (a) shows an instance of close-up view and its histograms in RGB and HSI color space. Since a close-up view shot contains less part of court, the color distribution of a close-up view image is much different from the color distribution of the dominant color. (b) shows an instance of medium view and its histograms in RGB and HIS color space. A medium view shot have a moderate amount of court pixels, hence the distribution of a medium view image is a little similar to the color distribution of the dominant color. (c) shows an instance of full court view and its histograms in RGB and HIS color space. A full court view shot usually implies a large number of court pixels, and consequently the distribution of a full court view image is much similar to the color distribution of the dominant color.

(a) Close-up View

(b) Medium View

Color Filtering

Background Subtraction

Morphological Operation Video

Shape and Size Filtering Ball Candidate

Reduction Ball

Candidates

After obtaining the scene change frams of basketball video, we identify the full court view shot since most information about tactics is involved in this kind of shots.

For full-court view shots, the ratio of dominant color pixels should be large. Therefore, with a threshold T_ratio , we can filter out Close-up view or Medium view.

Since clips with longer length comprise more information of tactics, we select long clips having length bigger than L_min, and use these clips to achieve better analysis.

3.3 Ball Candidate Search

Identifying a ball in the image is difficult because the ball is usually small and sometimes moves very fast. The process of ball candidate identification is described in Fig. 3-10. For each frame in a full court view clip, we use color filtering, background subtraction, morphological operation, shape and size filtering to find possible ball candidates. The ball candidate reduction step is applied to simplify the tracking process by avoid too much ball candidates.

Fig. 3-10 The process of ball candidate identification.

In the color filtering step, color feature is utilized for ball pixel identification. For each frame, the image is divided into overlapping blocks of size MxN. The

overlapping is achieved by moving the center of the first block by m× and ^M₂ n× ^N₂ to span the whole image, where m and n are arbitrary integers. Calculate the averages of R and G values in each block, and identify whether this block contains ball color.

However, the color of a basketball is not steady owing to the light condition and the angle of view. After choosing ball blocks from different video source manually and calculating their mean values of R, G, B, H, S, and I components, we observe that the R and G values of the basketball are in the range 110≤ r≤175 and70≤ g ≤135. Therefore, we identify blocks having average R and G values in the basketball color range to be possible ball blocks. Fig. 3-11 demonstrates some cases of ball block color.

In case (a), the ball is stationary and its color is similar to the real ball color. Case (b), (c), and (d) show the moving ball color. Since the ball moves fast, its color is influence by the background.

(a) (b)

Fig. 3-11 Observation of the color of basketball.

Only using the values of R and G is not enough to find out correct ball

used to select the correct ball candidates. Each possible ball block is compared to the corresponding position in the previous frame. Since the basketball is moving in high speed, the ball blocks must have large luminance difference between the two frames.

As shown in Fig. 3-12, (a) is a source image containing a moving ball, and (b) shows the pixels having large luminance difference between (a) and its previous frame. If the luminance difference is large enough, the pixel is dotted as white; otherwise, the pixel is dotted as black. The red circles indicate the ball positions. Most of the possible ball blocks that are not the ball will be filtered by background subtraction.

(a) Source image. (b) Frame difference.

Fig. 3-12 Background subtraction of the image.

The region with largest number of connected ball blocks is found after applying a region generation algorithm [31]. The minimum bounding rectangle (MBR) around the region is defined for two purposes: 1) Filter out noise having the same color feature such as the audience. 2) Obtain the center of the ball region.

Many noisy regions rather than the ball region might be detected. Therefore, the area and aspect ratio of the minimum bounding rectangle (MBK) are used as characteristics to identify the possible ball region. Moreover, we define the ball center

coordinate(centerX,centerY)

∑

= ⁿ

i i

n Px

(1 ,

∑

= n

i i

n Py

1 ) , where n is the total number of pixel in the minimum bounding rectangle and (Px_i,Py_i) is the coordinate of pixel

i .

Fig. 3-13 shows the result of ball candidate search after color and shape filtering.

(a) is the case without camera motion and (b) is the case with camera motion. When the camera is fixed, there are fewer ball candidates. However, when there is camera motion, there will be too many ball candidates in a frame. To reduce the number of ball candidates, we perform the Ball-Candidate-Reduction step.

(a) Without camera motion. (b) With camera motion.

Fig. 3-13 Result of ball candidate search after color and shape filtering.

Ball-Candidate-Reduction is implemented by examining each ball candidate to see whether the search range around it has any other candidate. Take the average coordinate of all candidates in the search range as the new candidate position. Thus we can delete many noisy candidates. As shown in Fig. 3-14, (a) represents ball candidates before reduction, and (b) depicts ball candidates after reduction. Fig. 3-15 is the result of applying Ball-Candidate-Reduction step to the real image, where (a)

在文檔中籃球影片之場景偵測及其在戰術分析之應用 (頁 23-0)