籃球影片中的球員追蹤與戰術分析

(1)

國

立

交

通

大

學

多媒體工程研究所

碩

士

論

文

籃球影片中的球員追蹤與戰術分析

Player Tracking and Tactic Analysis in Basketball Video

研究生：伏宗勝

指導教授：李素瑛教授

(2)

籃

球

影

片

中

的

球

員

追

蹤

與

戰

術

分

析

Player Tracking and Tactic Analysis in Basketball Video

研究生：伏宗勝 Student：Tsung-Sheng Fu

指導教授：李素瑛 Advisor：Suh-Yin Lee

國立交通大學

多媒體工程研究所

碩士論文

A Thesis

Submitted to Institute of MultimediaEngineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science

January 2011

Hsinchu, Taiwan, Republic of China

(3)

i

籃球影片中的球員追蹤與戰術分析

研究生：伏宗勝

指導老師：李素瑛教授

國立交通大學多媒體工程研究所

摘要

隨著電視轉播技術的發展，越來越多人觀賞籃球比賽，但是大多數的人對於籃球知識並不是非常了解。我們也許會因為球員投進壓哨三分球而尖叫，或為一個強力灌籃而興奮，但是不見得知道球員是如何擺脫防守者進行投籃。目前已經有一些籃球影片內容的研究，例如精采畫面擷取和記分板辨識，但是這些仍然無法幫助觀眾對於籃球有更深入的了解。所以我們希望能設計一個系統來提供觀眾一些比較深入的籃球知識，而不只是表面上的資訊。籃球比賽中，觀眾最有興趣的就是得分。但是得分背後的戰術是一門很深奧的學問，因為籃球是一項五個人的運動，不可能只靠一個球員去對抗另外一隊，也就是說單一球員很難靠他自己擊潰對方的防守並且進行得分。大部分的得分都是經由執行戰術而來的。所以我們的目標就是自動辨認出籃球比賽中執行的戰術，並且把這些收集來的資訊帶給觀眾，讓他們能更了解籃球這項運動；甚至可以提供教練和球員，作為他們訓練及了解敵隊的攻防策略之用。籃球戰術種類眾多，很難用單一演算法一以概之，因此我們著重於大多數戰術中都會使用的「掩護」，加以偵測並且分類，藉此分析出戰術執行的模式。我們開發的系統執行步驟如下。在比賽一開始先收集整場比賽都不會變的資訊，包含球場地板顏色以及兩隊球衣顏色。首先我們計算攝影機的參數，並且產生一張

(4)

ii 表示球場範圍的遮罩。第二步我們在球場範圍內計算出現次數最多的顏色，也代表著地板顏色。接著利用背景相減法，我們可以從球場範圍減去地板得到前景物體。最後我們利用顏色資訊將前景分成兩群，分別代表著兩支球隊的球衣顏色。因為這些資訊在整場比賽中都不會改變，所以我們可以利用它們降低往後的計算量，並且提升系統效能。在比賽中，針對每一次球權先分辨哪一隊是進攻方，了解雙方球員的行為模式才能判斷執行的戰術。利用先前得到的資訊並追蹤雙方球員的軌跡。在一波進攻結束的時候，根據追蹤到的雙方球員軌跡來判斷執行的掩護。經由實驗結果，我們開發的系統對於掩護的偵測和分類準確度相當令人滿意，因此在戰術分析上也有著顯著的幫助。這些被辨識出來的戰術會存入資料庫，於是觀眾就可以查詢他們有興趣的戰術並且學習。關鍵字：籃球影片、球員追蹤、戰術分析、運動影片分析、影像處理

(5)

iii

Player Tracking and Tactic Analysis in Basketball Video

Student: Tsung-Sheng Fu Advisor: Prof. Suh-Yin Lee

Institute of Multimedia Engineering

National Chiao Tung University

ABSTRACT

Thanks to the development of TV broadcasting technology, there are more and

more people watching basketball games. Most of us, however, do not know the

basketball sport very well. We may scream for a buzzer beater three-point shot or

get excited about a slam dunk, but we do not exactly realize how a player gets rid of

defenders and makes shots. There have been some researches on basketball video,

such as highlight extraction and scoreboard recognition, but they still cannot help

people further understand this sport. Therefore, we intend to design a system which

provides audience with further knowledge of basketball instead of superficial

information. In basketball games, people are most interested in scoring events.

Nevertheless, scoring is not that simple as it looks. It can be an abstruse subject

since basketball is a five-person sport and one player is not able to fight against the

opponent team. That is, it is difficult for an individual player to break the defense

and score by himself. Most shots are made through execution of tactics.

Consequently, our goal is to automatically identify tactics executed in basketball

games and bring audience the collected information so that they can learn more about

the basketball sport.

(6)

iv

algorithm. Hence, we focus on “screen,” which is widely used in most basketball

tactics. We detect and classify screens, and regard their patterns as certain tactics.

Our proposed system performs with the following steps. First of all, we gather some

consistent information at the beginning of the game, including the floor color and the

jersey colors of the two teams. We first compute the camera calibration and generate

a court mask indicating the court region. Second, we calculate the dominant color

within the court region, which represents the floor color. Next, we obtain the

foreground objects by subtracting the floor from the court region. This procedure is

similar to a background subtraction mechanism. Finally, we divide the foreground

region into two clusters with color information. Thus, the two clusters denote the

jersey colors of the two teams respectively. Since this information is consistent

through the entire game, we can utilize it to reduce computational cost and accelerate

the computation in the following frames. During the game, we first distinguish

which team is on offense in each possession since we have to learn the behaviors of

offensive and defensive players respectively in order to identify tactics. Next, we

extract players of the two teams with the previously obtained information and track

them. At the end of a possession, we identify what screens are set by the trajectories

of the players. Through our experiment, the accuracy of screen detection and

classification is satisfactory, which significantly helps analysis of basketball tactics.

The identified tactics are then inserted into a database from which audience can query

tactics they are interested in.

Keyword: basketball video, player tracking, tactic analysis, sports video analysis,

(7)

v

Acknowledgement

First of all, I greatly appreciate my advisor, Prof. Suh-Yin Lee. Not only for her

kind guidance, but also for her sincere help whenever I am troubled or upset. Next, I

would like to thank my seniors Hua-Tsung Chen, Hui-Zhen Gu and Min-Chun Hu for

their graceful ideas, precious experience and technical assists. Besides, I am grateful

to my colleagues for their inspiration. Also, I have to thank my brother Kuang-Yu

Fu, who is an expert in basketball and teaches me a lot. Last but not least, I

appreciate my parents Hai-Ju Fu and Hsiao-Li Tung. Without their support and

encouragement, I am not able to complete this achievement. I devoutly dedicate this

(8)

vi

List of Figures

Figure 2.1: Taxonomy of tracking methods [13]. ... 7

Figure 2.2: Motion constraints [13]. (a) Proximity. (b) Maximum velocity. (c) Small velocity-change. (d) Common motion. (e) Rigidity constraint. ... 9

Figure 2.3: Examples of shot types in a basketball game [7]. (a) Court shot. (b) Court shot. (c) Medium shot. (d) Medium shot. (e) Close-up shot. (f) Out-of-court shot. ... 11

Figure 2.4: Example of Golden Section spatial composition [7]. (a) Frame regions. (b) Court view. (c) Medium view. ... 12

Figure 2.5: Detection of backboard top-border [7]. (a) Detected court lines. (b) Computing vanishing point. (c) Searching backboard top-border. ... 12

Figure 2.6: Detection of court lines and corresponding points [7]. ... 13

Figure 2.7: Demonstration of shooting location estimation [7]. ... 13

Figure 2.8: Example of the procedure [1]. (a) Original Frame. (b) Dominant color map. (c) Court mask. (d) Removing foreground objects. (e) White pixel detection. (f) Camera calibration. ... 14

Figure 2.9: Sample results of wide-open warning [1]. ... 15

Figure 3.1: System overview. ... 16

Figure 3.2: Flowchart of pre-processing. ... 17

Figure 3.3: Flowchart of content analysis. The modules with shadows have the same functionality as those in the pre-processing phase. ... 18

Figure 3.4: Schematic, magnified view of part of an input image containing a court line [2]. ... 20

Figure 3.5: Hough transform diagram. ... 22

Figure 3.6: Basketball court model. ... 24

Figure 3.7: Sample results of line extraction. (a) Original frame. (b) Detected white pixels. (c) Result using our method. The right column shows some results using typical method with different thresholds 𝜎𝑕 of (d) 50 (e) 100 (f) 150. ... 26

Figure 3.8: Examples of basketball video frames. Solid red lines are baselines and solid yellow lines are free-throw lines, and dotted lines are their normals respectively. (a) Left court. (b) Right court. ... 27

Figure 3.9: Court mask. (a) Original frame. (b) Corresponding court mask. ... 29

Figure 3.10: Object extraction. (a) Original frame. (b) Dominant color map. (c) Foreground objects. ... 32

Figure 3.11: K-means clustering. (a) Original frame. (b) Foreground objects. (c)

Experimental data with different color spaces and number of clusters. The horizontal axis means the number of clusters and the vertical axis indicates the clustering error,

(10)

viii

and different lines represent different color spaces. ... 34

Figure 3.12: Player classification. (a) Original frame. (b) Foreground objects. (c)

Players of one team (red jerseys). (d) Players of the other team (white jerseys)... 36

Figure 3.13: Camera motion in basketball video, rotate and zoom in... 38 Figure 3.14: Predicting the camera parameters for frame 𝑡 + 1 based on the

previously computed parameters for frames 𝑡 and 𝑡 − 1 [2]. ... 39

Figure 3.15: Complete diagram of Kalman filter [12]. ... 42 Figure 3.16: A sample basketball tactic. ... 44 Figure 3.17: Example of front-screen. (a) Trajectories. (b) Before screen. (c) Setting

screen. (d) After screen. ... 45

Figure 3.18: Example of back-screen. (a) Trajectories. (b) Before screen. (c) Setting

Figure 3.19: Example of down screen. (a) Trajectories. (b) Before screen. (c) Setting

Figure 3.20: Diagram of screen classification. ... 52 Figure 4.1: Results of white pixel detection. (a) Original frame. (b) Without line

structure constraint. (c) With line structure constraint. ... 55

Figure 4.2: Results of camera calibration. (a) White line pixels. (b) Extracted court

lines and camera calibration. ... 57

Figure 4.3: Results of player extraction. (a) Original frame. (b) Dominant color map. (c)

Foreground objects. ... 61

Figure 4.4: Results of player classification. (a) Original frame. (b) Player mask of team

1. (c) Player mask of team 2. ... 63

Figure 4.5: Results of tactic analysis. (a) Screen detection (b) Screen classification. .. 69 Figure 5.1: Real game example. (a) Coach setting tactic. (b) Tactic execution. ... 73

(11)

ix

List of Tables

Table 1.1: Tactic categories and number of tactics using screens. ... 3

Table 2.1: Object detection categories [13]. ... 5

Table 2.2: Tracking categories [13]. ... 8

Table 3.1: Corresponding accumulator matrix to Figure 3.5. ... 23

Table 3.2: Basketball court dimensions. ... 25

Table 4.1: Video sources. ... 53

Table 4.2: Configuration for white pixel detection. ... 54

Table 4.3: Statistics of white pixel detection. ... 56

Table 4.4: Average projection error of camera calibration. ... 59

Table 4.5: Statistical results of possession recognition. ... 64

Table 4.6: Configuration of player tracking. ... 65

Table 4.7: Performance of player tracking. ... 66

Table 4.8: Configuration of screen detection and classification. ... 69

Table 4.9: Corresponding results of screen classification to Figure 4.5. ... 71

(12)

1

Chapter 1. Introduction

There have been many researches on sports video analysis in the past decade.

However, not much research is focused on broadcast basketball video analysis.

Doing researches on basketball video, one may face some difficulties and challenges.

Most of all, basketball players occlude each other very often. As a result, it is

difficult to segment and track players correctly. Unfortunately, segmentation and

tracking are the soul of video analysis. In other words, unless we overcome the

occlusion problem, we are not able to analyze much content in basketball videos.

Chang et al. proposed a method [1, 50] that can accurately separate players of

different teams. This tremendously improved the possibility of basketball video

analysis because in basketball games, in order to make wide-open shots, players of the

same team seldom stay together. On the other hand, the defensive players usually

stand next by his target to defend. That is, once we distinguish players of the two

teams, we can avoid most occlusions. Second, in order for the audience to see the

ball clearly, the camera usually follows the ball. This may lead to violent camera

motions since the ball moves fast. Consequently, the camera calibration is another

challenge. Farin et al. introduced a robust and efficient court model tracking

algorithm [2], which helps us use the frame coherence to obtain the camera calibration

with slight computational cost.

Besides, there is another question: what can we analyze in basketball videos?

Some researches focus on event detection and highlight extraction [3-6]; others are

interested in trajectory reconstruction [7]; still others concentrate on frame

information, including shot classification [8] and scoreboard recognition [9].

(13)

2

basketball sport itself. Our goal is to bring the audience further knowledge about

basketball, or even to provide professional players and coaches with technical

information. To achieve this goal, we put most effort in verifying the tactics

executed in basketball games. Having surveyed hundreds of basketball tactics, we

discovered that there is one fundamental essence – screen. A screen is a blocking

move performed by an offensive player, by standing beside or behind a defender, in

order to free a teammate to shoot, to receive a pass, or to drive in to score.

Basketball tactics can be categorized by strategies which they are following and

players whom the tactics are set for. Strategies include isolation, low-post, high-post,

mid-range, three-point, pick-and-roll, and pick-and-fade. Isolation means that the

team on offense tries to isolate a player and make a one-on-one attack. Low-post and

high-post indicate the location where players start attacks. Mid-range and

three-point are similar to low post and high post, describing the attack locations, but

they focus on the finish of attacks instead of the beginning. Pick-and-roll and

pick-and-fade strategies intend players to make open shots through screens. A tactic

sometimes does not follow a specific strategy, and we categorize it as general.

Furthermore, once the strategy is decided, a player is expected to shoot the ball.

That is, tactic categories are then distinguished by the positions of players, namely,

point guard, shooting guard, small forward, power forward, and center. In general,

point guards (PG) organize the offense of a team; shooting guards (SG) are good

shooters from long range; small forwards (SF) have high speed so that they usually

drive in and break the defense of the opponent team; power forwards (PF) and centers

(C) are the tallest players of a team and they behave most near the basket. Through

our observation, most tactics consist of screens. Table 1.1 shows total number of

surveyed tactics and number of tactics using screens. According to Table 1.1, over

(14)

3

of different types of screens. Once we want to study a basketball tactic, we have to

learn what types of screens are used in it first. In this thesis, therefore, we are

focused on detecting screens and classifying their types.

Table 1.1: Tactic categories and number of tactics using screens.

Strategy \ Position PG SG SF PF C Overall

General 12/16 13/16 8/10 7/10 9/10 49/62 Isolation 8/16 10/16 7/16 6/16 1/4 32/68 Low Post 5/7 11/16 12/16 10/16 14/16 52/71 High Post 4/5 5/8 4/6 7/8 7/11 27/38 Three Point 11/12 15/16 14/16 6/8 6/7 52/59 Mid Range 13/16 14/16 14/16 13/16 15/16 69/80

Pick and Roll 16/16 16/16 16/16 16/16 16/16 80/80

Pick and Fade 16/16 16/16 16/16 14/14 2/2 64/64

Overall 85/104 100/120 91/112 79/104 70/82 425/522

In Chapter 2, we review previous works on object tracking and some

applications in basketball video. In Chapter 3, we present our proposed system,

including player tracking and tactic analysis. Chapter 4 shows our experimental

(15)

4

Chapter 2. Related Work

In this chapter, we will briefly introduce the methods for object tracking, and

then show some recent researches on basketball video analysis.

2.1 Object Tracking

Object tracking is an important field in computer vision. When watching

videos, we can easily distinguish objects and tell their behavior through our

background knowledge. In computer vision, people want computers to recognize

what objects are in videos and how the objects behave. Nevertheless, it is simple for

people but difficult for computers to realize the video contents. Thus, many methods

for object tracking have been proposed, and are introduced in the following sections.

2.1.1 Object Detection

Before tracking objects, we have to extract objects either in every frame or when

they first appear in the video. That is, we will present the object detection methods

before we start to discuss the object tracking algorithms. The object detection

methods can be classified into four categories: point detectors, segmentation,

background subtraction, and supervised learning [13]. Table 2.1 shows the four

(16)

5

Table 2.1: Object detection categories [13].

Categories Representative Work

Point detectors Moravec’s detector [14],

Harris detector [15],

Scale Invariant Feature Transform [16]

Segmentation Mean-shift [18],

Graph-cut [19]

Background modeling Mixture of Gaussians [21], Eigenbackground [22], Wall flower [23],

Dynamic texture background [24] Supervised classifiers Support Vector Machine [25],

Neural Networks [26], Adaptive boosting [27]

Point detectors are used to find points of interest in images which have an

expressive texture in their respective region. To find points of interest, Moravec’s

operator [14] computes the variation of the image intensities within a 4-by-4 window

in the horizontal, vertical, diagonal, and anti-diagonal directions, and then chooses the

minimum of the four variations as representative values for the window. A point is

declared interesting if the intensity variation is a local maximum in a 12-by-12

window. The Harris detector [15] computes the first order image derivatives in

horizontal and vertical directions to emphasize the directional intensity variations, and

then construct a structure matrix 𝐒_𝑚 over a small window around each pixel. The points of interest are identified by thresholding 𝑅 = 𝑑𝑒𝑡(𝐒_𝑚) − 𝑘 ∙ 𝑡𝑟(𝐒_𝑚)2, where 𝑑𝑒𝑡(𝐒𝑚) represents the determinant of 𝐒𝑚 and 𝑡𝑟(𝐒𝑚) denotes the trace of 𝐒𝑚,

after applying non-maxima suppression. Theoretically, the 𝐒_𝑚 matrix is invariant to both rotation and translation. However, it is not invariant to affine or projective

transformations. In order to provide robust detection of interest points under

(17)

6

Transform) method [16], which is confirmed outperforming most point detectors and

more tolerable to image deformations according to the survey by Mikolajczyk and

Schmid [17].

The objects we are interested in are usually moving objects in videos. Frame

difference is a typical method and is well studied since Jain and Nagel’s work [28].

However, differencing temporally adjacent frames cannot achieve robust results under

some circumstances. Thus, background subtraction became popular which builds a

representation of the scene called the background model and regards any significant

change in an image region from the background model as moving object. Stauffer

and Grimson [21] use a mixture of Gaussians to model the pixel color. Each pixel is

classified based on whether the matched distribution represents the background

process. Instead of modeling the variation of individual pixels, Oliver et al.

introduce an integral approach using the eigenspace decomposition [22]. It first

forms a background matrix 𝐁 of dimension 𝑘 × 𝑙 from 𝑘 input frames of dimension 𝑛 × 𝑚, where 𝑙 = 𝑛𝑚. The background is then determined by the most descriptive eigenvectors.

Segmentation algorithms partition an image into regions of reasonable

homogeneity. The mean-shift [18] method is proposed to find clusters in the

spatial-color space, which is scalable to various other applications such as edge

detection, image regularization [30], and tracking [31]. Shi and Malik [19]

formulate image segmentation as a graph partitioning problem, where the vertices

(pixels) are partitioned into disjoint subgraphs (regions), and overcome the difficulty

(18)

7

Figure 2.1: Taxonomy of tracking methods [13].

2.1.2 Object Tracking

The goal of object tracking is to gather the trajectory of a specific object. Take

our system for example, since we intend to identify what tactics are executed, we have

to analyze how the players move. That is, we must track players during the game in

order to obtain their trajectories. Tracking algorithms can be classified into three

main categories: point tracking, kernel tracking, and silhouette tracking. Figure 2.1

illustrates the taxonomy of tracking methods and Table 2.2 demonstrates their most

notable works.

Detected objects over a video clip can be represented by points, and the point

tracking finds the point correspondence across frames. Point tracking methods can

be divided into two categories: deterministic and statistical methods. Deterministic

methods define a cost of associating each object to a single object in two adjacent

frames using a set of motion constraints, which is usually a combination of the

constraints illustrated in Figure 2.2. Proximity assumes the location of the object Object Tracking Point Tracking Deterministic Probabilistic Kernel Tracking Multi-view Based View Subspace Classifier Template Based Silhouette Tracking Contour Evolution State Space Methods Direct Minimization Variationaly Approach Heuristic Approach Shape Matching

(19)

8

would not change notably from one frame to other. Maximum velocity defines an

upper bound on the object velocity and limits the possible correspondences to the

circular neighborhood around the object. Small velocity change assumes the

direction and speed of the object does not change drastically. Common motion

constrains the velocity of objects in a small neighborhood to be similar. Rigidity

assumes that objects in the 3D world are rigid, so the distance between any two points

on the actual object will remain unchanged.

Table 2.2: Tracking categories [13].

Categories Representative Work

Point Tracking

Deterministic methods MGE tracker [32], GOA tracker [33] Statistical methods Kalman filter [34],

JPDAF [35], PMHT [36] Kernel Tracking

Template and density based appearance models

Mean-shift [31], KLT [37], Layering [38] Multi-view appearance models Eigentracking [39],

SVM tracker [40] Silhouette Tracking

Contour evolution State space models [41],

Variational methods [42], Heuristic methods [43]

Matching shapes Hausdorff [44],

Hough transform [45], Histogram [46]

Statistical methods consider the measurement and the model uncertainties during

(20)

9

such as position, velocity, and acceleration. Measurements usually consist of the

object position in the image, which is obtained by a detection algorithm. The

Kalman filter [34] computes the covariance for state estimation while the particle

filter [47] uses the conditional state density to estimate the next state, which can be

regarded as the generalized Kalman filter since the Kalman filter concentrates on

estimating the state of a linear system where the state variables are assumed to be

normally distributed (Gaussian) and the particle filter deals with the non-Gaussian

state.

Figure 2.2: Motion constraints [13]. (a) Proximity. (b) Maximum velocity. (c) Small

velocity-change. (d) Common motion. (e) Rigidity constraint.

Kernel refers to the object shape and appearance, and kernel tracking is typically

performed by computing the motion of the object, which is represented by a primitive

object region and generally in the form of parametric motion or the dense flow field

computed in subsequent frames. The major differences among kernel tracking

methods are the appearance representation used, the number of objects tracked, and

the method used to estimate the object motion. For instance, the mean-shift tracking

method [31] uses templates and density-based appearance models, while the SVM

tracker [40] tracks objects with multiview appearance models.

Objects may have complex shapes. Humans, for example, have head, arms, and

(21)

10

silhouette-based methods is to provide an accurate shape description, and to find the

object region in each frame through an object model generated according to the

previous frames. One category of the silhouette-based methods is shape matching

[44-46], which can be performed similar to tracking based on template matching

where an object silhouette and its corresponding model is searched in the current

frame. The search is invoked by computing the similarity between the object and the

model generated from the hypothesized object silhouette according to the previous

frame. The other category of the silhouette-based methods is contour tracking

[41-43], which iteratively evolve an initial contour in the previous frame to its new

position in the current frame. Tracking by evolving a contour can be performed with

either state space models which model the contour shape and motion or direct

evolution through minimizing the contour energy using direct minimization

techniques such as gradient descent.

2.2 Applications in Basketball Video

As discussed in Chapter 1, basketball video analysis is not a common field due to

several difficulties and limitations. Fortunately, there are more and more new

methods proposed that help us overcome those obstacles and make basketball video

analysis much more practicable. We are going to introduce some recent researches

on basketball video analysis related to our work.

At first, we would like to introduce the work of Chen et el. [7]. Their research

has several notable contributions. First of all, they modify the shot classification

algorithm to basketball videos. Basketball shots can be classified into three types:

court shots, medium shot, and close-up shots or out-of-court shots. A court shot

(22)

11

who is usually the ball handler. A close-up shot shows the above-waist view of

players, and an out-of-court shot presents spectators, coaches, or other places out of

the court. Figure 2.3 shows examples of different shot types in a basketball game.

Obviously, court shot is the type that contains most information on the court and

should be retrieved.

(a) (b) (c)

(d) (e) (f)

Figure 2.3: Examples of shot types in a basketball game [7]. (a) Court shot. (b) Court shot. (c)

Medium shot. (d) Medium shot. (e) Close-up shot. (f) Out-of-court shot.

They divide frames into nine regions by employing Golden Section spatial

composition rule as Figure 2.4 shows, and count the number of pixels of the floor

color in each region to distinguish shot types. Second, they propose a new method

to obtain vertical information in order to form a nonsingular 3D-to-2D transformation.

In addition to the typical court lines (2D), they extract the top-border of the backboard

(3D) by scanning the baseline from the vanishing point. Figure 2.5 demonstrates the

method and Figure 2.6 illustrates the result. Last but not least, they reconstruct 3D

information from single view 2D video sequences. With the reconstructed 3D

(23)

12

well. The 3D ball trajectories facilitate automatic collection of game statistics about

shooting locations, from which people can learn the shooting tendency of an

individual player, or even a whole team. Figure 2.7 shows some experimental

results. In each image in Figure 2.7, blue circles are the ball positions over frames,

green circle represents the estimated shooting location, and the red squares show the

movements of corresponding points due to the camera motion.

(a) (b) (c)

Figure 2.4: Example of Golden Section spatial composition [7]. (a) Frame regions. (b) Court

view. (c) Medium view.

(a) (b) (c)

Figure 2.5: Detection of backboard top-border [7]. (a) Detected court lines. (b) Computing

(24)

13

Figure 2.6: Detection of court lines and corresponding points [7].

Figure 2.7: Demonstration of shooting location estimation [7].

Besides, we highly praise the work of Chang et al. [1, 50] not only for their

contribution to basketball video analysis but also for their novel research on

basketball tactics. They propose a method that can gracefully extract players on the

(25)

14

(a) (b)

(c) (d)

(e) (f)

Figure 2.8: Example of the procedure [1]. (a) Original Frame. (b) Dominant color map. (c)

Court mask. (d) Removing foreground objects. (e) White pixel detection. (f) Camera calibration.

At first, dominant (floor) color is obtained and a dominant color map is

generated. The court region can then be shown through largest connected

component analysis of the dominant color map. By utilizing this, foreground objects

(26)

15

line pixels for the sake of camera calibration since court lines are only located within

the court region. Figure 2.8 illustrates the procedure and the result of camera

calibration. Next, using color information and any clustering algorithm, foreground

region is separated into two clusters representing the jersey colors of the two teams.

That is, players of the two teams are recognized. Most important of all, they step

into a further field of tactic analysis. Their system informs the user when the

distribution of players satisfies the preset rules of the wide-open event. Although

their system does not explicitly imply what tactic has been executed, the user can

infer the tactic from how the wide-open event occurs. This inspires us to design a

system that identifies tactics executed in basketball games and keeps the patterns in

order for users to learn basketball tactics. Figure 2.9 demonstrates results of the

wide-open warning system.

(27)

16

Chapter 3. Proposed System Architecture

This chapter describes the details of our proposed system. First of all, we will

give an overview in Section 3.1. In Section 3.2, pre-processing is described. Next,

we explain our proposed scheme of player tracking during the game in Section 3.3.

At last, we will introduce our algorithm for tactic detection and classification in

Section 3.4. Note that the video clips we are using are manually segmented by

possessions instead of a whole game because our main purpose is to analyze the

tactics executed in possessions and automatic possession distinction is not our focus

here. Possession means control of the ball. When one team is on offense, we say

the team has the possession. One team loses possession if it makes a shot or the

opponent team gets the ball. That is, the period we are interested in is from one team

first gets the ball until the team shoots the ball.

Figure 3.1: System overview.

3.1 Overview

The goal we are going to achieve is to analyze the tactics executed in basketball

Pre-process

• Floor color

• Jersey colors

• Possession

Analyze

• Player tracking

• Screen

detection

• Screen

classification

(28)

17

games. However, there are some obstacles blocking our way to this goal since

basketball tactics are complex. The position which a player plays, for instance, is

usually considered when setting tactics but difficult for computer to distinguish.

Fortunately, we have figured out that screen is a key to all basketball tactics as

mentioned in Chapter 1. Hence, screen verification is the core of our system.

Figure 3.2: Flowchart of pre-processing.

Our system can be divided into two parts: pre-processing and analysis as shown

in Figure 3.1. Pre-processing is performed at the beginning of a video clip in order

to gather consistent information in this possession, such as floor color and jersey

colors. Since they are invariant during a possession, or even the whole game, we

only have to compute them once and for all. With these information gathered in

pre-processing, we can avoid computing them each frame and accelerate the

computation. As Figure 3.2 illustrates, we first compute the camera calibration and

generate a court mask which indicates the court region. Second, we can obtain the

floor color by calculating the color histogram and finding the dominant color within

Camera Calibration Court Mask Generation Dominant Color Map Generation

Player Extraction Team Clustering Player Classification Possession Recognition

(29)

18

the court region. With the floor color, we can perform a background subtraction and

extract the foreground objects, that is, the players. Next, we cluster the players into

two teams according to their jersey colors. At last, we can realize which team is on

offense through the distance between the players and the basket.

Figure 3.3: Flowchart of content analysis. The modules with shadows have the same

functionality as those in the pre-processing phase.

In the following frames, we track the players and also confirm if a screen is set.

We have to calculate the camera calibration at first, and then generate a new court

mask. Unlike the pre-processing phase, we can obtain current camera calibration

from previous frame. Next, we extract the players by the floor color and the jersey

colors obtained from the pre-processing. Now we can track the players and detect

screens with the positions of players. Once a screen is detected, we retain the state at

the moment for the sake of screen type classification. At the end of the possession,

we classify the type of the screen set in the possession according to the trajectories of

the players. Figure 3.3 shows the flowchart of the analysis phase.

Court Model Tracking Court Mask Generation

Player Extraction Player Classification

Player Tracking Screen Detection

(30)

19

3.2 Pre-Processing

The reason why we perform the pre-processing is that there is some information

which will not change during a game, including the floor color and the jersey colors

of the two teams. If we repeatedly calculate the information in each frame and just

acquire the same result, it is nothing more than an impediment to efficiency.

Therefore, in order to reduce the computational cost, we prefer gather the information

once and for all. The pre-processing is summarized in Figure 3.2.

3.2.1 Camera Calibration

Camera calibration describes how objects in the world coordinates are projected

onto the image coordinates. Since sport courts can be assumed to be planar, camera

calibration defines a plane-to-plane mapping (a homography) 𝐇 from a position 𝐩 in the world coordinates to the image coordinates 𝐩′ . Writing positions as homogeneous coordinates 𝐩 = (𝑥, 𝑦, 1)T and 𝐩′ = (𝑢, 𝑣, 1)T, the transformation 𝐇𝐩 = 𝐩′ is defined in equation (1). ( 𝑕00 𝑕01 𝑕02 𝑕10 𝑕11 𝑕12 𝑕₂₀ 𝑕₂₁ 𝑕₂₂) ( 𝑥 𝑦 1) = ( 𝑢′ 𝑣′ 𝑤′) = ( 𝑢 𝑣 1) (1)

Camera calibration plays an important role in our system since we do most

works under the real-world coordinates. The way we obtain the camera parameters

is based on the court lines in the frame. Hence, we first have to detect all white

(31)

20

line candidates passing through those white pixels using Hough transform. Next, we

filter some unreasonable line candidates out and fit the remaining for the real court

lines. Finally, we can obtain the camera parameters through the mapping between

the intersection points of the line candidates and those of the court lines.

Figure 3.4: Schematic, magnified view of part of an input image containing a court line [2].

3.2.1.1 White Pixel Detection

The court lines are generally painted with white color. Accordingly, the first

filter is to confirm if the value of the R, G, B channels of a pixel are above a threshold

𝜎𝑙 to guarantee the pixel is white since the (R, G, B) value of a white pixel is (255,

255, 255). Unfortunately, court lines are usually not the only white objects in a

frame and they will influence the line extraction seriously. Hence, other constraints

should be applied to the white pixels. Assuming that court lines are not wider than

𝜏 pixels in the frame, we verify if the brightness at a distance of 𝜏 pixels from four neighbors of the candidate pixel is considerably darker than the candidate pixel as

shown in Figure 3.4. Only if they are, the candidate pixel is classified as a white

(32)

21 𝑙(𝑥, 𝑦) = { 1, 𝑔(𝑥, 𝑦) − 𝑔(𝑥 − 𝜏, 𝑦) > 𝜎𝑑∧ 𝑔(𝑥, 𝑦) − 𝑔(𝑥 + 𝜏, 𝑦) > 𝜎𝑑 1, 𝑔(𝑥, 𝑦) − 𝑔(𝑥, 𝑦 − 𝜏) > 𝜎_𝑑∧ 𝑔(𝑥, 𝑦) − 𝑔(𝑥, 𝑦 + 𝜏) > 𝜎_𝑑 0, else (2)

where 𝑙(𝑥, 𝑦) indicates if a pixel at position (𝑥, 𝑦) is a white pixel (𝑙(𝑥, 𝑦) = 1) or not (𝑙(𝑥, 𝑦) = 0), 𝑔(𝑥, 𝑦) is the luminance of a pixel at position (𝑥, 𝑦), and 𝜎_𝑑 is the luminance difference threshold. In equation (2), the first line corresponds to the

test if darker pixels can be found at some horizontal distance, assuming that the court

line is mostly vertical. The second line performs the analogous test in the vertical

direction, assuming that the court line is almost horizontal.

Sometimes the white pixels in textured areas may pass the above white line test,

such as small white letters in advertisement logos, spectators dressed in white clothes,

or white areas in the stadium. Therefore, we apply an additional line-structure

constraint to eliminate those white pixels in the textured areas by observing the two

eigenvalues of the structure matrix S which is computed over a small window of size (2𝑏 + 1) around each candidate pixel (𝑝_𝑥, 𝑝_𝑦) and defined by equation (3) [10].

𝐒 = ∑ ∑ 𝛻𝑔(𝑥, 𝑦) ∙ (𝛻𝑔(𝑥, 𝑦))T 𝑝𝑦+𝑏 𝑦=𝑝𝑦−𝑏 𝑝𝑥+𝑏 𝑥=𝑝𝑥−𝑏 (3)

Depending on the two eigenvalues of the matrix S, called 𝜆1 and 𝜆2 (𝜆1 ≥ 𝜆2), the

area can be classified into textured (both 𝜆₁ and 𝜆₂ are large), linear (𝜆₁ ≫ 𝜆₂), and

flat (both 𝜆₁ and 𝜆₂ are small). On the straight court lines, the linear case will apply to retain the white pixels only if 𝜆₁ > 𝛼𝜆₂. We find that when 𝛼 = 4, most linear cases can be recognized.

(33)

22

3.2.1.2 Hough Line Extraction

In order to extract the court lines, we perform the standard Hough transform on

the detected white pixels. The parameter space (𝜃, 𝑑) is used to represent a line, where 𝜃 is the angle between the line normal and the horizontal axis, and 𝑑 is the distance between the line and the origin.

Figure 3.5: Hough transform diagram.

Figure 3.5 demonstrates how the Hough transform searches lines. Given three

points and we want to find a line passing through them. For each point, a number of

lines at different angles are plotted through it. In this example, we plot lines at an

interval of 30 degrees. For each plotted line, we compute its distance to the origin

and obtain an angle-distance pair representing this line. The results are shown in the

tables in Figure 3.5, and the corresponding accumulator matrix is shown in Table 3.1.

We can figure out that the parameter set (Angle, Distance) = (60, 81) appears most

frequently (three times). Thus, it is the line that we are looking for. Now come

back to our problem that we want to extract court line candidates from those detected

white pixels. Similarly, we construct an accumulator matrix for all (𝜃, 𝑑) and sample the accumulator matrix at a resolution of one degree for 𝜃 and one pixel for

(34)

23

𝑑. By extracting the local maxima in the accumulator matrix, we can determine the line candidates.

Table 3.1: Corresponding accumulator matrix to Figure 3.5.

Angle\Dist. -40 -20 0 6 23 40 41 50 57 60 70 75 80 81 90 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 30 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 60 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 90 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 120 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 150 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

In addition, to obtain more precise line parameters, we refine them by

minimizing the distance between line pixel candidates and their nearest hough lines.

First, we re-parameterize a line obtained from Hough transform by its normal

𝐧 = (𝑛𝑥, 𝑛𝑦)T with ‖𝐧‖ = 1 and the distance to the origin 𝑑 . With the

parameters, the distance between a point with homogeneous coordinates in image

space 𝐩 = (𝑥, 𝑦, 1)T and a line can be calculated by the dot product (𝑛_𝑥, 𝑛_𝑦, −𝑑) ⋅ 𝐩. Next, we define a set 𝐿 of court line pixels that are close to the line as equation (4) [2].

𝐿 = {𝐩 = (𝑥, 𝑦, 1)T_{|𝑙(𝑥, 𝑦) = 1 ∧ ‖(𝑛}

𝑥, 𝑛𝑦, −𝑑) ⋅ 𝐩‖ < 𝜎𝑟} (4)

where 𝜎_𝑟 is the largest distance constraint in order to discard line pixel candidates far away from any hough line. Since the pixels in this set are supposed to be on the

same court line and we assume the refined line equation to be 𝑥 ∙ 𝑚𝑥+ 𝑦 ∙ 𝑚𝑦 = 1,

we form an equation system and then solve it in the least squares sense as shown in

(35)

24 ( 𝑥₁ 𝑦₁ 𝑥₂ 𝑦₂ ⋮ 𝑥_|𝐿| 𝑦⋮_|𝐿| ) (𝑚_𝑚𝑥 𝑦) = ( 1 1 ⋮ 1 ) (5)

Finally, the refined parameters are computed by 𝑑 = 1 ∕ √𝑚_𝑥2+ 𝑚_𝑦2, 𝑛_𝑥 = 𝑚_𝑥𝑑, 𝑛_𝑦 = 𝑚_𝑦𝑑 since the slope of the line is −𝑚_𝑥∕ 𝑚_𝑦, and the slope of the line normal is 𝑚𝑦∕ 𝑚𝑥.

3.2.1.3 Court Model Fitting

A court model consists of the lines that are drawn onto the ground to define the

playfield geometry. Basketball court model is illustrated in Figure 3.6 and the

dimensions are shown in Table 3.2.

Figure 3.6: Basketball court model.

Sideline Baseline Basket Restricted area Half-court line Free-throw line Three-point line

(36)

25

Table 3.2: Basketball court dimensions.

Area Dimension (m)

Court length (sideline length) 28

Court width (baseline length) 15

3-point line distance from the basket 6.25

Free-throw line distance from the baseline 5.8

Basket distance from the baseline 1.2

Restricted area width Free-throw line side 3.6

Baseline side 5

The camera calibration describes how those lines are projected from the world

coordinates onto the image coordinates. Therefore, in order to define the mapping,

the correspondence between a previously extracted hough line and the court line in

court model must be found. An algorithm has been proposed to find the line

correspondence [2] and performs well in several kinds of sport videos such as tennis,

volleyball and soccer. They regard the lines determined by extracting the local

maxima in the accumulator matrix (mentioned in Section 3.2.1.2) that are above a

threshold 𝜎_ℎ as court line candidates. The line candidates are then classified as two sets: one contains the horizontal lines and the other consists of the vertical lines.

Next, they sort the line candidates according to their distances to the image boundary,

and can search for the correspondence between the candidate lines and the model

lines. Nevertheless, when applying to basketball videos, we find that the

performance is not good as we expected. The major problem is: how to determine

the value of 𝜎_ℎ? The right column of Figure 3.7 shows some results of typical line extraction method with different 𝜎_ℎ values. When the 𝜎_ℎ value is small, there are many unreasonable lines passing the test and viewed as court line candidates which

(37)

26

we are not able to obtain sufficient lines to solve camera parameters. Most

important of all, whatever threshold we set, the free-throw line is always filtered out

because it is short. However, the free-throw line is not negligible since all the

corresponding points may locate on the baseline and it will lead to a singular solution

to the camera calibration without the free-throw line.

(a) (d)

(b) (e)

(c) (f)

Figure 3.7: Sample results of line extraction. (a) Original frame. (b) Detected white pixels. (c)

Result using our method. The right column shows some results using typical method with different thresholds 𝜎_ℎ of (d) 50 (e) 100 (f) 150.

To overcome such a difficulty, we propose a new method to find the line

correspondence in basketball video. We do not sample the entire accumulator matrix;

(38)

27

lines. It is an experiential method, and the searching ranges are determined through

our observation and knowledge of basketball video.

(a) (b)

Figure 3.8: Examples of basketball video frames. Solid red lines are baselines and solid

yellow lines are free-throw lines, and dotted lines are their normals respectively. (a) Left court. (b) Right court.

Our main purpose is to discard noise white pixels outside the court region and extract

correct court lines. The court region is determined by sideline and baseline.

Through Figure 3.8, we can realize that sideline and baseline are the longest

horizontal and vertical lines in the frame respectively. Hence, our first step is to find

the longest horizontal and vertical lines. For the longest vertical line, we extract the

local maximum in the accumulator matrix within the range of [0, 80] and [100, 180]

degrees. Remember that the parameters in Hough space are the distance between a

line and the origin, and the angle between the line normal and the horizontal axis.

That is, this ignores lines whose angle between the horizontal axis is within the range

of [-10, 10] degrees, namely, those almost horizontal lines. We obtain the longest

vertical line by eliminating horizontal lines instead of directly finding vertical lines

since it may not look that perpendicular on screen. Furthermore, the angle of

baseline also helps us distinguish whether it is the left court or right (see solid red

lines Figure 3.8). On the other hand, when extracting the longest horizontal line, we

just set the searching range to [80, 100] degrees since horizontal lines do not change

𝜃𝑏

(39)

28

significantly on screen. With the longest vertical and horizontal line, that is, baseline

and sideline respectively, we filter those white pixels out which are outside the region

bounded by the two lines, and reconstruct the accumulator matrix from the remaining

white pixels. Next, we extract the longest two horizontal lines as edges of the

restricted area. Top edge and bottom edge are then distinguished by angles of the

two lines. Through Figure 3.8 we can find that bottom edge is always more

horizontal than top edge. At last, we have to find free-throw line. Please view

Figure 3.8 again. We mark the baseline with the solid red line and the free-through

line with the solid yellow line, and the dotted lines are their normals respectively.

We can clearly figure out that although they are both vertical lines in court model,

free-throw line always looks more perpendicular than baseline whichever side of

court is on screen because the camera is usually set at the center of the court. Thus,

we set the searching range as [0, 𝜃_𝑏] degrees for right court and [𝜃_𝑏, 180] degrees for left court in order to extract free-throw line. Here, 𝜃_𝑏 is the angle between baseline normal and the horizontal axis. Since the remaining white pixels are guaranteed to

be within the court region, we can recognize those extracted lines as correct court

lines. In this way, we extract lines and find the correspondence at the same time

since we know exactly which line we are looking for. Finally, we compute the

intersection points and solve the equation system defined as equation (6) which is

rewritten from equation (1).

( 𝑥1 𝑦1 1 0 0 0 −𝑥′1𝑥1 −𝑥′1𝑦1 0 0 0 𝑥1 𝑦1 1 −𝑦′₁𝑥1 −𝑦′₁𝑦1 𝑥₂ 𝑦₂ 1 0 0 0 −𝑥′₂𝑥₂ −𝑥′₂𝑦₂ 0 0 0 𝑥₂ 𝑦₂ 1 −𝑦′₂𝑥₂ −𝑦′₂𝑦₂ ⋮ 𝑥𝑛 𝑦𝑛 1 0 0 0 −𝑥′𝑛𝑥𝑛 −𝑥′𝑛𝑦𝑛 0 0 0 𝑥𝑛 𝑦𝑛 1 −𝑦′_𝑛𝑥𝑛 −𝑦′_𝑛𝑦𝑛)₍ 𝑕₀₀ 𝑕01 𝑕₀₂ 𝑕₁₀ 𝑕₁₁ 𝑕12 𝑕20 𝑕₂₁) = ( 𝑥′₁ 𝑦′₁ 𝑥′₂ 𝑦′₂ ⋮ 𝑥′_𝑛 𝑦′_𝑛₎ (6)

(40)

29

Note that this makes use of the normalization 𝑕₂₂ = 1. There are eight variables 𝑕₀₀, 𝑕₀₁, … , 𝑕₂₁ so we need at least four points (𝑛 ≥ 4) in order to form more than eight equations. Here we use baseline, free-throw line and two edges of restricted

area to solve the equation system. Figure 3.7 (c) illustrates the result using our

method.

3.2.2 Court Mask Generation

In basketball video, most of important information is inside the court region. In

other words, court is our region of interest. In order to filter out noise and keep

significant information, we need a mask to indicate the court region, that is, the court

mask. With the previously computed camera calibration, we can project pixels from

image coordinates back to world coordinates and confirm whether they are located in

the court. Figure 3.9 shows a sample result of the court mask.

(a) (b)

Figure 3.9: Court mask. (a) Original frame. (b) Corresponding court mask.

3.2.3 Dominant Color Map Generation

In order to extract players, we have tried several methods [18, 21, 25].

(41)

30

obstacle is the camera motion. For example, redundant moving pixels resulting from

the camera motion generate huge amount of noise when performing the frame

difference. For another example, the camera motion prevents us from obtaining a

consistent background image and extracting real moving objects. Therefore, a new

method is proposed to extract the players on the court by detecting objects with

different colors from the floor [1, 50].

The way we obtain the floor color is to find the dominant color within the court

region using the previously generated court mask. First of all, we calculate the color

histogram. Since it has been proved in [11] that the performance in the YCbCr space

is better than that in the HSI space, we choose the YCbCr space and use the Cb and Cr

components to calculate the color histogram. With the color histogram, we next find

peaks by the following steps

Step 1: Determine the main peak bin 𝑃𝑒𝑎𝑘₁, that is, the bin with the largest value.

Step 2: Find the connected region around the main peak bin. Only bins with

value larger than 𝛼 ∗ 𝑣𝑎𝑙𝑢𝑒(𝑃𝑒𝑎𝑘₁) are considered.

Step 3: Compute the sum of the connected bins 𝑆𝑢𝑚₁ and subtract the connected region from the histogram. That is, we set the values of the bins of

the connected region to zero in order not to be considered again in the following

iterations.

Step 4: Repeat the above steps until there are no bins remaining.

After completing the procedure, we will have several peaks and their sums. Finally,

by sorting these peaks according to their sums, we can realize the dominant color. It

(42)

31

Figure 3.6), which is also called the painted area since it is usually painted with

different color from other parts of the court. That is, if we just recognize the largest

peak as the floor color, we will miss the restricted area. We propose two ways to

solve this problem. One is to regard the largest two peaks as the floor color, and the

other is to run the procedure again with another mask indicating the restricted area.

Both methods have their pros and cons. The first one takes advantage of the

previous result but it fails when there are many players stay in the restricted area.

The second one can distinguish the players from the restricted area since it compares

the two series of sorted peaks and verifies which peak represents the restricted area.

Through our experiment, we prefer the first one because it has good performance and

does not require extra computation. Figure 3.10 (b) illustrates a sample result of the

dominant color map.

3.2.4 Player Extraction

With the court mask and the dominant color map, we can perform a

background-subtraction-like method to extract the foreground objects in the court

region. If the color of a pixel can be found in the dominant color map, the pixel

should be labeled as background; otherwise, it is a foreground pixel. After all pixels

are confirmed, we apply morphological operators in order to remove small objects

and gaps. Figure 3.10 (c) demonstrates a sample result of extracted foreground

(43)

32 (a)

(b)

(c)

Figure 3.10: Object extraction. (a) Original frame. (b) Dominant color map. (c) Foreground

(44)

33

3.2.5 Team Clustering

Despite the fact that we have the foreground objects within the court region

extracted, we need more information to analyze the content. First of all, we have to

distinguish the jersey colors in order to separate the players of the two teams. We

use color information and k-means clustering to divide the foreground region into two

clusters representing the jersey colors of the two teams. In fact, we cannot have just

two clusters since there is some noise in the foreground region, the referees for

example, which enormously interferes with the cluster centroids and leads to a

miserable result of player classification. Figure 3.11 shows experimental data about

the number of clusters and the performances. Generally, the more the clusters, the

smaller the total distance between all data points and their corresponding cluster

centroids, which can also be regarded as the clustering error. However, the

computing time of the k-means clustering is proportional to the number of clusters.

We discovered that the clustering error decreases most rapidly when there are six

clusters. The clustering errors almost converge when there are more than six clusters.

This fact is also adaptive to other video clips through our experiment. Thus, we

separate the foreground region into six clusters and view the largest two clusters as

the jersey colors of the two teams, and choose the YCbCr space since it performs

(45)

34 (a)

(b)

(c)

Figure 3.11: K-means clustering. (a) Original frame. (b) Foreground objects. (c)

Experimental data with different color spaces and number of clusters. The horizontal axis means the number of clusters and the vertical axis indicates the clustering error, and different lines represent different color spaces.

(46)

35

3.2.6 Player Classification

Having gathered the jersey colors of the two teams, we are going to classify

players of the two teams in this step. At first, we verify the pixels in the foreground

region which clusters they belong to by their colors by equation (7) where 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑_𝐴 and 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑_𝐵 are the centroids of the largest two clusters from the Team Clustering step, that is, the jersey colors of the two teams.

𝑐𝑙𝑢𝑠𝑡𝑒𝑟(𝑥, 𝑦) = {

𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐴, ‖𝑐𝑜𝑙𝑜𝑟(𝑥, 𝑦) − 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝐴‖ < 𝛿𝑐

𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐵, ‖𝑐𝑜𝑙𝑜𝑟(𝑥, 𝑦) − 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝐵‖ < 𝛿𝑐

𝑁𝑜𝑛𝑒, else

(7)

After clustering all foreground pixels, we can generate two maps indicating the

players of the two teams as Figure 3.12 illustrates. Since we set a constraint to the

minimal distance between a color of a pixel and the cluster centroid which it belongs

to, we can remove those non-player objects such as the referees during clustering.

Also, we perform morphological operators to remove noise and gaps. At last, we

(47)

36

(a) (b)

(c) (d)

Figure 3.12: Player classification. (a) Original frame. (b) Foreground objects. (c) Players of

one team (red jerseys). (d) Players of the other team (white jerseys).

3.2.7 Possession Recognition

It is important to realize which team has the possession of the ball, or which team

is on offense, before tactic analysis. Typically, defenders are expected to stand

closer to the basket than the offensive player who he is guarding in basketball games

since the purpose of the team on defense is to prevent the opponent team from putting

the ball into the basket. Hence, we can make use of this feature to judge which team

is on offense. We first project all players back to the real-world court model with the

camera calibration. For each team, we compute the average distance between its

players and the basket. The team with shorter distance to the basket is recognized as

on defense. On the other hand, the team on offense is averagely farther away from

(48)

37

Algorithm 1: Possession Recognition

Input: positions of players of the two teams, represented by 𝑝𝑙𝑎𝑦𝑒𝑟𝑠_{𝑡𝑒𝑎𝑚}₁ and 𝑝𝑙𝑎𝑦𝑒𝑟𝑠_{𝑡𝑒𝑎𝑚}₂, and position of the basket

Output: the team on offense, 𝑡𝑒𝑎𝑚₁ or 𝑡𝑒𝑎𝑚₂

local 𝑑𝑖𝑠𝑡[2] for 𝑖 ∶= 1 to 2 do 𝑑𝑖𝑠𝑡[𝑖] ∶= 0 for each 𝑝𝑙𝑎𝑦𝑒𝑟 in 𝑝𝑙𝑎𝑦𝑒𝑟𝑠_{𝑡𝑒𝑎𝑚}_𝑖 do 𝑑𝑖𝑠𝑡[𝑖] ∶= 𝑑𝑖𝑠𝑡[𝑖] + 𝑑𝑖𝑠𝑡(𝑝𝑙𝑎𝑦𝑒𝑟, 𝑏𝑎𝑠𝑘𝑒𝑡) end for 𝑑𝑖𝑠𝑡[𝑖] ∶= 𝑑𝑖𝑠𝑡[𝑖] ∕ |𝑝𝑙𝑎𝑦𝑒𝑟𝑠_{𝑡𝑒𝑎𝑚}_𝑖| end for if 𝑑𝑖𝑠𝑡[1] > 𝑑𝑖𝑠𝑡[2] then return 𝑡𝑒𝑎𝑚₁ else return 𝑡𝑒𝑎𝑚₂ end if

3.3 Content Analysis

In this section, we are going to explain how we gather information from each

frame during the game. With the consistent data from the pre-processing, we can

obtain the information we want simply and fast. Figure 3.3 gives a brief view about

our analysis mechanism. Remark that the modules in Figure 3.3 with shadows have

the same functionality as those in pre-processing. That is, we will perform Court

Mask Generation, Player Extraction, and Player Classification in analysis part as well.

籃球影片中的球員追蹤與戰術分析

國

立

交

通

大

學

多媒體工程研究所

碩

士

論

文

籃 球 影 片 中 的 球 員 追 蹤 與 戰 術 分 析

Player Tracking and Tactic Analysis in Basketball Video

研 究 生：伏宗勝

指導教授：李素瑛 教授

籃

球

影

片

中

的

球

員

追

蹤

與

戰

術

分

析

Player Tracking and Tactic Analysis in Basketball Video

研 究 生：伏宗勝 Student：Tsung-Sheng Fu

指導教授：李素瑛 Advisor：Suh-Yin Lee

國 立 交 通 大 學

多 媒 體 工 程 研 究 所

碩 士 論 文

籃球影片中的球員追蹤與戰術分析

研究生：伏宗勝

指導老師：李素瑛 教授

國立交通大學多媒體工程研究所

摘 要

Player Tracking and Tactic Analysis in Basketball Video

ABSTRACT

Acknowledgement

Table of Contents

List of Figures

List of Tables

Chapter 1. Introduction

Chapter 2. Related Work

2.1 Object Tracking

2.1.1 Object Detection

2.1.2 Object Tracking

2.2 Applications in Basketball Video

Chapter 3. Proposed System Architecture

3.1 Overview

Pre-process

• Floor color

• Jersey colors

• Possession

Analyze

• Player tracking

• Screen

detection

• Screen

classification

3.2 Pre-Processing

3.2.1 Camera Calibration

3.2.1.1 White Pixel Detection

3.2.1.2 Hough Line Extraction

3.2.1.3 Court Model Fitting

3.2.2 Court Mask Generation

3.2.3 Dominant Color Map Generation

3.2.4 Player Extraction

3.2.5 Team Clustering

3.2.6 Player Classification

3.2.7 Possession Recognition

3.3 Content Analysis

3.3.1 Court Model Tracking

籃球影片中的球員追蹤與戰術分析

研究生：伏宗勝

指導教授：李素瑛教授

研究生：伏宗勝 Student：Tsung-Sheng Fu

國立交通大學

多媒體工程研究所

碩士論文

指導老師：李素瑛教授

摘要