國
立
交
通
大
學
多媒體工程研究所
碩
士
論
文
籃 球 影 片 中 的 球 員 追 蹤 與 戰 術 分 析
Player Tracking and Tactic Analysis in Basketball Video
研 究 生:伏宗勝
指導教授:李素瑛 教授
籃
球
影
片
中
的
球
員
追
蹤
與
戰
術
分
析
Player Tracking and Tactic Analysis in Basketball Video
研 究 生:伏宗勝 Student:Tsung-Sheng Fu
指導教授:李素瑛 Advisor:Suh-Yin Lee
國 立 交 通 大 學
多 媒 體 工 程 研 究 所
碩 士 論 文
A ThesisSubmitted to Institute of MultimediaEngineering College of Computer Science
National Chiao Tung University in partial Fulfillment of the Requirements
for the Degree of Master
in
Computer Science
January 2011
Hsinchu, Taiwan, Republic of China
i
籃球影片中的球員追蹤與戰術分析
研究生:伏宗勝
指導老師:李素瑛 教授
國立交通大學多媒體工程研究所
摘 要
隨著電視轉播技術的發展,越來越多人觀賞籃球比賽,但是大多數的人對於 籃球知識並不是非常了解。我們也許會因為球員投進壓哨三分球而尖叫,或為一 個強力灌籃而興奮,但是不見得知道球員是如何擺脫防守者進行投籃。目前已經 有一些籃球影片內容的研究,例如精采畫面擷取和記分板辨識,但是這些仍然無 法幫助觀眾對於籃球有更深入的了解。所以我們希望能設計一個系統來提供觀眾 一些比較深入的籃球知識,而不只是表面上的資訊。籃球比賽中,觀眾最有興趣 的就是得分。但是得分背後的戰術是一門很深奧的學問,因為籃球是一項五個人 的運動,不可能只靠一個球員去對抗另外一隊,也就是說單一球員很難靠他自己 擊潰對方的防守並且進行得分。大部分的得分都是經由執行戰術而來的。所以我 們的目標就是自動辨認出籃球比賽中執行的戰術,並且把這些收集來的資訊帶給 觀眾,讓他們能更了解籃球這項運動;甚至可以提供教練和球員,作為他們訓練 及了解敵隊的攻防策略之用。 籃球戰術種類眾多,很難用單一演算法一以概之,因此我們著重於大多數戰 術中都會使用的「掩護」,加以偵測並且分類,藉此分析出戰術執行的模式。我 們開發的系統執行步驟如下。在比賽一開始先收集整場比賽都不會變的資訊,包 含球場地板顏色以及兩隊球衣顏色。首先我們計算攝影機的參數,並且產生一張ii 表示球場範圍的遮罩。第二步我們在球場範圍內計算出現次數最多的顏色,也代 表著地板顏色。接著利用背景相減法,我們可以從球場範圍減去地板得到前景物 體。最後我們利用顏色資訊將前景分成兩群,分別代表著兩支球隊的球衣顏色。 因為這些資訊在整場比賽中都不會改變,所以我們可以利用它們降低往後的計算 量,並且提升系統效能。在比賽中,針對每一次球權先分辨哪一隊是進攻方,了 解雙方球員的行為模式才能判斷執行的戰術。利用先前得到的資訊並追蹤雙方球 員的軌跡。在一波進攻結束的時候,根據追蹤到的雙方球員軌跡來判斷執行的掩 護。經由實驗結果,我們開發的系統對於掩護的偵測和分類準確度相當令人滿意, 因此在戰術分析上也有著顯著的幫助。這些被辨識出來的戰術會存入資料庫,於 是觀眾就可以查詢他們有興趣的戰術並且學習。 關鍵字:籃球影片、球員追蹤、戰術分析、運動影片分析、影像處理
iii
Player Tracking and Tactic Analysis in Basketball Video
Student: Tsung-Sheng Fu Advisor: Prof. Suh-Yin Lee
Institute of Multimedia Engineering
National Chiao Tung University
ABSTRACT
Thanks to the development of TV broadcasting technology, there are more and
more people watching basketball games. Most of us, however, do not know the
basketball sport very well. We may scream for a buzzer beater three-point shot or
get excited about a slam dunk, but we do not exactly realize how a player gets rid of
defenders and makes shots. There have been some researches on basketball video,
such as highlight extraction and scoreboard recognition, but they still cannot help
people further understand this sport. Therefore, we intend to design a system which
provides audience with further knowledge of basketball instead of superficial
information. In basketball games, people are most interested in scoring events.
Nevertheless, scoring is not that simple as it looks. It can be an abstruse subject
since basketball is a five-person sport and one player is not able to fight against the
opponent team. That is, it is difficult for an individual player to break the defense
and score by himself. Most shots are made through execution of tactics.
Consequently, our goal is to automatically identify tactics executed in basketball
games and bring audience the collected information so that they can learn more about
the basketball sport.
iv
algorithm. Hence, we focus on “screen,” which is widely used in most basketball
tactics. We detect and classify screens, and regard their patterns as certain tactics.
Our proposed system performs with the following steps. First of all, we gather some
consistent information at the beginning of the game, including the floor color and the
jersey colors of the two teams. We first compute the camera calibration and generate
a court mask indicating the court region. Second, we calculate the dominant color
within the court region, which represents the floor color. Next, we obtain the
foreground objects by subtracting the floor from the court region. This procedure is
similar to a background subtraction mechanism. Finally, we divide the foreground
region into two clusters with color information. Thus, the two clusters denote the
jersey colors of the two teams respectively. Since this information is consistent
through the entire game, we can utilize it to reduce computational cost and accelerate
the computation in the following frames. During the game, we first distinguish
which team is on offense in each possession since we have to learn the behaviors of
offensive and defensive players respectively in order to identify tactics. Next, we
extract players of the two teams with the previously obtained information and track
them. At the end of a possession, we identify what screens are set by the trajectories
of the players. Through our experiment, the accuracy of screen detection and
classification is satisfactory, which significantly helps analysis of basketball tactics.
The identified tactics are then inserted into a database from which audience can query
tactics they are interested in.
Keyword: basketball video, player tracking, tactic analysis, sports video analysis,
v
Acknowledgement
First of all, I greatly appreciate my advisor, Prof. Suh-Yin Lee. Not only for her
kind guidance, but also for her sincere help whenever I am troubled or upset. Next, I
would like to thank my seniors Hua-Tsung Chen, Hui-Zhen Gu and Min-Chun Hu for
their graceful ideas, precious experience and technical assists. Besides, I am grateful
to my colleagues for their inspiration. Also, I have to thank my brother Kuang-Yu
Fu, who is an expert in basketball and teaches me a lot. Last but not least, I
appreciate my parents Hai-Ju Fu and Hsiao-Li Tung. Without their support and
encouragement, I am not able to complete this achievement. I devoutly dedicate this
vi
Table of Contents
Chapter 1. Introduction ... 1
Chapter 2. Related Work ... 4
2.1 Object Tracking ... 4
2.1.1 Object Detection ... 4
2.1.2 Object Tracking ... 7
2.2 Applications in Basketball Video ... 10
Chapter 3. Proposed System Architecture ... 16
3.1 Overview ... 16
3.2 Pre-Processing... 19
3.2.1 Camera Calibration ... 19
3.2.1.1 White Pixel Detection ... 20
3.2.1.2 Hough Line Extraction ... 22
3.2.1.3 Court Model Fitting ... 24
3.2.2 Court Mask Generation ... 29
3.2.3 Dominant Color Map Generation ... 29
3.2.4 Player Extraction ... 31
3.2.5 Team Clustering ... 33
3.2.6 Player Classification ... 35
3.2.7 Possession Recognition ... 36
3.3 Content Analysis ... 37
3.3.1 Court Model Tracking ... 37
3.3.2 Player Tracking ... 41
3.4 Tactic Analysis Algorithm ... 43
3.4.1 Screen Detection ... 49
3.4.2 Screen Classification ... 50
Chapter 4. Experimental Results... 53
4.1 White Pixel Detection ... 53
4.2 Camera Calibration ... 57 4.3 Player Extraction ... 60 4.4 Player Classification ... 62 4.5 Possession Recognition ... 64 4.6 Player Tracking ... 64 4.7 Tactic Analysis ... 67 Chapter 5. Conclusions ... 73 Bibliography ... 75
vii
List of Figures
Figure 2.1: Taxonomy of tracking methods [13]. ... 7
Figure 2.2: Motion constraints [13]. (a) Proximity. (b) Maximum velocity. (c) Small velocity-change. (d) Common motion. (e) Rigidity constraint. ... 9
Figure 2.3: Examples of shot types in a basketball game [7]. (a) Court shot. (b) Court shot. (c) Medium shot. (d) Medium shot. (e) Close-up shot. (f) Out-of-court shot. ... 11
Figure 2.4: Example of Golden Section spatial composition [7]. (a) Frame regions. (b) Court view. (c) Medium view. ... 12
Figure 2.5: Detection of backboard top-border [7]. (a) Detected court lines. (b) Computing vanishing point. (c) Searching backboard top-border. ... 12
Figure 2.6: Detection of court lines and corresponding points [7]. ... 13
Figure 2.7: Demonstration of shooting location estimation [7]. ... 13
Figure 2.8: Example of the procedure [1]. (a) Original Frame. (b) Dominant color map. (c) Court mask. (d) Removing foreground objects. (e) White pixel detection. (f) Camera calibration. ... 14
Figure 2.9: Sample results of wide-open warning [1]. ... 15
Figure 3.1: System overview. ... 16
Figure 3.2: Flowchart of pre-processing. ... 17
Figure 3.3: Flowchart of content analysis. The modules with shadows have the same functionality as those in the pre-processing phase. ... 18
Figure 3.4: Schematic, magnified view of part of an input image containing a court line [2]. ... 20
Figure 3.5: Hough transform diagram. ... 22
Figure 3.6: Basketball court model. ... 24
Figure 3.7: Sample results of line extraction. (a) Original frame. (b) Detected white pixels. (c) Result using our method. The right column shows some results using typical method with different thresholds 𝜎 of (d) 50 (e) 100 (f) 150. ... 26
Figure 3.8: Examples of basketball video frames. Solid red lines are baselines and solid yellow lines are free-throw lines, and dotted lines are their normals respectively. (a) Left court. (b) Right court. ... 27
Figure 3.9: Court mask. (a) Original frame. (b) Corresponding court mask. ... 29
Figure 3.10: Object extraction. (a) Original frame. (b) Dominant color map. (c) Foreground objects. ... 32
Figure 3.11: K-means clustering. (a) Original frame. (b) Foreground objects. (c)
Experimental data with different color spaces and number of clusters. The horizontal axis means the number of clusters and the vertical axis indicates the clustering error,
viii
and different lines represent different color spaces. ... 34
Figure 3.12: Player classification. (a) Original frame. (b) Foreground objects. (c)
Players of one team (red jerseys). (d) Players of the other team (white jerseys)... 36
Figure 3.13: Camera motion in basketball video, rotate and zoom in... 38 Figure 3.14: Predicting the camera parameters for frame 𝑡 + 1 based on the
previously computed parameters for frames 𝑡 and 𝑡 − 1 [2]. ... 39
Figure 3.15: Complete diagram of Kalman filter [12]. ... 42 Figure 3.16: A sample basketball tactic. ... 44 Figure 3.17: Example of front-screen. (a) Trajectories. (b) Before screen. (c) Setting
screen. (d) After screen. ... 45
Figure 3.18: Example of back-screen. (a) Trajectories. (b) Before screen. (c) Setting
screen. (d) After screen. ... 46
Figure 3.19: Example of down screen. (a) Trajectories. (b) Before screen. (c) Setting
screen. (d) After screen. ... 47
Figure 3.20: Diagram of screen classification. ... 52 Figure 4.1: Results of white pixel detection. (a) Original frame. (b) Without line
structure constraint. (c) With line structure constraint. ... 55
Figure 4.2: Results of camera calibration. (a) White line pixels. (b) Extracted court
lines and camera calibration. ... 57
Figure 4.3: Results of player extraction. (a) Original frame. (b) Dominant color map. (c)
Foreground objects. ... 61
Figure 4.4: Results of player classification. (a) Original frame. (b) Player mask of team
1. (c) Player mask of team 2. ... 63
Figure 4.5: Results of tactic analysis. (a) Screen detection (b) Screen classification. .. 69 Figure 5.1: Real game example. (a) Coach setting tactic. (b) Tactic execution. ... 73
ix
List of Tables
Table 1.1: Tactic categories and number of tactics using screens. ... 3
Table 2.1: Object detection categories [13]. ... 5
Table 2.2: Tracking categories [13]. ... 8
Table 3.1: Corresponding accumulator matrix to Figure 3.5. ... 23
Table 3.2: Basketball court dimensions. ... 25
Table 4.1: Video sources. ... 53
Table 4.2: Configuration for white pixel detection. ... 54
Table 4.3: Statistics of white pixel detection. ... 56
Table 4.4: Average projection error of camera calibration. ... 59
Table 4.5: Statistical results of possession recognition. ... 64
Table 4.6: Configuration of player tracking. ... 65
Table 4.7: Performance of player tracking. ... 66
Table 4.8: Configuration of screen detection and classification. ... 69
Table 4.9: Corresponding results of screen classification to Figure 4.5. ... 71
1
Chapter 1. Introduction
There have been many researches on sports video analysis in the past decade.
However, not much research is focused on broadcast basketball video analysis.
Doing researches on basketball video, one may face some difficulties and challenges.
Most of all, basketball players occlude each other very often. As a result, it is
difficult to segment and track players correctly. Unfortunately, segmentation and
tracking are the soul of video analysis. In other words, unless we overcome the
occlusion problem, we are not able to analyze much content in basketball videos.
Chang et al. proposed a method [1, 50] that can accurately separate players of
different teams. This tremendously improved the possibility of basketball video
analysis because in basketball games, in order to make wide-open shots, players of the
same team seldom stay together. On the other hand, the defensive players usually
stand next by his target to defend. That is, once we distinguish players of the two
teams, we can avoid most occlusions. Second, in order for the audience to see the
ball clearly, the camera usually follows the ball. This may lead to violent camera
motions since the ball moves fast. Consequently, the camera calibration is another
challenge. Farin et al. introduced a robust and efficient court model tracking
algorithm [2], which helps us use the frame coherence to obtain the camera calibration
with slight computational cost.
Besides, there is another question: what can we analyze in basketball videos?
Some researches focus on event detection and highlight extraction [3-6]; others are
interested in trajectory reconstruction [7]; still others concentrate on frame
information, including shot classification [8] and scoreboard recognition [9].
2
basketball sport itself. Our goal is to bring the audience further knowledge about
basketball, or even to provide professional players and coaches with technical
information. To achieve this goal, we put most effort in verifying the tactics
executed in basketball games. Having surveyed hundreds of basketball tactics, we
discovered that there is one fundamental essence – screen. A screen is a blocking
move performed by an offensive player, by standing beside or behind a defender, in
order to free a teammate to shoot, to receive a pass, or to drive in to score.
Basketball tactics can be categorized by strategies which they are following and
players whom the tactics are set for. Strategies include isolation, low-post, high-post,
mid-range, three-point, pick-and-roll, and pick-and-fade. Isolation means that the
team on offense tries to isolate a player and make a one-on-one attack. Low-post and
high-post indicate the location where players start attacks. Mid-range and
three-point are similar to low post and high post, describing the attack locations, but
they focus on the finish of attacks instead of the beginning. Pick-and-roll and
pick-and-fade strategies intend players to make open shots through screens. A tactic
sometimes does not follow a specific strategy, and we categorize it as general.
Furthermore, once the strategy is decided, a player is expected to shoot the ball.
That is, tactic categories are then distinguished by the positions of players, namely,
point guard, shooting guard, small forward, power forward, and center. In general,
point guards (PG) organize the offense of a team; shooting guards (SG) are good
shooters from long range; small forwards (SF) have high speed so that they usually
drive in and break the defense of the opponent team; power forwards (PF) and centers
(C) are the tallest players of a team and they behave most near the basket. Through
our observation, most tactics consist of screens. Table 1.1 shows total number of
surveyed tactics and number of tactics using screens. According to Table 1.1, over
3
of different types of screens. Once we want to study a basketball tactic, we have to
learn what types of screens are used in it first. In this thesis, therefore, we are
focused on detecting screens and classifying their types.
Table 1.1: Tactic categories and number of tactics using screens.
Strategy \ Position PG SG SF PF C Overall
General 12/16 13/16 8/10 7/10 9/10 49/62 Isolation 8/16 10/16 7/16 6/16 1/4 32/68 Low Post 5/7 11/16 12/16 10/16 14/16 52/71 High Post 4/5 5/8 4/6 7/8 7/11 27/38 Three Point 11/12 15/16 14/16 6/8 6/7 52/59 Mid Range 13/16 14/16 14/16 13/16 15/16 69/80
Pick and Roll 16/16 16/16 16/16 16/16 16/16 80/80
Pick and Fade 16/16 16/16 16/16 14/14 2/2 64/64
Overall 85/104 100/120 91/112 79/104 70/82 425/522
In Chapter 2, we review previous works on object tracking and some
applications in basketball video. In Chapter 3, we present our proposed system,
including player tracking and tactic analysis. Chapter 4 shows our experimental
4
Chapter 2. Related Work
In this chapter, we will briefly introduce the methods for object tracking, and
then show some recent researches on basketball video analysis.
2.1 Object Tracking
Object tracking is an important field in computer vision. When watching
videos, we can easily distinguish objects and tell their behavior through our
background knowledge. In computer vision, people want computers to recognize
what objects are in videos and how the objects behave. Nevertheless, it is simple for
people but difficult for computers to realize the video contents. Thus, many methods
for object tracking have been proposed, and are introduced in the following sections.
2.1.1 Object Detection
Before tracking objects, we have to extract objects either in every frame or when
they first appear in the video. That is, we will present the object detection methods
before we start to discuss the object tracking algorithms. The object detection
methods can be classified into four categories: point detectors, segmentation,
background subtraction, and supervised learning [13]. Table 2.1 shows the four
5
Table 2.1: Object detection categories [13].
Categories Representative Work
Point detectors Moravec’s detector [14],
Harris detector [15],
Scale Invariant Feature Transform [16]
Segmentation Mean-shift [18],
Graph-cut [19]
Background modeling Mixture of Gaussians [21], Eigenbackground [22], Wall flower [23],
Dynamic texture background [24] Supervised classifiers Support Vector Machine [25],
Neural Networks [26], Adaptive boosting [27]
Point detectors are used to find points of interest in images which have an
expressive texture in their respective region. To find points of interest, Moravec’s
operator [14] computes the variation of the image intensities within a 4-by-4 window
in the horizontal, vertical, diagonal, and anti-diagonal directions, and then chooses the
minimum of the four variations as representative values for the window. A point is
declared interesting if the intensity variation is a local maximum in a 12-by-12
window. The Harris detector [15] computes the first order image derivatives in
horizontal and vertical directions to emphasize the directional intensity variations, and
then construct a structure matrix 𝐒𝑚 over a small window around each pixel. The points of interest are identified by thresholding 𝑅 = 𝑑𝑒𝑡(𝐒𝑚) − 𝑘 ∙ 𝑡𝑟(𝐒𝑚)2, where 𝑑𝑒𝑡(𝐒𝑚) represents the determinant of 𝐒𝑚 and 𝑡𝑟(𝐒𝑚) denotes the trace of 𝐒𝑚,
after applying non-maxima suppression. Theoretically, the 𝐒𝑚 matrix is invariant to both rotation and translation. However, it is not invariant to affine or projective
transformations. In order to provide robust detection of interest points under
6
Transform) method [16], which is confirmed outperforming most point detectors and
more tolerable to image deformations according to the survey by Mikolajczyk and
Schmid [17].
The objects we are interested in are usually moving objects in videos. Frame
difference is a typical method and is well studied since Jain and Nagel’s work [28].
However, differencing temporally adjacent frames cannot achieve robust results under
some circumstances. Thus, background subtraction became popular which builds a
representation of the scene called the background model and regards any significant
change in an image region from the background model as moving object. Stauffer
and Grimson [21] use a mixture of Gaussians to model the pixel color. Each pixel is
classified based on whether the matched distribution represents the background
process. Instead of modeling the variation of individual pixels, Oliver et al.
introduce an integral approach using the eigenspace decomposition [22]. It first
forms a background matrix 𝐁 of dimension 𝑘 × 𝑙 from 𝑘 input frames of dimension 𝑛 × 𝑚, where 𝑙 = 𝑛𝑚. The background is then determined by the most descriptive eigenvectors.
Segmentation algorithms partition an image into regions of reasonable
homogeneity. The mean-shift [18] method is proposed to find clusters in the
spatial-color space, which is scalable to various other applications such as edge
detection, image regularization [30], and tracking [31]. Shi and Malik [19]
formulate image segmentation as a graph partitioning problem, where the vertices
(pixels) are partitioned into disjoint subgraphs (regions), and overcome the difficulty
7
Figure 2.1: Taxonomy of tracking methods [13].
2.1.2 Object Tracking
The goal of object tracking is to gather the trajectory of a specific object. Take
our system for example, since we intend to identify what tactics are executed, we have
to analyze how the players move. That is, we must track players during the game in
order to obtain their trajectories. Tracking algorithms can be classified into three
main categories: point tracking, kernel tracking, and silhouette tracking. Figure 2.1
illustrates the taxonomy of tracking methods and Table 2.2 demonstrates their most
notable works.
Detected objects over a video clip can be represented by points, and the point
tracking finds the point correspondence across frames. Point tracking methods can
be divided into two categories: deterministic and statistical methods. Deterministic
methods define a cost of associating each object to a single object in two adjacent
frames using a set of motion constraints, which is usually a combination of the
constraints illustrated in Figure 2.2. Proximity assumes the location of the object Object Tracking Point Tracking Deterministic Probabilistic Kernel Tracking Multi-view Based View Subspace Classifier Template Based Silhouette Tracking Contour Evolution State Space Methods Direct Minimization Variationaly Approach Heuristic Approach Shape Matching
8
would not change notably from one frame to other. Maximum velocity defines an
upper bound on the object velocity and limits the possible correspondences to the
circular neighborhood around the object. Small velocity change assumes the
direction and speed of the object does not change drastically. Common motion
constrains the velocity of objects in a small neighborhood to be similar. Rigidity
assumes that objects in the 3D world are rigid, so the distance between any two points
on the actual object will remain unchanged.
Table 2.2: Tracking categories [13].
Categories Representative Work
Point Tracking
Deterministic methods MGE tracker [32], GOA tracker [33] Statistical methods Kalman filter [34],
JPDAF [35], PMHT [36] Kernel Tracking
Template and density based appearance models
Mean-shift [31], KLT [37], Layering [38] Multi-view appearance models Eigentracking [39],
SVM tracker [40] Silhouette Tracking
Contour evolution State space models [41],
Variational methods [42], Heuristic methods [43]
Matching shapes Hausdorff [44],
Hough transform [45], Histogram [46]
Statistical methods consider the measurement and the model uncertainties during
9
such as position, velocity, and acceleration. Measurements usually consist of the
object position in the image, which is obtained by a detection algorithm. The
Kalman filter [34] computes the covariance for state estimation while the particle
filter [47] uses the conditional state density to estimate the next state, which can be
regarded as the generalized Kalman filter since the Kalman filter concentrates on
estimating the state of a linear system where the state variables are assumed to be
normally distributed (Gaussian) and the particle filter deals with the non-Gaussian
state.
Figure 2.2: Motion constraints [13]. (a) Proximity. (b) Maximum velocity. (c) Small
velocity-change. (d) Common motion. (e) Rigidity constraint.
Kernel refers to the object shape and appearance, and kernel tracking is typically
performed by computing the motion of the object, which is represented by a primitive
object region and generally in the form of parametric motion or the dense flow field
computed in subsequent frames. The major differences among kernel tracking
methods are the appearance representation used, the number of objects tracked, and
the method used to estimate the object motion. For instance, the mean-shift tracking
method [31] uses templates and density-based appearance models, while the SVM
tracker [40] tracks objects with multiview appearance models.
Objects may have complex shapes. Humans, for example, have head, arms, and
10
silhouette-based methods is to provide an accurate shape description, and to find the
object region in each frame through an object model generated according to the
previous frames. One category of the silhouette-based methods is shape matching
[44-46], which can be performed similar to tracking based on template matching
where an object silhouette and its corresponding model is searched in the current
frame. The search is invoked by computing the similarity between the object and the
model generated from the hypothesized object silhouette according to the previous
frame. The other category of the silhouette-based methods is contour tracking
[41-43], which iteratively evolve an initial contour in the previous frame to its new
position in the current frame. Tracking by evolving a contour can be performed with
either state space models which model the contour shape and motion or direct
evolution through minimizing the contour energy using direct minimization
techniques such as gradient descent.
2.2 Applications in Basketball Video
As discussed in Chapter 1, basketball video analysis is not a common field due to
several difficulties and limitations. Fortunately, there are more and more new
methods proposed that help us overcome those obstacles and make basketball video
analysis much more practicable. We are going to introduce some recent researches
on basketball video analysis related to our work.
At first, we would like to introduce the work of Chen et el. [7]. Their research
has several notable contributions. First of all, they modify the shot classification
algorithm to basketball videos. Basketball shots can be classified into three types:
court shots, medium shot, and close-up shots or out-of-court shots. A court shot
11
who is usually the ball handler. A close-up shot shows the above-waist view of
players, and an out-of-court shot presents spectators, coaches, or other places out of
the court. Figure 2.3 shows examples of different shot types in a basketball game.
Obviously, court shot is the type that contains most information on the court and
should be retrieved.
(a) (b) (c)
(d) (e) (f)
Figure 2.3: Examples of shot types in a basketball game [7]. (a) Court shot. (b) Court shot. (c)
Medium shot. (d) Medium shot. (e) Close-up shot. (f) Out-of-court shot.
They divide frames into nine regions by employing Golden Section spatial
composition rule as Figure 2.4 shows, and count the number of pixels of the floor
color in each region to distinguish shot types. Second, they propose a new method
to obtain vertical information in order to form a nonsingular 3D-to-2D transformation.
In addition to the typical court lines (2D), they extract the top-border of the backboard
(3D) by scanning the baseline from the vanishing point. Figure 2.5 demonstrates the
method and Figure 2.6 illustrates the result. Last but not least, they reconstruct 3D
information from single view 2D video sequences. With the reconstructed 3D
12
well. The 3D ball trajectories facilitate automatic collection of game statistics about
shooting locations, from which people can learn the shooting tendency of an
individual player, or even a whole team. Figure 2.7 shows some experimental
results. In each image in Figure 2.7, blue circles are the ball positions over frames,
green circle represents the estimated shooting location, and the red squares show the
movements of corresponding points due to the camera motion.
(a) (b) (c)
Figure 2.4: Example of Golden Section spatial composition [7]. (a) Frame regions. (b) Court
view. (c) Medium view.
(a) (b) (c)
Figure 2.5: Detection of backboard top-border [7]. (a) Detected court lines. (b) Computing
13
Figure 2.6: Detection of court lines and corresponding points [7].
Figure 2.7: Demonstration of shooting location estimation [7].
Besides, we highly praise the work of Chang et al. [1, 50] not only for their
contribution to basketball video analysis but also for their novel research on
basketball tactics. They propose a method that can gracefully extract players on the
14
(a) (b)
(c) (d)
(e) (f)
Figure 2.8: Example of the procedure [1]. (a) Original Frame. (b) Dominant color map. (c)
Court mask. (d) Removing foreground objects. (e) White pixel detection. (f) Camera calibration.
At first, dominant (floor) color is obtained and a dominant color map is
generated. The court region can then be shown through largest connected
component analysis of the dominant color map. By utilizing this, foreground objects
15
line pixels for the sake of camera calibration since court lines are only located within
the court region. Figure 2.8 illustrates the procedure and the result of camera
calibration. Next, using color information and any clustering algorithm, foreground
region is separated into two clusters representing the jersey colors of the two teams.
That is, players of the two teams are recognized. Most important of all, they step
into a further field of tactic analysis. Their system informs the user when the
distribution of players satisfies the preset rules of the wide-open event. Although
their system does not explicitly imply what tactic has been executed, the user can
infer the tactic from how the wide-open event occurs. This inspires us to design a
system that identifies tactics executed in basketball games and keeps the patterns in
order for users to learn basketball tactics. Figure 2.9 demonstrates results of the
wide-open warning system.
16
Chapter 3. Proposed System Architecture
This chapter describes the details of our proposed system. First of all, we will
give an overview in Section 3.1. In Section 3.2, pre-processing is described. Next,
we explain our proposed scheme of player tracking during the game in Section 3.3.
At last, we will introduce our algorithm for tactic detection and classification in
Section 3.4. Note that the video clips we are using are manually segmented by
possessions instead of a whole game because our main purpose is to analyze the
tactics executed in possessions and automatic possession distinction is not our focus
here. Possession means control of the ball. When one team is on offense, we say
the team has the possession. One team loses possession if it makes a shot or the
opponent team gets the ball. That is, the period we are interested in is from one team
first gets the ball until the team shoots the ball.
Figure 3.1: System overview.
3.1 Overview
The goal we are going to achieve is to analyze the tactics executed in basketball
Pre-process
• Floor color
• Jersey colors
• Possession
Analyze
• Player tracking
• Screen
detection
• Screen
classification
17
games. However, there are some obstacles blocking our way to this goal since
basketball tactics are complex. The position which a player plays, for instance, is
usually considered when setting tactics but difficult for computer to distinguish.
Fortunately, we have figured out that screen is a key to all basketball tactics as
mentioned in Chapter 1. Hence, screen verification is the core of our system.
Figure 3.2: Flowchart of pre-processing.
Our system can be divided into two parts: pre-processing and analysis as shown
in Figure 3.1. Pre-processing is performed at the beginning of a video clip in order
to gather consistent information in this possession, such as floor color and jersey
colors. Since they are invariant during a possession, or even the whole game, we
only have to compute them once and for all. With these information gathered in
pre-processing, we can avoid computing them each frame and accelerate the
computation. As Figure 3.2 illustrates, we first compute the camera calibration and
generate a court mask which indicates the court region. Second, we can obtain the
floor color by calculating the color histogram and finding the dominant color within
Camera Calibration Court Mask Generation Dominant Color Map Generation
Player Extraction Team Clustering Player Classification Possession Recognition
18
the court region. With the floor color, we can perform a background subtraction and
extract the foreground objects, that is, the players. Next, we cluster the players into
two teams according to their jersey colors. At last, we can realize which team is on
offense through the distance between the players and the basket.
Figure 3.3: Flowchart of content analysis. The modules with shadows have the same
functionality as those in the pre-processing phase.
In the following frames, we track the players and also confirm if a screen is set.
We have to calculate the camera calibration at first, and then generate a new court
mask. Unlike the pre-processing phase, we can obtain current camera calibration
from previous frame. Next, we extract the players by the floor color and the jersey
colors obtained from the pre-processing. Now we can track the players and detect
screens with the positions of players. Once a screen is detected, we retain the state at
the moment for the sake of screen type classification. At the end of the possession,
we classify the type of the screen set in the possession according to the trajectories of
the players. Figure 3.3 shows the flowchart of the analysis phase.
Court Model Tracking Court Mask Generation
Player Extraction Player Classification
Player Tracking Screen Detection
19
3.2 Pre-Processing
The reason why we perform the pre-processing is that there is some information
which will not change during a game, including the floor color and the jersey colors
of the two teams. If we repeatedly calculate the information in each frame and just
acquire the same result, it is nothing more than an impediment to efficiency.
Therefore, in order to reduce the computational cost, we prefer gather the information
once and for all. The pre-processing is summarized in Figure 3.2.
3.2.1 Camera Calibration
Camera calibration describes how objects in the world coordinates are projected
onto the image coordinates. Since sport courts can be assumed to be planar, camera
calibration defines a plane-to-plane mapping (a homography) 𝐇 from a position 𝐩 in the world coordinates to the image coordinates 𝐩′ . Writing positions as homogeneous coordinates 𝐩 = (𝑥, 𝑦, 1)T and 𝐩′ = (𝑢, 𝑣, 1)T, the transformation 𝐇𝐩 = 𝐩′ is defined in equation (1). ( 00 01 02 10 11 12 20 21 22) ( 𝑥 𝑦 1) = ( 𝑢′ 𝑣′ 𝑤′) = ( 𝑢 𝑣 1) (1)
Camera calibration plays an important role in our system since we do most
works under the real-world coordinates. The way we obtain the camera parameters
is based on the court lines in the frame. Hence, we first have to detect all white
20
line candidates passing through those white pixels using Hough transform. Next, we
filter some unreasonable line candidates out and fit the remaining for the real court
lines. Finally, we can obtain the camera parameters through the mapping between
the intersection points of the line candidates and those of the court lines.
Figure 3.4: Schematic, magnified view of part of an input image containing a court line [2].
3.2.1.1 White Pixel Detection
The court lines are generally painted with white color. Accordingly, the first
filter is to confirm if the value of the R, G, B channels of a pixel are above a threshold
𝜎𝑙 to guarantee the pixel is white since the (R, G, B) value of a white pixel is (255,
255, 255). Unfortunately, court lines are usually not the only white objects in a
frame and they will influence the line extraction seriously. Hence, other constraints
should be applied to the white pixels. Assuming that court lines are not wider than
𝜏 pixels in the frame, we verify if the brightness at a distance of 𝜏 pixels from four neighbors of the candidate pixel is considerably darker than the candidate pixel as
shown in Figure 3.4. Only if they are, the candidate pixel is classified as a white
21 𝑙(𝑥, 𝑦) = { 1, 𝑔(𝑥, 𝑦) − 𝑔(𝑥 − 𝜏, 𝑦) > 𝜎𝑑∧ 𝑔(𝑥, 𝑦) − 𝑔(𝑥 + 𝜏, 𝑦) > 𝜎𝑑 1, 𝑔(𝑥, 𝑦) − 𝑔(𝑥, 𝑦 − 𝜏) > 𝜎𝑑∧ 𝑔(𝑥, 𝑦) − 𝑔(𝑥, 𝑦 + 𝜏) > 𝜎𝑑 0, else (2)
where 𝑙(𝑥, 𝑦) indicates if a pixel at position (𝑥, 𝑦) is a white pixel (𝑙(𝑥, 𝑦) = 1) or not (𝑙(𝑥, 𝑦) = 0), 𝑔(𝑥, 𝑦) is the luminance of a pixel at position (𝑥, 𝑦), and 𝜎𝑑 is the luminance difference threshold. In equation (2), the first line corresponds to the
test if darker pixels can be found at some horizontal distance, assuming that the court
line is mostly vertical. The second line performs the analogous test in the vertical
direction, assuming that the court line is almost horizontal.
Sometimes the white pixels in textured areas may pass the above white line test,
such as small white letters in advertisement logos, spectators dressed in white clothes,
or white areas in the stadium. Therefore, we apply an additional line-structure
constraint to eliminate those white pixels in the textured areas by observing the two
eigenvalues of the structure matrix S which is computed over a small window of size (2𝑏 + 1) around each candidate pixel (𝑝𝑥, 𝑝𝑦) and defined by equation (3) [10].
𝐒 = ∑ ∑ 𝛻𝑔(𝑥, 𝑦) ∙ (𝛻𝑔(𝑥, 𝑦))T 𝑝𝑦+𝑏 𝑦=𝑝𝑦−𝑏 𝑝𝑥+𝑏 𝑥=𝑝𝑥−𝑏 (3)
Depending on the two eigenvalues of the matrix S, called 𝜆1 and 𝜆2 (𝜆1 ≥ 𝜆2), the
area can be classified into textured (both 𝜆1 and 𝜆2 are large), linear (𝜆1 ≫ 𝜆2), and
flat (both 𝜆1 and 𝜆2 are small). On the straight court lines, the linear case will apply to retain the white pixels only if 𝜆1 > 𝛼𝜆2. We find that when 𝛼 = 4, most linear cases can be recognized.
22
3.2.1.2 Hough Line Extraction
In order to extract the court lines, we perform the standard Hough transform on
the detected white pixels. The parameter space (𝜃, 𝑑) is used to represent a line, where 𝜃 is the angle between the line normal and the horizontal axis, and 𝑑 is the distance between the line and the origin.
Figure 3.5: Hough transform diagram.
Figure 3.5 demonstrates how the Hough transform searches lines. Given three
points and we want to find a line passing through them. For each point, a number of
lines at different angles are plotted through it. In this example, we plot lines at an
interval of 30 degrees. For each plotted line, we compute its distance to the origin
and obtain an angle-distance pair representing this line. The results are shown in the
tables in Figure 3.5, and the corresponding accumulator matrix is shown in Table 3.1.
We can figure out that the parameter set (Angle, Distance) = (60, 81) appears most
frequently (three times). Thus, it is the line that we are looking for. Now come
back to our problem that we want to extract court line candidates from those detected
white pixels. Similarly, we construct an accumulator matrix for all (𝜃, 𝑑) and sample the accumulator matrix at a resolution of one degree for 𝜃 and one pixel for
23
𝑑. By extracting the local maxima in the accumulator matrix, we can determine the line candidates.
Table 3.1: Corresponding accumulator matrix to Figure 3.5.
Angle\Dist. -40 -20 0 6 23 40 41 50 57 60 70 75 80 81 90 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 30 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 60 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 90 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 120 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 150 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
In addition, to obtain more precise line parameters, we refine them by
minimizing the distance between line pixel candidates and their nearest hough lines.
First, we re-parameterize a line obtained from Hough transform by its normal
𝐧 = (𝑛𝑥, 𝑛𝑦)T with ‖𝐧‖ = 1 and the distance to the origin 𝑑 . With the
parameters, the distance between a point with homogeneous coordinates in image
space 𝐩 = (𝑥, 𝑦, 1)T and a line can be calculated by the dot product (𝑛𝑥, 𝑛𝑦, −𝑑) ⋅ 𝐩. Next, we define a set 𝐿 of court line pixels that are close to the line as equation (4) [2].
𝐿 = {𝐩 = (𝑥, 𝑦, 1)T|𝑙(𝑥, 𝑦) = 1 ∧ ‖(𝑛
𝑥, 𝑛𝑦, −𝑑) ⋅ 𝐩‖ < 𝜎𝑟} (4)
where 𝜎𝑟 is the largest distance constraint in order to discard line pixel candidates far away from any hough line. Since the pixels in this set are supposed to be on the
same court line and we assume the refined line equation to be 𝑥 ∙ 𝑚𝑥+ 𝑦 ∙ 𝑚𝑦 = 1,
we form an equation system and then solve it in the least squares sense as shown in
24 ( 𝑥1 𝑦1 𝑥2 𝑦2 ⋮ 𝑥|𝐿| 𝑦⋮|𝐿| ) (𝑚𝑚𝑥 𝑦) = ( 1 1 ⋮ 1 ) (5)
Finally, the refined parameters are computed by 𝑑 = 1 ∕ √𝑚𝑥2+ 𝑚𝑦2, 𝑛𝑥 = 𝑚𝑥𝑑, 𝑛𝑦 = 𝑚𝑦𝑑 since the slope of the line is −𝑚𝑥∕ 𝑚𝑦, and the slope of the line normal is 𝑚𝑦∕ 𝑚𝑥.
3.2.1.3 Court Model Fitting
A court model consists of the lines that are drawn onto the ground to define the
playfield geometry. Basketball court model is illustrated in Figure 3.6 and the
dimensions are shown in Table 3.2.
Figure 3.6: Basketball court model.
Sideline Baseline Basket Restricted area Half-court line Free-throw line Three-point line
25
Table 3.2: Basketball court dimensions.
Area Dimension (m)
Court length (sideline length) 28
Court width (baseline length) 15
3-point line distance from the basket 6.25
Free-throw line distance from the baseline 5.8
Basket distance from the baseline 1.2
Restricted area width Free-throw line side 3.6
Baseline side 5
The camera calibration describes how those lines are projected from the world
coordinates onto the image coordinates. Therefore, in order to define the mapping,
the correspondence between a previously extracted hough line and the court line in
court model must be found. An algorithm has been proposed to find the line
correspondence [2] and performs well in several kinds of sport videos such as tennis,
volleyball and soccer. They regard the lines determined by extracting the local
maxima in the accumulator matrix (mentioned in Section 3.2.1.2) that are above a
threshold 𝜎ℎ as court line candidates. The line candidates are then classified as two sets: one contains the horizontal lines and the other consists of the vertical lines.
Next, they sort the line candidates according to their distances to the image boundary,
and can search for the correspondence between the candidate lines and the model
lines. Nevertheless, when applying to basketball videos, we find that the
performance is not good as we expected. The major problem is: how to determine
the value of 𝜎ℎ? The right column of Figure 3.7 shows some results of typical line extraction method with different 𝜎ℎ values. When the 𝜎ℎ value is small, there are many unreasonable lines passing the test and viewed as court line candidates which
26
we are not able to obtain sufficient lines to solve camera parameters. Most
important of all, whatever threshold we set, the free-throw line is always filtered out
because it is short. However, the free-throw line is not negligible since all the
corresponding points may locate on the baseline and it will lead to a singular solution
to the camera calibration without the free-throw line.
(a) (d)
(b) (e)
(c) (f)
Figure 3.7: Sample results of line extraction. (a) Original frame. (b) Detected white pixels. (c)
Result using our method. The right column shows some results using typical method with different thresholds 𝜎ℎ of (d) 50 (e) 100 (f) 150.
To overcome such a difficulty, we propose a new method to find the line
correspondence in basketball video. We do not sample the entire accumulator matrix;
27
lines. It is an experiential method, and the searching ranges are determined through
our observation and knowledge of basketball video.
(a) (b)
Figure 3.8: Examples of basketball video frames. Solid red lines are baselines and solid
yellow lines are free-throw lines, and dotted lines are their normals respectively. (a) Left court. (b) Right court.
Our main purpose is to discard noise white pixels outside the court region and extract
correct court lines. The court region is determined by sideline and baseline.
Through Figure 3.8, we can realize that sideline and baseline are the longest
horizontal and vertical lines in the frame respectively. Hence, our first step is to find
the longest horizontal and vertical lines. For the longest vertical line, we extract the
local maximum in the accumulator matrix within the range of [0, 80] and [100, 180]
degrees. Remember that the parameters in Hough space are the distance between a
line and the origin, and the angle between the line normal and the horizontal axis.
That is, this ignores lines whose angle between the horizontal axis is within the range
of [-10, 10] degrees, namely, those almost horizontal lines. We obtain the longest
vertical line by eliminating horizontal lines instead of directly finding vertical lines
since it may not look that perpendicular on screen. Furthermore, the angle of
baseline also helps us distinguish whether it is the left court or right (see solid red
lines Figure 3.8). On the other hand, when extracting the longest horizontal line, we
just set the searching range to [80, 100] degrees since horizontal lines do not change
𝜃𝑏
28
significantly on screen. With the longest vertical and horizontal line, that is, baseline
and sideline respectively, we filter those white pixels out which are outside the region
bounded by the two lines, and reconstruct the accumulator matrix from the remaining
white pixels. Next, we extract the longest two horizontal lines as edges of the
restricted area. Top edge and bottom edge are then distinguished by angles of the
two lines. Through Figure 3.8 we can find that bottom edge is always more
horizontal than top edge. At last, we have to find free-throw line. Please view
Figure 3.8 again. We mark the baseline with the solid red line and the free-through
line with the solid yellow line, and the dotted lines are their normals respectively.
We can clearly figure out that although they are both vertical lines in court model,
free-throw line always looks more perpendicular than baseline whichever side of
court is on screen because the camera is usually set at the center of the court. Thus,
we set the searching range as [0, 𝜃𝑏] degrees for right court and [𝜃𝑏, 180] degrees for left court in order to extract free-throw line. Here, 𝜃𝑏 is the angle between baseline normal and the horizontal axis. Since the remaining white pixels are guaranteed to
be within the court region, we can recognize those extracted lines as correct court
lines. In this way, we extract lines and find the correspondence at the same time
since we know exactly which line we are looking for. Finally, we compute the
intersection points and solve the equation system defined as equation (6) which is
rewritten from equation (1).
( 𝑥1 𝑦1 1 0 0 0 −𝑥′1𝑥1 −𝑥′1𝑦1 0 0 0 𝑥1 𝑦1 1 −𝑦′1𝑥1 −𝑦′1𝑦1 𝑥2 𝑦2 1 0 0 0 −𝑥′2𝑥2 −𝑥′2𝑦2 0 0 0 𝑥2 𝑦2 1 −𝑦′2𝑥2 −𝑦′2𝑦2 ⋮ 𝑥𝑛 𝑦𝑛 1 0 0 0 −𝑥′𝑛𝑥𝑛 −𝑥′𝑛𝑦𝑛 0 0 0 𝑥𝑛 𝑦𝑛 1 −𝑦′𝑛𝑥𝑛 −𝑦′𝑛𝑦𝑛)( 00 01 02 10 11 12 20 21) = ( 𝑥′1 𝑦′1 𝑥′2 𝑦′2 ⋮ 𝑥′𝑛 𝑦′𝑛) (6)
29
Note that this makes use of the normalization 22 = 1. There are eight variables 00, 01, … , 21 so we need at least four points (𝑛 ≥ 4) in order to form more than eight equations. Here we use baseline, free-throw line and two edges of restricted
area to solve the equation system. Figure 3.7 (c) illustrates the result using our
method.
3.2.2 Court Mask Generation
In basketball video, most of important information is inside the court region. In
other words, court is our region of interest. In order to filter out noise and keep
significant information, we need a mask to indicate the court region, that is, the court
mask. With the previously computed camera calibration, we can project pixels from
image coordinates back to world coordinates and confirm whether they are located in
the court. Figure 3.9 shows a sample result of the court mask.
(a) (b)
Figure 3.9: Court mask. (a) Original frame. (b) Corresponding court mask.
3.2.3 Dominant Color Map Generation
In order to extract players, we have tried several methods [18, 21, 25].
30
obstacle is the camera motion. For example, redundant moving pixels resulting from
the camera motion generate huge amount of noise when performing the frame
difference. For another example, the camera motion prevents us from obtaining a
consistent background image and extracting real moving objects. Therefore, a new
method is proposed to extract the players on the court by detecting objects with
different colors from the floor [1, 50].
The way we obtain the floor color is to find the dominant color within the court
region using the previously generated court mask. First of all, we calculate the color
histogram. Since it has been proved in [11] that the performance in the YCbCr space
is better than that in the HSI space, we choose the YCbCr space and use the Cb and Cr
components to calculate the color histogram. With the color histogram, we next find
peaks by the following steps
Step 1: Determine the main peak bin 𝑃𝑒𝑎𝑘1, that is, the bin with the largest value.
Step 2: Find the connected region around the main peak bin. Only bins with
value larger than 𝛼 ∗ 𝑣𝑎𝑙𝑢𝑒(𝑃𝑒𝑎𝑘1) are considered.
Step 3: Compute the sum of the connected bins 𝑆𝑢𝑚1 and subtract the connected region from the histogram. That is, we set the values of the bins of
the connected region to zero in order not to be considered again in the following
iterations.
Step 4: Repeat the above steps until there are no bins remaining.
After completing the procedure, we will have several peaks and their sums. Finally,
by sorting these peaks according to their sums, we can realize the dominant color. It
31
Figure 3.6), which is also called the painted area since it is usually painted with
different color from other parts of the court. That is, if we just recognize the largest
peak as the floor color, we will miss the restricted area. We propose two ways to
solve this problem. One is to regard the largest two peaks as the floor color, and the
other is to run the procedure again with another mask indicating the restricted area.
Both methods have their pros and cons. The first one takes advantage of the
previous result but it fails when there are many players stay in the restricted area.
The second one can distinguish the players from the restricted area since it compares
the two series of sorted peaks and verifies which peak represents the restricted area.
Through our experiment, we prefer the first one because it has good performance and
does not require extra computation. Figure 3.10 (b) illustrates a sample result of the
dominant color map.
3.2.4 Player Extraction
With the court mask and the dominant color map, we can perform a
background-subtraction-like method to extract the foreground objects in the court
region. If the color of a pixel can be found in the dominant color map, the pixel
should be labeled as background; otherwise, it is a foreground pixel. After all pixels
are confirmed, we apply morphological operators in order to remove small objects
and gaps. Figure 3.10 (c) demonstrates a sample result of extracted foreground
32 (a)
(b)
(c)
Figure 3.10: Object extraction. (a) Original frame. (b) Dominant color map. (c) Foreground
33
3.2.5 Team Clustering
Despite the fact that we have the foreground objects within the court region
extracted, we need more information to analyze the content. First of all, we have to
distinguish the jersey colors in order to separate the players of the two teams. We
use color information and k-means clustering to divide the foreground region into two
clusters representing the jersey colors of the two teams. In fact, we cannot have just
two clusters since there is some noise in the foreground region, the referees for
example, which enormously interferes with the cluster centroids and leads to a
miserable result of player classification. Figure 3.11 shows experimental data about
the number of clusters and the performances. Generally, the more the clusters, the
smaller the total distance between all data points and their corresponding cluster
centroids, which can also be regarded as the clustering error. However, the
computing time of the k-means clustering is proportional to the number of clusters.
We discovered that the clustering error decreases most rapidly when there are six
clusters. The clustering errors almost converge when there are more than six clusters.
This fact is also adaptive to other video clips through our experiment. Thus, we
separate the foreground region into six clusters and view the largest two clusters as
the jersey colors of the two teams, and choose the YCbCr space since it performs
34 (a)
(b)
(c)
Figure 3.11: K-means clustering. (a) Original frame. (b) Foreground objects. (c)
Experimental data with different color spaces and number of clusters. The horizontal axis means the number of clusters and the vertical axis indicates the clustering error, and different lines represent different color spaces.
35
3.2.6 Player Classification
Having gathered the jersey colors of the two teams, we are going to classify
players of the two teams in this step. At first, we verify the pixels in the foreground
region which clusters they belong to by their colors by equation (7) where 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝐴 and 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝐵 are the centroids of the largest two clusters from the Team Clustering step, that is, the jersey colors of the two teams.
𝑐𝑙𝑢𝑠𝑡𝑒𝑟(𝑥, 𝑦) = {
𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐴, ‖𝑐𝑜𝑙𝑜𝑟(𝑥, 𝑦) − 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝐴‖ < 𝛿𝑐
𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐵, ‖𝑐𝑜𝑙𝑜𝑟(𝑥, 𝑦) − 𝐶𝑒𝑛𝑡𝑟𝑜𝑖𝑑𝐵‖ < 𝛿𝑐
𝑁𝑜𝑛𝑒, else
(7)
After clustering all foreground pixels, we can generate two maps indicating the
players of the two teams as Figure 3.12 illustrates. Since we set a constraint to the
minimal distance between a color of a pixel and the cluster centroid which it belongs
to, we can remove those non-player objects such as the referees during clustering.
Also, we perform morphological operators to remove noise and gaps. At last, we
36
(a) (b)
(c) (d)
Figure 3.12: Player classification. (a) Original frame. (b) Foreground objects. (c) Players of
one team (red jerseys). (d) Players of the other team (white jerseys).
3.2.7 Possession Recognition
It is important to realize which team has the possession of the ball, or which team
is on offense, before tactic analysis. Typically, defenders are expected to stand
closer to the basket than the offensive player who he is guarding in basketball games
since the purpose of the team on defense is to prevent the opponent team from putting
the ball into the basket. Hence, we can make use of this feature to judge which team
is on offense. We first project all players back to the real-world court model with the
camera calibration. For each team, we compute the average distance between its
players and the basket. The team with shorter distance to the basket is recognized as
on defense. On the other hand, the team on offense is averagely farther away from
37
Algorithm 1: Possession Recognition
Input: positions of players of the two teams, represented by 𝑝𝑙𝑎𝑦𝑒𝑟𝑠𝑡𝑒𝑎𝑚1 and 𝑝𝑙𝑎𝑦𝑒𝑟𝑠𝑡𝑒𝑎𝑚2, and position of the basket
Output: the team on offense, 𝑡𝑒𝑎𝑚1 or 𝑡𝑒𝑎𝑚2
local 𝑑𝑖𝑠𝑡[2] for 𝑖 ∶= 1 to 2 do 𝑑𝑖𝑠𝑡[𝑖] ∶= 0 for each 𝑝𝑙𝑎𝑦𝑒𝑟 in 𝑝𝑙𝑎𝑦𝑒𝑟𝑠𝑡𝑒𝑎𝑚𝑖 do 𝑑𝑖𝑠𝑡[𝑖] ∶= 𝑑𝑖𝑠𝑡[𝑖] + 𝑑𝑖𝑠𝑡(𝑝𝑙𝑎𝑦𝑒𝑟, 𝑏𝑎𝑠𝑘𝑒𝑡) end for 𝑑𝑖𝑠𝑡[𝑖] ∶= 𝑑𝑖𝑠𝑡[𝑖] ∕ |𝑝𝑙𝑎𝑦𝑒𝑟𝑠𝑡𝑒𝑎𝑚𝑖| end for if 𝑑𝑖𝑠𝑡[1] > 𝑑𝑖𝑠𝑡[2] then return 𝑡𝑒𝑎𝑚1 else return 𝑡𝑒𝑎𝑚2 end if
3.3 Content Analysis
In this section, we are going to explain how we gather information from each
frame during the game. With the consistent data from the pre-processing, we can
obtain the information we want simply and fast. Figure 3.3 gives a brief view about
our analysis mechanism. Remark that the modules in Figure 3.3 with shadows have
the same functionality as those in pre-processing. That is, we will perform Court
Mask Generation, Player Extraction, and Player Classification in analysis part as well.