國 立 交 通 大 學
資訊科學與工程研究所
博 士 論 文
籃球影片之語義標注與摘要擷取之研究
A STUDY ON SEMANTIC ANNOTATION AND
SUMMARIZATION OF BASKETBALL VIDEO
研 究 生 :陳俊旻
指導教授 :陳玲慧博士
籃球影片之語義標注與摘要擷取之研究
A STUDY ON SEMANTIC ANNOTATION AND
SUMMARIZATION OF BASKETBALL VIDEO
研 究 生 : 陳 俊 旻
Student: Chun-Min Chen
指導教授 : 陳玲慧博士
Advisor: Dr. Ling-Hwei Chen
國立交通大學資訊學院
資訊科學與工程研究所
博士論文
A Dissertation Submitted to
Institute of Computer Science and Engineering
College of Computer Science
National Chiao Tung University
in Partial Fulfillment of the Requirementsfor the Degree of
Doctor of Philosophy
in
Computer Science
July 2014
Hsinchu, Taiwan, Republic of China
籃球影片之語義標注與摘要擷取之研究
研 究 生 : 陳 俊 旻
指導教授 : 陳 玲 慧 博士
國 立 交 通 大 學 資 訊 學 院
資 訊 科 學 與 工 程 研 究 所
摘 要
運動影片在我們的休閒娛樂中,扮演了重要角色,然因運動影片的資訊量很 大,除了需要的頻寬與傳輸時間多,觀眾亦需耗費大量的時間觀賞,為了節省不 必要的時間成本與能源成本,影片精華檢索、影片摘要、以及影片慢動作重播偵 測已成為一個熱門的研究題目。目前大多數方法,皆對影片中的每一張畫面分 析,然而語義事件只發生在有計分板的畫面,慢動作重播則只出現在沒有計分板 的畫面,從不相關的畫面中擷取語義事件或慢動作重播,反而降低方法的準確度 與執行效率,且現存的方法多針對足球影片而設計,對籃球影片之探討相對較 少,為了解決現存方法所遇到的各式挑戰,本論文將以籃球影片為例,提出一個 新穎的運動影片分析架構,讓一般民眾得以有效率的查詢賽事精華,也讓專業人 士能夠用來延伸到其他相關應用(自動影片精華產生、運動員動作分析、球隊戰 術分析等)。在此架構中,首先提供一個影片畫面分割方法,將運動影片分成有/ 無計分板兩類。接著,對有計分板的畫面提出一語義事件偵測方法,對無計分板關於語義事件偵測的相關研究,現存的方法,多使用影片本身的影像或聲音 作為特徵,然而僅使用影片內容作為特徵,往往會發生一些語義鴻溝,也就是較 低階的影片特徵,和較高階的語義事件,兩者之間的差距。雖然近來有些方法, 參考網路轉播文字作為外部知識以彌補語義鴻溝,但從網路轉播文字中擷取語義 事件,並標注在運動影片上,仍然存在許多困難與挑戰。在此論文中,我們將討 論相關的困境,並提出兩個方法來解決。 關於慢動作重播偵測的研究,現存方法大致可以分為兩類。慢動作重播前 後,常常有製播單位後製加上的特效畫面,第一類方法都是基於這些特效的位 置,來偵測慢動作重播,但籃球影片較為複雜,此假設在籃球影片未必恆成立。 第二類方法是分析慢動作片段的特徵,利用這些特徵將慢動作重播片段和一般片 段作區分,但由於某些用於足球的特徵並不適用於籃球,此類方法在籃球應用上 仍有改進空間。籃球是世界上最重要的運動之一,但在偵測籃球影片慢動作重播 上,仍有許多挑戰尚待解決。本論文將提出一個新的方法,偵測籃球影片中的慢 動作重播,提供一個重要的運動影片分析素材。 實驗結果顯示,本論文所提出的架構與方法,可行性與有效性皆可得到良好 的驗證,基於提出的架構與方法皆沒有使用籃球限定的特徵,我們期望本論文可 以被延伸應用於其他類型的運動影片。
A STUDY ON SEMANTIC ANNOTATION AND
SUMMARIZATION OF BASKETBALL VIDEO
Student: Chun-Min Chen Advisor: Dr. Ling-Hwei Chen
Institute of Computer Science and Engineering College of Computer Science
National Chiao Tung University
ABSTRACT
Semantic event and slow motion replay extraction for sports videos have become
hot research topics. Most researches analyze every video frame; however, semantic
events only appear in frames with scoreboard, whereas replays only appear in frames
without scoreboard. Extracting events and replays from unrelated frames causes
defects and leads to degradation of performance. In this dissertation, a novel
framework will be proposed to tackle challenges of sports video analysis. In the
framework, a scoreboard detector is first provided to divide video frames to two
classes, with/without scoreboard. Then, a semantic event extractor is presented to
extract semantic events from frames with scoreboard and a slow motion replay
extractor is proposed to extract replays from frames without scoreboard.
As to semantic event extraction, most of existing researches focus on analyzing
the distance between lower level video features and higher level semantic events.
Although the multimodal fusion scheme that conducts webcast text as external
knowledge to bridge the semantic gap has been proposed recently, extracting semantic
events from sports webcast text and annotating semantic events in sports videos are
still challenging tasks. In this dissertation, we will address the challenges in the
multimodal fusion scheme. Then, we will propose two methods to overcome the
challenges.
As to slow motion replay detection, many methods have been proposed, and they
are classified into two categories. One assumes that a replay is sandwiched by a pair
of visually similar special digital video effects, but the assumption is not always true
in basketball videos. The other analyzes replay features to distinguish replay segments
from non-replay segments. The results are not satisfactory since some features (e.g.
dominant color of sports field) are not applicable for basketball. Most replay detectors
focus on soccer videos. In this dissertation, we will propose a novel idea to detect
slow motion replays in basketball videos.
The feasibility and effectiveness of all the above proposed methods have been
demonstrated in experiments. It is expected that the proposed sports video analysis
誌 謝
首先,我想對我的指導老師陳玲慧教授獻上我最誠摯的感謝,在她亦師亦母 指導之下,無論是學術研究的方法、解決問題的能力、待人處事的態度等,都令 我獲益匪淺;更感謝她的支持,讓我得以兼顧對籃球的熱愛以及學術研究的熱 誠,我很幸運能夠遇到這麼好的老師。 接著我要感謝自動化資訊處理實驗室的諸多伙伴,因為各位學長學弟們的陪 伴,讓我的研究生涯充實有趣而不孤單。感謝井民全學長和郭萓聖學長,在我剛 進入實驗室的階段,能有兩位非常好的楷模與榜樣。感謝李惠龍學長、楊文超學 長、歐占和學弟,從資格考一起共同奮戰,到整個學術生涯過程中的提點與幫忙。 感謝林懷三學弟口試期間的熱心協助。 再來,我要感謝交通大學男子籃球隊伙伴們的支持,讓我在研究疲累之餘, 能有一個避風港,和各位學長學弟們一同在球場上奔馳,是我的榮幸,也是我這 輩子最珍貴的回憶。感謝我最好的朋友們:偉益、金煌、志瑋、秉澄、信華,在 我脆弱的時候陪伴我度過難關。感謝在我生命中,每一位曾經給予我幫助與鼓勵 的朋友,謝謝你們,讓我成為一個更好的人。 最後,也是最重要地,我要感謝一直以來無條件支持我的家人:父親國淇、 母親淑真、哥哥俊宇、姊姊韻如,永遠作我堅強的後盾,讓我毫無後顧之憂地追 求目標、挑戰人生、享受生活,謹以最感恩的心,將此篇論文獻給我最親愛的家 人。TABLE OF CONTENTS
CHINESE ABSTRACT ... i
ENGLISH ABSTRACT ... iii
ACKNOWLEDGMENT (IN CHINESE)... v
TABLE OF CONTENTS ... vi
LIST OF TABLES ... viii
LIST OF FIGURES ... ix
CHAPTER 1 ... 1
INTRODUCTION ... 1
1.1 Motivation ... 1
1.2 Related Work ... 2
1.2.1 Semantic Event Extraction Challenges ... 2
1.2.2 Slow Motion Replay Detection Challenges ... 5
1.3 Synopsis of the Dissertation ... 6
CHAPTER 2 ... 8
A NOVEL FRAMEWORK FOR SPORTS VIDEO ANALYSIS ... 8
2.1 Video Frames Partition... 8
2.1.1 Context-Based Static Region Detection... 11
2.1.2 Scoreboard Selection... 13
2.1.3 Experimental Results ... 14
2.2 Overview of the Framework ... 15
2.3 Summary ... 16
CHAPTER 3 ... 17
A NOVEL APPROACH FOR SEMANTIC EVENT EXTRACTION
FROM SPORTS WEBCAST TEXT ... 17
3.1 Introduction ... 17
3.2 Proposed Method ... 20
3.2.1 Unrelated Words Filtering ... 22
3.2.1.1 Stop Words
... 233.2.1.2 The Proposed Interactive System for Establishing
Sports Stop Word List and Event Keyword List
... 243.2.1.3 The Proposed Unrelated Words Filtering Procedure
263.2.2 Event Clustering ... 27
3.3 Experimental Results ... 34
3.4 Summary ... 42
CHAPTER 4 ... 43
ANNOTATING WEBCAST TEXT IN BASKETBALL VIDEOS BY
GAME CLOCK RECOGNITION AND TEXT/VIDEO ALIGNMENT . 43
4.1 Introduction ... 43
4.2 Proposed Method ... 47
4.2.1 Video Frames Partition ... 47
4.2.2 Semantic Event Extraction from Scoreboard Frames ... 48
4.2.2.1 Clock Digit Locator
... 494.2.2.2 Clock Digit Template Collection
... 504.2.2.3 Clock Digit Recognition
... 514.2.2.4 Text/Video Alignment
... 534.3 Experimental Results ... 54
4.4 Summary ... 56
CHAPTER 5 ... 58
A NOVEL METHOD FOR SLOW MOTION REPLAY DETECTION IN
BROADCAST BASKETBALL VIDEO ... 58
5.1 Introduction ... 58
5.2 Proposed Method ... 61
5.2.1 Video Frames Partition ... 61
5.2.2 Feature Extraction and Replay Detection ... 62
5.3 Experimental Results ... 80
5.4 Summary ... 87
CHAPTER 6 ... 88
CONCLUSIONS AND FUTURE WORKS ... 88
REFERENCES ... 90
PUBLICATION LIST ... 92
LIST OF TABLES
Table 3.1 Average number of sports event categories in 25 basketball
training data and 20 soccer training data. ... 35
Table 3.2 Mappings of basketball event categories from pLSA to the
proposed method. ... 37
Table 3.3 Mappings of soccer event categories from pLSA to the
proposed method. ... 39
Table 3.4 Occurrences of exception basketball events from 41 testing
games. ... 41
Table 3.5 Occurrences of exception soccer events from 48 testing games.
... 41
Table 4.1 Semantic events extraction results of the proposed method. .... 55
Table 5.1 Replay detection results for MNS. ... 81
Table 5.2 Replay detection results for MNS with self pruning. ... 82
Table 5.3 Replay detection results for MNS by methods in the first
category. ... 82
Table 5.4 Total replay detection results with fixed TH
slv= 30. ... 84
Table 5.5 Total replay detection results with fixed TH
smoothness= 85%. .... 84
Table 5.6 Total replay detection results with TH
smoothness=0.85 and TH
slv=
25. ... 85
Table 5.7 Total replay detection results with TH
smoothness=0.85 and TH
slv=
30. ... 86
Table 5.8 Total replay detection results by methods in the first category.86
LIST OF FIGURES
Fig. 2.1 Examples of scoreboard frames and non-scoreboard frames... 10
Fig. 2.2 Block diagram of scoreboard template extraction. ... 10
Fig. 2.3 Example of pixel-based frame difference accumulation. ... 12
Fig. 2.4 Scoreboard template extraction for 3 different broadcasters with
extracted positions marked by white rectangle. ... 15
Fig. 2.5 The proposed framework. ... 16
Fig. 3.1 An example of basketball webcast text. ... 21
Fig. 3.2 Block diagram of the proposed method. ... 21
Fig. 3.3 An example to illustrate description and word. ... 23
Fig. 3.4 The block diagram of the interactive pre-training system. ... 26
Fig. 3.5 Block diagram of unrelated words filtering procedure. ... 27
Fig. 3.6 An example to illustrate the concept of the proposed hierarchical
search system. ... 30
Fig. 3.7 An example to illustrate the data structure for hierarchical search.
... 32
Fig. 4.1 Two examples of overlaid scoreboard with game clock in
basketball video. ... 45
Fig. 4.2 General definitions of game clock patterns. ... 49
Fig. 4.3 An example of locating game clock digits (10:30). ... 50
Fig. 4.4 An example of text/video alignment. ... 54
Fig. 4.5 Examples of basketball games playing without game clock... 56
Fig. 5.1 Examples of game-related segments. ... 63
Fig. 5.2 Block diagram of slow motion replay detection. ... 64
Fig. 5.3 An example of comparison between a game-related segment and
a replay segment. ... 65
Fig. 5.4 The two global features of each MNS in a basketball video. ... 68
Fig. 5.5 An example of the DH
1sequence of a game-related segment
misclassified as replay. ... 69
Fig. 5.6 Histogram of σ´
DF1from the preliminary replays in ten
experimented basketball videos. ... 69
Fig. 5.8 Examples of still shots of the product and slogan in TV
commercials. ... 71
Fig. 5.9 Examples of abrupt transition detection results and the
corresponding cut scenes of non-replay and replay. ... 77
CHAPTER 1 INTRODUCTION
1.1 Motivation
Thanks to the rapid growth of computer science and network technology, people
now are capable of using mobile devices, e.g. notebook, tablet, smart phone, to
acquire sports videos anytime and anywhere. Since substantial number of sports
videos are produced and broadcasted every day, it is nearly impossible to watch them
all. Most of the time, people prefer to watch highlights of sports videos or retrieve
only partial video segments that they are interested in. Many websites, such as ESPN,
NBA, and Yahoo Sports, already make this kind of online service available. These
online services are made by professional film editors and sports reporters by
exhaustedly watching sports videos personally, so people or fans can see the unified
version. However, these services may not please all fans. For example, fans, who
want to practice certain sports skills or imitate specific sports stars cannot take
advantage of the unified version highlight, and have to download the whole game and
search for certain moves made by certain players. It is quite inconvenient. Therefore,
sports video analysis, such as semantic event extraction [1]-[9] and slow motion
1.2 Related Work
Many research efforts have been spent on sports video analysis. However, some
challenges still remain to be solved and will be presented in the following.
1.2.1 Semantic Event Extraction Challenges
Some semantic event extraction researches [1]-[3] use video content as resource
knowledge. Chen and Deng [1] analyzed video features (e.g. color, motion, shot) to
extract and index events in a basketball video. Hassan et al. [2] extracted audio-visual
(AV) features and applied Conditional Random Fields (CRFs) based probabilistic
graphical model for sports event detection. Kim and Lee [3] built an indexing and
retrieving system for a golf video by analyzing its AV content. However, schemes
relying on video content encounter a challenge called semantic gap, which represents
the distance between video features and semantic events. Recently, some researches
[4]-[9] use a multimodal fusion of video content and external resource knowledge to
bridge the semantic gap. Webcast text, one of the most powerful external resource
knowledge, is an online commentary posted with well-defined structure by
professional announcers. It focuses on sports games and contains detail information
(e.g., event description, game clock, player involved, etc.). The multimodal fusion
text/video alignment to complete sports video annotation or summarization, has been
used in American football [4], soccer [6]-[8], and basketball [7]-[8].
For webcast text analysis, Xu et al. [8] apply probabilistic latent semantic
analysis (pLSA), a linear algebra–probability combined model, to analyze the webcast
text for text event clustering and detection. Based on their observation, the
descriptions of the same event in the webcast text have a similar sentence structure
and word usage. They use pLSA to first cluster the descriptions into several categories
and then extract keywords from each category for event detection. Although they
extend pLSA for both basketball and soccer, there are two problems in the approach: 1)
the optimal number of event categories is determined by minimizing the ratio of
within-class similarity and between-class similarity. In fact, there are more event
categories for a basketball or soccer game. For example, in a basketball game, many
events, such as timeout, assist, turnover, ejected, are mis-clustered into wrong
categories or discarded as noises. This may cause side effects degrading and limiting
the results of sports video retrieval; 2) after keywords extraction, events can be
detected by keywords matching. In Xu et al.’s method, they use the top ranked word
in pLSA model as single-keyword of each event category. But in some event
categories, the single-keyword match will lead to horrible results. For example, in
players make. Without detecting “makes” as a previous word of “jumper” in
description sentences, the precision of “jumper” event detection is decreased from
89.3% to 51.7% in their testing dataset. However, the “jumper” event actually is an
event that consists of “makes jumper” event and “misses jumper” event. The former
can be used in highlights, and the latter can be used in sports behavior analysis and
injury prevention. Accordingly, using single-keyword match is insufficient and some
important events will be discarded.
In the multimodal fusion scheme, text/video alignment has a great impact on
performance, and it can be achieved through scoreboard recognition. A scoreboard is
usually overlaid on sports videos to present the audience some game related
information (e.g., score, game status, game clock) that can be recognized and aligned
with text results. For sports with game clock (e.g., basketball and soccer), event
moment detection can be performed through video game clock recognition Xu et al.
[6]-[8] used Temporal Neighboring Pattern Similarity (TNPS) measure to locate game
clock and recognize each digit of the clock. A detection-verification-redetection
mechanism is proposed to solve the problem of temporal disappearing clock region in
basketball videos. However, recognizing game clock in a frame which has no game
clock is definitely unnecessary. The cost of verification and redetection could have
semi-transparent scoreboard.
1.2.2 Slow Motion Replay Detection Challenges
As to slow motion replay detection, many methods have been proposed, and they
can be classified into two categories. The first category [10]-[15] is to locate positions
of specific production actions called special digital video effects (SDVEs) or logo
transitions, and bases on these positions to detect replay segments. However, in this
category, they all made an imperfect assumption that a replay is sandwiched by either
two visually similar SDVEs or logo transitions, the assumption is not always true in
basketball videos. In fact, a basketball video segment bounded by paired SDVEs is
not always a replay. Moreover, the beginning and end of a basketball replay can have
some combinations: 1) paired visually similar SDVEs; 2) non-paired SDVEs; 3) a
SDVE in one end and an abrupt transition in the other. So, previous work in this
category cannot be applied to basketball videos with replays having combinations (2)
and (3).
The second category [16]-[18] analyzes features of replays to distinguish replay
segments from non-replay segments. Farn et al. [16] extracted slow motion replays by
referring to the dominate color of soccer field; however, it is not applicable in
textures are more complicated. Wang et al. [17] conducted motion-related features and
presented a support vector machine (SVM) to classify slow motion replays and
normal shots. The precision rates of two experimented basketball videos are 55.6%
and 53.3% with recall rates 62.5% and 66.7%, respectively. Han et al. [18] proposed a
general framework based on Bayesian network to make full use of multiple clues,
including shot structure, gradual transition pattern, slow motion, and sports scene. The
method is suffered from the inaccuracy of the used automatic gradual transition
detector. Their experiments show precision rate 82.9% and recall rate 83.2%.
The existing two category methods are generic but not satisfactory for basketball
videos. Moreover, most previous researches analyze every video frame to detect
replays, but detecting replays in video frames that are surely non-replay degrades both
performance and detection rate.
1.3 Synopsis of the Dissertation
Semantic event and slow motion replay extraction for sports videos have become
hot research topics. Most researches analyze every video frame; however, semantic
events only appear in frames with scoreboard, whereas replays only appear in frames
without scoreboard. Extracting events and replays from unrelated frames causes
challenges, a novel framework combining semantic event extraction and slow motion
replay detection is proposed in this dissertation. In the framework, a scoreboard
detector is first provided to divide video frames to two classes, with/without
scoreboard. Then, a semantic event extractor is presented to extract semantic events
from frames with scoreboard and a slow motion replay extractor is proposed to extract
replays from frames without scoreboard.
The rest of the dissertation is organized as follows. Chapter 2 presents an
overview of the proposed framework for sports video analysis. Under the framework,
some sports video analysis schemes are proposed and discussed in Chapter 3 to
Chapter 5. Chapter 3 describes an unsupervised approach to extract semantic events
from sports webcast text. The text/video alignment and event annotation method is
proposed in Chapter 4. Chapter 5 provides a slow motion replay detection method for
broadcast basketball video. Some conclusions and future research directions are given
CHAPTER 2
A NOVEL FRAMEWORK FOR SPORTS VIDEO ANALYSIS
In this chapter, we will propose a novel framework to analyze sports videos. One
of the main novelties is to refer to scoreboard information. It is observed that sports
video frames can be partitioned into two categories according to the existence of
scoreboard. Frames with scoreboard existence are called scoreboard frames, and
others are called non-scoreboard frames. In general, semantic events appear during
playing of a sports game, which consists of scoreboard frames only. Slow motion
replays appear during temporal pausing of a sports game, which consists of
non-scoreboard frames only. The phenomenon is dominant and used to skip large
amount of unnecessary processing frames before semantic resource extraction.
Accordingly, the performance and the detection rate can be assured. The chapter is
organized as follows. In Section 2.1, a video frame partition method to divide frames
into scoreboard frames and non-scoreboard frames is introduced. An overview of the
proposed framework will be presented in Section 2.2. Note that extracting semantic
events from scoreboard frames and extracting slow motion replays from
non-scoreboard frames will be provided in the latter chapters.
2.1 Video Frames Partition
classified into two categories, scoreboard frames and non-scoreboard frames.
Scoreboard frames present basketball game with scoreboard overlaid on them, while
non- scoreboard frames present the rest, e.g., sideline interview, slow motion replay,
etc. Since semantic events only appear in scoreboard frames, whereas replays only
appear in non-scoreboard frames. It is beneficial to filter out unnecessary processing
frames in each semantic resource extraction step. So, an automatic scoreboard
template extractor is first proposed to extract scoreboard template and scoreboard
position. Then, the video frame partitioning can be done by simple template matching.
It can be seen from Fig. 2.1(a), a scoreboard is a large, still, and rectangular area
which consists pixels that change very infrequently. Based on this fact, an automatic
scoreboard template extractor is proposed. First, a context-based static region detector
is provided to extract few static regions called scoreboard candidates. Then a
scoreboard selection method is used to get the right scoreboard. The block diagram of
(a) Scoreboard frame. (b) Non-scoreboard frame (sideline interview). (c) Non-scoreboard frame (TV commercial). (d) Non-scoreboard frame (slow motion replay).
Fig. 2.1 Examples of scoreboard frames and non-scoreboard frames.
Video Input
Context-based Static Region Detection
Scoreboard Selection
Extracted Scoreboard
Template Position
2.1.1 Context-Based Static Region Detection
As to context-based static region detection, a sports video is considered as an
input frame sequence. Let fi be the i-th input frame and K be the total frame number.
For each frame fi, the pixel-based frame difference between fi and its previous frame
fi-1 is first calculated as follows:
K i y x f y x f y x Dfi( , )= i( , )− i−1( , ), 2≤ ≤
Where fi(x,y) represents the color value of pixel (x,y) at frame fi. Then, an
accumulated difference frame, ADfi, is created by
K i y x Df y x ADf i j j i =
∑
≤ ≤ = 2 , ) , ( ) , ( 2Fig. 2.3 shows an example. As time goes by, the accumulated difference at each pixel
can be considered as the change degree at that position.
After binarizing the accumulation result, each white point represents the position
that changes more frequently and each black point represents the opposite. Then, we
do region growing on black points of each binarized accumulated difference frame to
find the largest connected component, which satisfies two constraints, as a potential
scoreboard candidate. One constraint is about size. Since a scoreboard should be large
enough to present score information, the width of the bounding box of the connected
component should be at least 1/12 frame width and the height should be at least 1/18
component should be near rectangular, that is, the ratio of the connected component
area and its bounding box area should be at least 0.9.
frame 1 frame 2 frame 3 frame i
(a) Video frame sequence.
Df2 Df3 Dfi
(b) Pixel-based frame difference.
ADf2 ADf3 ADfi
(c) Accumulation of neighboring frame pair differences.
(d) Binarized results.
For each binarized accumulated difference frame, if a potential scoreboard candidate
is found, its position is then recorded. If the position is unchanged for consecutive
frames, e.g. 300 frames, this means a potential scoreboard candidate is stable enough,
and it can be considered as a scoreboard candidate. The context-based static region
detector is applied repeatedly to the video frame sequence until few candidates are
detected.
2.1.2 Scoreboard Selection
Some sports videos have overlaid rectangular logos made by the TV stations.
The TV station logo is overlaid at the same position during the game while the
scoreboard may disappear from time to time (see Fig. 2.1). Thus the logo is possibly
detected as a scoreboard candidate. Fortunately, a TV station logo is never larger than
a scoreboard, thus the scoreboard selection will prune smaller size candidates. Note
that a scoreboard candidate consists of two parts, position and template. Now, we
have located the scoreboard position. For template, since the scoreboard may
disappear from time to time, extracting a template from a scoreboard candidate
position cannot guarantee a right one. To solve this problem, for each scoreboard
candidate sc extracted from fi, the temporal change of the candidate sc, TC(sc), is
, ) , ( ) , ( ) ( 2 2 1 0 1 0
∑ ∑ ∑
− = − = − = − − = s M x N y s i i c c y x f y x f sc TCwhere Mc and Nc represent the width and height of sc, fi(x,y) represents the color value
of pixel (x,y) at frame fi, and s represents temporal frame offset. Then, the scoreboard
selection will take the one with the least temporal change as the scoreboard template.
According to our experiments, four scoreboard candidates are enough to extract
the right scoreboard template. After scoreboard template extraction, the video frames
partition can be done by matching every frame with scoreboard template at the
scoreboard position.
2.1.3 Experimental Results
Our experiments are conducted by 10 NBA basketball games from 3 different
broadcasters, i.e., ESPN, TNT, NBA TV. The data are recorded from TV in MPEG-2
format with resolution 480 × 352. All 10 scoreboard templates are extracted
successfully. It can be seen from Fig. 2.4, the proposed scoreboard template extractor
works great for the 3 different broadcasters. Due to the effective results for different
style scoreboards, it is believed that the proposed scoreboard template extractor can
be generalized to other sports. Note that a scoreboard contains rich information in a
(a) Game match broadcasted by ESPN.
(b) Game match broadcasted by TNT.
(c) Game match broadcasted by NBA TV. Fig. 2.4 Scoreboard template extraction for 3 different broadcasters with extracted
positions marked by white rectangle.
2.2 Overview of the Framework
The proposed framework is shown in Fig. 2.5. It can be seen from Fig. 2.5, the
existing methods for semantic event extraction and replay detection can easily apply
to the framework. Contrary to previous works, in the framework, scoreboard frames
and non-scoreboard frames will be separately processed in semantic event extraction
and slow motion replay detection. Since scoreboard only covers a small part of a
video frame, conducting this slight-cost partitioning task before semantic resource
extraction improves a lot of performance in both time complexity and detection
accuracy.
In this dissertation, some sports video analysis schemes are proposed under the
framework. A novel approach for webcast text analysis is presented in Chapter 3.
Semantic event annotation through video clock recognition is provided in Chapter 4.
Accordingly, the framework of the dissertation is presented in Fig. 2.5 as well. Detail
techniques will be discussed in the following chapters.
Fig. 2.5 The proposed framework.
2.3 Summary
In this chapter, a novel framework for sports video analysis, which provides
flexibility to combine different schemes of event extraction and those of replay
detection, is proposed. The novelty of video frames partition prevents semantic
resource extraction from a lot of unnecessary processing frames, so the performance
and detection rate can be increased. The framework is also capable of acquiring both
two valuable semantic resources in one time. Sports Video
Webcast Text
Text Analysis (Ch. 3)
Text Event Extraction
Text/Video Alignment (Ch. 4) Scoreboard Frames Analysis
(Ch. 4)
Clock Recognition Frames
Partition Non-scoreboard Frames Analysis (Ch. 5) Replay Extraction Semantic Events Event: Assist Player:Kobe Bryant Team: L. A. Lakers Semantic Event Extraction
CHAPTER 3
A NOVEL APPROACH FOR SEMANTIC EVENT EXTRACTION FROM SPORTS WEBCAST TEXT
In this chapter, we will propose an unsupervised approach to extract semantic
events from sports webcast text. First, unrelated words in the descriptions of webcast
text are filtered out, and then the filtered descriptions are clustered into significant
event categories. Finally, the keywords for each event category are extracted. The
extracted significant text events can be used for further video indexing and
summarization. Furthermore, we also provide a hierarchical searching scheme for text
event retrieval.
3.1 Introduction
In video summarization and retrieval, a source video is first clipped into smaller
videos representing significant events through a preprocessing, called semantic event
detection, which detect events occurred in a video and annotates events with
appropriate tags. With finer results of the preprocessing, video summarization and
retrieval can be completed efficiently and correctly. Most of existing event detection
schemes use video content as their resource knowledge. However, the schemes
relying on video content encounter a challenge called semantic gap, which represents
the distance between low level video features and high level semantic events. In
One of the external knowledge is Closed-Caption (CC) [19]. CC is the transcript
of speech and sound, and it is helpful for semantic analysis of sports videos. It is
mainly used in aid of listening and language learning, but only available in certain
videos and certain countries. Because CC completely records the sound in video, it
contains a lot of redundant information and usually lacks of structure. The other
external knowledge is webcast text. Comparing to CC, webcast text is the online
commentary posted by professional announcers and focuses more on sports games. It
contains more detail information (e.g., event name, time, player involved, etc.), which
is difficult to extract from video content itself automatically. Xu and Chua [5] first use
webcast text as external knowledge to assist event detection in soccer video. They
proposed a framework that combines internal AV features with external knowledge to
do event detection and event boundary identification. But the proposed model is
inapplicable to other team sports. Xu et al. [8] apply probabilistic latent semantic
analysis (pLSA), a linear algebra–probability combined model, to analyze the webcast
text for text event clustering and detection. Based on their observation, the
descriptions of the same event in the webcast text have a similar sentence structure
and word usage. They use pLSA to first cluster the descriptions into several categories
and then extract keywords from each category for event detection. Although they
1) The optimal numbers of event categories are nine for basketball and eight for
soccer in the results, which is determined by minimizing the ratio of within-class
similarity and between-class similarity. In fact, there are more event categories for
a basketball or soccer game. For example, in a basketball game, many events,
such as timeout, assist, turnover, ejected, are mis-clustered into wrong categories
or discarded as noises. This may cause side effects degrading and limiting the
results of video retrieval.
2) After keywords extraction, events can be detected by keywords matching. In Xu et
al.’s method, they use the top ranked word in pLSA model as single-keyword of
each event category. But in some event categories, the single-keyword match will
lead to horrible results. For example, in their method for a basketball game,
“jumper” event represents those jumpers that players make. Without detecting
“makes” as a previous word of “jumper” in description sentences, the precision of
“jumper” event detection is decreased from 89.3% to 51.7% in their testing dataset.
However, the “jumper” event actually is an event that consists of “makes jumper”
event and “misses jumper” event. The former can be used in highlights, and the
latter can be used in sports behavior analysis and injury prevention. Accordingly,
using single-keyword match is insufficient and some important events will be
To treat the above-mentioned problems, we propose a method to analyze sports
webcast text and extract significant text events. An unsupervised scheme is used to
detect events from the webcast text and extract multiple keywords from each event. A
data structure is used to store these multiple keywords and to support a hierarchical
search system with auto-complete feature for event retrieval. The word “hierarchical”
means that a user can get more specific results by querying more keywords and the
word “auto-complete” means that the system can give suggested keywords during the
query step.
3.2 Proposed Method
Webcast text comprises knowledge which is closely related to the game and is
easily retrieved from websites. As can be seen in Fig. 3.1, it contains time tags, team
names, scores, and event descriptions. The format is so organized that we can follow
the time flow and understand how the game goes on. Among this well-organized text,
it is apparent that event descriptions relate to semantic events the most. Our goal is to
Fig. 3.1 An example of basketball webcast text.
Fig. 3.2 Block diagram of the proposed method. Unrelated Words Filtering
Event Clustering
Event Data Structure Establishing Webcast
Text
Extracted Semantic Information
Forward Index Inverted Index
Hierarchical Search System Input
The block diagram of the proposed method is presented in Fig. 3.2. It can be seen
that we first filter out unrelated words of webcast text and then cluster them into
significant events. We store the extracted semantic information with a pair of index
tables and build a hierarchical retrieval system by manipulating the two tables. The
detail of each block will be described in the following subsections.
3.2.1 Unrelated Words Filtering
In webcast text, each description can be considered as an event. It contains many
words and may include player name, team name, movement name, and whether the
player or the team makes the movement or not. An example is given in Fig. 3.3, a
player named “Peja Stojakovic” failed to make a movement called “10-foot two point
shot.”
The number of descriptions in each basketball game is more than four hundred.
The descriptions are readable and can be easily categorized into several events by
human eyes. But the task is not effortless for computer machines. According to our
observations, words in each description consist of three mutually disjoint word sets: 1)
stop words, 2) event keywords, and 3) names. Stop words are unrelated to event and
should be discarded. Event keywords are closely related to event and should be kept
preserved for event annotation. Our objective is to extract event keywords and use
these keywords to do event clustering. To achieve the objective, based on a reference
stop word list and an online name information, an interactive system is first provided
to establish a sports stop word list and an event keyword list. The system will be
explained in Sections 3.2.1.1 and 3.2.1.2. According to these two lists, for each
webcast text, an unrelated word filtering procedure described in Section 3.2.1.3 is
next provided to filter out stop words and to preserve name words. The remaining
keywords are then used for event clustering, which will be described in Section 3.2.2.
Fig. 3.3 An example to illustrate description and word.
3.2.1.1 Stop Words
In information retrieval, there are some words that occur very frequently (e.g.
some articles, prepositions, pronouns, be-verbs) and are useless in document matching.
These words are called stop words [20]. Due to the uselessness of stop words, filtering
out them during both index step and query step can reduce the index size and query
Description
processing time. This technique has been used in search engines and can be
implemented through predefining a stop word list. For the variety of applications,
there is no standard stop word list. Many reference stop word lists [21]-[22] have been
proposed by using techniques about statistics and probability.
From Fig. 3.1, it can be seen that descriptions contain articles (e.g. “the”),
prepositions (e.g. “of”), range of shot (e.g. “10-foot”), and points of shot (e.g. “two
point”). Some words are details of events which decrease the connections between
similar events. With the aid of reference stop lists, articles and prepositions can be
easily filtered out from descriptions. However, the range of shot and points of shot are
exceptions in reference stop lists. Moreover, in soccer webcast text, due to the
relatively larger ground, there are more unrelated words to describe locations where
an event happens. For example, right wing, left wing, inside the box, outside the box,
left corner, right corner, etc. Accordingly, it is hard to automatically generate a sports
stop word list for all kinds of sports. So we will provide an interactive system to
establish a sports stop word list.
3.2.1.2 The Proposed Interactive System for Establishing Sports Stop Word List and Event Keyword List
text descriptions of several games are taken as training inputs, next some unrelated
words are filtered out according to a reference stop word list [21] and a name word
list (e.g., online box score in basketball and online player statistics in soccer). And
then the system interacts with sports professionals, who will divide the remaining
words into a black list and a white list. The black list contains stop words for sports,
and the white list contains sports event keywords. Finally the black list is merged into
the reference stop word list to get the sports stop word list. The block diagram of the
interactive system is presented in Fig. 3.4.
Our training webcast text is conducted by 41 basketball games and 48 soccer
games. After the reference stop words filtering and the name words filtering, the
remaining words needed to interactively ask professionals are less than 100 in
basketball and less than 200 in soccer. The responses from professionals may take just
Fig. 3.4 The block diagram of the interactive pre-training system.
3.2.1.3 The Proposed Unrelated Words Filtering Procedure
Fig. 3.5 shows the block diagram of the proposed unrelated words filtering
procedure. For a webcast text, the sports stop word list is first used to filter out
unrelated words. Next the event keyword list is used to extract event keywords. Then
the words with uppercase beginning in the remaining words are considered as
reserved names for further indexing. According to our experiment results, the
unrelated words filtering works well both in basketball and soccer. Reference Stop Words Filtering
Interactively Asking Professional Training
Webcast Text
Black List White List
Name Words Filtering
Reference Stop Word List Sports Stop Word List Merging Event Keyword List
Fig. 3.5 Block diagram of unrelated words filtering procedure.
3.2.2 Event Clustering
After filtering, each description is reduced and almost exactly describes an event;
for example, “misses shot” represents a missed shot. So a matching function is
provided to cluster these filtered descriptions into event categories.
Filtered descriptions can be represented as FD = { fd1, fd2,…, fdN }, and event
categories can be represented as C = { C1, C2, …, CK }, where N denotes the number
of descriptions in a game and K denotes the number of categories that the clustering
step produces. Since a filtered description consists of some words, it can be
considered as a set of words. Note that the number of keywords of an event category No
Sports Stop Words Filtering Webcast Text
Keywords Passing Filtering for Event Clustering
Event Keywords
Uppercase Beginning
Reserved Names for Further Indexing
Unrelated Words for Discarding Yes
Yes
is not restricted to be single in our method. The matching function is defined as (3.1) otherwise, , 0 y x if , 1 ) , ( _ = = y x Match Text
where x and y are two sets of words. Each filtered description, fdi, can be clustered
into one category based on the following function
(3.2) , 1 , } ..., , 1 )), ( , ( _ { max arg ) ( , ..., N i K m C Keywords fd Match Text fd Clustering i m m i = = =
where Keywords(Cm) denotes the multiple-keywords set of category Cm. Clustering(fdi)
= j means that description fdi is clustered into category Cj. In order to avoid zero
matching in (2), a flag function to examine whether the situation happens is defined as
(3.3) . 1 }, ..., , 1 )), ( , ( _ { max ) ( , ..., N i K m C Keywords fd Match Text fd Flag i m m i = = =
The detail of the proposed clustering algorithm is given below.
Clustering Algorithm
Step0: Initialization: Given FD = { fd1, fd2,…, fdN }.
Set K = 1, Clustering(fd1) = 1, Keywords(C1) = fd1, i = 2.
Step1: // Cluster the description fdi according to Functions (3.1), (3.2), and (3.3).
For m = 1 to K, use Function (1) to calculate TMfdim )) ( , ( _ i m
im Text Match fd Keywords C
TMfd = ; Let } { max ) ( ,..., 1 K im m i TMfd fd Flag = = ;
if (Flag(fdi) = 0) then begin
//fdi cannot be clustered into any existing class
// create a new class for fdi
K = K + 1;
Keywords(CK ) = fdi;
Clustering(fdi) = K;
else
//fdi is clustered into one of the existing classes
Use Function (2) to calculate Clustering(fdi) as
} { max arg ) ( im m i TMfd fd Clustering = ; end
Step2: If any of the descriptions in FD is not clustered yet, set i = i + 1 and go to
Step1 for next iteration. Otherwise, end of iterations.
Once the clustering algorithm is completed, the filtered descriptions are clustered
multiple keywords of the event. At the meantime, semantic event detection is
accomplished. Then two data structures are built to recommend users for further
queries and to support the hierarchical search.
3.2.3 Hierarchical Search System
Fig. 3.6 gives an example to show the concept of the proposed hierarchical
search system. First, a user can query by one word to get rough results. Then he can
continually query by more words to get into deeper levels for finer results. Here we
implement the system by establishing a pair of index tables and manipulating them
back and forth.
Fig. 3.6 An example to illustrate the concept of the proposed hierarchical search system. Query jumper makes assists misses dunk makes
Here we build a forward index table and an inverted index table. The former
records mappings from descriptions to event keywords, and the latter stores mappings
from keywords to descriptions. Note that the forward index table is established
automatically after applying the unrelated words filtering procedure. Based on the
forward index table, the inverted index table can be established by sequentially
scanning event keyword set of each description. An example is given in Fig. 3.7 to do
clearer explanation. Suppose we have five descriptions as shown in Fig. 3.7(a). After
applying unrelated words filtering procedure to each description, we can obtain Fig.
3.7(b). By scanning each row in Fig. 3.7(b), for each row, we can obtain a description
index (DI) and the corresponding event keyword set (EKS). Then DI is linked to each
keyword in EKS. After scanning all rows sequentially in Fig. 3.7(b), Fig. 3.7(c) is
established. Both inverted index table and forward index table are referred to achieve
the hierarchical search system. The inverted index table is used for returning query
results by intersecting those description sets mapped by query keywords. The forward
index is originally just an intermediate, but reused in our method for providing
Webcast Text Index of Description Description
D1 Peja Stojakovic misses 10-foot two point shot
D2 David West misses jumper
D3 Peja Stojakovic makes 19-foot two point shot
D4 Trevor Ariza makes 19-foot jumper
D5 David West makes 17-foot jumper (Chris Paul
assists)
(a) Descriptions and their indices.
Forward Index
Index of Description Event Keyword Set
D1 misses, shot
D2 misses, jumper
D3 makes, shot
D4 makes, jumper
D5 assists, makes, jumper
(b) Mappings from description indices to event keywords.
Inverted Index
Keywords Indices of Description Set
assists D5
jumper D2, D4, D5
makes D3, D4, D5
misses D1, D2
shot D1, D3
(c) Mappings from keywords to description indices.
Fig. 3.7 An example to illustrate the data structure for hierarchical search.
In our system, a query is considered as a set of multiple words. The hierarchical
feature means that a user can get more general results by querying fewer words or get
more specific result by querying more words; for example, the results of querying
“jumper” are those descriptions having the keyword “jumper”, and the results of
The query result is the intersection of description sets obtained through the keywords
of query in the inverted index list. For providing suggested query keywords, the
resulting intersection set is then used as another query for the forward index list. The
keyword set of each description in the resulting intersection set are extracted. Finally,
the union of all extracted keyword sets is considered as the suggested query keywords.
The detail algorithm of the proposed search system is given below.
Hierarchical Search Algorithm
Step1: A user types several query words.
Step2: Look up the inverted index and get description sets mapped by the query
words. Intersect these description sets to obtain a query result.
Step3: Look up the forward index and get keyword sets mapped by the query
result.
Step4: Output the union set of these word sets. The user selects some keywords from output as query words. Perform Step2 and output the query result.
Here, we use Fig. 3.7 as an example to do explanation. Assume that a user types
a query {jumper}, the system will look up the inverted index list and get a temporary
result set {D2, D4, D5}. Then, the system will look up the forward index list and
recommend the user {assists, jumper, makes, misses}, i.e. the union set of {jumper,
to {jumper, makes}, the system will return {D4, D5}, i.e. the intersection set of {D3,
D4, D5} and {D2, D4, D5}. Therefore, a powerful hierarchical search system with
query recommendation function is built.
3.3 Experimental Results
In most search systems, statistical analysis such as receiver operating
characteristic (ROC) analysis or recall-precision is used to evaluate the performance.
Through the analysis, the system degradation caused by misclassification can be
estimated. However, as mentioned in Section 3.2.2, we cluster descriptions by an
exactly matching function, so there is no misclassified event in our system. This
means that both precision and recall rates of the proposed method are 100%.
Researches aimed at detecting text events from webcast text are few. Xu and
Chua [5] modeled webcast text as external knowledge in detecting events from
football and soccer. The evaluation of the fusion video event detection was presented,
but that of webcast text analysis alone was not. Xu et al. [8] proposed a framework to
analyze webcast text and videos independently and align them through game time.
According to the framework, the performance of video event detection mainly
depends on webcast text analysis. Here we compare our method with Xu et al.’s work.
2008-2009 postseason games. The former are used as training database, and the latter
are used as testing database to examine the reliability of the proposed method. We
also collect 68 UEFA Champions League 2010-2011 soccer games, where 20 of them
are used as training database and the other 48 are used as testing database. The
webcast text from 134 games is acquired from ESPN website. As can be seen in Table
3.1, hundreds of descriptions in a game are clustered into, in average, 44 semantic
event categories for basketball and 20 semantic event categories for soccer.
Table 3.1 Average number of sports event categories in 25 basketball training data and 20 soccer training data.
Mean Variance Standard deviation Basketball 44.08 9.08 3.01
Soccer 19.85 5.40 2.32
From Xu et al.’s previous work, the pLSA, the optimal number of event
categories is nine for basketball and eight for soccer. The top three keywords of each
category are selected by a conditional probability. They use the top ranked keyword as
single keyword during event detection. We map the top three results of pLSA to our
multiple keywords categories in Table 3.2 and Table 3.3. In Table 3.3, because
“attempt” is chosen as a member of black list in the interactive system, we use “shot”
as the single-keyword match for mappings from soccer events in pLSA to those in the
and have the same meaning in descriptions. We consider these two words as the same
and use “missed(misses)” as their common representative. In order to achieve fine
performance in detecting semantic events, Xu et al. not only use keywords detection
in description sentences, but also analyze context information in them. For example,
in basketball, the top ranked keyword “jumper” is detected as “Jumper” event only if
its previous word is “makes,” and other sentences containing word “jumper,” e.g.,
Kenyon Martin misses 22-foot jumper, are discarded. However, these discarded
events are actually semantic events and can be valuable for further research, e.g.,
sports posture analysis, injury prevention, special highlight, etc. It can be seen from
Table 3.2 and Table 3.3 that every category of pLSA is mapped to several different
semantic events of the proposed method. These several events are related but
somehow different. For example, in basketball, “jumper misses” describes that a
jumper is missed while “jumper makes” describes that a jumper is made successfully.
In soccer, “blocked shot” describes that a shot attempt is blocked by an opponent
while “missed(misses) shot” describes that a shot attempt is missed by the kicker
himself. Hence, misclassifying or discarding these events decreases the precision and
recall rates. However, in our method, the precision and recall rates are both 100%.
With the support of hierarchical search system, we can query multiple keywords for
3.2 and Table 3.3 also show those semantic event categories which are unavailable in
Xu et al.’s method, but can be detected in our method, e.g., steal, timeout, turnover for
basketball and injury, blocked, penalty for soccer. These semantic events are
important for special highlights or injury prevention, and should not be ignored or
misclassified. So, the proposed method is superior to pLSA.
Table 3.2 Mappings of basketball event categories from pLSA to the proposed method.
Xu et al.’s Method (pLSA) Proposed Method
(Categories with Multiple Keywords) Category Ranked
Keywords
Shot shot makes shot, misses shot pass
bad
Jumper jumper jumper misses, jumper makes, assists jumper makes
foot misses
Layup layup layup makes, layup misses, driving layup makes, assists layup makes
driving blocks
Dunk dunk dunk makes, assists dunk makes, dunk makes slam, driving dunk makes, dunk misses makes
Table 3.2 Mappings of basketball event categories from pLSA to the proposed method (continued).
Xu et al.’s Method (pLSA) Proposed Method
(Categories with Multiple Keywords) Category Ranked
Keywords
Block blocks blocks layup, blocks jumper, blocks driving layup, blocks hook shot, blocks shot, blocks dunk, blocks layup, blocks jumper, blocks driving layup, blocks hook shot, blocks shot,
blocks dunk shot
assists
Rebound rebound defensive rebound, offensive rebound defensive
offensive
Foul foul draws foul shooting, draws foul personal, draws foul offensive, ball draws foul loose,
foul technical, defense foul illegal person, draws flagrant foul type
draw personal
Free throw throw free makes throw, free misses throw free
makes
Substitution enters enters game
timeout
N/A bad pass, bad pass steals, bad lost steals, full timeout, official timeout, turnover, traveling, ejected, double dribble, defense illegal, clock
Table 3.3 Mappings of soccer event categories from pLSA to the proposed method.
Xu et al.’s Method (pLSA) Proposed Method
(Categories with Multiple Keywords) Category Ranked
Keywords
Corner corner corner, assisted corner saved shot, corner goal penalty shot, corner saved shot, assisted corner
goal, assisted corner goal shot, assisted corner missed(misses), corner goal shot, corner
missed(misses) shot, assisted corner missed(misses) shot, corner free kick missed(misses) shot, assisted corner saved, corner
free goal kick shot conceded
bottom
Shot attempt blocked shot, assisted missed(misses) shot, assisted blocked shot, assisted goal saved shot, missed(misses) shot, assisted corner saved shot,
assisted shot, corner goal penalty shot, corner saved shot, assisted corner goal shot, corner goal shot, corner missed(misses) shot, goal saved shot,
free kick shot, assisted goal shot, free kick missed(misses) shot, assisted corner missed(misses) shot, corner free kick missed(misses) shot, goal penalty saved shot,
corner free goal kick shot, goal penalty shot right
footed
Foul foul foul, card foul yellow, foul penalty, card foul dangerous
for
Card yellow card foul yellow, card yellow shown
Table 3.3 Mappings of soccer event categories from pLSA to the proposed method (continued).
Xu et al.’s Method (pLSA) Proposed Method
(Categories with Multiple Keywords) Category Ranked
Keywords
Free kick kick free kick, free kick shot, free kick missed(misses) shot, corner free kick missed(misses) shot, corner free goal kick shot free
wins
Offside offside offside ball
tries
Substitution substitution replaces substitution, injury replaces substitution replaces
lineups
Goal goal assisted goal saved shot, corner goal penalty shot, assisted corner goal, assisted corner goal shot, corner goal shot, goal saved shot, assisted goal shot, assisted goal saved, goal penalty saved shot, goal saved, goal, corner free goal kick shot,
goal penalty shot, assisted goal shot
box
N/A injury, assisted missed(misses), assisted blocked, penalty, assisted
Here we want to examine the reliability of the proposed method. For basketball,
25 NBA 2009-2010 games are taken as training data. After processing all the training
data and gathering the extracted semantic events, we collect the union of these
semantic events as a sample set with cardinality 82. Then we process the testing data,
which are collected from 41 NBA 2008-2009 postseason games, and examine whether
For soccer, we use 20 UEFA Champions League soccer games as training data and 48
UEFA Champions League soccer games as testing data. According to our examination,
with sparse exceptions, almost all the semantic events extracted from testing data can
be found in the sample set. Table 3.4 and Table 3.5 show all exception events which
are quite rare. These exceptions may be caused by different writing styles or some
rarely happened events, and can still be collected in an interactive way if necessary.
Therefore, the proposed method is very stable.
Table 3.4 Occurrences of exception basketball events from 41 testing games. Exception events 18679 basketball descriptions
Number (Percentage) 10 second
backcourt called full timeout driving dunk misses
dunk misses slam away ball draws foul
misses pointer flagrant free misses throw
blocks driving dunk
3 (0.02%) 7 (0.04%) 1 (0.01%) 2 (0.01%) 2 (0.01%) 5 (0.03%) 7 (0.04%) 1 (0.01%) 1 (0.01%)
Table 3.5 Occurrences of exception soccer events from 48 testing games. Exception events 5727 soccer descriptions
Number (Percentage) card
corner penalty saved shot missed(misses)
goal shot
assisted corner missed shot missed shot shot corner missed(misses) corner saved assisted corner blocked 6 (0.10%) 2 (0.03%) 3 (0.05%) 1 (0.02%) 1 (0.02%) 1 (0.02%) 4 (0.07%) 3 (0.05%) 2 (0.03%) 1 (0.02%) 1 (0.02%)
3.4 Summary
In this chapter, we have proposed an unsupervised approach for semantic event
extraction from sports webcast text and made some contributions: 1) detecting
semantic events from webcast text in an unsupervised manner; 2) requiring no
additional context information analysis; 3) preserving more significant events in
sports games; 4) extracting multiple keywords from event categories to support
hierarchical searching; 5) providing auto-complete feature for finer retrieval.
According to experimental results, the proposed method extracts significant semantic
events from basketball and soccer games and preserves those events that are ignored
or misclassified by previous work. The extracted significant text events can be used
for further video indexing and summarization. Furthermore, the proposed method is
CHAPTER 4
ANNOTATING WEBCAST TEXT IN BASKETBALL VIDEOS BY GAME CLOCK RECOGNITION AND TEXT/VIDEO ALIGNMENT
In this chapter, we will propose a text/video alignment and event annotation
method. As mentioned in Chapter 2, semantic events appear in scoreboard frames only.
Thus, the proposed semantic event extraction method focuses on analyzing
scoreboard frames. For each scoreboard frame, location of each clock digit is first
located. A digit templates collection scheme is provided to collect digit character
templates. With clock digit locations and digit templates, a two-step strategy is
proposed to recognize game clocks on the semi-transparent scoreboard in scoreboard
frames. With the game clock recognized from sports video, the alignment work is
done by finding every match for game clock extracted from webcast text and
annotating the corresponding event description on video frames.
4.1 Introduction
In the world, substantial number of sports videos are produced and broadcasted
through television program or Internet streaming. It is nearly impossible to watch all
sports videos. Most of the time, fans prefer to watch highlights of sports videos or
retrieve only partial video segments that they are interested in. Therefore, sports video
topics, automatic semantic event detection and video annotation are essential works.
Most of existing researches [1]-[3] use video content as resource knowledge.
However, schemes relying on video content encounter a challenge called semantic
gap. Recently, some researches [4]-[9] use a multimodal fusion of video content and
external resource knowledge to bridge the semantic gap. The multimodal fusion
scheme, which analyzes webcast text and video content separately and then does
text/video alignment to complete sports video annotation or summarization, has been
used in American football [4], soccer [6]-[8], and basketball [7]-[8].
In the scheme, text/video alignment, which consists of event moment detection
and event boundary detection, has a great impact on performance. It can be achieved
through scoreboard recognition. As can be seen in Fig. 4.1, a scoreboard is usually
overlaid on sports videos to present the audience some game related information (e.g.,
score, game status, game clock) that can be recognized and aligned with text results.
For sports with game clock (e.g., basketball and soccer), event moment detection can
be performed through video game clock recognition. Xu et al. [6]-[8] used Temporal
Neighboring Pattern Similarity (TNPS) measure to locate game clock and recognize
each digit of the clock. A detection-verification-redetection mechanism is proposed to
solve the problem of temporal disappearing clock region in basketball videos.
unnecessary. The cost of verification and redetection could have been avoided.
Moreover, the clock digit characters cannot be located on a semi-transparent
scoreboard.
(a) Transparent scoreboard.
(b) Non-transparent scoreboard.
Fig. 4.1 Two examples of overlaid scoreboard with game clock in basketball video.
According to our observation, two main problems of detecting game clock in
basketball videos are the temporal disappearance and the temporal pause of game
clock. The temporal disappearance of game clock may be caused by slow motion