籃球影片之語義標注與摘要擷取之研究

(1)

國立交通大學

資訊科學與工程研究所

博士論文

籃球影片之語義標注與摘要擷取之研究

A STUDY ON SEMANTIC ANNOTATION AND

SUMMARIZATION OF BASKETBALL VIDEO

研究生：陳俊旻

指導教授：陳玲慧博士

(2)

籃球影片之語義標注與摘要擷取之研究

A STUDY ON SEMANTIC ANNOTATION AND

SUMMARIZATION OF BASKETBALL VIDEO

研究生 : 陳俊旻

Student: Chun-Min Chen

指導教授 : 陳玲慧博士

Advisor: Dr. Ling-Hwei Chen

國立交通大學資訊學院

資訊科學與工程研究所

博士論文

A Dissertation Submitted to

Institute of Computer Science and Engineering

College of Computer Science

National Chiao Tung University

in Partial Fulfillment of the Requirementsfor the Degree of

Doctor of Philosophy

in

Computer Science

July 2014

Hsinchu, Taiwan, Republic of China

(3)

籃球影片之語義標注與摘要擷取之研究

研究生 : 陳俊旻

指導教授 : 陳玲慧博士

國立交通大學資訊學院

資訊科學與工程研究所

摘要

運動影片在我們的休閒娛樂中，扮演了重要角色，然因運動影片的資訊量很大，除了需要的頻寬與傳輸時間多，觀眾亦需耗費大量的時間觀賞，為了節省不必要的時間成本與能源成本，影片精華檢索、影片摘要、以及影片慢動作重播偵測已成為一個熱門的研究題目。目前大多數方法，皆對影片中的每一張畫面分析，然而語義事件只發生在有計分板的畫面，慢動作重播則只出現在沒有計分板的畫面，從不相關的畫面中擷取語義事件或慢動作重播，反而降低方法的準確度與執行效率，且現存的方法多針對足球影片而設計，對籃球影片之探討相對較少，為了解決現存方法所遇到的各式挑戰，本論文將以籃球影片為例，提出一個新穎的運動影片分析架構，讓一般民眾得以有效率的查詢賽事精華，也讓專業人士能夠用來延伸到其他相關應用(自動影片精華產生、運動員動作分析、球隊戰術分析等)。在此架構中，首先提供一個影片畫面分割方法，將運動影片分成有/ 無計分板兩類。接著，對有計分板的畫面提出一語義事件偵測方法，對無計分板

(4)

關於語義事件偵測的相關研究，現存的方法，多使用影片本身的影像或聲音作為特徵，然而僅使用影片內容作為特徵，往往會發生一些語義鴻溝，也就是較低階的影片特徵，和較高階的語義事件，兩者之間的差距。雖然近來有些方法，參考網路轉播文字作為外部知識以彌補語義鴻溝，但從網路轉播文字中擷取語義事件，並標注在運動影片上，仍然存在許多困難與挑戰。在此論文中，我們將討論相關的困境，並提出兩個方法來解決。關於慢動作重播偵測的研究，現存方法大致可以分為兩類。慢動作重播前後，常常有製播單位後製加上的特效畫面，第一類方法都是基於這些特效的位置，來偵測慢動作重播，但籃球影片較為複雜，此假設在籃球影片未必恆成立。第二類方法是分析慢動作片段的特徵，利用這些特徵將慢動作重播片段和一般片段作區分，但由於某些用於足球的特徵並不適用於籃球，此類方法在籃球應用上仍有改進空間。籃球是世界上最重要的運動之一，但在偵測籃球影片慢動作重播上，仍有許多挑戰尚待解決。本論文將提出一個新的方法，偵測籃球影片中的慢動作重播，提供一個重要的運動影片分析素材。實驗結果顯示，本論文所提出的架構與方法，可行性與有效性皆可得到良好的驗證，基於提出的架構與方法皆沒有使用籃球限定的特徵，我們期望本論文可以被延伸應用於其他類型的運動影片。

(5)

A STUDY ON SEMANTIC ANNOTATION AND

SUMMARIZATION OF BASKETBALL VIDEO

Student: Chun-Min Chen Advisor: Dr. Ling-Hwei Chen

Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University

ABSTRACT

Semantic event and slow motion replay extraction for sports videos have become

hot research topics. Most researches analyze every video frame; however, semantic

events only appear in frames with scoreboard, whereas replays only appear in frames

without scoreboard. Extracting events and replays from unrelated frames causes

defects and leads to degradation of performance. In this dissertation, a novel

framework will be proposed to tackle challenges of sports video analysis. In the

framework, a scoreboard detector is first provided to divide video frames to two

classes, with/without scoreboard. Then, a semantic event extractor is presented to

extract semantic events from frames with scoreboard and a slow motion replay

extractor is proposed to extract replays from frames without scoreboard.

As to semantic event extraction, most of existing researches focus on analyzing

(6)

the distance between lower level video features and higher level semantic events.

Although the multimodal fusion scheme that conducts webcast text as external

knowledge to bridge the semantic gap has been proposed recently, extracting semantic

events from sports webcast text and annotating semantic events in sports videos are

still challenging tasks. In this dissertation, we will address the challenges in the

multimodal fusion scheme. Then, we will propose two methods to overcome the

challenges.

As to slow motion replay detection, many methods have been proposed, and they

are classified into two categories. One assumes that a replay is sandwiched by a pair

of visually similar special digital video effects, but the assumption is not always true

in basketball videos. The other analyzes replay features to distinguish replay segments

from non-replay segments. The results are not satisfactory since some features (e.g.

dominant color of sports field) are not applicable for basketball. Most replay detectors

focus on soccer videos. In this dissertation, we will propose a novel idea to detect

slow motion replays in basketball videos.

The feasibility and effectiveness of all the above proposed methods have been

demonstrated in experiments. It is expected that the proposed sports video analysis

(7)

誌謝

首先，我想對我的指導老師陳玲慧教授獻上我最誠摯的感謝，在她亦師亦母指導之下，無論是學術研究的方法、解決問題的能力、待人處事的態度等，都令我獲益匪淺；更感謝她的支持，讓我得以兼顧對籃球的熱愛以及學術研究的熱誠，我很幸運能夠遇到這麼好的老師。接著我要感謝自動化資訊處理實驗室的諸多伙伴，因為各位學長學弟們的陪伴，讓我的研究生涯充實有趣而不孤單。感謝井民全學長和郭萓聖學長，在我剛進入實驗室的階段，能有兩位非常好的楷模與榜樣。感謝李惠龍學長、楊文超學長、歐占和學弟，從資格考一起共同奮戰，到整個學術生涯過程中的提點與幫忙。感謝林懷三學弟口試期間的熱心協助。再來，我要感謝交通大學男子籃球隊伙伴們的支持，讓我在研究疲累之餘，能有一個避風港，和各位學長學弟們一同在球場上奔馳，是我的榮幸，也是我這輩子最珍貴的回憶。感謝我最好的朋友們：偉益、金煌、志瑋、秉澄、信華，在我脆弱的時候陪伴我度過難關。感謝在我生命中，每一位曾經給予我幫助與鼓勵的朋友，謝謝你們，讓我成為一個更好的人。最後，也是最重要地，我要感謝一直以來無條件支持我的家人：父親國淇、母親淑真、哥哥俊宇、姊姊韻如，永遠作我堅強的後盾，讓我毫無後顧之憂地追求目標、挑戰人生、享受生活，謹以最感恩的心，將此篇論文獻給我最親愛的家人。

(8)

CHINESE ABSTRACT ... i

ENGLISH ABSTRACT ... iii

ACKNOWLEDGMENT (IN CHINESE)... v

TABLE OF CONTENTS ... vi

LIST OF TABLES ... viii

LIST OF FIGURES ... ix

CHAPTER 1 ... 1

INTRODUCTION ... 1

1.1 Motivation ... 1

1.2 Related Work ... 2

1.2.1 Semantic Event Extraction Challenges ... 2

1.2.2 Slow Motion Replay Detection Challenges ... 5

1.3 Synopsis of the Dissertation ... 6

CHAPTER 2 ... 8

A NOVEL FRAMEWORK FOR SPORTS VIDEO ANALYSIS ... 8

2.1 Video Frames Partition... 8

2.1.1 Context-Based Static Region Detection... 11

2.1.2 Scoreboard Selection... 13

2.1.3 Experimental Results ... 14

2.2 Overview of the Framework ... 15

2.3 Summary ... 16

CHAPTER 3 ... 17

A NOVEL APPROACH FOR SEMANTIC EVENT EXTRACTION

FROM SPORTS WEBCAST TEXT ... 17

3.1 Introduction ... 17

3.2 Proposed Method ... 20

3.2.1 Unrelated Words Filtering ... 22

3.2.1.1 Stop Words

... 23

3.2.1.2 The Proposed Interactive System for Establishing

Sports Stop Word List and Event Keyword List

... 24

3.2.1.3 The Proposed Unrelated Words Filtering Procedure

26

3.2.2 Event Clustering ... 27

(9)

3.3 Experimental Results ... 34

3.4 Summary ... 42

CHAPTER 4 ... 43

ANNOTATING WEBCAST TEXT IN BASKETBALL VIDEOS BY

GAME CLOCK RECOGNITION AND TEXT/VIDEO ALIGNMENT . 43

4.1 Introduction ... 43

4.2 Proposed Method ... 47

4.2.1 Video Frames Partition ... 47

4.2.2 Semantic Event Extraction from Scoreboard Frames ... 48

4.2.2.1 Clock Digit Locator

... 49

4.2.2.2 Clock Digit Template Collection

... 50

4.2.2.3 Clock Digit Recognition

... 51

4.2.2.4 Text/Video Alignment

... 53

4.3 Experimental Results ... 54

4.4 Summary ... 56

CHAPTER 5 ... 58

A NOVEL METHOD FOR SLOW MOTION REPLAY DETECTION IN

BROADCAST BASKETBALL VIDEO ... 58

5.1 Introduction ... 58

5.2 Proposed Method ... 61

5.2.1 Video Frames Partition ... 61

5.2.2 Feature Extraction and Replay Detection ... 62

5.3 Experimental Results ... 80

5.4 Summary ... 87

CHAPTER 6 ... 88

CONCLUSIONS AND FUTURE WORKS ... 88

REFERENCES ... 90

PUBLICATION LIST ... 92

(10)

LIST OF TABLES

Table 3.1 Average number of sports event categories in 25 basketball

training data and 20 soccer training data. ... 35

Table 3.2 Mappings of basketball event categories from pLSA to the

proposed method. ... 37

Table 3.3 Mappings of soccer event categories from pLSA to the

proposed method. ... 39

Table 3.4 Occurrences of exception basketball events from 41 testing

games. ... 41

Table 3.5 Occurrences of exception soccer events from 48 testing games.

... 41

Table 4.1 Semantic events extraction results of the proposed method. .... 55

Table 5.1 Replay detection results for MNS. ... 81

Table 5.2 Replay detection results for MNS with self pruning. ... 82

Table 5.3 Replay detection results for MNS by methods in the first

category. ... 82

Table 5.4 Total replay detection results with fixed TH

slv

= 30. ... 84

Table 5.5 Total replay detection results with fixed TH

smoothness

= 85%. .... 84

Table 5.6 Total replay detection results with TH

smoothness

=0.85 and TH

slv

=

25. ... 85

Table 5.7 Total replay detection results with TH

smoothness

=0.85 and TH

slv

=

30. ... 86

Table 5.8 Total replay detection results by methods in the first category.86

(11)

LIST OF FIGURES

Fig. 2.1 Examples of scoreboard frames and non-scoreboard frames... 10

Fig. 2.2 Block diagram of scoreboard template extraction. ... 10

Fig. 2.3 Example of pixel-based frame difference accumulation. ... 12

Fig. 2.4 Scoreboard template extraction for 3 different broadcasters with

extracted positions marked by white rectangle. ... 15

Fig. 2.5 The proposed framework. ... 16

Fig. 3.1 An example of basketball webcast text. ... 21

Fig. 3.2 Block diagram of the proposed method. ... 21

Fig. 3.3 An example to illustrate description and word. ... 23

Fig. 3.4 The block diagram of the interactive pre-training system. ... 26

Fig. 3.5 Block diagram of unrelated words filtering procedure. ... 27

Fig. 3.6 An example to illustrate the concept of the proposed hierarchical

search system. ... 30

Fig. 3.7 An example to illustrate the data structure for hierarchical search.

... 32

Fig. 4.1 Two examples of overlaid scoreboard with game clock in

basketball video. ... 45

Fig. 4.2 General definitions of game clock patterns. ... 49

Fig. 4.3 An example of locating game clock digits (10:30). ... 50

Fig. 4.4 An example of text/video alignment. ... 54

Fig. 4.5 Examples of basketball games playing without game clock... 56

Fig. 5.1 Examples of game-related segments. ... 63

Fig. 5.2 Block diagram of slow motion replay detection. ... 64

Fig. 5.3 An example of comparison between a game-related segment and

a replay segment. ... 65

Fig. 5.4 The two global features of each MNS in a basketball video. ... 68

Fig. 5.5 An example of the DH

1

sequence of a game-related segment

misclassified as replay. ... 69

Fig. 5.6 Histogram of σ´

_DF1

from the preliminary replays in ten

experimented basketball videos. ... 69

Fig. 5.8 Examples of still shots of the product and slogan in TV

commercials. ... 71

Fig. 5.9 Examples of abrupt transition detection results and the

corresponding cut scenes of non-replay and replay. ... 77

(12)

CHAPTER 1 INTRODUCTION

1.1 Motivation

Thanks to the rapid growth of computer science and network technology, people

now are capable of using mobile devices, e.g. notebook, tablet, smart phone, to

acquire sports videos anytime and anywhere. Since substantial number of sports

videos are produced and broadcasted every day, it is nearly impossible to watch them

all. Most of the time, people prefer to watch highlights of sports videos or retrieve

only partial video segments that they are interested in. Many websites, such as ESPN,

NBA, and Yahoo Sports, already make this kind of online service available. These

online services are made by professional film editors and sports reporters by

exhaustedly watching sports videos personally, so people or fans can see the unified

version. However, these services may not please all fans. For example, fans, who

want to practice certain sports skills or imitate specific sports stars cannot take

advantage of the unified version highlight, and have to download the whole game and

search for certain moves made by certain players. It is quite inconvenient. Therefore,

sports video analysis, such as semantic event extraction [1]-[9] and slow motion

(13)

1.2 Related Work

Many research efforts have been spent on sports video analysis. However, some

challenges still remain to be solved and will be presented in the following.

1.2.1 Semantic Event Extraction Challenges

Some semantic event extraction researches [1]-[3] use video content as resource

knowledge. Chen and Deng [1] analyzed video features (e.g. color, motion, shot) to

extract and index events in a basketball video. Hassan et al. [2] extracted audio-visual

(AV) features and applied Conditional Random Fields (CRFs) based probabilistic

graphical model for sports event detection. Kim and Lee [3] built an indexing and

retrieving system for a golf video by analyzing its AV content. However, schemes

relying on video content encounter a challenge called semantic gap, which represents

the distance between video features and semantic events. Recently, some researches

[4]-[9] use a multimodal fusion of video content and external resource knowledge to

bridge the semantic gap. Webcast text, one of the most powerful external resource

knowledge, is an online commentary posted with well-defined structure by

professional announcers. It focuses on sports games and contains detail information

(e.g., event description, game clock, player involved, etc.). The multimodal fusion

(14)

text/video alignment to complete sports video annotation or summarization, has been

used in American football [4], soccer [6]-[8], and basketball [7]-[8].

For webcast text analysis, Xu et al. [8] apply probabilistic latent semantic

analysis (pLSA), a linear algebra–probability combined model, to analyze the webcast

text for text event clustering and detection. Based on their observation, the

descriptions of the same event in the webcast text have a similar sentence structure

and word usage. They use pLSA to first cluster the descriptions into several categories

and then extract keywords from each category for event detection. Although they

extend pLSA for both basketball and soccer, there are two problems in the approach: 1)

the optimal number of event categories is determined by minimizing the ratio of

within-class similarity and between-class similarity. In fact, there are more event

categories for a basketball or soccer game. For example, in a basketball game, many

events, such as timeout, assist, turnover, ejected, are mis-clustered into wrong

categories or discarded as noises. This may cause side effects degrading and limiting

the results of sports video retrieval; 2) after keywords extraction, events can be

detected by keywords matching. In Xu et al.’s method, they use the top ranked word

in pLSA model as single-keyword of each event category. But in some event

categories, the single-keyword match will lead to horrible results. For example, in

(15)

players make. Without detecting “makes” as a previous word of “jumper” in

description sentences, the precision of “jumper” event detection is decreased from

89.3% to 51.7% in their testing dataset. However, the “jumper” event actually is an

event that consists of “makes jumper” event and “misses jumper” event. The former

can be used in highlights, and the latter can be used in sports behavior analysis and

injury prevention. Accordingly, using single-keyword match is insufficient and some

important events will be discarded.

In the multimodal fusion scheme, text/video alignment has a great impact on

performance, and it can be achieved through scoreboard recognition. A scoreboard is

usually overlaid on sports videos to present the audience some game related

information (e.g., score, game status, game clock) that can be recognized and aligned

with text results. For sports with game clock (e.g., basketball and soccer), event

moment detection can be performed through video game clock recognition Xu et al.

[6]-[8] used Temporal Neighboring Pattern Similarity (TNPS) measure to locate game

clock and recognize each digit of the clock. A detection-verification-redetection

mechanism is proposed to solve the problem of temporal disappearing clock region in

basketball videos. However, recognizing game clock in a frame which has no game

clock is definitely unnecessary. The cost of verification and redetection could have

(16)

semi-transparent scoreboard.

1.2.2 Slow Motion Replay Detection Challenges

As to slow motion replay detection, many methods have been proposed, and they

can be classified into two categories. The first category [10]-[15] is to locate positions

of specific production actions called special digital video effects (SDVEs) or logo

transitions, and bases on these positions to detect replay segments. However, in this

category, they all made an imperfect assumption that a replay is sandwiched by either

two visually similar SDVEs or logo transitions, the assumption is not always true in

basketball videos. In fact, a basketball video segment bounded by paired SDVEs is

not always a replay. Moreover, the beginning and end of a basketball replay can have

some combinations: 1) paired visually similar SDVEs; 2) non-paired SDVEs; 3) a

SDVE in one end and an abrupt transition in the other. So, previous work in this

category cannot be applied to basketball videos with replays having combinations (2)

and (3).

The second category [16]-[18] analyzes features of replays to distinguish replay

segments from non-replay segments. Farn et al. [16] extracted slow motion replays by

referring to the dominate color of soccer field; however, it is not applicable in

(17)

textures are more complicated. Wang et al. [17] conducted motion-related features and

presented a support vector machine (SVM) to classify slow motion replays and

normal shots. The precision rates of two experimented basketball videos are 55.6%

and 53.3% with recall rates 62.5% and 66.7%, respectively. Han et al. [18] proposed a

general framework based on Bayesian network to make full use of multiple clues,

including shot structure, gradual transition pattern, slow motion, and sports scene. The

method is suffered from the inaccuracy of the used automatic gradual transition

detector. Their experiments show precision rate 82.9% and recall rate 83.2%.

The existing two category methods are generic but not satisfactory for basketball

videos. Moreover, most previous researches analyze every video frame to detect

replays, but detecting replays in video frames that are surely non-replay degrades both

performance and detection rate.

1.3 Synopsis of the Dissertation

Semantic event and slow motion replay extraction for sports videos have become

hot research topics. Most researches analyze every video frame; however, semantic

events only appear in frames with scoreboard, whereas replays only appear in frames

without scoreboard. Extracting events and replays from unrelated frames causes

(18)

challenges, a novel framework combining semantic event extraction and slow motion

replay detection is proposed in this dissertation. In the framework, a scoreboard

detector is first provided to divide video frames to two classes, with/without

scoreboard. Then, a semantic event extractor is presented to extract semantic events

from frames with scoreboard and a slow motion replay extractor is proposed to extract

replays from frames without scoreboard.

The rest of the dissertation is organized as follows. Chapter 2 presents an

overview of the proposed framework for sports video analysis. Under the framework,

some sports video analysis schemes are proposed and discussed in Chapter 3 to

Chapter 5. Chapter 3 describes an unsupervised approach to extract semantic events

from sports webcast text. The text/video alignment and event annotation method is

proposed in Chapter 4. Chapter 5 provides a slow motion replay detection method for

broadcast basketball video. Some conclusions and future research directions are given

(19)

CHAPTER 2

A NOVEL FRAMEWORK FOR SPORTS VIDEO ANALYSIS

In this chapter, we will propose a novel framework to analyze sports videos. One

of the main novelties is to refer to scoreboard information. It is observed that sports

video frames can be partitioned into two categories according to the existence of

scoreboard. Frames with scoreboard existence are called scoreboard frames, and

others are called non-scoreboard frames. In general, semantic events appear during

playing of a sports game, which consists of scoreboard frames only. Slow motion

replays appear during temporal pausing of a sports game, which consists of

non-scoreboard frames only. The phenomenon is dominant and used to skip large

amount of unnecessary processing frames before semantic resource extraction.

Accordingly, the performance and the detection rate can be assured. The chapter is

organized as follows. In Section 2.1, a video frame partition method to divide frames

into scoreboard frames and non-scoreboard frames is introduced. An overview of the

proposed framework will be presented in Section 2.2. Note that extracting semantic

events from scoreboard frames and extracting slow motion replays from

non-scoreboard frames will be provided in the latter chapters.

2.1 Video Frames Partition

(20)

classified into two categories, scoreboard frames and non-scoreboard frames.

Scoreboard frames present basketball game with scoreboard overlaid on them, while

non- scoreboard frames present the rest, e.g., sideline interview, slow motion replay,

etc. Since semantic events only appear in scoreboard frames, whereas replays only

appear in non-scoreboard frames. It is beneficial to filter out unnecessary processing

frames in each semantic resource extraction step. So, an automatic scoreboard

template extractor is first proposed to extract scoreboard template and scoreboard

position. Then, the video frame partitioning can be done by simple template matching.

It can be seen from Fig. 2.1(a), a scoreboard is a large, still, and rectangular area

which consists pixels that change very infrequently. Based on this fact, an automatic

scoreboard template extractor is proposed. First, a context-based static region detector

is provided to extract few static regions called scoreboard candidates. Then a

scoreboard selection method is used to get the right scoreboard. The block diagram of

(21)

(a) Scoreboard frame. (b) Non-scoreboard frame (sideline interview). (c) Non-scoreboard frame (TV commercial). (d) Non-scoreboard frame (slow motion replay).

Fig. 2.1 Examples of scoreboard frames and non-scoreboard frames.

Video Input

Context-based Static Region Detection

Scoreboard Selection

Extracted Scoreboard

Template Position

(22)

2.1.1 Context-Based Static Region Detection

As to context-based static region detection, a sports video is considered as an

input frame sequence. Let fi be the i-th input frame and K be the total frame number.

For each frame fi, the pixel-based frame difference between fi and its previous frame

fi-1 is first calculated as follows:

K i y x f y x f y x Df_i( , )= _i( , )− _i₋₁( , ), 2≤ ≤

Where fi(x,y) represents the color value of pixel (x,y) at frame fi. Then, an

accumulated difference frame, ADfi, is created by

K i y x Df y x ADf i j j i =

∑

≤ ≤ = 2 , ) , ( ) , ( 2

Fig. 2.3 shows an example. As time goes by, the accumulated difference at each pixel

can be considered as the change degree at that position.

After binarizing the accumulation result, each white point represents the position

that changes more frequently and each black point represents the opposite. Then, we

do region growing on black points of each binarized accumulated difference frame to

find the largest connected component, which satisfies two constraints, as a potential

scoreboard candidate. One constraint is about size. Since a scoreboard should be large

enough to present score information, the width of the bounding box of the connected

component should be at least 1/12 frame width and the height should be at least 1/18

(23)

component should be near rectangular, that is, the ratio of the connected component

area and its bounding box area should be at least 0.9.

frame 1 frame 2 frame 3 frame i

(a) Video frame sequence.

Df2 Df3 Dfi

(b) Pixel-based frame difference.

ADf2 ADf3 ADfi

(c) Accumulation of neighboring frame pair differences.

(d) Binarized results.

(24)

For each binarized accumulated difference frame, if a potential scoreboard candidate

is found, its position is then recorded. If the position is unchanged for consecutive

frames, e.g. 300 frames, this means a potential scoreboard candidate is stable enough,

and it can be considered as a scoreboard candidate. The context-based static region

detector is applied repeatedly to the video frame sequence until few candidates are

detected.

2.1.2 Scoreboard Selection

Some sports videos have overlaid rectangular logos made by the TV stations.

The TV station logo is overlaid at the same position during the game while the

scoreboard may disappear from time to time (see Fig. 2.1). Thus the logo is possibly

detected as a scoreboard candidate. Fortunately, a TV station logo is never larger than

a scoreboard, thus the scoreboard selection will prune smaller size candidates. Note

that a scoreboard candidate consists of two parts, position and template. Now, we

have located the scoreboard position. For template, since the scoreboard may

disappear from time to time, extracting a template from a scoreboard candidate

position cannot guarantee a right one. To solve this problem, for each scoreboard

candidate sc extracted from fi, the temporal change of the candidate sc, TC(sc), is

(25)

, ) , ( ) , ( ) ( 2 2 1 0 1 0

∑ ∑ ∑

− = − = − = − − = s M x N y s i i c c y x f y x f sc TC

where Mc and Nc represent the width and height of sc, fi(x,y) represents the color value

of pixel (x,y) at frame fi, and s represents temporal frame offset. Then, the scoreboard

selection will take the one with the least temporal change as the scoreboard template.

According to our experiments, four scoreboard candidates are enough to extract

the right scoreboard template. After scoreboard template extraction, the video frames

partition can be done by matching every frame with scoreboard template at the

scoreboard position.

2.1.3 Experimental Results

Our experiments are conducted by 10 NBA basketball games from 3 different

broadcasters, i.e., ESPN, TNT, NBA TV. The data are recorded from TV in MPEG-2

format with resolution 480 × 352. All 10 scoreboard templates are extracted

successfully. It can be seen from Fig. 2.4, the proposed scoreboard template extractor

works great for the 3 different broadcasters. Due to the effective results for different

style scoreboards, it is believed that the proposed scoreboard template extractor can

be generalized to other sports. Note that a scoreboard contains rich information in a

(26)

(a) Game match broadcasted by ESPN.

(b) Game match broadcasted by TNT.

(c) Game match broadcasted by NBA TV. Fig. 2.4 Scoreboard template extraction for 3 different broadcasters with extracted

positions marked by white rectangle.

2.2 Overview of the Framework

The proposed framework is shown in Fig. 2.5. It can be seen from Fig. 2.5, the

existing methods for semantic event extraction and replay detection can easily apply

to the framework. Contrary to previous works, in the framework, scoreboard frames

and non-scoreboard frames will be separately processed in semantic event extraction

and slow motion replay detection. Since scoreboard only covers a small part of a

video frame, conducting this slight-cost partitioning task before semantic resource

extraction improves a lot of performance in both time complexity and detection

accuracy.

In this dissertation, some sports video analysis schemes are proposed under the

framework. A novel approach for webcast text analysis is presented in Chapter 3.

Semantic event annotation through video clock recognition is provided in Chapter 4.

(27)

Accordingly, the framework of the dissertation is presented in Fig. 2.5 as well. Detail

techniques will be discussed in the following chapters.

Fig. 2.5 The proposed framework.

2.3 Summary

In this chapter, a novel framework for sports video analysis, which provides

flexibility to combine different schemes of event extraction and those of replay

detection, is proposed. The novelty of video frames partition prevents semantic

resource extraction from a lot of unnecessary processing frames, so the performance

and detection rate can be increased. The framework is also capable of acquiring both

two valuable semantic resources in one time. Sports Video

Webcast Text

Text Analysis (Ch. 3)

Text Event Extraction

Text/Video Alignment (Ch. 4) Scoreboard Frames Analysis

(Ch. 4)

Clock Recognition Frames

Partition _{Non-scoreboard Frames} Analysis (Ch. 5) Replay Extraction Semantic Events Event: Assist Player:Kobe Bryant Team: L. A. Lakers Semantic Event Extraction

(28)

CHAPTER 3

A NOVEL APPROACH FOR SEMANTIC EVENT EXTRACTION FROM SPORTS WEBCAST TEXT

In this chapter, we will propose an unsupervised approach to extract semantic

events from sports webcast text. First, unrelated words in the descriptions of webcast

text are filtered out, and then the filtered descriptions are clustered into significant

event categories. Finally, the keywords for each event category are extracted. The

extracted significant text events can be used for further video indexing and

summarization. Furthermore, we also provide a hierarchical searching scheme for text

event retrieval.

3.1 Introduction

In video summarization and retrieval, a source video is first clipped into smaller

videos representing significant events through a preprocessing, called semantic event

detection, which detect events occurred in a video and annotates events with

appropriate tags. With finer results of the preprocessing, video summarization and

retrieval can be completed efficiently and correctly. Most of existing event detection

schemes use video content as their resource knowledge. However, the schemes

relying on video content encounter a challenge called semantic gap, which represents

the distance between low level video features and high level semantic events. In

(29)

One of the external knowledge is Closed-Caption (CC) [19]. CC is the transcript

of speech and sound, and it is helpful for semantic analysis of sports videos. It is

mainly used in aid of listening and language learning, but only available in certain

videos and certain countries. Because CC completely records the sound in video, it

contains a lot of redundant information and usually lacks of structure. The other

external knowledge is webcast text. Comparing to CC, webcast text is the online

commentary posted by professional announcers and focuses more on sports games. It

contains more detail information (e.g., event name, time, player involved, etc.), which

is difficult to extract from video content itself automatically. Xu and Chua [5] first use

webcast text as external knowledge to assist event detection in soccer video. They

proposed a framework that combines internal AV features with external knowledge to

do event detection and event boundary identification. But the proposed model is

inapplicable to other team sports. Xu et al. [8] apply probabilistic latent semantic

analysis (pLSA), a linear algebra–probability combined model, to analyze the webcast

text for text event clustering and detection. Based on their observation, the

descriptions of the same event in the webcast text have a similar sentence structure

and word usage. They use pLSA to first cluster the descriptions into several categories

and then extract keywords from each category for event detection. Although they

(30)

1) The optimal numbers of event categories are nine for basketball and eight for

soccer in the results, which is determined by minimizing the ratio of within-class

similarity and between-class similarity. In fact, there are more event categories for

a basketball or soccer game. For example, in a basketball game, many events,

such as timeout, assist, turnover, ejected, are mis-clustered into wrong categories

or discarded as noises. This may cause side effects degrading and limiting the

results of video retrieval.

2) After keywords extraction, events can be detected by keywords matching. In Xu et

al.’s method, they use the top ranked word in pLSA model as single-keyword of

each event category. But in some event categories, the single-keyword match will

lead to horrible results. For example, in their method for a basketball game,

“jumper” event represents those jumpers that players make. Without detecting

“makes” as a previous word of “jumper” in description sentences, the precision of

“jumper” event detection is decreased from 89.3% to 51.7% in their testing dataset.

However, the “jumper” event actually is an event that consists of “makes jumper”

event and “misses jumper” event. The former can be used in highlights, and the

latter can be used in sports behavior analysis and injury prevention. Accordingly,

using single-keyword match is insufficient and some important events will be

(31)

To treat the above-mentioned problems, we propose a method to analyze sports

webcast text and extract significant text events. An unsupervised scheme is used to

detect events from the webcast text and extract multiple keywords from each event. A

data structure is used to store these multiple keywords and to support a hierarchical

search system with auto-complete feature for event retrieval. The word “hierarchical”

means that a user can get more specific results by querying more keywords and the

word “auto-complete” means that the system can give suggested keywords during the

query step.

3.2 Proposed Method

Webcast text comprises knowledge which is closely related to the game and is

easily retrieved from websites. As can be seen in Fig. 3.1, it contains time tags, team

names, scores, and event descriptions. The format is so organized that we can follow

the time flow and understand how the game goes on. Among this well-organized text,

it is apparent that event descriptions relate to semantic events the most. Our goal is to

(32)

Fig. 3.1 An example of basketball webcast text.

Fig. 3.2 Block diagram of the proposed method. Unrelated Words Filtering

Event Clustering

Event Data Structure Establishing Webcast

Text

Extracted Semantic Information

Forward Index Inverted Index

Hierarchical Search System Input

(33)

The block diagram of the proposed method is presented in Fig. 3.2. It can be seen

that we first filter out unrelated words of webcast text and then cluster them into

significant events. We store the extracted semantic information with a pair of index

tables and build a hierarchical retrieval system by manipulating the two tables. The

detail of each block will be described in the following subsections.

3.2.1 Unrelated Words Filtering

In webcast text, each description can be considered as an event. It contains many

words and may include player name, team name, movement name, and whether the

player or the team makes the movement or not. An example is given in Fig. 3.3, a

player named “Peja Stojakovic” failed to make a movement called “10-foot two point

shot.”

The number of descriptions in each basketball game is more than four hundred.

The descriptions are readable and can be easily categorized into several events by

human eyes. But the task is not effortless for computer machines. According to our

observations, words in each description consist of three mutually disjoint word sets: 1)

stop words, 2) event keywords, and 3) names. Stop words are unrelated to event and

should be discarded. Event keywords are closely related to event and should be kept

(34)

preserved for event annotation. Our objective is to extract event keywords and use

these keywords to do event clustering. To achieve the objective, based on a reference

stop word list and an online name information, an interactive system is first provided

to establish a sports stop word list and an event keyword list. The system will be

explained in Sections 3.2.1.1 and 3.2.1.2. According to these two lists, for each

webcast text, an unrelated word filtering procedure described in Section 3.2.1.3 is

next provided to filter out stop words and to preserve name words. The remaining

keywords are then used for event clustering, which will be described in Section 3.2.2.

Fig. 3.3 An example to illustrate description and word.

3.2.1.1 Stop Words

In information retrieval, there are some words that occur very frequently (e.g.

some articles, prepositions, pronouns, be-verbs) and are useless in document matching.

These words are called stop words [20]. Due to the uselessness of stop words, filtering

out them during both index step and query step can reduce the index size and query

Description

(35)

processing time. This technique has been used in search engines and can be

implemented through predefining a stop word list. For the variety of applications,

there is no standard stop word list. Many reference stop word lists [21]-[22] have been

proposed by using techniques about statistics and probability.

From Fig. 3.1, it can be seen that descriptions contain articles (e.g. “the”),

prepositions (e.g. “of”), range of shot (e.g. “10-foot”), and points of shot (e.g. “two

point”). Some words are details of events which decrease the connections between

similar events. With the aid of reference stop lists, articles and prepositions can be

easily filtered out from descriptions. However, the range of shot and points of shot are

exceptions in reference stop lists. Moreover, in soccer webcast text, due to the

relatively larger ground, there are more unrelated words to describe locations where

an event happens. For example, right wing, left wing, inside the box, outside the box,

left corner, right corner, etc. Accordingly, it is hard to automatically generate a sports

stop word list for all kinds of sports. So we will provide an interactive system to

establish a sports stop word list.

3.2.1.2 The Proposed Interactive System for Establishing Sports Stop Word List and Event Keyword List

(36)

text descriptions of several games are taken as training inputs, next some unrelated

words are filtered out according to a reference stop word list [21] and a name word

list (e.g., online box score in basketball and online player statistics in soccer). And

then the system interacts with sports professionals, who will divide the remaining

words into a black list and a white list. The black list contains stop words for sports,

and the white list contains sports event keywords. Finally the black list is merged into

the reference stop word list to get the sports stop word list. The block diagram of the

interactive system is presented in Fig. 3.4.

Our training webcast text is conducted by 41 basketball games and 48 soccer

games. After the reference stop words filtering and the name words filtering, the

remaining words needed to interactively ask professionals are less than 100 in

basketball and less than 200 in soccer. The responses from professionals may take just

(37)

Fig. 3.4 The block diagram of the interactive pre-training system.

3.2.1.3 The Proposed Unrelated Words Filtering Procedure

Fig. 3.5 shows the block diagram of the proposed unrelated words filtering

procedure. For a webcast text, the sports stop word list is first used to filter out

unrelated words. Next the event keyword list is used to extract event keywords. Then

the words with uppercase beginning in the remaining words are considered as

reserved names for further indexing. According to our experiment results, the

unrelated words filtering works well both in basketball and soccer. Reference Stop Words Filtering

Interactively Asking Professional Training

Webcast Text

Black List White List

Name Words Filtering

Reference Stop Word List Sports Stop Word List Merging Event Keyword List

(38)

Fig. 3.5 Block diagram of unrelated words filtering procedure.

3.2.2 Event Clustering

After filtering, each description is reduced and almost exactly describes an event;

for example, “misses shot” represents a missed shot. So a matching function is

provided to cluster these filtered descriptions into event categories.

Filtered descriptions can be represented as FD = { fd1, fd2,…, fdN }, and event

categories can be represented as C = { C1, C2, …, CK }, where N denotes the number

of descriptions in a game and K denotes the number of categories that the clustering

step produces. Since a filtered description consists of some words, it can be

considered as a set of words. Note that the number of keywords of an event category No

Sports Stop Words Filtering Webcast Text

Keywords Passing Filtering for Event Clustering

Event Keywords

Uppercase Beginning

Reserved Names for Further Indexing

Unrelated Words for Discarding Yes

Yes

(39)

is not restricted to be single in our method. The matching function is defined as (3.1) otherwise, , 0 y x if , 1 ) , ( _    = = y x Match Text

where x and y are two sets of words. Each filtered description, fdi, can be clustered

into one category based on the following function

(3.2) , 1 , } ..., , 1 )), ( , ( _ { max arg ) ( , ..., N i K m C Keywords fd Match Text fd Clustering _i _m m i = = =

where Keywords(Cm) denotes the multiple-keywords set of category Cm. Clustering(fdi)

= j means that description fdi is clustered into category Cj. In order to avoid zero

matching in (2), a flag function to examine whether the situation happens is defined as

(3.3) . 1 }, ..., , 1 )), ( , ( _ { max ) ( , ..., N i K m C Keywords fd Match Text fd Flag _i _m m i = = =

The detail of the proposed clustering algorithm is given below.

Clustering Algorithm

Step0: Initialization: Given FD = { fd1, fd2,…, fdN }.

Set K = 1, Clustering(fd1) = 1, Keywords(C1) = fd1, i = 2.

Step1: // Cluster the description fdi according to Functions (3.1), (3.2), and (3.3).

(40)

For m = 1 to K, use Function (1) to calculate TMfdim )) ( , ( _ _i _m

im Text Match fd Keywords C

TMfd = ; Let } { max ) ( ,..., 1 K im m i TMfd fd Flag = = ;

if (Flag(fdi) = 0) then begin

//fdi cannot be clustered into any existing class

// create a new class for fdi

K = K + 1;

Keywords(CK ) = fdi;

Clustering(fdi) = K;

else

//fdi is clustered into one of the existing classes

Use Function (2) to calculate Clustering(fdi) as

} { max arg ) ( _im m i TMfd fd Clustering = ; end

Step2: If any of the descriptions in FD is not clustered yet, set i = i + 1 and go to

Step1 for next iteration. Otherwise, end of iterations.

Once the clustering algorithm is completed, the filtered descriptions are clustered

(41)

multiple keywords of the event. At the meantime, semantic event detection is

accomplished. Then two data structures are built to recommend users for further

queries and to support the hierarchical search.

3.2.3 Hierarchical Search System

Fig. 3.6 gives an example to show the concept of the proposed hierarchical

search system. First, a user can query by one word to get rough results. Then he can

continually query by more words to get into deeper levels for finer results. Here we

implement the system by establishing a pair of index tables and manipulating them

back and forth.

Fig. 3.6 An example to illustrate the concept of the proposed hierarchical search system. Query jumper makes assists misses dunk makes

(42)

Here we build a forward index table and an inverted index table. The former

records mappings from descriptions to event keywords, and the latter stores mappings

from keywords to descriptions. Note that the forward index table is established

automatically after applying the unrelated words filtering procedure. Based on the

forward index table, the inverted index table can be established by sequentially

scanning event keyword set of each description. An example is given in Fig. 3.7 to do

clearer explanation. Suppose we have five descriptions as shown in Fig. 3.7(a). After

applying unrelated words filtering procedure to each description, we can obtain Fig.

3.7(b). By scanning each row in Fig. 3.7(b), for each row, we can obtain a description

index (DI) and the corresponding event keyword set (EKS). Then DI is linked to each

keyword in EKS. After scanning all rows sequentially in Fig. 3.7(b), Fig. 3.7(c) is

established. Both inverted index table and forward index table are referred to achieve

the hierarchical search system. The inverted index table is used for returning query

results by intersecting those description sets mapped by query keywords. The forward

index is originally just an intermediate, but reused in our method for providing

(43)

Webcast Text Index of Description Description

D1 Peja Stojakovic misses 10-foot two point shot

D2 David West misses jumper

D3 Peja Stojakovic makes 19-foot two point shot

D4 Trevor Ariza makes 19-foot jumper

D5 David West makes 17-foot jumper (Chris Paul

assists)

(a) Descriptions and their indices.

Forward Index

Index of Description Event Keyword Set

D1 misses, shot

D2 misses, jumper

D3 makes, shot

D4 makes, jumper

D5 assists, makes, jumper

(b) Mappings from description indices to event keywords.

Inverted Index

Keywords Indices of Description Set

assists D5

jumper D2, D4, D5

makes D3, D4, D5

misses D1, D2

shot D1, D3

(c) Mappings from keywords to description indices.

Fig. 3.7 An example to illustrate the data structure for hierarchical search.

In our system, a query is considered as a set of multiple words. The hierarchical

feature means that a user can get more general results by querying fewer words or get

more specific result by querying more words; for example, the results of querying

“jumper” are those descriptions having the keyword “jumper”, and the results of

(44)

The query result is the intersection of description sets obtained through the keywords

of query in the inverted index list. For providing suggested query keywords, the

resulting intersection set is then used as another query for the forward index list. The

keyword set of each description in the resulting intersection set are extracted. Finally,

the union of all extracted keyword sets is considered as the suggested query keywords.

The detail algorithm of the proposed search system is given below.

Hierarchical Search Algorithm

Step1: A user types several query words.

Step2: Look up the inverted index and get description sets mapped by the query

words. Intersect these description sets to obtain a query result.

Step3: Look up the forward index and get keyword sets mapped by the query

result.

Step4: Output the union set of these word sets. The user selects some keywords from output as query words. Perform Step2 and output the query result.

Here, we use Fig. 3.7 as an example to do explanation. Assume that a user types

a query {jumper}, the system will look up the inverted index list and get a temporary

result set {D2, D4, D5}. Then, the system will look up the forward index list and

recommend the user {assists, jumper, makes, misses}, i.e. the union set of {jumper,

(45)

to {jumper, makes}, the system will return {D4, D5}, i.e. the intersection set of {D3,

D4, D5} and {D2, D4, D5}. Therefore, a powerful hierarchical search system with

query recommendation function is built.

3.3 Experimental Results

In most search systems, statistical analysis such as receiver operating

characteristic (ROC) analysis or recall-precision is used to evaluate the performance.

Through the analysis, the system degradation caused by misclassification can be

estimated. However, as mentioned in Section 3.2.2, we cluster descriptions by an

exactly matching function, so there is no misclassified event in our system. This

means that both precision and recall rates of the proposed method are 100%.

Researches aimed at detecting text events from webcast text are few. Xu and

Chua [5] modeled webcast text as external knowledge in detecting events from

football and soccer. The evaluation of the fusion video event detection was presented,

but that of webcast text analysis alone was not. Xu et al. [8] proposed a framework to

analyze webcast text and videos independently and align them through game time.

According to the framework, the performance of video event detection mainly

depends on webcast text analysis. Here we compare our method with Xu et al.’s work.

(46)

2008-2009 postseason games. The former are used as training database, and the latter

are used as testing database to examine the reliability of the proposed method. We

also collect 68 UEFA Champions League 2010-2011 soccer games, where 20 of them

are used as training database and the other 48 are used as testing database. The

webcast text from 134 games is acquired from ESPN website. As can be seen in Table

3.1, hundreds of descriptions in a game are clustered into, in average, 44 semantic

event categories for basketball and 20 semantic event categories for soccer.

Table 3.1 Average number of sports event categories in 25 basketball training data and 20 soccer training data.

Mean Variance Standard deviation Basketball 44.08 9.08 3.01

Soccer 19.85 5.40 2.32

From Xu et al.’s previous work, the pLSA, the optimal number of event

categories is nine for basketball and eight for soccer. The top three keywords of each

category are selected by a conditional probability. They use the top ranked keyword as

single keyword during event detection. We map the top three results of pLSA to our

multiple keywords categories in Table 3.2 and Table 3.3. In Table 3.3, because

“attempt” is chosen as a member of black list in the interactive system, we use “shot”

as the single-keyword match for mappings from soccer events in pLSA to those in the

(47)

and have the same meaning in descriptions. We consider these two words as the same

and use “missed(misses)” as their common representative. In order to achieve fine

performance in detecting semantic events, Xu et al. not only use keywords detection

in description sentences, but also analyze context information in them. For example,

in basketball, the top ranked keyword “jumper” is detected as “Jumper” event only if

its previous word is “makes,” and other sentences containing word “jumper,” e.g.,

Kenyon Martin misses 22-foot jumper, are discarded. However, these discarded

events are actually semantic events and can be valuable for further research, e.g.,

sports posture analysis, injury prevention, special highlight, etc. It can be seen from

Table 3.2 and Table 3.3 that every category of pLSA is mapped to several different

semantic events of the proposed method. These several events are related but

somehow different. For example, in basketball, “jumper misses” describes that a

jumper is missed while “jumper makes” describes that a jumper is made successfully.

In soccer, “blocked shot” describes that a shot attempt is blocked by an opponent

while “missed(misses) shot” describes that a shot attempt is missed by the kicker

himself. Hence, misclassifying or discarding these events decreases the precision and

recall rates. However, in our method, the precision and recall rates are both 100%.

With the support of hierarchical search system, we can query multiple keywords for

(48)

3.2 and Table 3.3 also show those semantic event categories which are unavailable in

Xu et al.’s method, but can be detected in our method, e.g., steal, timeout, turnover for

basketball and injury, blocked, penalty for soccer. These semantic events are

important for special highlights or injury prevention, and should not be ignored or

misclassified. So, the proposed method is superior to pLSA.

Table 3.2 Mappings of basketball event categories from pLSA to the proposed method.

Xu et al.’s Method (pLSA) Proposed Method

(Categories with Multiple Keywords) Category Ranked

Keywords

Shot shot makes shot, misses shot pass

bad

Jumper jumper jumper misses, jumper makes, assists jumper makes

foot misses

Layup layup layup makes, layup misses, driving layup makes, assists layup makes

driving blocks

Dunk dunk dunk makes, assists dunk makes, dunk makes slam, driving dunk makes, dunk misses makes

(49)

Table 3.2 Mappings of basketball event categories from pLSA to the proposed method (continued).

Keywords

Block blocks blocks layup, blocks jumper, blocks driving layup, blocks hook shot, blocks shot, blocks dunk, blocks layup, blocks jumper, blocks driving layup, blocks hook shot, blocks shot,

blocks dunk shot

assists

Rebound rebound defensive rebound, offensive rebound defensive

offensive

Foul foul draws foul shooting, draws foul personal, draws foul offensive, ball draws foul loose,

foul technical, defense foul illegal person, draws flagrant foul type

draw personal

Free throw throw free makes throw, free misses throw free

makes

Substitution enters enters game

timeout

N/A bad pass, bad pass steals, bad lost steals, full timeout, official timeout, turnover, traveling, ejected, double dribble, defense illegal, clock

(50)

Table 3.3 Mappings of soccer event categories from pLSA to the proposed method.

Keywords

Corner corner corner, assisted corner saved shot, corner goal penalty shot, corner saved shot, assisted corner

goal, assisted corner goal shot, assisted corner missed(misses), corner goal shot, corner

missed(misses) shot, assisted corner missed(misses) shot, corner free kick missed(misses) shot, assisted corner saved, corner

free goal kick shot conceded

bottom

Shot attempt blocked shot, assisted missed(misses) shot, assisted blocked shot, assisted goal saved shot, missed(misses) shot, assisted corner saved shot,

assisted shot, corner goal penalty shot, corner saved shot, assisted corner goal shot, corner goal shot, corner missed(misses) shot, goal saved shot,

free kick shot, assisted goal shot, free kick missed(misses) shot, assisted corner missed(misses) shot, corner free kick missed(misses) shot, goal penalty saved shot,

corner free goal kick shot, goal penalty shot right

footed

Foul foul foul, card foul yellow, foul penalty, card foul dangerous

for

Card yellow card foul yellow, card yellow shown

(51)

Table 3.3 Mappings of soccer event categories from pLSA to the proposed method (continued).

Keywords

Free kick kick free kick, free kick shot, free kick missed(misses) shot, corner free kick missed(misses) shot, corner free goal kick shot free

wins

Offside offside offside ball

tries

Substitution substitution replaces substitution, injury replaces substitution replaces

lineups

Goal goal assisted goal saved shot, corner goal penalty shot, assisted corner goal, assisted corner goal shot, corner goal shot, goal saved shot, assisted goal shot, assisted goal saved, goal penalty saved shot, goal saved, goal, corner free goal kick shot,

goal penalty shot, assisted goal shot

box

N/A injury, assisted missed(misses), assisted blocked, penalty, assisted

Here we want to examine the reliability of the proposed method. For basketball,

25 NBA 2009-2010 games are taken as training data. After processing all the training

data and gathering the extracted semantic events, we collect the union of these

semantic events as a sample set with cardinality 82. Then we process the testing data,

which are collected from 41 NBA 2008-2009 postseason games, and examine whether

(52)

For soccer, we use 20 UEFA Champions League soccer games as training data and 48

UEFA Champions League soccer games as testing data. According to our examination,

with sparse exceptions, almost all the semantic events extracted from testing data can

be found in the sample set. Table 3.4 and Table 3.5 show all exception events which

are quite rare. These exceptions may be caused by different writing styles or some

rarely happened events, and can still be collected in an interactive way if necessary.

Therefore, the proposed method is very stable.

Table 3.4 Occurrences of exception basketball events from 41 testing games. Exception events 18679 basketball descriptions

Number (Percentage) 10 second

backcourt called full timeout driving dunk misses

dunk misses slam away ball draws foul

misses pointer flagrant free misses throw

blocks driving dunk

3 (0.02%) 7 (0.04%) 1 (0.01%) 2 (0.01%) 2 (0.01%) 5 (0.03%) 7 (0.04%) 1 (0.01%) 1 (0.01%)

Table 3.5 Occurrences of exception soccer events from 48 testing games. Exception events 5727 soccer descriptions

Number (Percentage) card

corner penalty saved shot missed(misses)

goal shot

assisted corner missed shot missed shot shot corner missed(misses) corner saved assisted corner blocked 6 (0.10%) 2 (0.03%) 3 (0.05%) 1 (0.02%) 1 (0.02%) 1 (0.02%) 4 (0.07%) 3 (0.05%) 2 (0.03%) 1 (0.02%) 1 (0.02%)

(53)

3.4 Summary

In this chapter, we have proposed an unsupervised approach for semantic event

extraction from sports webcast text and made some contributions: 1) detecting

semantic events from webcast text in an unsupervised manner; 2) requiring no

additional context information analysis; 3) preserving more significant events in

sports games; 4) extracting multiple keywords from event categories to support

hierarchical searching; 5) providing auto-complete feature for finer retrieval.

According to experimental results, the proposed method extracts significant semantic

events from basketball and soccer games and preserves those events that are ignored

or misclassified by previous work. The extracted significant text events can be used

for further video indexing and summarization. Furthermore, the proposed method is

(54)

CHAPTER 4

ANNOTATING WEBCAST TEXT IN BASKETBALL VIDEOS BY GAME CLOCK RECOGNITION AND TEXT/VIDEO ALIGNMENT

In this chapter, we will propose a text/video alignment and event annotation

method. As mentioned in Chapter 2, semantic events appear in scoreboard frames only.

Thus, the proposed semantic event extraction method focuses on analyzing

scoreboard frames. For each scoreboard frame, location of each clock digit is first

located. A digit templates collection scheme is provided to collect digit character

templates. With clock digit locations and digit templates, a two-step strategy is

proposed to recognize game clocks on the semi-transparent scoreboard in scoreboard

frames. With the game clock recognized from sports video, the alignment work is

done by finding every match for game clock extracted from webcast text and

annotating the corresponding event description on video frames.

4.1 Introduction

In the world, substantial number of sports videos are produced and broadcasted

through television program or Internet streaming. It is nearly impossible to watch all

sports videos. Most of the time, fans prefer to watch highlights of sports videos or

retrieve only partial video segments that they are interested in. Therefore, sports video

(55)

topics, automatic semantic event detection and video annotation are essential works.

Most of existing researches [1]-[3] use video content as resource knowledge.

However, schemes relying on video content encounter a challenge called semantic

gap. Recently, some researches [4]-[9] use a multimodal fusion of video content and

external resource knowledge to bridge the semantic gap. The multimodal fusion

scheme, which analyzes webcast text and video content separately and then does

text/video alignment to complete sports video annotation or summarization, has been

used in American football [4], soccer [6]-[8], and basketball [7]-[8].

In the scheme, text/video alignment, which consists of event moment detection

and event boundary detection, has a great impact on performance. It can be achieved

through scoreboard recognition. As can be seen in Fig. 4.1, a scoreboard is usually

overlaid on sports videos to present the audience some game related information (e.g.,

score, game status, game clock) that can be recognized and aligned with text results.

For sports with game clock (e.g., basketball and soccer), event moment detection can

be performed through video game clock recognition. Xu et al. [6]-[8] used Temporal

Neighboring Pattern Similarity (TNPS) measure to locate game clock and recognize

each digit of the clock. A detection-verification-redetection mechanism is proposed to

solve the problem of temporal disappearing clock region in basketball videos.

(56)

unnecessary. The cost of verification and redetection could have been avoided.

Moreover, the clock digit characters cannot be located on a semi-transparent

scoreboard.

(a) Transparent scoreboard.

(b) Non-transparent scoreboard.

Fig. 4.1 Two examples of overlaid scoreboard with game clock in basketball video.

According to our observation, two main problems of detecting game clock in

basketball videos are the temporal disappearance and the temporal pause of game

clock. The temporal disappearance of game clock may be caused by slow motion

籃球影片之語義標注與摘要擷取之研究

國 立 交 通 大 學

資訊科學與工程研究所

博 士 論 文

籃球影片之語義標注與摘要擷取之研究

A STUDY ON SEMANTIC ANNOTATION AND

SUMMARIZATION OF BASKETBALL VIDEO

研 究 生 ：陳俊旻

指導教授 ：陳玲慧博士

籃球影片之語義標注與摘要擷取之研究

A STUDY ON SEMANTIC ANNOTATION AND

SUMMARIZATION OF BASKETBALL VIDEO

研 究 生 : 陳 俊 旻

Student: Chun-Min Chen

指導教授 : 陳玲慧博士

Advisor: Dr. Ling-Hwei Chen

國立交通大學資訊學院

資訊科學與工程研究所

博士論文

A Dissertation Submitted to

Institute of Computer Science and Engineering

College of Computer Science

National Chiao Tung University

in Partial Fulfillment of the Requirementsfor the Degree of

Doctor of Philosophy

in

Computer Science

July 2014

Hsinchu, Taiwan, Republic of China

籃球影片之語義標注與摘要擷取之研究

研 究 生 : 陳 俊 旻

指導教授 : 陳 玲 慧 博士

國 立 交 通 大 學 資 訊 學 院

資 訊 科 學 與 工 程 研 究 所

摘 要

A STUDY ON SEMANTIC ANNOTATION AND

SUMMARIZATION OF BASKETBALL VIDEO

ABSTRACT

誌 謝

TABLE OF CONTENTS

CHINESE ABSTRACT ... i

ENGLISH ABSTRACT ... iii

ACKNOWLEDGMENT (IN CHINESE)... v

TABLE OF CONTENTS ... vi

LIST OF TABLES ... viii

LIST OF FIGURES ... ix

CHAPTER 1 ... 1

INTRODUCTION ... 1

1.1 Motivation ... 1

1.2 Related Work ... 2

1.2.1 Semantic Event Extraction Challenges ... 2

1.2.2 Slow Motion Replay Detection Challenges ... 5

1.3 Synopsis of the Dissertation ... 6

CHAPTER 2 ... 8

A NOVEL FRAMEWORK FOR SPORTS VIDEO ANALYSIS ... 8

2.1 Video Frames Partition... 8

2.1.1 Context-Based Static Region Detection... 11

2.1.2 Scoreboard Selection... 13

2.1.3 Experimental Results ... 14

2.2 Overview of the Framework ... 15

2.3 Summary ... 16

CHAPTER 3 ... 17

A NOVEL APPROACH FOR SEMANTIC EVENT EXTRACTION

FROM SPORTS WEBCAST TEXT ... 17

3.1 Introduction ... 17

3.2 Proposed Method ... 20

3.2.1 Unrelated Words Filtering ... 22

3.2.1.1 Stop Words

3.2.1.2 The Proposed Interactive System for Establishing

Sports Stop Word List and Event Keyword List

3.2.1.3 The Proposed Unrelated Words Filtering Procedure

3.2.2 Event Clustering ... 27

3.3 Experimental Results ... 34

3.4 Summary ... 42

CHAPTER 4 ... 43

ANNOTATING WEBCAST TEXT IN BASKETBALL VIDEOS BY

GAME CLOCK RECOGNITION AND TEXT/VIDEO ALIGNMENT . 43

4.1 Introduction ... 43

4.2 Proposed Method ... 47

4.2.1 Video Frames Partition ... 47

國立交通大學

博士論文

研究生：陳俊旻

指導教授：陳玲慧博士

研究生 : 陳俊旻

研究生 : 陳俊旻

指導教授 : 陳玲慧博士

國立交通大學資訊學院

資訊科學與工程研究所

摘要

誌謝