Related Work - 籃球影片之語義標注與摘要擷取之研究

Many research efforts have been spent on sports video analysis. However, some challenges still remain to be solved and will be presented in the following.

1.2.1 Semantic Event Extraction Challenges

Some semantic event extraction researches [1]-[3] use video content as resource knowledge. Chen and Deng [1] analyzed video features (e.g. color, motion, shot) to extract and index events in a basketball video. Hassan et al. [2] extracted audio-visual (AV) features and applied Conditional Random Fields (CRFs) based probabilistic graphical model for sports event detection. Kim and Lee [3] built an indexing and retrieving system for a golf video by analyzing its AV content. However, schemes relying on video content encounter a challenge called semantic gap, which represents the distance between video features and semantic events. Recently, some researches [4]-[9] use a multimodal fusion of video content and external resource knowledge to bridge the semantic gap. Webcast text, one of the most powerful external resource knowledge, is an online commentary posted with well-defined structure by professional announcers. It focuses on sports games and contains detail information (e.g., event description, game clock, player involved, etc.). The multimodal fusion scheme, which analyzes webcast text and video content separately and then does

text/video alignment to complete sports video annotation or summarization, has been used in American football [4], soccer [6]-[8], and basketball [7]-[8].

For webcast text analysis, Xu et al. [8] apply probabilistic latent semantic analysis (pLSA), a linear algebra–probability combined model, to analyze the webcast text for text event clustering and detection. Based on their observation, the descriptions of the same event in the webcast text have a similar sentence structure and word usage. They use pLSA to first cluster the descriptions into several categories and then extract keywords from each category for event detection. Although they extend pLSA for both basketball and soccer, there are two problems in the approach: 1) the optimal number of event categories is determined by minimizing the ratio of within-class similarity and between-class similarity. In fact, there are more event categories for a basketball or soccer game. For example, in a basketball game, many events, such as timeout, assist, turnover, ejected, are mis-clustered into wrong categories or discarded as noises. This may cause side effects degrading and limiting the results of sports video retrieval; 2) after keywords extraction, events can be detected by keywords matching. In Xu et al.’s method, they use the top ranked word in pLSA model as single-keyword of each event category. But in some event categories, the single-keyword match will lead to horrible results. For example, in their method for a basketball game, “jumper” event represents those jumpers that

players make. Without detecting “makes” as a previous word of “jumper” in description sentences, the precision of “jumper” event detection is decreased from 89.3% to 51.7% in their testing dataset. However, the “jumper” event actually is an event that consists of “makes jumper” event and “misses jumper” event. The former can be used in highlights, and the latter can be used in sports behavior analysis and injury prevention. Accordingly, using single-keyword match is insufficient and some important events will be discarded.

In the multimodal fusion scheme, text/video alignment has a great impact on performance, and it can be achieved through scoreboard recognition. A scoreboard is usually overlaid on sports videos to present the audience some game related information (e.g., score, game status, game clock) that can be recognized and aligned with text results. For sports with game clock (e.g., basketball and soccer), event moment detection can be performed through video game clock recognition Xu et al.

[6]-[8] used Temporal Neighboring Pattern Similarity (TNPS) measure to locate game clock and recognize each digit of the clock. A detection-verification-redetection mechanism is proposed to solve the problem of temporal disappearing clock region in basketball videos. However, recognizing game clock in a frame which has no game clock is definitely unnecessary. The cost of verification and redetection could have been avoided. Moreover, the clock digit characters cannot be located on a

semi-transparent scoreboard.

1.2.2 Slow Motion Replay Detection Challenges

As to slow motion replay detection, many methods have been proposed, and they can be classified into two categories. The first category [10]-[15] is to locate positions of specific production actions called special digital video effects (SDVEs) or logo transitions, and bases on these positions to detect replay segments. However, in this category, they all made an imperfect assumption that a replay is sandwiched by either two visually similar SDVEs or logo transitions, the assumption is not always true in basketball videos. In fact, a basketball video segment bounded by paired SDVEs is not always a replay. Moreover, the beginning and end of a basketball replay can have some combinations: 1) paired visually similar SDVEs; 2) non-paired SDVEs; 3) a SDVE in one end and an abrupt transition in the other. So, previous work in this category cannot be applied to basketball videos with replays having combinations (2) and (3).

The second category [16]-[18] analyzes features of replays to distinguish replay segments from non-replay segments. Farn et al. [16] extracted slow motion replays by referring to the dominate color of soccer field; however, it is not applicable in basketball videos since the size of basketball court is relatively smaller and its

textures are more complicated. Wang et al. [17] conducted motion-related features and presented a support vector machine (SVM) to classify slow motion replays and normal shots. The precision rates of two experimented basketball videos are 55.6%

and 53.3% with recall rates 62.5% and 66.7%, respectively. Han et al. [18] proposed a general framework based on Bayesian network to make full use of multiple clues, including shot structure, gradual transition pattern, slow motion, and sports scene. The method is suffered from the inaccuracy of the used automatic gradual transition detector. Their experiments show precision rate 82.9% and recall rate 83.2%.

The existing two category methods are generic but not satisfactory for basketball videos. Moreover, most previous researches analyze every video frame to detect replays, but detecting replays in video frames that are surely non-replay degrades both performance and detection rate.

在文檔中籃球影片之語義標注與摘要擷取之研究 (頁 13-17)