Video Frames Partition - 籃球影片之語義標注與摘要擷取之研究

As can be seen from Fig. 2.1, in basketball videos, all frames can be broadly

classified into two categories, scoreboard frames and non-scoreboard frames.

Scoreboard frames present basketball game with scoreboard overlaid on them, while non- scoreboard frames present the rest, e.g., sideline interview, slow motion replay, etc. Since semantic events only appear in scoreboard frames, whereas replays only appear in non-scoreboard frames. It is beneficial to filter out unnecessary processing frames in each semantic resource extraction step. So, an automatic scoreboard template extractor is first proposed to extract scoreboard template and scoreboard position. Then, the video frame partitioning can be done by simple template matching.

It can be seen from Fig. 2.1(a), a scoreboard is a large, still, and rectangular area which consists pixels that change very infrequently. Based on this fact, an automatic scoreboard template extractor is proposed. First, a context-based static region detector is provided to extract few static regions called scoreboard candidates. Then a scoreboard selection method is used to get the right scoreboard. The block diagram of the scoreboard template extraction is shown in Fig. 2.2.

(a) Scoreboard frame.

(b) Non-scoreboard frame (sideline interview).

(d) Non-scoreboard frame (slow motion replay).

Fig. 2.1 Examples of scoreboard frames and non-scoreboard frames.

Video Input

Context-based Static Region Detection

Scoreboard Selection

Extracted Scoreboard Template Position

Fig. 2.2 Block diagram of scoreboard template extraction.

2.1.1 Context-Based Static Region Detection

As to context-based static region detection, a sports video is considered as an input frame sequence. Let fi be the i-th input frame and K be the total frame number.

For each frame fi, the pixel-based frame difference between fi and its previous frame fi-1 is first calculated as follows:

K accumulated difference frame, ADfi, is created by

Fig. 2.3 shows an example. As time goes by, the accumulated difference at each pixel can be considered as the change degree at that position.

After binarizing the accumulation result, each white point represents the position that changes more frequently and each black point represents the opposite. Then, we do region growing on black points of each binarized accumulated difference frame to find the largest connected component, which satisfies two constraints, as a potential scoreboard candidate. One constraint is about size. Since a scoreboard should be large enough to present score information, the width of the bounding box of the connected component should be at least 1/12 frame width and the height should be at least 1/18 frame height. The other constraint is about shape. The shape of the connected

component should be near rectangular, that is, the ratio of the connected component area and its bounding box area should be at least 0.9.

frame 1 frame 2 frame 3 frame i

(a) Video frame sequence.

Df₂ Df₃ Df_i

(b) Pixel-based frame difference.

ADf₂ ADf₃ ADf_i

(d) Binarized results.

Fig. 2.3 Example of pixel-based frame difference accumulation.

For each binarized accumulated difference frame, if a potential scoreboard candidate is found, its position is then recorded. If the position is unchanged for consecutive frames, e.g. 300 frames, this means a potential scoreboard candidate is stable enough, and it can be considered as a scoreboard candidate. The context-based static region detector is applied repeatedly to the video frame sequence until few candidates are detected.

2.1.2 Scoreboard Selection

Some sports videos have overlaid rectangular logos made by the TV stations.

The TV station logo is overlaid at the same position during the game while the scoreboard may disappear from time to time (see Fig. 2.1). Thus the logo is possibly detected as a scoreboard candidate. Fortunately, a TV station logo is never larger than a scoreboard, thus the scoreboard selection will prune smaller size candidates. Note that a scoreboard candidate consists of two parts, position and template. Now, we have located the scoreboard position. For template, since the scoreboard may disappear from time to time, extracting a template from a scoreboard candidate position cannot guarantee a right one. To solve this problem, for each scoreboard candidate sc extracted from fi, the temporal change of the candidate sc, TC(sc), is evaluated by

where Mc and Nc represent the width and height of sc, fi(x,y) represents the color value of pixel (x,y) at frame fi, and s represents temporal frame offset. Then, the scoreboard selection will take the one with the least temporal change as the scoreboard template.

According to our experiments, four scoreboard candidates are enough to extract the right scoreboard template. After scoreboard template extraction, the video frames partition can be done by matching every frame with scoreboard template at the scoreboard position.

2.1.3 Experimental Results

Our experiments are conducted by 10 NBA basketball games from 3 different broadcasters, i.e., ESPN, TNT, NBA TV. The data are recorded from TV in MPEG-2 format with resolution 480 × 352. All 10 scoreboard templates are extracted successfully. It can be seen from Fig. 2.4, the proposed scoreboard template extractor works great for the 3 different broadcasters. Due to the effective results for different style scoreboards, it is believed that the proposed scoreboard template extractor can be generalized to other sports. Note that a scoreboard contains rich information in a sports video, so the proposed scoreboard template extractor is applicable as a

(a) Game match broadcasted by ESPN.

(b) Game match broadcasted by TNT.

Fig. 2.4 Scoreboard template extraction for 3 different broadcasters with extracted positions marked by white rectangle.

在文檔中籃球影片之語義標注與摘要擷取之研究 (頁 19-26)