Experimental Results and Analysis - Motion Activity Based Shot Identification and Closed Captio

Chapter 4. Motion Activity Based Shot Identification and Closed Caption

4.5 Experimental Results and Analysis

Two video sequences are recorded from the VL Sports and ESPN TV channels respectively and encoded in the MPEG-2 format in which the GOP structure is IBBPBBPBBPBBPBB and the frame rate is 30 fps. Since testing videos, Video I and Video II demonstrated in Fig.4-9, are recorded from different games, the background color, background texture, object color and lighting effect in these videos are thus different. The length of the Video I is about one hour and 163 shots of services, competition of the full-court views and close-up are obtained and the length of the Video II is around one and half hour and it consists of 199 shots. To measure the performance of the proposed scheme, precision and recall for the approach of shot identification and the algorithm for detecting closed captions are evaluated. Table 4-1 and Table 4-2 show the experimental results of the shot identification in Video I and Video II, respectively. The precision of identification of all three kinds of shots in Video I and II are both higher than 92%. The values of recall in close-up shots of both videos are up to 98%. The recall values of full-court view shots is just 87% in Video I and 89 in Video II since the camera zooms in to capture shots in which players spike

near the net. In such a case, the shot would include a large portion of the net and a large object would be detected; the shot is thus regarded as a close-up shot.

Additionally, when a team is defending, several players may run to save the ball. In this situation, the number of objects on the left might not be similar to the number of objects on the right and thus the shot is classified as a service shot. Although the recall value of the full court shot does not exceed 90%, the overall accuracy of shot identification is still very favorable.

(a)

(b)

Fig. 4-9. Demonstration of testing videos: (a) Video I (b) Video II

Table 4-3 presents the results of closed caption localization. In Video I, 107 potential captions are detected in which 98 localized regions are the real closed captions including the scoreboard and the trademark. In Video II, there are 125 closed captions containing the scoreboard and the trademark and 128 potential captions are detected in which 118 localized regions are the real closed captions. The recall value reaches 100% and the precision is around 92% in Video I and the recall value is about 94% and the precision is about 92% in Video II. The number of false detection in Video I is nine and that the number of false detection in Video II is ten because the background may include an advertising page whose gradient energy is similar to that of the scoreboard and the channel trademark. In such a case, this high-textured region is falsely detected as the closed caption. Fig. 4-10 presents an example. In Fig.

4-10(d), the billboard is not filtered out since its gradient energy is stronger than the superimposed scoreboard and the trademark.

Table 4-1. Result of shot identification (Video I: 163 shots)

Ground

Table 4-2. Result of shot identification (Video II: 199 shots)

Ground

Table 4-3. Result of closed caption localization

Ground

Fig. 4-11 shows the initial graphical user interface of the video browsing system.

The table of video content is provided, in which the scoreboard at each game point is in the “Closed Caption” field and the representative frames of the three types of shot are shown in the “Service Shot”, “Full-Court Shot” and “Close-Up Shot” fields, respectively. Semantic high-level video structuring provides users an overall view of the competition as textual information in the scoreboard, and allows users to select the

point to watch, browsing through the video sequences in the different levels of detail.

Additionally, when users want to see smashes, defense or offense, they can select full court view shots. Fig. 4-12 depicts all full-court view shots when users click the option “show all shots” in the “F shot” field. Moreover, when users want to see their favorite players, they can watch close-up view shots. Fig. 4-13 shows all “one-point”

close-up shots obtained by selecting the “show other shots” option in the close-up shot field.

(a) (b) (c) (d) (e)

Fig. 4-10. Closed caption localization; (a) original I-frame; (b) result after filtering by horizontal gradient energy; (c) result after morphological operation; (d) result after filtering by SOM-based algorithm; (e) result after dilation

Fig. 4-11. Video structure of caption frames as well as service, full-court view, and close-up shots

Fig. 4-12. The bottom of the interface shows full-court shots

Fig. 4-13. The bottom of the interface presents close-up shots

4.6 Summary

In this chapter, we propose a novel mechanism to automatically structuring volleyball videos in the MPEG compressed domain and construct the table of video content employing both the localized scoreboard and the semantic classes of shots.

GOP-based video segmentation is used to efficiently segment videos into shots. The spatial distribution of moving objects is characterized using the object-based motion activity descriptor. Experimental results indicate that the proposed descriptor effectively identified several shot types in volleyball videos. Additionally, experimental results in localizing superimposed closed captions also show that the target captions are successfully localized and differentiated from the high-textured background regions. These target captions and the shots in semantic classes are well organized in a compact form. Therefore, users are allowed to browse videos nonlinearly in an efficient manner through the table of video content following either

the scoreboards or the semantic classes of shots. Although only volleyball games are used in the experiments, the proposed mechanism provides several reusable modules like the descriptor of motion activity and the method of closed caption detection.

Once the spatial distribution model of moving objects is obtained from employing specific domain knowledge, shots of interest such as the full or partial view of athletic field with particular player distribution can be automatically identified using the proposed object-based motion activity descriptor.

In the future, with the successful identification of shots in volleyball games in this chapter and the effective classification of video shots of MPEG-7 testing dataset in our previous research, we would like to apply the proposed system architecture for the motion activity shot identification/classification to other videos, including movies, documentaries and other sports. In addition, we will investigate video OCR to recognize the localized closed captions and thereby to support the automatic generation of meta-data, like the names of teams in sports videos, the names of leading characters in movies, or important people in other kinds of videos.

Chapter 5. Robust Video Sequence Retrieval Using A Novel

在文檔中高階視訊處理、擷取、特徵粹取及視訊結構化計算之研究 (頁 100-107)