Feature Extraction and Replay Detection

3.2 Proposed Method

5.2.2 Feature Extraction and Replay Detection

After video frames partition, every consecutive non-scoreboard frame sequence bounded by scoreboard frames can be considered as a non-scoreboard segment.

Characteristics of replays and non-replays are observed to create features, which are used to detect replays and prune non-replays.

According to our observation, there are three possible components in a non-scoreboard segment: 1) slow motion replay; 2) TV commercial; 3) game-related segment which shows game-related information with background around the court but is not a replay. Some game-related segment examples are given in Fig. 5.1.

Non-scoreboard segments can be classified into three classes by their time duration:

Short, medium, and long. The composition of each class is different from the others.

Short non-scoreboard segment (SNS), which is less than 6 seconds, is only caused by temporal disappearing of scoreboard for showing some game-related information.

Medium non-scoreboard segment (MNS), which is between 6 seconds and 30 seconds,

occurs in temporal pause of the game, and it is either game-related segment or slow motion replay. Long non-scoreboard segment (LNS), which is more than 30 seconds, occurs in long pause of the game, and it is a combination of TV commercial, half-time report, game-related segment with or without replay in it. To treat different compositions and characteristics of each non-scoreboard segment class, different strategies are proposed to detect slow motion replays.

(a) Sideline interview. (b) Player’s comment. (c) Player’s statistic.

Fig. 5.1 Examples of game-related segments.

A SNS can be always detected as a non-replay, since a slow motion replay is never less than 6 seconds. A MNS is either slow motion replay or game-related segment, so a binary classifier is given to classify it as a replay segment or a game-related segment. Replay detection for LNS is more complicated than that for MNS because of the existence of TV commercials. TV commercials are various, and they possibly contain features that are ambiguous with replay. Hence, each LNS is first cut into sub segments, then replays are detected from the sub segments. Note that our replay detection results of MNS are great that some of the information can be

referred to aid replay detection for LNS. More replay detection details for MNS and LNS are presented in the following paragraphs. The block diagram of slow motion replay detection is shown in Fig. 5.2.

For MNS, most histogram differences of neighboring frames in a game-related segment are similar. On the contrary, the histogram differences of neighboring frames in a replay segment are various (see Figs. 5.3(a)-5.3(c)). The variety of differences in a replay segment is caused by two reasons. One is the camera flash which appears frequently in replays (see Fig. 5.3(d)). The other is due to the smaller court of basketball videos, the background sometimes changes between in-court view and out-court view in a replay, and the histogram difference of two neighboring frame becomes larger (see Fig. 5.3(e)).

Fig. 5.2 Block diagram of slow motion replay detection.

Non-scoreboard

(a) An example of game-related segment.

(b) An example of replay segment.

Fig. 5.3 An example of comparison between a game-related segment and a replay segment.

frame 1 frame 2 frame 3 frame 4 frame 5 frame 6 frame 7

frame 9 frame 10 frame 11 frame 12 frame 13 frame 14

frame 8 frame 15

frame 1 frame 2 frame 3 frame 4 frame 5 frame 6 frame 7

frame 9 frame 10 frame 11 frame 12 frame 13 frame 14 frame 15 frame 8

(d) Frames 6~8 of the replay segment in (b) with a camera flash.

(e) Frames 10~13 of the replay segment in (b) with background changing from in-court view to out-court view.

Fig. 5.3 An example of comparison between a game-related segment and a replay segment (continued).

Based on the different characteristics, a binary classifier is used. Given a non-scoreboard segment (NS) with the frame sequence (nf1, nf2, …, nf_Kn), Kn is the total frame number of NS. Let (nh1, nh2, …, nh_Kn) be the corresponding color histograms of Kn frames in NS. Two histogram-based frame differences are defined by

measure the histogram difference of two neighboring frames, DH15 is for two frames

with distance 15. After calculating DH1(nhi) and DH15(nhi) for each frame in NS, let σ_DH1(NS) and σ_DH15(NS) represent the standard deviations of sequences DH₁(nh_i) and

DH15(nhi), respectively. Then these two standard deviations are considered as global variation features of a NS. The two global variation features of all MNSs are used in binary classifier to classify each MNS as a preliminary replay or a game-related segment. According to our preliminary experiments for MNS classification, the average precision rate of correctly classifying segments as replays in ten experimented basketball videos is 94% with average recall rate 100%. In order to explain the performance of the binary classifier, the two global features of each MNS in one of the experimented basketball videos are shown in Fig. 5.4. From this figure, we can see that replay segments and game-related segments can be well-separated based on these two global features.

Note that the misclassification is due to that few game-related segments consist of several still shots (see Fig. 5.5) with few near abrupt transitions and are misclassified as replays (i.e. false alarms). Since the differences of neighboring frames in a replay segment are always diverse, another variation feature can be used to prune this kind of misclassification.

Fig. 5.4 The two global features of each MNS in a basketball video.

In order to decrease the effects of few near abrupt transitions in a game-related segment, a mean filter is used to skip larger neighboring frame differences, pixel-based difference of a neighboring frame pair is defined by

where M and N represent the width and height of a frame, and nfi(x,y) represents the color value of pixel (x,y) at frame nf_i. Let µ_DF1(NS) be the mean value of all DF₁(nf_i) in the NS. And let σ´_DF1(NS) represent the standard deviation of those DF₁(nf_i) less than µ_DF1(NS), and it is considered as another variation feature of NS. Hence, for each MNS detected as a replay with small σ´_DF1, it should be pruned. To determine the

threshold automatically, a self-training mechanism is provided. For each MNS detected as a replay in ten experimented basketball games, σ´

histogram of σ´_DF1 is established and shown in Fig. 5.6.

Fig. 5.5 An example of the DH1 sequence of a game-related segment misclassified as replay.

Fig. 5.6 Histogram of σ´_DF1 from the preliminary replays in ten experimented basketball videos.

T1 T2

Time

Still shot 1 Still shot 2 Still shot 3

Camera flashes Transition T1

Transition T2

All MNSs in a Basketball Video

Computation of Global Variation Features (σ_DH1,σ_DH15) for Each MNS

Replays

Binary Clustering Based on Global Variation Features (σ_DH1,σ_DH15)

Computation of Variation Feature σ´_DF1for Each Preliminary Replay

Preliminary Replays

σ´_DF1< 4? Yes

Non-replays

Fig. 5.7 Block diagram of replay detection for MNS.

As to LNS, the task becomes more complicated due to that the major portion of LNS is TV commercial, and the features of TV commercials can be ambiguous with replays. Since a LNS consists of thousands of frames, our strategy is to cut each LNS into several sub segments, called sub-LNSs. Then, instead of directly detecting replays from sub-LNS, detection results for MNS and characteristics of TV commercials are referred to build some pruning rules. After pruning non-replays, the rest sub-LNSs are considered as detected replays.

According to the structure of typical commercial block [27], TV commercials are

always grouped into blocks, and several monochrome black frames are inserted to separate each of them. Mostly, the last few seconds of each TV commercial consist of still shots of the product and slogan to impress viewers (see Fig. 5.8). However, there are neither monochrome black frames nor still shots in slow motion replays. So, each LNS can be cut by consecutive runs of low differences of neighboring frames without affecting the completeness of replays.

(a) (b) (c)

Fig. 5.8 Examples of still shots of the product and slogan in TV commercials.

Given the frame sequence of a LNS = (lf₁, lf₂, …, lf_Kl), where K_l is the total frame number of the LNS. Pixel-based difference of a neighboring frame pair DF₁(lf_i) is calculated and recorded. Consecutive frames with DF₁(lf_i) less than 20 are considered as a still run. All still runs with length more than 20 are used to cut the LNS into several sub-LNSs, each of which is bounded by a pair of still runs with length > 20.

As mentioned earlier, a slow motion replay is never less than 6 seconds, so each sub-LNS less than 6 seconds can be detected as a non-replay. For those sub-LNSs more than 6 seconds, the binary classifier is not applicable because variation features of TV commercials can be ambiguous with replays. Since the detection results for

MNS are great, they can be considered as a pre-trained model to build some pruning rules. In the pre-trained model, let R-MNS and G-MNS represent the set of detected replay segments and the set of detected game-related segments, respectively. All frames of R-MNS are grouped as a replay frame sequence (RFS); likewise, all frames of G-MNS are grouped as a game-related frame sequence (GFS). The proposed pruning rules are given below.

Rule 1: Global Variation Pruning

Inspired by replay detection for MNS, the differences of neighboring frames in a replay segment are diverse. The observation is already proved by the great recall rate

in the preliminary experiments. Based on the two global variation features used in binary classifier for MNS, the center feature vector (Cσ_DH1(R-MNS), Cσ_DH15(R-MNS)) of R-MNS is denoted as

The radius of global variation features of R-MNS is defined by

{

⁽ ⁽ ⁾ ⁽ ⁾⁾ ⁽ ⁽ ⁾ ⁽ ⁾⁾

}

(σ_DH1(sLNS) ,σ_DH15(sLNS)) to the center feature vector of R-MNS can be calculated

If UD(sLNS) > rvariation(R-MNS), it is not similar to any of R-MNS. The dissimilarity is caused by either too larger global variations or too small ones. It is more reasonable to prune those dissimilar sub-LNS with small global variations. A threshold TH1 is set as

Hence, for each dissimilar sub-LNS with σ_DH1(sLNS)＜TH₁, it should be pruned as

non-replay.

Rule 2: Color Pruning

The color distribution of a TV commercial is various; however, the color distribution of RFS or GFS is more related to game itself. So a sub-LNS should be pruned if its color distribution is neither similar to that of RFS nor similar to that of GFS.

Given RFS = (rf1, rf2, …, rf_Kr) and GFS = (gf1, gf2, …, gf_Kg), where Kr is the total frame number in RFS and Kg is the total frame number in GFS. Let RHS = (rh1, rh2, …, rh_Kr) be the corresponding quantized histogram sequence calculated from RFS

and GHS = (gh1, gh2, …, gh_Kg) be the corresponding quantized histogram sequence calculated from GFS. Then, mean color histograms of RHS and GHS can be calculated by

The maximum differences are considered as the radiuses of RHS and GHS and calculated by quantized histogram sequence calculated from the sub-LNS. Mean color histogram of the sLNS is defined by

Hence, for the mean color histogram of each sub-LNS, if its distance from µRHS is

larger than rRHS and its distance from µGHS is larger than rGHS, the sub-LNS is

Rule 3: Smoothness Pruning

In a slow motion replay, the pixel-based difference of each neighboring frame pair is usually larger to show the details of a sports event. However, a non-replay is normally much smoother to fulfill requirements of human visual perception. So, for each sub-LNS, if most differences of neighboring frame pairs are smoother than those of a replay, it should be pruned.

Let RFS = (rf1, rf2, …, rf_Kr), where Kr is the total frame number in RFS. Mean pixel-based neighboring frame difference is defined by

and it is already calculated by formula (6) in earlier process (i.e. replay detection for MNS). Given a sub-LNS, sLNS, with frame sequence = (slf1, slf2, …, slf_Ks), where Ks

is the total frame number of sLNS. The pixel-based frame difference DF1(slfi) of two neighboring frames slfi and slfi-1 is already calculated as well (i.e. sub-LNS cutting for LNS). For each sLNS, the smoothness feature is defined by

If the smoothness feature is larger than a threshold, THsmoothness, the sub-LNS is detected as a non-replay.

Rule 4: Scene Length Variation Pruning

From the structure of TV commercials [27], each individual TV commercial consists of many story scenes with abrupt scene changes. On the contrary, a replay always happens in the same scene, and there are rare abrupt scene changes in a replay.

Note that some unexpectedly camera flashes may appear in replays to challenge the detection of abrupt transition, so the number of abrupt transitions is not a good pruning feature. Here, the length of each scene cut by abrupt transitions is observed instead. The length of each cut scene in a replay is various because the camera flashes appear unexpectedly; however, that in a commercial is relatively stable to show a story. Hence, for each sub-LNS, if lengths of its cut scenes are relatively stable, it should be considered as a commercial, i.e. non-replay.

Here, an abrupt transition detection method for each sub-LNS is provided by finding every frame slfi with local maximum difference DF1(slfi) which is two times larger than that of one of its neighboring frames, i.e., DF1(slfi)>2DF1(slfi-1) or DF1(slfi)>2DF1(slfi+1). Some examples of abrupt transition detection results are given in Fig.5.9. After abrupt transition detection, the scene length variation feature for each sub-LNS is defined by the standard deviation of the lengths of the cut scenes. If the scene length variation feature is less than a threshold, THslv, the sub-LNS is detected as a non-replay.

(a) Non-replay.

(b) Replay.

Fig. 5.9 Examples of abrupt transition detection results and the corresponding cut scenes of non-replay and replay.

After pruning non-replays by the four proposed rules, the rest sub-LNSs are considered as detected slow motion replays. The detail implementation steps of replay detection for LNS are given below.

Step0: Initialization: Given R-MNS, RFS, GFS, and all LNSs in a basketball game.

Time

Cut scene 1 Cut scene 2 Cut scene 3

Time

Cut scene 1 Cut scene 3

Cut scene 2

Cut each LNS into sub-LNSs. Consider sub-LNSs less than 6 seconds as non-replays and pass those more than 6 seconds to Step1.

Step1: // Global Variation Pruning Using Rule 1

For R-MNS

Calculate (CσDH1(R-MNS), CσDH15(R-MNS)), rvariation(R-MNS), and

TH1 by formulas (5.1), (5.2), (5.3), and (5.5).

In LNS, for each unprocessed sub-LNS, sLNS Calculate UD(sLNS) by formula (5.4).

if (UD(sLNS) > rvariation(R-MNS) and σDH1(sLNS)＜TH1) Consider sLNS as a non-replay

else

Consider sLNS as a potential replay end

Step2: // Color Pruning Using Rule 2

For RFS and GFS

Calculate µRHS, µGHS, rRHS, and rGHS by formulas (5.6), (5.7), (5.8), and

(5.9).

In LNS, for each potential replay sLNS Calculate µH(sLNS) by formula (5.10).

if (||µH(sLNS)－µRHS|| > rRHS and ||µH(sLNS)－µGHS|| > rGHS)

Reconsider sLNS as a non-replay end

Step3: // Smoothness Pruning Using Rule 3

For RFS

Calculate µ_DF1(RFS) by formula (5.11).

In LNS, for each potential replay sLNS

Calculate smoothness(sLNS) by formula (5.12).

if (smoothness(sLNS) > THsmoothness) Reconsider sLNS as a non-replay end

Step4: // Scene Length Variation Pruning Using Rule 4 In LNS, for each potential replay sLNS

Calculate scene length variation feature of sLNS.

if (scene length variation feature of sLNS < THslv) Reconsider sLNS as a non-replay

end

Note that each rule corresponds to a specific feature for pruning non-replays. Since

our goal is to prune non-replays and to keep replays, changing the order of the four rules, i.e., Step1-Step4, comes out the same results.

5.3 Experimental Results

Our experiments are conducted by 10 NBA basketball games from 3 different broadcasters, i.e., ESPN, TNT, NBA TV. The length of each game match is about 150 minutes with TV commercials included. The data are recorded from TV in MPEG-2 format with resolution 480×352.

In slow motion replay detection, there are two kinds of resource videos. One is the commercial-free sports video which is only available for professional staff from the production room. The other is the broadcasted version with a lot of TV commercials, and it is more available for general audiences. From our experimental results, the proposed replay detection method can fulfill both kinds of potential users.

Table 5.1 and Table 5.2 show the replay detection results for MNS. In Table 5.1, some game-related segments (e.g., player’s statistic, game series information, player’s online comments) are misclassified into replays, the reason is that they have several still shots with few near abrupt transitions. This will increase the global variation features. It can be seen from Table 5.2, with the proposed automatic self pruning, the precision rate can be raised from 94% to 97%. The rare false alarms are acceptable

because they are all game-related video segments.

As to experimental comparison with replay detection methods in the first category, here, we assume that all methods in the first category can extract all segments sandwiched by paired SDEVs correctly, and consider the extracted segments as slow motion replays. Since some extracted segments sandwiched by paired SDEVs are not slow motion replays, only those real slow motion replay segments are considered as correctly detected ones. The results are shown in Table

5.3.

Table 5.1 Replay detection results for MNS.

Match Correctly Detected

Total

Detected Precision Total

Replays Recall

Table 5.2 Replay detection results for MNS with self pruning.

Match Correctly Detected

Total

Detected Precision Total

Replays Recall

Table 5.3 Replay detection results for MNS by methods in the first category.

Match Correctly Detected

Total

Detected Precision Total

Replays Recall

It can be seen from Table 5.3, the precision rate is decreased due to many game-related segments with paired SDEVs, e.g., player’s statistics, game series information, sideline clips during timeout. Accordingly, the precision of our method is better. Since MNS is the only possible non-scoreboard segment in commercial-free sports videos, the proposed replay detector for MNS is applicable for commercial-free resources.

In our approach, THsmoothness and THslv have to be preset. As to THsmoothness, a larger threshold means that the condition to prune non-replay is stricter. So, the precision rate is decreased while the recall rate is increased. On the other hand, a smaller threshold means that the condition is looser, so the precision rate is increased while the recall rate is decreased. This exactly illustrates the tradeoff phenomenon.

THslv also has the tradeoff phenomenon for a similar reason. To show the tradeoff phenomenon, results of fixed THslv = 30 with various THsmoothness are shown in Table 5.4. Results of fixed THsmoothness = 85% with various THslv are shown in Table 5.5 as well. By observing the trends in Table 5.4 and Table 5.5, two pairs of thresholds, (THsmoothness = 85%, THslv = 25) and (THsmoothness = 85%, THslv = 30), are chosen in our experiments. The results are presented in Table 5.6 and Table 5.7.

Table 5.4 Total replay detection results with fixed THslv = 30.

THslv THsmoothness Precision Recall

75% 90% 90%

80% 89% 92%

85% 88% 94%

90% 86% 94%

Table 5.5 Total replay detection results with fixed THsmoothness = 85%.

THsmoothness THslv Precision Recall

85%

15 78% 97%

20 83% 96%

25 87% 96%

30 88% 94%

35 89% 91%

Table 5.6 and Table 5.7 present total replay detection results by combining results for MNS and LNS. As can be seen from Table 5.7, the use of a stricter threshold can raise precision rate from 87% to 88% with 2% degradation of recall rate.

Since one of the most important goals of replay detection is for highlight generation, the high recall rates in both results show the great performance. As compared with previous researches for basketball videos [17]-[18], our method presents the superior performance.

We also compare our approach with methods in the first category. The results of methods in the first category are shown in Table 5.8. From this table, we can see that the recall rate (69%) are worse than ours (94%). The reason is that methods in the first

detection missing due to that replays are not always sandwiched by paired SDVEs.

Accordingly, as compared with previous researches [12][15], our method is superior.

在文檔中籃球影片之語義標注與摘要擷取之研究 (頁 73-104)