Chapter 3. Automatic Closed Caption Detection and Filtering in MPEG Videos
3.3 Closed Caption Localization
3.3.3 Font Size Differentiation
From Fig. 3-7(d), we can notice that the scoreboard in the left upper corner and the trademark in the right upper corner are all successfully detected. Since scoreboards can be used for the content structuring of sports videos, the issue of separating out the captions in the scoreboard is one of our concerns. Hence, the tool – font size detector is proposed to automatically discriminate the font size as a support in the discrimination of scoreboards. To detect the font size, the gradient energy of each text block is exploited. Since a block consisting of characters will have much larger gradient energy than that of a block consisting of blank space, the distance between two character blocks can thus be determined by evaluating the distance between peak gradient values among blocks in a row or column. It means that the font size can be evaluated by measuring the distance between blocks with peak gradient value (i.e., the periodicity of peak values). The gradient energy in the vertical direction instead of horizontal direction is exploited since the blank space in between two text rows is generally larger than that between two letters and hence the variation of gradient energy in the vertical direction would present in more regular pattern.
In addition, to obtain robust periodicity, we compute the DCT coefficients of the 8x8 overlap-block between two neighboring blocks as defined in Eq. (3-8). A overlap-block shown in Fig. 3-10 comprises lower portion of the top neighboring 8x8 block and upper portion of the bottom neighboring block , where and are the identity matrix in the dimension of w0 x w0 and w1 x w1, respectively. More robust results would be achieved if more overlap-blocks are computed and exploited. For example, w0 and w1 can be respectively set to 1 and 7, 2 and 6, 3 and 5, etc. to acquire more overlap-blocks for more accurate estimation of font size.
block overlap
B −
Bt Bb
0
Iw Iw1
w b
Fig. 3-10. Overlap-block is interpolated from its two neighboring blocks Bt andBb
Region Selection
Fig. 3-11. The proposed approach of font size differentiation in compressed domain Fig. 3-11 shows the proposed approach of font size differentiation, in which the periodicity and variance are estimated for each block column. However, localized closed captions like the example in the top of Fig. 3-11 may not be complete in shape because some pieces with low gradient energy are filtered out. Therefore, to achieve robust font size differentiation, a region that forms a rectangular in the localized
caption is determined for font size computation. Font size differentiation is performed on each block column in the selected region of the closed caption, where a block column depicted in Fig. 3-11 is defined as a whole column of blocks. While the AC energy of each block is extracted, the curve of the variation of AC energy for each block column is checked to locate each local maximum. We can observe that the region containing the boundary of closed captions would have conspicuous texture variation in the vertical direction and the value of the gradient energy would be relatively high. Therefore the local maximum of the curve of vertical AC gradient energy is regarded as the boundary of closed captions. While all local maximums are recognized, we must filter out noise and select reliable curve peaks for further verification. Due to the fact that the first and the last local maximums usually reflect the boundary of closed captions, hence we select the first and the last peaks of the curve and compute the average of the value of these two peaks as the threshold adaptively for noise filtering. If the value of a peak is smaller than the threshold, the peak is filtered out. Otherwise, the peak is kept for font size computation. Therefore, the periodicity of each block column is computed by averaging the distance between two peaks of the curve of AC energy. Finally, the average periodicity T and the periodicity variance V of the closed caption are obtained by
Ti
where N is the total number of block columns in the selected area of the closed caption.
The results of font size analysis of the scoreboard and the trademark in Fig.3-12 are
demonstrated in Fig. 3-13 and Fig. 3-14. In the example, each column of the scoreboard and the trademark consists of 9 blocks, in which 5 blocks are original and 4 overlap-blocks are interpolated. For robustness of font size measurement, we should select some portion of localized closed caption, in which the height of each block column is consistent. Therefore, we compute T for first five block columns because the first part “Doki” of the localized scoreboard consists of five block columns of consistent height and several non-text blocks separate the second part of the scoreboard. Hence, in Fig. 3-13(b), the block columns of the trademark are all selected for font size computation because the height of each block column is consistent.
(a) (b)
Fig. 3-12. The localized closed captions (a) scoreboard (b) trademark
From Fig. 3-13 and Fig. 3-14, we can see that the average distance T of the scoreboard is about 2.2 which is smaller than 2.9 of the trademark. Besides, the variance V of the row distance of blank space among block columns of the scoreboard is 0.05 which is also smaller than 0.8 of the trademark. Hence, we can correctly discriminate the scoreboard from trademark since the font size of the scoreboard is smaller than that of the trademark and the font size is of better regularity in the scoreboard than that in the trademark.
0
Block Number of A Column
AC Energy Column1
Fig. 3-13. Variation of AC energy of the scoreboard in Fig. 3-12(a) (T=2.2, V=0.05)
0
Block Number of A Column
AC Energy
Fig. 3-14. Variation of AC energy of the trademark in Fig. 3-12(b) (T=2.9, V=0.8) Furthermore, in order to estimate periodicity of font size more efficiently, we exploit the concept of the projection analysis of a print line [44-45]. Since it can serve for the detection of blank space between successive letters, we thus compute the horizontal projection profile of each block row by summing up the vertical AC coefficients of the blocks. is defined as follows:
PH Py where HT is the summation of the number of original blocks (H) and the number of
overlap-blocks (H-1) of a block column in an H x W caption region, and is a block of coordinate (x, y). By this method, we compute the periodicity T of each localized closed caption once instead of inspection of the periodicity T and of the variance V in each block column. The horizontal projection profile of the scoreboard and the trademark is demonstrated in Fig. 3-15, where the average periodicity T of the scoreboard and the trademark is about 2 and 3, respectively. Using horizontal projection profile, font size can be detected more efficiently since one curve of AC energy variation needs to be computed for a closed caption.
y
Bx,
Horizontal Projection Profile
0 1000 2000 3000 4000 5000
0 2 4 6 8 10
Row Number
Total AC Energy
Trademark Scoreboard
Fig. 3-15. Horizontal projection profile of DCT AC energy of the scoreboard and the trademark in Fig. 3-12(a) and Fig. 3-12(b), respectively