Shot Similarity Measure - Scene Segmentation

Scene Segmentation

2.4 Shot Similarity Measure

This section describes the color and texture features used in our work and how these features are taken into account in the formulation of shot similarity measure. A major requirement for shot similarity measure is to define a content representation that captures the common aspects or characteristics of the shot. One common method is to select one key-frame from the shot and use the image features of that key-frame as an abstract representation of the shot. For shots with fast changing content, one key-frame per shot is not adequate. Besides, the content description it provides varies significantly with the key-frame selection criterion. To avoid these problems, a more feasible approach is to consider the visual content of all the frames within a shot for shot representation. Color is one of the most widely used visual features in video content analysis. Most scene extraction algorithms compare color histograms between key-frames to determine the shot similarity measure. The histogram-based approach is relatively simple to implement and provides reasonable results. However, due to its statistical nature, the color histogram cannot capture the spatial layout information of each color. When the image collection is

Figure 2.3: The static mosaic of the ”Terminator II” sequence.

large, two different content images are likely to have quite similar histograms. To remedy this deficiency, in our approach, the distribution state of each color in the spatial (image) domain is also taken into account. The color histogram of an image is constructed by counting the number of pixels of each color. The main issues regarding the construction of color histograms involve the choice of color space and quantization of the color space.

The RGB color space is the most common color format for digital images, but it is not perceptually uniform. Uniform quantization of RGB space yields perceptually redundant bins and perceptual holes in the color space. Therefore, non-uniform quantization may be needed. Alternatively, HSV (hue, saturation, intensity) color space is chosen since it is nearly perceptually uniform. Thus, the similarity between two colors is determined by their proximity in the HSV color space. When a perceptually uniform color space is chosen, uniform quantization may be appropriate. Since the human visual system is more sensitive to hue than to saturation and intensity [33], H should be quantized finer than S and V. In our implementation, the hue is quantized into 20 bins. The saturation and intensity are each quantized into 10 bins. This quantization provides 2000 (= 20 × 10 × 10) distinct colors (bins), and each bin with non-zero count corresponds to a color object. Since we are interested in the whole shot rather than single image frame, only one

histogram is used to count the color distribution of all background images within a shot.

Then, each bin of the resulting histogram is divided by the number of frames in a shot to obtain the average histogram. Next, several spatial features are calculated to characterize the distribution state of each color object in each image frame. Assuming a set of pixels S= {(x₁, y₁), ..., (x_n, y_n)} belong to color object c_i, k is the image size, and m is the total number of 4-connected pixels in S. Then, we define

1. the density of distribution as

To define the fourth feature, the image is partitioned equally into p blocks of size 16 × 16. A block is active if it contains some subsets of S Let the number of active blocks in the image frame be q, we define

4. f_i4= ^q_p.

After the spatial features of all images are computed, we take average of these values, respectively. Let f_i1, f_i2, f_i3, and f_i4be the average feature values of a color object c_iin a shot, for two color objects c_iand c_j, the difference in the spatial distribution within a shot is defined as

D_s(c_i, c_j) =1

4(| f_i1− f_j1| + | f_i2− f_j2| + | f_i3− f_j3| + | f_i4− f_j4|). (2.4)

Texture refers to the visual patterns that have properties of homogeneity that do not result from the presence of only a single one color or intensity only. It contains impor-tant information about the structural arrangement of objects and their relationship to the surrounding environment. We define the coarseness of an images texture in term of the distribution density of the edges. The Canny edge detector is used to extract edges from an image. The edge location indicates sharp intensity variation. Psychophysical experiments have shown that the human visual system is sensitive to the high-frequency regions of an image such as edges. The detected edge image is partitioned into a set of 16 × 22 blocks.

A block is textured, if the number of edge points in the block is greater than a threshold (=30, in our setting). Then, we can compute the ratio of the textured block of each image and its average value over a shot. The texture similarity between two shots is determined by the minimum of the two average values. In Fig. 2.4, two images with different level of texture coarseness are shown. Fig. 2.5 shows the detected edge image partitioned into a set of 16 × 22 blocks. Histogram intersection is a popular similarity measure used for color-based image matching [34]. It yields the number of pixels that have same color in two images. In our work, we extend this idea to shot similarity measure. Let A,B be the set of all color objects in shot S1and S2, respectively, for a given u ∈ A, its similar color object in B is some v ∈ B such that ku − vk < ε, where ku − vk denotes the Euclidean distance between u and v in the HSV color space, and ε is a threshold (=3, in our setting).

Then, (u, v) is called a similar color pair. Let Ω = {(u, v)|(u, v) ∈ A × B, (u, v) is a similar color pair }, the shot similarity measure between S₁(with the average histogram H₁) and S₂(with the average histogram H₂) is defined as

ShotSim(S₁, S₂) = 1

∑

(u,v)∈Ω

{W (D_s(u, v))min(H₁(u), H₂(v))} + w_t× min(t₁,t₂), (2.5)

where k is the image size; t₁and t₂are the average ratios of textured block for shot S₁and S₂, respectively; w_t is the weight of texture feature; D_sis the difference in spatial features as defined in Eq. (4); and W is a weight function defined as

W(x) = ¹

1+e^a×x+b

The weight function W is the general form of the sigmoid function which is frequently used in neural networks computation [35], where a and b are parameters. In our work, it is used to fuse the spatial distribution information with a histogram. The construction of this weight function is motivated by the psychophysical observation that the effect of spatial distribution on human perception is progressive [36]. Only when the difference in spatial features is greater than a threshold, humans perceive significant visual variation. The property of the sigmoid function fulfills this requirement. In our system, we set a = 10 and b = −5. As shown is Fig. 2.6, the functions value becomes significantly small for x> 0.75.

It is noted that a given color object in shot S₁ may have more than one similar color objects in shot S₂ as illustrated in Fig. 2.7. To avoid the overlapping contribution in calculating shot similarity, after each step of min(H1(u), H₂(v)), H₁(u) and H₂(v) are all subtracted by min(H₁(u), H₂(v))

Figure 2.4: Two images with different texture coarseness.

Figure 2.5: The detected edge image is partitioned into a set of 16 × 22 blocks.

Figure 2.6: Sigmoid function with parameters a = 10 and b =-5.

在文檔中以人類為基礎的視訊處理及其在監控上的應用 (頁 27-32)