Synopsis of the Dissertation - 一個關於切割影片以產生單一/多重場景背景之研究

CHAPTER 1 INTRODUCTION

1.7 Synopsis of the Dissertation

The rest of the dissertation is organized as follows. Chapter 2 describes the proposed sprite generation method with automatically generated segmentation masks. In Chapter 3, the proposed sprite generation method without using segmentation masks will be introduced. The fast multiple sprite partition and reference frame selection methods are proposed in Chapter 4.

Some conclusions and future research directions are drawn in Chapter 5.

zoomed-out frame

reference frame up-sampling

CHAPTER 2 AUTOMATIC GENERATION OF SEGMENTATION MASKS FOR SPRITE GENERATION

In this chapter, we will propose a sprite generator with an automatic generation of segmentation masks. The generation process contains two passes. The first pass generates a coarse sprite by conventional averaging blending method. Then we generate segmentation masks of every frame automatically from the coarse sprite. In the second pass of sprite generation, the final sprite is generated with the generated segmentation masks to reduce the effect of moving objects. The details of the proposed generator are described as follows.

2.1 Proposed Two-Pass Sprite Generation

As Fig. 1.1 shows, MPEG-4’s sprite generation framework requires auxiliary segmentation masks to reduce the effect of moving objects. Segmentation masks are used in global motion estimation to raise the estimation accuracy. These masks can also be used to avoid moving objects attending the sprite blending. It is impractical to build these masks manually. Thus an automatic generation of segmentation masks is necessary. Since the sprite is a merged background in the video sequence, it can be used as a reference background for moving object detection.

2.1.1 Object blurring effect of averaging blending

The averaging blending used in sprite generation has a blurring effect to every pixel in the sprite. If there does not have segmentation masks, the averaging blending blends pixels of moving objects and background together. In case of enough frames are blended, the moving objects will be blurred and only shadows of objects are left in a generated sprite. Fig. 2.1 demonstrates the blurring of moving objects. Figs. 2.1(a) and (b) are an original frame and its reconstructed background from generated sprite, respectively. One can see that the moving player is blended into the sprite, and leave white shadows in the reconstructed background.

Although the quality of reconstructed background is degraded by these shadows, the backgrounds in the original frame are still visible in the reconstructed background.

(a) (b)

Fig. 2.1 Object blurring effect of averaging blending.

(a) Original frame. (b) Reconstructed background of (a).

2.1.2 Frame segmentation in sprite generation

Since reconstructed backgrounds in a sprite generated by averaging blending without segmentation masks are only blurred by moving objects, the reconstructed backgrounds can be used as reference backgrounds to detect moving objects. A two-pass sprite generation with automatic segmentation masks generation is proposed and shown in Fig. 2.2.

Fig. 2.2 The proposed two-pass sprite generator.

In the first pass of the proposed generator, a coarse sprite is generated first by the MPEG-4’s sprite generator without segmentation masks. The coarse sprite will contain shadows of moving objects definitely. Then the reconstructed backgrounds of the coarse sprite are employed as reference backgrounds to detect the moving objects in the video and generate

segmentation masks automatically. Finally the sprite is re-generated in the second pass by the MPEG-4’s sprite generator with the generated segmentation masks.

2.2 The First Pass of Sprite Generation

The first pass of sprite generation is almost identical to the MPEG-4’s framework. In the global motion estimation, some feature points are extracted by selecting the pixels with larger Hessian value. Then global motion parameters are estimated by a least mean-square-error minimization method, the Levenberg-Marquardt algorithm. An averaging blending method is employed as the blending method. No segmentation masks are applied in this pass of sprite generation. Thus moving objects will be blended and the generated coarse sprite will be blurred. The coarse sprite will be used to extract the segmentation masks automatically.

2.3 Automatic Generation of Segmentation Masks

Despite of the blurred areas, the reconstructed backgrounds still carry most of background information. By subtracting the original frame by the reconstructed background, we can get an image of the moving objects. In order to remove the effect of peak noise, the block difference is applied instead of the pixel difference.

The pixel difference D is defined as the magnitude of the difference between the original frame I and the reconstructed background R, i.e. D= I −R . A threshold t1 is set to find out

the candidates of object pixels. Pixels with D value larger than t1 are considered as candidates.

For each candidate, a 5×5 block B centered on the candidate is taken. The block difference DB, is defined as

The candidate with block difference larger than a preset threshold t2 is considered as an object pixel. The two-stage thresholding technique computes the block differences only for those pixels with higher possibility to be objects. It reduces the complexity of computing block difference for each pixel.

There are two problems while extracting the object pixels. First, the object regions are often ill-shaped with holes. Second, there are some small-sized regions which are misclassified as objects. These problems can be solved using morphological processing and region selecting. Let O be a binary image representing the results of thresholding in the previous step. Pixels judged as objects will be set to one and others will be set to zero. Two binary images called seed and base images are computed. The seed image is produced by applying morphological erosion to O using a disk shaped structure element of radius 2, and the base image is produced by applying morphological dilation to O using the same shaped structure element of radius 5. The region selecting is applied on the base image. An object region is selected if any of its pixels have a value of one in the seed image. The segmentation mask is defined as the union of all regions selected.

Fig. 2.3 gives an example of generating a segmentation mask. The original frame and the reconstructed background of the frame are shown in Figs. 2.3(a) and (b) respectively. By subtracting the original frame by the reconstructed background and performing the two-stage thresholding, the image of the extracted object pixels is shown in Fig. 2.3(c). The seed image shown in Fig. 2.3(d) and the base image shown in Fig. 2.3(e) are generated by applying the morphological processing to Fig. 2.3(c). Finally, the segmentation mask is produced by region selecting and shown in Fig. 2.3(f). The object regions are colored black.

(a) (b)

Fig. 2.3 (Continued) The generation of a segmentation mask. (a) The original image.

(b) The reconstructed background. (c) The object pixels extracted by using two-stage thresholding. (d) The seed image. (e) The base image.

(f) The generated segmentation mask.

(e) (f)

Fig. 2.3 The generation of a segmentation mask. (a) The original image.

(b) The reconstructed background. (c) The object pixels extracted by using two-stage thresholding. (d) The seed image. (e) The base image.

(f) The generated segmentation mask.

Most part of the moving objects was extracted correctly, except two unclassified parts:

the upper part of bat and the player’s legs. The upper part of bat is nearly transparent hence the background is visible through the bat; the legs of the player have similar intensities to the background. Thus, both misclassified parts do not affect the blending result. Moreover, the top and right borders are also classified as object; this will eliminate the black line shadows in the generated sprite. Note that the tennis ball is also classified as an object. These generated segmentation masks will be employed in the second pass of sprite generation.

2.4 The Second Pass of Sprite Generation

The generated segmentation masks are employed in the second pass of sprite generation.

The second pass sprite generation is similar to the first pass with some modifications. The

automatically generated segmentation masks are employed in the global motion estimation and the blending process. These modifications remove the effect caused by considering the object pixels as background, and increase the fidelity of the generated sprite.

In the global motion estimation, the generated segmentation masks are employed as a classification of object pixels. All feature points are checked with the masks. Feature points which are classified as moving objects in the masks are removed from the feature points.

Then the global motion parameters are also estimated by the Levenberg-Marquardt algorithm.

The accuracy of estimated parameters should be increased since the effect of object pixels is reduced.

The sprite is then blended using the newly estimated parameters. Since generated segmentation masks are available in the second pass, the reliability-based blending is adopted instead of the averaging blending employed in the first pass. The reliability-based blending prevents some of moving objects that not segmented correctly from being blended into the final sprite. The generated sprite in the second pass is outputted as the final sprite.

2.5 Experimental Results

Fig. 2.4 shows the generated sprite of the video sequence ‘stefan’ by different methods.

Fig. 2.4(a) is generated by the MPEG-4’s method without using segmentation masks, that is also the coarse sprite generated from the first pass of sprite generation. The masks used to

generate Fig. 2.4(b) are obtained automatically by the proposed segmentation schema.

Fig. 2.5 shows one of the reconstructed frames by different methods respectively. Like we stated before, the sprite generated without using masks contains shadows, which are circled in Fig. 2.4(a), caused by wrongly blending the player into sprite. These shadows are successfully removed in the sprite generated using the masks generated automatically by our method. Manually segmented masks are employed in Fig. 2.4(c) and Fig. 2.5(c) for comparisons. Both sprites generated using automatically or manually segmented masks are perceptually the same by human eyes.

(a)

Fig. 2.4 (Continued) Generated sprites of different methods. (a) The first pass.

(b) Two pass generation with automatic generated segmentation masks.

(b)

(c)

Fig. 2.4 Generated sprites of different methods. (a) The first pass.

(b) Two pass generation with automatic generated segmentation masks.

(a) (b) (c)

Fig. 2.5 Reconstructed frames of different methods. (a) The first pass.

(b) Two pass generation with automatic generated segmentation masks.

CHAPTER 3 A NEW APPROACH FOR SPRITE GENERATION WITHOUT SEGMENTATION MASKS

In this chapter, we will propose a sprite generator without using segmentation masks. It consists of a modified feature point selection method and a novel intelligent blending method.

3.1 Problems of Segmentation Masks

To avoid sprite being blurred, pixels of moving objects must be excluded from being blended into the sprite. If the segmentation is perfect, the averaging blending provided in MPEG-4 can achieve excellent quality. Otherwise, the generated sprite will be blurred around moving object boundary due to that some pixels of moving objects are considered as background. However, a perfect segmentation is impossible, conventional sprite generation methods often use a reliability-based blending concept provided in [18] to solve this problem.

In the reliability-based blending strategy, a frame is divided into reliable, unreliable, and undefined regions according to the segmented masks. Pixels denoted as objects in the segmented masks are classified as undefined pixels, and pixels near mask borders or frame borders (within a given distance) are classified as unreliable ones. The rest of pixels are classified as reliable ones. The reliable and unreliable pixels are average-blended separately, and the blended pixels with the highest reliability are chosen into the sprite. The undefined pixels do not contribute to the sprite blending. The given distance from the mask border must

be large enough to cover all segmentation faults, or the generated sprite will have ghost-like shadows in some places. However, it is hard to decide the distance automatically. In this dissertation, we will provide a sprite generator to avoid above-mentioned problems.

The proposed method is based on the MPEG-4’s framework shown in Fig. 1.1, but the demand of segmentation masks is removed. A balanced feature point extractor with object point removing is proposed. With the proposed feature points, the precision of estimated global motion parameters are increased significantly. A new blending strategy that does not need segmentation masks is also proposed. The moving objects are excluded from blending by a counting schema. The proposed method provides higher visual quality of the generated sprite than those existing methods, and the average PSNR of reconstructed backgrounds is increased slightly.

3.2 The Proposed Sprite Generator

In the proposed sprite generator, a two-stage GME is provided with a novel feature point extraction method to get accurate global motion parameters. With the estimated parameters, each input frame is warped. An intelligent blending strategy is then presented to blend the warped frame to form a sprite. The details of the proposed generator are described as follows.

3.2.1 Global motion estimation

The aim of global motion estimation is to obtain an accurate estimation of camera motion between the current frame and a reference image, e.g. the current sprite. In this dissertation, we take the perspective transformation to model camera motion as follows:

⎪⎪

where (x,y) and (x',y') denote the coordinates of a pixel before and after the camera motion respectively. m1,m2,…m8 are the transformation parameters referred as global motion parameters.

The global motion parameter is estimated by the two-stage GME schema described in Section 1.2.1. The minimization problem described in Section 1.2.1.2 is solved by the Levenberg-Marquardt algorithm [15]. In order to increase the estimation speed, only selected feature points are attending the minimization. Pixels of moving objects should be avoided to be selected into the feature points. Since we want to propose a sprite generator that do not need segmentation masks, a novel feature point selection method that excludes the pixels of moving objects must be developed.

3.2.2 Feature point extraction

The iterative minimization of the gradient descent method is time consuming. To reduce the time complexity, only some selected feature points in the current frame are employed

while computing the registration error. Like other sprite generation methods, our method selects feature points according to their Hessian values defined by

⎟⎟

Those points with Hessian values being local maximum or minimum are considered as feature points. An example of using Hessian value to extract feature points is shown in Fig. 3.1. The grayscale of each pixel in Fig. 3.1(b) represents the absolute Hessian value of the corresponding pixel in Fig. 3.1(a).

Conventional methods choose pixels with largest absolute Hessian values as feature points. However, the distribution of pixels with large absolute Hessian values does not spread uniformly, as shown in Fig. 3.1(b). Fig. 3.1(c) shows the feature points extracted by these methods. The extracted feature points are concentrated in the half-upper of the image. This will degrade the accuracy of the estimated parameters because the registration will be focused only on the half-upper of the image. This degradation will become more serious, when the number of feature points is small. The sprite generated based on these feature points is shown in Fig. 3.1(a). Although the half-upper of the sprite looks well, the white lines in the half-bottom of the sprite are not fitted correctly such that they look blurred (see Fig. 3.1(c));

the reason is that no white line points are considered as feature points. To overcome this problem, the feature points must be selected uniformly. A balanced feature point extraction

method is proposed and described as follows.

(a) (b)

Fig. 3.1 Feature point extraction based on Hessian value. (a) Original image.

(b) absolute Hessian values of (a). (c) Feature points extracted by conventional methods.

(d) Feature points extracted by the proposed method.

For an image of width W and height H, its border area of width B is excluded first. The rest is divided into 256 non-overlapping blocks. For each block, the gray value variance is calculated to test its homogeneity. The block will be classified as a homogeneous one if its variance is smaller than a preset threshold TV. The feature points are extracted uniformly in the non-homogeneous blocks to avoid the aperture problem. Suppose that we want to extract

N feature points from K non-homogenous blocks, N/K pixels with largest absolute Hessian

values are chosen in each non-homogeneous block. Feature points extracted from Fig. 3.1(a) using the proposed method are shown in Fig. 3.1(d). In contrast to the result of using the conventional method (see Fig. 3.1(c)), the distribution of feature points using the proposed method is more balanced than MPEG-4’s method. Several points on the white line in the half-bottom of the frame are extracted as feature points, this will make the white line registered well and will significantly improve the visual quality of the generated sprite.

However, from Fig. 3.1(d), we also find some points on the player located; this will reduce the accuracy of estimated GMPs. As mentioned previously, we should avoid taking moving objects as feature points. Since the motions of moving objects usually differ from the motion of background, this provides us a clue to remove these outliers.

Traditional translation-based motion estimation is applied on each feature point to find the motion vector relative to the previous frame. In order to reduce the searching time, a global translation is found based on some selected feature points first, then a full search around the global translation for each feature point is preformed. A 17×17 block centered at

each selected feature point is used to find the global translation. A full-searched motion estimation with a large search window (64×64) is proceeded on the 17×17 block. To raise up the searching speed, only 100 pixels with the largest absolute Hessian values among all feature points found previously are employed. The occurrences of the estimated 100 motion

vectors are counted, and the motion vector with the highest occurrence is considered as the global translation. The motion vectors of all feature points are found by searching around the global translation with a smaller search window (17×17).

Let (dx,dy) be the motion vector estimated, the feature point is considered as an outlier if its mean-squared-error (MSE) between the original and the motion-estimated blocks is larger than a preset threshold TO, i.e.,

where B is the block centered at the feature point and NB is the number of pixels in the block, I(x,y) and I'(x,y) are the current frame and the previous frame respectively. Since objects are

assumed to have different motions from the background, their best motion vectors are usually not around the global translation (a roughly approximation of the background motion), and their MSEs are likely to be higher with inaccuracy motion vectors. They will be considered as outliers in Eq. (3.3).

Fig. 3.2(a) illustrates the object pixels found from the feature points shown in Fig. 3.1(d).

The feature points on the player are detected successfully. These object pixels are removed from the original feature points and the final feature points are shown in Fig. 3.2(b).

(a) (b)

Fig. 3.2 An example for outlier removing. (a) Detected object pixels in Fig. 3.1(d).

(b) The feature points after removing outliers from Fig. 3.1(d).

Figs. 3.3(a) and (b) show the sprite generated using the conventional method and proposed balanced feature point extraction method, respectively. The same number of feature points is used in both methods. A close view of the white lines in Figs. 3.3(a) and (b) are shown in Figs. 3.3(c) and (d), respectively. From the figures, we can see that those white lines in the sprite generated by the proposed method are registered very well. While the same place in sprite generated by conventional method looks blurred. The average PSNR of the reconstructed backgrounds using the non-balanced and balanced feature points are both 26.25dB. Although the PSNR is the same, the proposed method achieves much better visual quality.

(a)

(b)

Fig. 3.3 Two examples to show sprites generated using different feature points with the

在文檔中一個關於切割影片以產生單一/多重場景背景之研究 (頁 42-0)