CHAPTER 2 AUTOMATIC GENERATION OF SEGMENTATION MASKS FOR
3.2 The Proposed Sprite Generator
3.2.3 Intelligent blending
The precision of segmentation mask affects the quality of the generated sprite. Although
48
the unreliable region around segmentation mask boundary reduces the segmentation error in the reliability-based blending, some errors still can not be covered. Fig. 3.4 shows two frames with segmentation errors. The left foot of the player in both frames is not completely segmented; this will leave some ghostlike shadows after blending. A close view of the shadows in the blended sprite is shown in Fig. 3.5(a). Increasing the distance from the mask border given in the reliability-based blending schema may solve this problem, but other problem will occur. More segmentation errors can be covered using a larger distance, but it also increases the number of pixels being classified as unreliable. Thus the opportunity of an unreliable pixel being replaced by a reliable one is decreased. The reliable and unreliable pixels are blended by averaging separately. If an unreliable pixel is not replaced by a reliable one, the blending acts like the normal averaging. Fig. 3.5(b) shows a close view of the blended sprite using a larger distance. We can see that the right border, which is the boundary of reliable and unreliable regions, is blurred. Thus, how to give a suitable distance is a hard job. To avoid this problem, an intelligent blending strategy without requiring segmentation masks is proposed here.
49
(a) (b)
Fig. 3.4 Segmentation errors in player’s feet. (a) Frame 255 with left foot incompletely segmented. (b) Frame 258 with both feet incompletely segmented.
(a) (b)
(c) (d)
Fig. 3.5 Two examples to show the blended sprites using different methods.
(a) The first example of the generated sprite based on the reliability-based blending.
(b) The second example of the generated sprite based on the reliability-based blending.
(c) The first example of the generated sprite based on the intelligent blending.
(d) The second example of the generated sprite based on the intelligent blending.
50
The proposed intelligent blending strategy is based on a fact that for a series of pixels in video frames corresponding to the same location of a sprite, most of these pixels will be background; only few pixels are moving objects. Since objects are moving, those object pixels will come from different positions of an object, or even different objects, their intensities will have larger variation. On the contrary, intensities of those background pixels standing for the same background point in the real world should be similar, and can be found out by counting their occurrence.
Fig. 3.6 shows a flowchart of the proposed intelligent blending schema. Let X be the incoming pixel, S be the current sprite pixel. A candidate pixel C is used to store a candidate of incoming background pixel. Two counters CS and CC are used to store the number of pixels being blended into S and C, respectively. Initially, S and C are undefined and both counters are set as zero. A similarity check is performed on the incoming pixel by calculating two absolute differences:
C X D
S X D
C S
−
=
−
= . (3.4)
51
Fig. 3.6 Flowchart of the intelligent blending.
In the starting of the blending, since S is undefined, it is set to be gray value of the first incoming pixel and CS is set to one. After filling S, the similarity check is conducted. If an incoming pixel is unlike S (i.e. DS is greater than a preset threshold T) and C is undefined, C is set to the gray value of the incoming pixel and CC is set to one; otherwise, if DS is smaller than T and DC, the incoming pixel is considered as a background pixel and is blended into S, CS is increased by 1. If DC is smaller than T and DS, the incoming pixel is blended into the candidate C, CC is increased by 1. Two counters are compared when a pixel is blended into the candidate C. If the candidate counter is larger than the sprite counter, the sprite and the sprite counter are replaced by the candidate and the candidate counter. Then the candidate and its counter are reset to undefined and zero, respectively. This replacement is based on the fact that being described before. Since the candidate appears more frequently than the current
52
sprite, the candidate is more likely to be the background.
If both DS and DC are larger than T, the candidate is replaced by the incoming pixel, and the candidate counter is set to one. With a series of incoming object pixels, the candidate is continuously replaced by the incoming pixels until a background pixel is replaced into the candidate. If the background pixels appear continuously, the accumulation of candidate counter will begin.
The boundaries of frames are often non-reliable and should be removed from blending.
In the proposed method, the affects of boundaries are eliminated by counting the boundaries pixels once. The pixels near the frame border within a preset distance W are defined as boundary pixels. The boundary pixels are checked and blended into the sprite or the candidate as normal pixels, but the corresponding counter is not increased. This will ensure the boundary pixels to be replaced quickly when normal pixels are inputted.
Two examples of the blending results using the proposed intelligent blender are shown in Fig. 3.5(c) and 3.5(d). In contrast to the results using reliability-based blender shown in Fig.
3.5(a) and 3.5(b), the ghostlike shadows are eliminated and the sprite border is clear and sharp.
In order to examine the results easier, a contrast-enhanced version of Fig. 3.5 is provided in Fig. 3.7.
53
(a) (b)
(c) (d)
Fig. 3.7 Contrast-enhanced version of Fig. 3.5. (a) Contrast-enhanced version of Fig.
3.5(a). (b) Contrast-enhanced version of Fig. 3.5(b). (c) Contrast-enhanced version of Fig.
3.5(c). (d) Contrast-enhanced version of Fig. 3.5(d).
3.3 Experimental Results
The aim of sprite generation is to reconstruct the background from the generated sprite perfectly. The quality of the reconstructed background for each frame is often measured by PSNR. In most cases, the PSNR is a good measurement to describe the quality. However, the PSNR is fooled in seldom cases. A comparison in visual quality of the generated sprites is also performed.
The GMPs are estimated using the two-staged GME with the proposed balanced feature
54
points described in Sections 3.2.1 and 3.2.2. Then sprites are mixed by different blending strategies with the same GMPs. Backgrounds are reconstructed from the generated sprites and their PSNRs are calculated. Pixels of moving objects are excluded when calculating the PSNR since we are measuring the qualities of the reconstructed backgrounds. The qualities of generated sprites are measured by computing the averaging PSNR of the reconstructed backgrounds. The generated sprites are shown in Fig. 3.8. Fig. 3.8(a) is generated by the averaging blending strategy employed in the MPEG-4 VM without segmentation masks. The result using the averaging blending strategy with manually segmented masks is shown in Fig.
3.8(b). Fig. 3.8(c) shows the sprite generated using the reliability-based blending strategy based on the rough segmentation masks extracted via the method developed by Lu et al. [19].
Finally, the sprite generated using the proposed intelligent blender is shown in Fig. 3.8(d). Fig.
3.9 shows one of the reconstructed backgrounds using three different strategies respectively.
(a)
Fig. 3.8 (Continued) The generated sprites. (a) Sprite generated using averaging blending without segmentation masks. (b) Sprite generated using averaging blending based on manually segmented masks. (c) Sprite generated using reliability-based blending. (d) Sprite
generated using intelligent blending.
55
(b)
(c)
(d)
Fig. 3.8 The generated sprites. (a) Sprite generated using averaging blending without segmentation masks. (b) Sprite generated using averaging blending based on manually segmented masks. (c) Sprite generated using reliability-based blending. (d) Sprite generated
using intelligent blending.
Since all moving objects are blended, the sprite generated by averaging blending will have shadows in several places, which are obvious in a reconstructed frame shown in Fig.
56
3.9(a). However, if perfect manually masks are provided, the shadows are eliminated completely and the averaging blending can achieve excellent results, as shown in Fig. 3.9(b).
Most shadows can be removed using the reliability-based blending except some ill-segmented parts shown in the half-bottom of Fig. 3.9(c). The reconstructed background using the proposed intelligent blending is shown in Fig. 3.9(d). We can see that the sprite generated by our method is perceptually the same as the result using the average blending with perfect manually masks provided.
(a) (b)
(c) (d)
Fig. 3.9 The reconstructed backgrounds. (a) Averaging without segmentation masks. (b) Averaging with manually segmented masks.
(c) Reliability-based blending. (d) Intelligent blending.
57
A quantitative comparison in PSNR is performed and illustrated in Fig. 3.10. While calculating the PSNR of a frame, only the background parts in the frame are attending the calculation because the sprite contains only the information of backgrounds in a video sequence. Manually segmented masks are employed to exclude object parts in the frame.
The results of the average blending with and without manually segmented masks are plotted in dash-dotted and dotted line respectively in Fig. 3.10. The result without masks is degraded by the shadows of moving objects and the frame borders, and has low average PSNR of 26.23dB. With manually segmented masks, the averaging blending shows superior results not only in the visual quality but also in the measured PSNR. The average PSNR is 28.38dB and is the best result in our tests.
The reliability-based and intelligent blending strategies are plotted in dashed line, and normal line respectively. The reliability-based blending has average PSNRs 28.20dB. The proposed intelligent blending has average PSNRs 28.29dB, which is slightly higher than that of reliability-based blending and close to that of the average blending with perfect masks.
These experiments show that the proposed method can generate high visual quality sprite without needing any segmentation mask.
58
Fig. 3.10 PSNR comparison of different blending strategies.
The result of Lu et al.’s sprite generator [21], which is shown in Fig. 3.11, is also quoted as a comparison. In contrast to the result of the proposed generator shown in Fig. 3.8(d), the proposed generator can generate better results. The average PSNR of Lu’s generator is 23.1dB, which is much lower than that of the proposed generator (28.29dB). To make comparison more clear, we take close views of three parts from the generated sprites in Fig. 3.11 and show them in Fig. 3.12. From part 1 shown in Figs. 3.12(a) and (b), we can see that the generated sprite using [21] skews seriously. Figs. 3.12(c) and (d) show part 2, the white lines in Fig.
3.12(d) are registered well, but in Fig. 3.12(c) are not. The above two faults are due to the inaccuracy of the estimated GMP, and do not exist in the result of our generator. Part 3 shown in Figs. 3.12(e) and (f) also demonstrates that our method is superior to Lu et al.’s.
59
Fig. 3.11 The generated sprite of Lu et al.’s work [21].
(a) (b)
(c) (d)
Fig. 3.12 (Continued) Close views of the generated sprites.
(a) Part 1 of Lu et al.’s work [21]. (b) Part 1 of the proposed method.
(c) Part 2 of Lu et al.’s work [21]. (d) Part 2 of the proposed method.
(e) Part 3 of Lu et al.’s work [21]. (f) Part 3 of the proposed method.
60
(e) (f)
Fig. 3.12 Close views of the generated sprites.
(a) Part 1 of Lu et al.’s work [21]. (b) Part 1 of the proposed method.
(c) Part 2 of Lu et al.’s work [21]. (d) Part 2 of the proposed method.
(e) Part 3 of Lu et al.’s work [21]. (f) Part 3 of the proposed method.
61
CHAPTER 4
A NEW APPROACH FOR FAST MULTIPLE SPRITES GENERATION
Farin et al. have proposed a novel technique denoted as multiple sprites or multi-sprites that divides a video sequence into several sub-sequences. The backgrounds of each sub-sequence are stored by its own sprite. Using multiple sprites reduces the geometrical distortions and the storage required by a single large sprite. However, Farins’ technique uses exhaustive searches to find the optimum sub-sequences and optimum reference frame of each sub-sequence. This makes the algorithm very time-consuming.
In our dissertation, a fast multiple-sprite partition method will be proposed. The proposed method reduces the searching time for finding an applicable partition for multiple sprite generation, and the memory required during the searching is also decreased in contrast to the optimal partition method. Experimental results show that the coding cost of the generated sprites using the proposed method is near-optimum, i.e. only slightly higher than that in the optimal method.
The proposed method consists of two algorithms: video partition algorithm and a reference frame selection algorithm. The video partition algorithm is developed based on the characteristics of frame translations and scaling. The frame translation, which is caused by camera motion and also denoted as global translation, represents the movement of the
62
background of a frame in the x and y direction relative to a reference frame. The global translations across frames are accumulated to represent the estimated position of a frame projected in a sprite. Since the geometric distortion depends on the accumulated global translation relative to the reference frame, the accumulated global translation provides a good measurement on the distortion. The effect of frame scaling caused by camera zoom-in or zoom-out can be employed in a similar way.
A reference frame selection algorithm is developed based on the idea of Messey and Bender’s work [37]. In their work, the middle frame of a video sequence is suggested as the reference frame, since its background has higher possibility to be located at the center of a generated sprite. The proposed algorithm extends this idea by taking the frame with its background most likely being at the center of the corresponding sprite as the reference frame.
4.1 Proposed Feasible Partition Points Selecting and Reference Frames Finding Methods
Farin et al.’s method achieves optimal partition results with high computational complexity, even if their efficient algorithm is applied. If we can reduce the possible combinations of sub-sequences and reference frames, the computational complexity will be reduced. In the following sections, we will first analyze the accumulated translations and scalings. Based on the accumulated translations and scaling, some candidate partition points
63
and reference frames are then located first. Finally, a method is provided to get a near-optimal partition with similar total sprite area to Farins’.
4.1.1 Analysis of Accumulated Translation
As mentioned previously, the geometric projection distortion comes from the camera rotation. Farin et al.’s experiments [38] also show that the sprite area grows exponentially as camera pan angle increases. Thus the selecting of partition frames must be highly related to the effect of camera rotation. In order to capture the effect of camera rotation, the global translations between video frames are calculated and analyzed.
The global translation between two frames is a measurement of background displacements. Let frame i and frame j be the two frames to be measured and pv =(x,y) be a pixel in frame j, the displacement of pv relative to frame i is defined as
p p T
dvp = ji(v)− v, (4.1)
where Tji is the geometric transformation applied in the frame warping which converts locations of pixels from the coordinate of frame j to frame i. Due to the effect of geometric transformation, the displacements of all pixels in frame j are not consistent. In order to get a fast estimation of the frame displacement, the average of four corner displacements is used, that is,
64
where LT, RT, LB, and RB are the left-top, right-top, left bottom, and right bottom pixel of frame j. The tv is considered as the global translation between frame i and frame j. ji
The global translations of sequence ‘stefan’ are illustrated in Fig. 4.1(a). The first frame is set to be the reference frame. The calculated translations of frames show their background displacements relative to the reference frame. A positive translation in the x-axis represents a frame displacement in the right direction, i.e., the frame is warped to the right side of the reference frame. A negative translation in the x-axis denotes a displacement in the left direction, and the y-axis translations can be denoted similarly. In Fig. 4.1(a), one can see that the view of background moves toward the right direction in the first thirty frames. Then it moves left and crosses the first frame in the next ninety frames. And then it moves toward the right again until frame 205, finally it moves toward the left in the rest of frames.
0 100 200 300
Fig. 4.1 Background displacements of ‘stefan’.
(a) Global translations. (b) Accumulated translations.
65
The figure also shows that the view of background moves toward left quickly after frame 260. The magnitudes of global translations increase quickly from hundred to over ten thousand pixels. Actually, the magnitudes can be million pixels in short frames and tend to be infinity when a frame is unable to be projected into the first frame. The huge difference of magnitudes between frames is the result of huge geometric distortions when projecting a frame into the first frame with a view angle far away from the projecting frame. The huge difference makes it difficult to analyze the effect of camera translations from the global translations.
In order to analyze the camera translations efficiently, accumulated translation is proposed. The accumulated translations are calculated from the local translations, which represent the translation from one frame to its adjacent previous frame and can be denoted as
) 1 (j−
tvj in Eq. (3.6). Since local translations are less geometric-distorted than the global translations, they can be combined into an accumulated translation, av , to represent a ji
less-distorted global translation between frame j and frame i, that is
∑
=+ −Note that av can be computed by a recursive procedure: ji
)
The accumulated translations for sequence ‘stefan’ are illustrated in Fig. 4.1(b). In
66
contrast to global translation, magnitudes of accumulated translations are limited into a reasonable range. The details of camera movements are still preserved, and the translations of all frames can be calculated, even for those frames that can not be projected into the first frame.
4.1.2 Accumulated Translation Based Feasible Partition Point Finding Method
One of the goals of a multi-sprite partition algorithm is to find some partition points to split the video sequence into several sub-sequences. Since the geometric distortions are the major issue of using multiple sprites, camera motion must be considered first. The following paragraph demonstrates the finding of feasible partition points, FPX, based on accumulated translations in x-axis direction. By the similar way, we can also find the feasible partition points, FPY, based on accumulated translation in y-axis direction.
Fig. 4.2 shows the x-axis accumulated translations which have been shown in Fig. 4.1(b).
The camera pans to the right from the first frame to frame 29, then pans to the left until frame 107. When the camera begins the left-panning, the backgrounds of frames from 29 to 69 are going back through an area that has been recorded into the current sprite. Since the background area already exists in the sprite, merging these frames into the sprite will not expand the sprite area. Thus, frames from 29 to 69 must not be selected as candidate partition points, and frames 70 to 107 are considered as candidate partition points.
67
Fig. 4.2 Finding feasible partition points.
Now, the camera pans to the right from frame 107 to 204, and backgrounds of frames from 107 to 183 have been recorded, thus they will not be considered as candidate partition points. By similar reasons, frames 204 to 244 are not considered as candidate partition points, and frames after 245 are considered as candidates. The candidates of partition points are illustrated by thick lines in Fig. 4.2. The candidates of partition points can be grouped into several pieces, and each piece covers a small range of view angles. Since the covered view angle range in a piece is small, frames in the same piece should be merged into a sprite. The first frame in each piece is considered as a feasible partition point. If the candidates are grouped into K pieces, there will be (K-1) feasible partition points. This will produce 2K-1 combinations of possible partitions. In Fig. 4.2, feasible partition points are frame 70, 183, and 245. Comparing to Farin et al.’s result, which has an optimum partition point at frame