Chapter 3 Statistical Approaches
3.4 Time Series Analysis
As mentioned before, time series is applied to generate smooth and continuous limbs’
movements. In our work, ARMA is used to analyze limbs’ movements of a character in several previous time slices for estimating the motion trajectories. Then we could synthesize the current movements following these estimated trajectories. The general form of a time series model is considered as
(
1,..., ; 1,...,)
,where Dt denotes a univariate time series and fTS(.) indicates an unknown function of time series. p and q represent non-negative integers. aa is a sequence of random variables assumed to come from a Normal distribution with mean zero and variance one.
C is assumed to be a constant. Based on this general form, ARMA is formulated as follows:
where aaa and aaa are the coefficients of parameters in this model. It is similar to the time series model proposed by Chen et al. [13], except that they assumed the functional form of fTS(.) was a known linear function whereas we assumes fTS(.) is estimated nonparametrically along with the Bayesian estimation of ERBFs already described in Section 3.1 and Section 3.3 in order to add smooth variety of the time series data, that is, we develop ERBF kernel in the original time series model with parameters inferred by using RJMCMC. We further use this nonparametric time series model to forecast the current limbs’ movements of the character from his several previous poses.
φi κi
29
Chapter 4 Two-scale Image Abstraction
Generating a natural-looking 2D character animation from still photographs or paintings can be considered to analyze and simulate the character’s motion in that image. Note that a photograph contains redundant information. The raw format of a photograph may have 16-bits or even 24-bits per color channel. Using all contours of the character extracted from raw photographs for statistical analysis is not practical and useful. Hence, it is necessary to obtain contours of interests of the characters. We advocate the two-scale abstraction similar to progressive image abstraction proposed by Farbman et al. [23]. The proposed abstraction method is based on a two-scale decomposition of the image consisting of a base layer, which encodes large-scale variations of pixels, and a detail layer. The base and detail layers would be obtained by using an edge preserving filer called the bilateral filter [64]. Given photographs, the bilateral filter is applied to obtain regions of interest. The selected contours of a character from the detail layer, which represent important features, and the contours of that character in the base layer are used to estimate the character’s motion. The redundant information of a photograph is filtered by the bilateral filter so as to animate 2D character from arbitrary still pictures by the proposed statistical approaches.
4.1 Color Space Transformation
In order to keep the regions of interest, we propose the two-scale image abstraction based on the bilateral filter. It classifies the image into a base layer and a detail layer.
Important features can be preserved by adopting the contours selected from the detail
30
layer and the contours of the base layer to train the statistical model. In contrast, unimportant features can be filtered by applying the contours of base layer to the model only.
Tomasi and Manduchi [64] suggested computing the bilateral filter on a perceptually uniform feature space, such as CIELab [72]. Perceptually uniform means that a change of the same amount in a color value should produce a change of the same visual importance. Only perceptually similar colors are averaged together while bilateral filtering is carried out in CIELab color space. Moreover, only perceptually important details are preserved. The values used by CIELab color space are called L*, a*, and b*.
L* component closely matches human perception of lightness, which is the luminance signal that can estimate the difference between light and dark. a* represents the difference between red and green. b* represents the difference between yellow and blue.
Unlike RGB color space, CIELab is based on a large body of psychophysical data concerning color-matching experiments performed by human observers, and is designed a practical approximation to color processing in a human visual system model. In contrast, RGB models the output of physical devices rather than human visual perception. CIELab can thus be used to adjust the lightness contrast by using L*
component. Furthermore, CIELab can make accurate color balance corrections by modifying output signals in a* and b* components.
The three coordinates of CIELab represent the lightness intensity L*, its position on a pure red and pure green scale a*, and its position on a pure yellow and pure blue scale b*. Note that L* = 0 yields black and L* = 100 indicates diffuse white. a* = -127 indicates pure green and a* = 127 indicates pure red. b* = -127 indicates pure blue and b* = 127 indicates pure yellow. The red/green and yellow/blue opponent channels are computed as differences of lightness transformations of cone responses. Note that the nonlinear relations for L*, a*, and b* are intended to mimic the nonlinear response of the human eye. Furthermore, uniform changes of components in the CIELab color space aim to correspond to uniform changes in perceived color, so the relative perceptual differences between any two colors in CIELab color space can be approximated by taking the Euclidean distance between them. The Euclidean distance is directly proportional to the difference between the two colors as perceived by the human eye.
31
CIELab can be computed via simple formulas from nonlinearly-compressed CIEXYZ color space coordinates (XCIE, YCIE, ZCIE), which is not particularly perceptually uniform. Now, we would transform the initial color space RGB of a raw image into CIELab through CIEXYZ.
0.412453 0.357580 0.180423
Then three components of CIELab are obtained from CIEXYZ.
* 1163 -16 for 0.008856 Here Xn, Yn, and Zn are the tristimulus values corresponding to the reference white point.
They are specified respectively as 0.950456, 1.000000, and 1.088754.
4.2 Bilateral Filter
Next, we use the bilateral filter to classify the original image I into a base layer encoding large-scale variations and a detail layer. Note that the detail layer is applied to select the regions of interest and the features which should be preserved. The bilateral filter is a non-linear filter, where each pixel in the filtered result is a weighted mean of its neighbors, with the weights decreasing both with spatial distance and with difference in value. The bilateral filter can be defined as
32
where is the whole image range, and the subscripts p and q indicate spatial locations of pixels. Ip and I are intensity values of pixel p and q. The kernel functions
S and G I are typically Gaussians, where gis determines the spatial support, while gs controls the sensitivity to edges. k is a normalized function. B is the filtered result of the pixel p. That represents the base layer of the original image. A bilateral filter allows combining three color channels in CIELab and measuring photometric distances between pixels in the combined space. Furthermore, the detail layer is the division of the original image by the base layer. The ratio is computed on each color channel separately and is independent of the signal magnitude. The ratio captures the local detail variation in the original image and is commonly called a quotient image [59]
or a ratio image [37] in computer vision community.
Ω
We transform the color space of the base layer and the detail layer back to RGB for the further process. As mentioned previously, Xn, Yn, and Zn are specified respectively as 0.950456, 1.000000, and 1.088754. The reverse transformation to CIEXYZ is
(
* 500 ,)
The color space RGB can be obtained by the transformation matrix.
33
Furthermore, the detail layer is attenuated to achieve a stylized abstract look. In this work, the short line segments in the detail layer are regarded as unimportant regions or noises. Hence, the median filter is applied to smooth or denoise the detail layer. The idea of median filtering is to calculate the median of neighboring pixels’ values. It can be done by repeating these steps for each pixel in the image, as described follows:
1. Store the neighboring pixels in an array called the window. Note that the neighboring pixels are chosen by a box.
2. Sort the window in numerical order.
3. Pick the median from the window as the pixels value.
Finally, the base layer is overlaid with edges extracted from the filtered detail layer.
4.4 Experimental Results
The proposed method yields several results. Figure 4.1 shows the abstracted image by using the bilateral filter. Figure 4.1 (a) is the original real image. Figure 4.1 (b) is the coarsened image of the base layer with asaaaaaa and aaaaaaaaa. In our experiment, we found these parameters to be better suited for applications that discarded or attenuated some of the details, such as image abstraction. Thus, we used these parameters throughout this dissertation for most real images. Figure 4.1 (c) is the selected contours from the detail layer by using the median filter. Moreover, Figure 4.1 (d) shows the image abstraction result.
I 0.15 σ =
S 12 σ =
Figure 4.2 shows another example of the abstracted image. Figure 4.2 (a) is the original real image. Figure 4.2 (b) is the coarsened image of the base layer by using the bilateral filter. Figure 4.2 (c) is the selected contours from the detail layer by using the median filter. Finally, Figure 4.2 (d) shows the image abstraction result.
34
(a) (b)
(c) (d)
Figure 4.1: Image abstraction computed using our two-scale decomposition. (a) The original real image (the photo of Charlize Theron). (b) The base layer. (c) The selected contours from the detail layer. (d) The final abstracted image.
35
(a) (b)
(c) (d)
Figure 4.2: Another example of image abstraction. (a) The original real image (the photo of David Axelrod). (b) The base layer. (c) The selected contours from the detail layer. (d) The final result.
36
Chapter 5 Novel View Generation
For animating 2D characters in still pictures, we proposed a statistical method based on nonparametric regression analysis to generate a novel view of a character. Kernel regression with ERBFs is introduced to fit the contours of a character and is applied to infer the corresponding displacements of the contours. LOESS is applied to fill in the color and texture information obtained from the original character in the given picture.
5.1 View Interpolation
In our work, we would take the idea of creating deformations directly in image space one step further by making 2D characters move. Actually, we propose a nonparametric regression model to animate the characters from still images. For instance, animating the character in a comic could be carried out by the creation of a novel view, as shown in Figure 5.1. It shows two continuous frames in the original comic, which can be regarded as two different scenes, and the synthesized frames from a single input frame.
Figure 5.1 (a) are the original frames in the comic. Figure 5.1 (b) shows the synthesized novel views. Note that the model is trained to fit the shape and detail of the character between two key-poses from a given frame and its reverse, while minimizing unnatural distortion. Then the trained model is applied to synthesize the smooth transition between these key-poses.
37 (a)
(b)
Figure 5.1: Novel view generation in a comic. (a) Two continuous frames in a comic.
(b) The frames synthesized from a single frame. © Georges Remi (Hergé)
As mentioned previously, the proposed model is based on the prediction abilities of nonparametric regression. Kernel regression approximates the shape of a deformed character between two key-poses or moving templates indicating different poses by the prior use of a set of kernel functions. Circular Gaussian distribution function is not an appropriate choice to fit contours, which have noncircular structures like characters or human-like subjects. Instead of RBF kernel based on spatially-limited circular Gaussian, kernel regression using elliptic radial basis functions (ERBFs), specifically elliptic Gaussians which provide less learning time, is applied for contours fitting during shape
38
deformation (defined as shape deforming). Although ERBFs require more computation during optimization, better quality is obtained with smaller number of basis functions.
Furthermore, the local-fitting methodology is also applied to preserve important features within the deformed shape (that is filling in the color and texture information obtained from the original character in the given image). Locally weighted regression, or LOESS, is used to preserve the features or details by fitting a function of independent variables locally.
5.2 Algorithm Overview
In Figure 5.2, the outline reflects the structure of our proposed method for novel view generation. Considering Figure 5.2, we briefly describe our method in the following paragraphs.
1. Character Extraction: As mentioned in Chapter 4, in addition to the paintings or the comic, the input images like real images are filtered by using the bilateral filter first.
Then in order to reduce the effects of the background upon deformations, we extract characters from the input image. We use level-set-based GrabCut to extract characters and features, as described in Section 5.3. Similar regions are extracted by the level set method. The bounding box of all regions is then used by GrabCut.
(a) (b) (c) (d) (e)
Figure 5.2: This example shows the picture of Mona Lisa. (a) The original input image.
(b) The character is extracted, (c) who is described by the similar parts found by level-set-based GrabCut, and (d) the contours are applied to build the nonparametric regression model for shape deforming and detail preserving. After deforming the shape and preserving details, several resulting frames in the synthesized Mona Lisa’s views are shown in (e).
39
The boundaries of regions corresponding to the matte produced automatically are further applied to obtain the final character matte. The foreground and background are separated successfully. Besides, the facial features are extracted simultaneously by the level set method. As shown in Figure 5.2 (b), Mona Lisa is extracted, which is described by the similar parts found by level-set-based GrabCut in Figure 5.2 (c), and the corresponding contours shown in Figure 5.2 (d) are applied to build the nonparametric regression model for shape deforming and detail preserving.
2. Statistical Approaches: Before we generate novel views of a character in a still image, we apply nonparametric regression for novel view generation mentioned in Section 3.1 and Section 3.2. Note that the proposed statistical approaches are not only used to generate novel views of a character, but also adopted to create an expressive talking face and synthesize smooth limbs movements. In practice, our framework for 2D character animation consists of these statistical approaches.
3. Novel View Generation: Given one key-pose of a character and its reverse, we deform the shape of the character by applying a trained nonparametric regression model for generating novel views of the character. The process can be divided into two steps:
shape deforming and detail preserving. In the shape deforming, the correspondence in training data set is constructed first. Kernel regression with ERBFs is employed to train the model to represent and fit the contour of deformed shape, as described in Section 5.4.
In the detail preserving step, as described in Section 5.5, LOESS is adopted to fit the details of the deformed shape. LOESS is suitable for detail preserving in accordance with the previously fitted contours. Figure 5.2 (e) shows finally the resulting images.
5.3 Character Extraction
In addition to the paintings or the comic, real images are filtered by using the bilateral filte. Then we adopt the level set method to extract regions with a similar color distribution in the image. The level set method, proposed by Osher and Sethian [38, 57, 58], is an approach for approximating the dynamics of moving curves and surfaces.
Note that we choose HSV color space [72], it is not only close to the people understanding of colors, but also is regarded as the best option in judgment on the color
40
changes. It consists of three components, namely representatives of hue H (hue), saturation S (saturation), and brightness V (value). In practice, HSV color space allows combining the three color channels appropriately. Moreover, the combined color difference can be the made to correspond to the distance between two points in the color space. It is suitable for image segmentation and analysis. We introduce the concept of color gradient information of images, instead of using gray gradient to update the curve evolution function of the level set method. Furthermore, these regions representing the facial features of a character are found simultaneously.
After feature extraction, GrabCut [51] is then applied to separate foreground (characters) and background. GrabCut is powerful object matting tool. However, it requires an initial incomplete trimap which represents the seeds of foreground and background for the underlying graph cut algorithm. That is, no hard foreground labeling is done at all. The region of background is determined by users as a strip of pixels around the outside of the marked rectangle. The Gibbs energy minimization is computed and the object matting for character extraction is applied.
We construct a bounding box of all these regions extracted by using the level set method. Then we use the bounding box for GrabCut instead of the initial incomplete trimap. Note that the extracted regions correspond to the regions of a character matte with the similar color distribution. The pixels inside the contours of the regions are considered the foreground distribution replacing users’ refinement during the iterative minimization in GrabCut. Subsequently, the entire energy minimization process would be performed iteratively with the updated foreground distribution. The process is guaranteed to converge at least to a local minimization since the energy decreases monotonically. After convergence is achieved, the character matte is extracted successfully.
Note that we choose HSV color space. Due to the hue, saturation, and brightness of the three components to determine changes in color, the level set method with color gradient enriches the way that only uses gray gradient to judge whether at the border.
Since joining the color factor, the character and feature extraction is robust for the images, which the gray level of the background is close to the gray level of the foreground. The final character and features matte is shown in Figure 5.2 (b).
41
5.4 Shape Deformation Using Kernel Regression with ERBFs
As mentioned previously, we use level-set-based GrabCut to extract the character and similar regions in the character. After extracting the characters in the input image and its reverse, we train a regression model with ERBF kernel. First, in Section 5.4.1, an initial solution to regression parameters is obtained. Then we discuss how to train the model and fit the character’s deformed shape by the trained model, as described in Section 5.4.2.
5.4.1 The Determination of Initial Values
The initial guesses are important for further optimization convergence in model learning.
Before setting the initial value of center and covariance, the correspondence with regard to feature alignment should be done. First, we create a window and use it to compute the curvature along each region boundary in the face. Note that these regions in the face are obtained by using the level set method. We choose the top five curvatures from the window interiors and sample points along these contours. The five bounding boxes of these sets of sample points are the feature blocks shown in Figure 5.3 (a) and Figure 5.3 (b). The structure of these feature blocks (that is the order of the feature blocks) is constructed to maintain the spatial relationship among these features, as shown in Figure 5.3 (c). Note that the structure is similar to the tree structure. However, there are no root and leaf nodes in our work. We only use the link between two nodes (feature blocks) to record the spatial relationship or the order of two nodes. Subsequently, Tchebichef moments (TMs) [43] of these blocks are used to determine the correspondence in the other key-pose, which is obtained by reversing the original input image, for spatial constraints, as shown in Figure 5.3 (d) and Figure 5.3 (e).
42
(a) (b) (c) (d) (e) (f) (g)
Figure 5.3: Correspondences and initial value determination. (a) Top five features are selected. (b) The structure is constructed from feature blocks. (c) The spatial relation is obtained from first key-pose. (d) (e) The correspondences in the other key-pose are extracted based on the structure of spatial relationship. (f) (g) The samples and
Figure 5.3: Correspondences and initial value determination. (a) Top five features are selected. (b) The structure is constructed from feature blocks. (c) The spatial relation is obtained from first key-pose. (d) (e) The correspondences in the other key-pose are extracted based on the structure of spatial relationship. (f) (g) The samples and