3 Human Object Inpainting Using Manifold Learning-Based
3.2 Human Object Inpainting Using Posture Sequence
3.2.3 Posture Sequence Estimation
Based on the graphical model of an object’s motion shown in Figure 3.1, we obtain suitable postures to replace damaged/missing postures by finding an approximate path that links data points xi and xj in a low dimension manifold. Intuitively, a motion path can be reconstructed by taking the shortest path between two nodes or by an optimization process [39], but these two approaches cannot guarantee the smoothness of a recovered motion. To resolve the problem, we propose using two constraints to regulate the motion continuity property in the local region of a graphical model. Specifically, we need a strategy to select a certain number of data points that satisfy the continuous motion constraint. The first constraint limits the search range to within a reasonable neighborhood, as shown in Figure 3.2. Therefore, we need to define the
search range of the complete trajectory of an object’s motion. In the manifold domain, such trajectories are comprised of a number of linked data points (see Figure 3.1). To determine the distance between any two consecutive data points on a trajectory, we calculate the shape context difference between their corresponding postures. Then, the maximum distance among all the measured distances is taken as the search range to satisfy the first constraint. Since the search range is circular, we calculate the radius as follows:
on a complete trajectorymax
ij
e ij
r e
=∀ , (3.1)
where eij represents the distance between xi and xj on an object’s motion trajectory.
Figure 3.2 The neighborhood constraint.
The second constraint is introduced to maintain the tendency of an object’s motion in each local region. It can be realized by checking the tendency of an object’s motion trajectory in a graphical model. In a low
dimensional manifold, a motion trajectory does not change direction significantly in a neighborhood region. Based on this observation, a variance constraint of motion tendency is designed to ensure that the variance of motion tendency stays within a reasonable range (see Figure 3.3). In the manifold domain, the complete trajectory of an object’s motion is comprised of a number of linked segments, as shown by the red lines in Figure 3.1. For the segments indicated by the lines, we compute the change in direction between any two consecutive segments based on the inner product of their corresponding vectors. Among all the computed direction changes, the largest direction change is taken as the maximum allowable angle for direction change. This angle, which is the basis for executing the second constraint, is calculated as follows:
on a complete trajectorymax
ijk
θ ijk
α θ
=∀ , (3.2)
where
θ
ijk represents the angle between the vectors x xi jand x xj k
on an object’s motion trajectory.
Figure 3.3 The motion tendency constraint.
The above constraints are designed to maintain the local continuity of an object’s motion. To maintain the global motion continuity, we propose a two-way (forward-backward) prediction mechanism. We use three time instants, t−1, t, and t+1, to explain how the proposed mechanism operates.
In the forward operation, we make a forward prediction on each data point at time t−1. The motion tendency constraint and the search range constraint are applied to determine m probable data points at the next time instant t. The selected data points, m, will be used to predict the candidate data points at time t+1. We apply the same strategy in the reverse direction and collect related information from t+1 to t, and from t to t−1.
Then, we combine the results from the bi-directional processing to obtain the final results for time t. To illustrate the two-way prediction process further, we use a test sequence containing 245 frames. Some snapshots extracted from the test sequence, #1, are shown in Figure 3.4. The candidate points chosen at time instant 19 (t−1) are indicated by the blue dots in Figure 3.5(a), and their corresponding postures are shown on the left-hand side of the figure. Those candidate points are used to perform forward prediction. The predicted candidate points at time instant 20 are shown in Figure 3.5(b). We apply the same procedure in the reverse
direction and generate results from t = 21 to t = 20 (shown in Figure 3.5(c) and (d)). The two sets of results are then combined to form the final results, as shown in Figure 3.5(e). Table I provides detailed information about the above mentioned processes, including the distance and angle information calculated during the forward and backward prediction steps.
Figure 3.4 Some snapshots extracted from test sequence #1.
(a) (b)
(c) (d)
(e)
Figure 3.5 (a)−(b) some forward prediction steps, (c)−(d) some backward prediction steps, and (e) the combined results of two-way prediction at time t.
Table 3.1. Detailed information derived during the forward-backward prediction process
Forward prediction from time instant 19 to time instant 20 (D: distance; A: angle)
T:19
T:20 Backward prediction from time instant 21 to time instant 20 (D: distance; A: angle)
T:21
Since the motion continuity constraint is only effective on local regions, we use the Markov Random Field (MRF) model to derive global motion continuity. MRF provides a convenient and accurate way to model context-dependent entities, such as image pixels and correlated features.
The above modeling can be achieved by characterizing the mutual influences that relate such entities. To predict an object’s motion, instead of following the Markov assumption, we assign one node of the Markov network to each time state. Then, the constructed network can reflect
statistical dependencies. Given a set of data points located at the intervening nodes, every node of a Markov network is statistically independent of other nodes in the network. Since our Markov network does not contain loops, the above-mentioned Markov assumption results in simple “message-passing” rules for computing the probability during inference. The data point estimated at node j is
* 1 1 column vector with the initial probability of all the elements associated with node j. The function Ψ( ,c cj j+1,cj+2) is defined as follows: the mean and standard deviation of all angles in a complete trajectory of
an object’s motion.
Figure 3.6 An example of the MRF process.
To better explain how (3.3), (3.4) and (3.5) find an optimal
c , we use
t* the three nodes shown in Figure 3.6 as an example.Initially, node t receives two messages in the form of a column vector with the initial probabilities of the elements associated with node t−1 and
t+1. It then sends the two messages,
Mtt−1 and Mtt+1, to nodes t−1 and t+1 respectively. The messages contain the probability information of all the candidate data points associated with node t. Before the information is sent, it is reordered to form a column vector. On receipt of the information, nodes t−1 and t+1 respond by sending messages Mtt−1 and1 t
Mt+ , respectively, to node t. When each candidate point of node t receives the message Mtt−1 it finds a matching point in node t−1 as
the previous self probability of candidate point ct, and p(ct−1) and p(ct+1) are the probabilities propagated by messages Mtt−1andMtt+1, respectively.
After normalizing the probability value of each candidate point calculated by (3.6), we obtain a new probability value for each candidate point. Then, node t sends the updated message Mtt+1 with the new probability to node
t+1. Similarly, if node t receives an updated message from node t+1, the
probability values of all the candidate points of node t are recomputed and sent to node t−1. Freeman et al. [53] showed that after, at most, one global iteration of (3.4) on each node of the network, (3.3) can derive the desired optimal estimate of
c at node j.
*j3.3 Experimental Results
To test the effectiveness of the proposed posture sequence estimation method, we performed experiments on eight test sequences, where part of them were captured with a camcorder and the remaining were grabbed from the Weizmann database [46] and the Internet. In addition to test sequence #1 shown in Figure 3.4, we used sequences #2 and #3 to evaluate the proposed method. In the experiments, we first removed several consecutive frames to simulate a real-world situation where
objects in a number of consecutive frames were damaged due to packet loss. Then, we applied the proposed posture sequence estimation method to reconstruct the motion of each object. We also compared the performance of our approach with that of Ding et al.’s approach [10] and Xu et al.’s approach [39]. For all the test sequences, the proposed method maintained the motion continuity of a reconstructed motion and yielded better results than the compared approaches.
In the first experiment, we removed 10 of the 245 frames in test sequence #1. Part of the sequence (28 frames) is shown in Figure 3.7(a).
In the Figure 3.7(a), the 10 frames that we removed are bounded by the red rectangle. Figure 3.7(b), (c), and (d) show the missing sequence that was reconstructed by applying Ding et al.’s approach [10], Xu et al.’s approach [39] and our approach, respectively; and Figure 3.7(e) shows the corresponding trajectories reconstructed by the three approaches in the manifold space. Among the trajectories, the red, blue, yellow and green colors represent the ground-truth trajectory, and the trajectories reconstructed by Ding et al.’s approach, Xu et al.’s approach and the proposed approach, respectively. We observe that the trajectory reconstructed by our approach maintains the best motion continuity; and
it is also the smoothest of the three trajectories. Because the proposed posture sequence estimation method is more effective in recovering an object’s motion and maintaining motion continuity simultaneously, we conclude that it is more suitable for object inpainting than the compared methods.
Table 3.2 details the results of the ground-truth and the three compared methods. The top row shows the sequence of missing ground truth postures; and the second, third, and fourth rows show the missing frames reconstructed by Ding et al.’s method, Xu et al.’s method, and our method, respectively. The black parts of the figures are the ground-truth postures; the gray parts are perfectly matched portions; and the red parts belong to reconstructed postures. We observe that the frames reconstructed by our method are consistently better than those derived by the compared methods.
(a)
(b)
(c)
(d)
(e)
Figure 3.7 The experiments on test sequence #1: (a) partial sequence of test sequence #1 in which the red rectangle indicates missing frames; (b) frames reconstructed by Ding et al.’s approach; (c) frames reconstructed by Xu et al.’s approach; (d) frames reconstructed by the proposed approach; and (e) the corresponding trajectory information of predicted object motion generated by the three approaches.
Table 3.2 Comparison of the ground-truth postures and the reconstructed missing postures (The parts in black, red and gray represent the ground-truth postures, reconstructed postures, and perfectly matched portions, respectively)
Ground-t
In the second experiment, we used test sequence #2, which contained 100 frames. In the sequence, two people are walking toward each other, and one person occludes the other in about 20 frames (some of the frames are shown in Figure 3.8(a)). Figure 3.8 (b), (c), and (d) show, respectively, the snapshots of human objects reconstructed by the methods in [30] and [39] and our approach. From the reconstructed frames, it is apparent that our approach was the most effective in recovering the occluded frames.
Using the recovered sequence generated by our approach yielded the best inpainting results among the three compared approaches, as shown in Figure 3.8(e).
(a)
(b)
(c)
(d)
(e)
Figure 3.8 The experiments on test sequence #2: (a) some snapshots of the occluded object in the test sequence; (b) frames reconstructed by Ding et al.’s approach; (c) frames reconstructed by Xu et al.’s approach; (d) frames reconstructed by the proposed approach; and (e) the inpainting result derived by our approach.
In the third experiment, we used a video sequence (test sequence #3) from the Weizmann database [46] to evaluate our method. We removed 7 of the 55 frames in the sequence. Figure 3.9(a) shows part of the sequence (21 frames). The 7 frames bounded by the red rectangle were the ones removed before the experiment. Figure 3.9(b), (c), and (d) show, respectively, the missing frames reconstructed by the three approaches;
and Figure 3.9(e) shows the trajectories reconstructed by the three approaches in the manifold space.
Table 3.3 details the results of the ground-truth method and the three
compared methods. The top row shows the sequence of missing ground-truth postures. The second, third, and fourth rows show the missing frames reconstructed by the two methods in [30] and [39] and our method, respectively. The black parts of the figures are the ground-truth postures; the gray parts are perfectly matched portions; and the red portions belong to reconstructed postures. Note that the first frame reconstructed by Ding et al.’s method covers a broad area (the red area above the head). Only this method may generate such results. In terms of the accuracy of the reconstructed frames, our method reconstructed the most accurate postures overall. However, Xu et al.’s method reconstructs the most accurate postures in the last of the 7 missing frames. The match rate was 94.3% compared to that of the ground-truth. In contrast, the accuracies of the postures reconstructed by Ding et al.’s method and our method are 67.7% and 77.2% respectively compared to that of the ground-truth posture.
(a)
(b)
(c)
(d)
(e)
Figure 3.9 The experiments on test sequence #3: (a) partial sequence of the test sequence in which the red rectangle indicates the 7 missing frames; (b) the frames reconstructed by Ding et al.’s approach; (c) the frames reconstructed by Xu et al.’s approach; (d) the frames reconstructed by the proposed approach; and (e) the corresponding trajectory information of predicted object motion generated by the compared approaches.
Table 3.3 Comparison of the ground-truth postures and the reconstructed missing postures (The parts in black, red and gray represent the ground-truth postures, reconstructed postures, and perfectly matched portions, respectively)
Ground-
truth
Average
Ding et al. [10]
71.3%
72.7% 76.2% 71.1% 69.2% 72.0% 70.3% 67.7%
Xu et al.
[39]
75.7%
60.6% 94.0% 68.7% 72.8% 73.3% 66.1% 94.3%
Ours 80.9%
83.0% 94.0% 81.3% 73.7% 79.7% 77.5% 77.2%
3.4 Summary
In this Chapter, we proposed a human object inpainting scheme that divides the process into three steps: human posture synthesis, graphical model construction, and posture sequence estimation. In addition, we also define two constraints on the motion continuity property. The first constraint sets a threshold to confine the maximum search distance; and the second restricts the range of the search direction. With the two constraints, the number of possible candidates between any two consecutive postures can be minimized to a satisfactory extent. We then apply the MRF model to perform global matching. The experiment results demonstrate that the proposed approach outperforms two existing state-of-the-art approaches.
Chapter 4
Object Posture Temporal Super-Resolution Using Tensor Decomposition-Based Manifold Learning
In this Chapter, we describe the proposed framework for Object posture super-resolution using tensor decomposition-based manifold learning. First, we give an introduction about this research topic. The proposed approach is then described. Next, we detail the experiment results. Finally, we present our conclusions.
4.1 Introduction
Super-resolution (SR) [48]−[54] is a class of technique that enhance the resolution of existing images/videos. However, existing SR methods may fail to produce realistic and smooth results while dealing with sequences of human motion. Since human motion usually contains repeated postures, one may insert interpolated postures into the LR input sequence to increase the temporal resolution. In order to generate postures and animate animal/human motion, Xu et al. [39] proposed energy minimizing approach to animate motions. However, energy minimization process did not include human motion model, the performance is unstable
and very sensitive to the selected parameters. In [10], Ding et al.
proposed a rank minimization approach to model and synthesize human motion for video inpainting. This rank minimization approach would usually produce good results as far as the object’s motion is periodic.
Makihara et al. [59] proposed a reconstruction-based method to synthesize periodic human motion with high frame rate from a single periodic motion sequence. Under the constraint of periodic motion, their method could also produce good experiment results.
Nevertheless, since human motion is not always periodic, a single motion sequence could provide only limited and insufficient information to generate high quality temporal SR sequences. Therefore, in this work, we propose using learning-based approach to extract motion tendency from a set of learning sequences and then synthesize interpolated human postures using the learned motion tendency as the prior information. Note that, the extracted motion tendency should preserve only the motion-related information regardless of individual discrepancy in the learning sequence. In [60], Elgammal et al. introduced a framework to separate motion data into person and motion factors. However, while we use this decomposed motion factors to increase the temporal resolution of
human motion, we found it difficult to get a stable result. The main reason is because the decomposed person and motion factors are not guaranteed to be orthogonal. Although the multilinear analysis tool like, tensor decomposition, is able to discover the orthogonal factors, the limitation of tensor decomposition is that the motion data need to be arranged into various orthogonal factors beforehand. Such requirement makes it hard if we apply tensor decomposition to decompose motion data into orthogonal factors. Typically, human motion sequences would have different lengths or different sampling rates. In this work, we propose a motion data alignment scheme which can automatically arrange motion data in tensor. Then, we can apply tensor decomposition to decompose motion data into orthogonal factors. Based on decomposed result, we can reconstruct the motion trajectory of LR input sequence.
Finally, the global feature, reconstructed motion trajectory and object inpainting which can maintain local motion continuity are combined together to obtain final result.
The proposed framework consists of three steps: graphical model construction, motion trajectory reconstruction and posture selection. The first step, graphical model construction, projects each input motion
sequence into a manifold space and then represent the projected sequence by a motion trajectory. This low-dimensional representation provides a simple and concise representation for human motion. Secondly, we extract the motion and person factors via tensor decomposition, and then use the motion factor extracted from learning sequences and the person factor extracted from the LR input sequence to reconstruct the motion trajectory for the input sequence. Finally, we adopt the human object inpainting technique [61] to select interpolated postures based on the reconstructed motion trajectory.
4.2 Object Posture Temporal Super-Resolution
4.2.1 Overview of the Proposed Method
We propose an object posture super-resolution scheme that can increase the temporal resolution of human motion sequence. Initially, we assume that the objects have been extracted by an automatic object segmentation scheme [19], or by an interactive extraction scheme [20]-[22]. We also assume that the posture number is enough for posture selection based on the observation that human motion usually contains
repeated postures.
Our primary goal is to transfer the motion factor from HR learning sequences to LR input sequence in order to increase the temporal resolution of LR input sequence. Figure 4.1 shows the flowchart of the proposed posture super-resolution scheme which is comprised of three steps: graphical representation of object postures, temporal super-resolution using tensor decomposition-based manifold learning and posture selection. The first step of posture super-resolution involves calculating the similarity value between postures. Then, all postures are projected onto manifold space and we link the postures that appear in adjacent frames in manifold space. After applying the above procedure, we can obtain a graphical representation of the object’s motion which provides a simple representation of an object’s motion. Next, we extract some significant points along motion trajectory of each learning sequence.
These significant points are invariant to different persons and are used to align the motion data of different learning sequences. This process can avoid the affect of different capture rate of cameras and different motion speed. After data alignment, a fixed number of m sampled points is extracted and used to represent the motion trajectory for each training
sequence. As to the LR input sequence alignment, we find k postures (k represents the posture number of LR input sequence) among the m position of tensors and arrange the coordinate value of all postures in tensor. After the above data alignment, we extract the motion factor from only the learning sequence, and extract the person factor form only the
sequence. As to the LR input sequence alignment, we find k postures (k represents the posture number of LR input sequence) among the m position of tensors and arrange the coordinate value of all postures in tensor. After the above data alignment, we extract the motion factor from only the learning sequence, and extract the person factor form only the