The difference between Model-Free and Model-Based methods (shown in Figure 2.1), as the names suggest, is that the latter uses an auxiliary human model with anatomic structure. Model-Free methods usually estimate human motions with a bottom-up pro-cess. They use part detection technologies for the head, torso, or limbs to detect and measure the possible candidates of each part. Finally the best association is consolidated [21][43]. However, it is not easy to construct a robust detector for each part. Hua et al.
[20] collect 2D shapes of the human motions as prior knowledge and propose a data-driven belief propagation Monte Carlo algorithm to infer pose parameters from image cues. Ramanan et al. [40] set up the appearance detector for every part of the personage in the film automatically. Mori et al. [37] propose an effective segmentation method and acquire appearance information of the parts to build an appearance model in advance.
Ren et al. [42] simply employ various pairwise configuration constraints for edges such as parallelism, to form the best body configurations. The human motion recovery with bottom-up estimation is flexible but relatively unstable too. In 3D cases, Cheung et al. [8]
, after reconstructing a 3D human volume, calculate principle axes of the volume and use oval columns that can change sizes to fit the human volume and recover human postures.
2.2. SINGLE VIEW VS. MULTIPLE VIEW
There is a anatomic structure in a human body such that body parts are correlated with each other. The advantage of using a 3D human model is that reasonable kinematics constraints can be easily enforced and high level application such as animation or action recognition can also be easily performed. Model-Based methods usually estimate human motions with a top-down process. They estimate high-dimensional configurations of the human postures by measuring similarities between predicted and actual observations. The methods in [1][14][16] all employ a 2D model. The advantage lies in that it neglects the depth of the view to simplify their problem, with the disadvantage of not being able to estimate 3D information of human postures. On the other hand, the results of the hu-man motion capture with a 3D model are very intuitive [10][12][48][25][6]. But the main disadvantage of using a 3D model is that the 3D human model is not always available since the body of everyone always differs. For this, Mündermann et al. [38] establish a database of human figures, and Cheung et al. [9] build 3D human shape and appearance models directly from multiple cameras in advance. They resolve the problem of available 3D human model usage. We will adopt a simple 3D human model combing the infor-mation from multiple cameras to explore the model parameters optimizing measurement functions.
2.2 Single View vs. Multiple View
In this section we will discuss the relevant research that use information of a single view or multiple views as illustrated in Figure 2.2. Lee and Cohen [27] localize each body part to estimate the human posture in a single still image. Sidenbladh et al. [45] and Smin-chisescu and Triggs [47] recover 3D postures from a single monocular image sequence.
The difficulty of recovering postures from a single view is that self-occlusion and depth ambiguity may occur easily. Agarwal and Triggs [2] use a mixture of regressors frame-work to find multiple possible poses for monocular images. Like [5][12], a lot of methods
CHAPTER 2. RELATED WORKS
Figure 2.2: Single View vs. Multiple View. The left is extracted from [27] and the right is extracted from [9].
are originally developed for the single view scenario but can be extended for multiple views with a straightforward method. They usually sum up prediction errors calculated from each view independently and find the posture with minimum total errors as their estimation result. This is a simple but not necessary the most effective way to integrate information from multiple views, because not every view contains the same discrimina-tive cues for each human motion all the time. Delamarre and Faugeras [10] estimate 3D movement directions in each view from the differences of silhouettes in each view, and then integrate movement vectors as the model motion. Kakadiaris and Metaxas [24]
utilize three orthogonal cameras and consider occluded regions and motion changes to choose only cameras with significant changes for posture estimation, but the information in the discarded views that is still potentially useful are not considered altogether.
There is also one popular and effective way to integrate the information from multiple views, that is, constructing a 3D shape volume for the human body from multiple views.
Instead of considering 2D human silhouette from each view, the 3D shape volume is a visual hull that is consistent with the silhouettes of multiple views at the same time.
Therefore, the reconstructed shape volume can be used when estimating human postures for the multiview scenario [33][9][25][32].
2.3. IMAGE-BASED LOCALIZATION VS. VIDEO-BASED TRACKING
Figure 2.3: Image-Based Localization vs. Video-Based Tracking. The left is extracted from [59] and the right is extracted from [1].
2.3 Image-Based Localization vs. Video-Based Tracking
We have already mentioned several previous works that localize the human posture (as shown in Figure 2.3) with a single static image, like bottom-up human posture recovery using part detectors. Lee and Cohen [27] perform 3D human motion capture from a single image [59] assume that the human body is made up of several image cues, and then ex-ploit Sequential Monte Carlo to estimate the position of each cues. Mori and Malik [36]
propose a example-based method, where some key poses of skiting regarded as exemplars and the silhouettes of these poses are described by the Shape Context descriptor. Then the most suitable posture exemplars are selected to interpolate the estimated human posture for a given input image. Though there is amazing achievement in image-based localiza-tion methods, they are often limited to trained human postures only and the accuracy is not satisfactory.
Considering human motions as a continuous sequence of postures, the estimated result at the previous time step is an important source of information that can be utilized. The problem of human motion capture for continuous video sequences is regarded as video-based human motion tracking. We will further discuss relevant research about human motion tracking in the following subsections, including Kalman filtering, particle filtering,
CHAPTER 2. RELATED WORKS
advanced and hierarchical particle filtering..
2.3.1 Kalman Filtering vs. Particle Filtering
For 3D human motion tracking, One of the difficulties is the high dimensionality of the configuration space. Yamamoto et al. [56] and Bregler and Malik [5] recover high-DOF articulated human configurations by solving a linear estimation problem. Miki`c et al.
[33] propose a 3D voxel labeling method to label limbs and detect the positions of joints between different body parts, and then use extended kalman filtering to estimate model configurations. But the mapping from the parameter space to the feature space is non-linear and multi-modal. Using non-linear estimation methods, like Kalman filtering, to solve nonlinear problems is not feasible, not to mention that we cannot expect to find a perfect measurement function between model parameters and real-world observations.
Particle filter [23] remedies this by maintaining multiple hypotheses of state estima-tions. Deutscher et al. [11] and Sidenbladh et al. [45] use general particle filtering to perform human motion tracking. Sidenbladh et al. [45] assume orthographic projection and focus on walking motions only. The configurations of their 3D human model consist of 25 DOFs. So, the particle filter must search in the parameter space with 25 dimensions, where searching may be easily trapped in local maxima. In order to tracking accuracies, exponentially increasing particles can be sampled at the cost of computational overhead.
2.3.2 Advanced Particle Filtering
Due to the inefficient scalability of particle filtering for high-DOF tracking, some ad-vanced particle filtering techniques appear to sample particles and find global maximum effectively. Deutscher et al. [12] propose the annealed particle filtering that incorporate the concept of simulated annealing into particle filtering. With smoothed likelihood func-tions and layered sampling, the annealed particle filtering conduct a coarse-to-fine search
2.3. IMAGE-BASED LOCALIZATION VS. VIDEO-BASED TRACKING
that can find the global maximum with fewer particles. Fontmarty et al. [15] propose a modified annealed particle filtering that also considers the concept of importance sam-pling from ICONDENSATION [22]. Some additional particles estimated by other meth-ods such as parts detection are augmented, and may effectively improve the tracking re-sults. Sminchisescu and Triggs [47] and Sminchisescu and Triggs [48] propose a method called covariance scaled sampling, where particles are sampled at the scale of estimated covariance.
There are also some other advanced particle filtering techniques that utilize gradient descent search methods. Wang and Rehg [54] divide the steps of particle filtering into multiple modules and analyze the influences of particle sampling with different gradient descent search methods at different stages. In addition, there are also some advanced par-ticle filtering techniques that are applied to articulated hand tracking in a high dimensional state space, such as appearance-guided particle filtering [7] and smart particle filtering [4].
Some previous works combine learning methods of dimensionality reduction to re-duce the exponential increase of the number of sampled particles for high-DOF tracking.
A influential dimensionality reduction method is Principal Component Analysis (PCA), which is inadequate to handle the non-linear human motion configuration space. Manifold learning algorithms, such as Locally Linear Embedding(LLE), Isomap, and Laplacian Eigenmaps are also inadequate because the inverse mapping from the low dimensional space to the original state space is not always available. But the inverse mapping is usu-ally indispensable for measuring the likelihood function to reweight sampled particles.
Li et al. [28], Raskin et al. [41] and Hou et al. [18] use the Gaussian Process Model with an inverse mapping that can reduce the dimensions to effectively improve the track-ing accuracies and efficiently decrease computation time. One disadvantage of these methods is that they are only valid for tracking trained human movements. Moreover, Xu and Li [55] exploit symmetry among human postures while walking and find the mo-tion correlamo-tion by learning with training images. Then, particle filtering is only required
CHAPTER 2. RELATED WORKS
to estimate parameters of on one side, and other parameters are inferred by the learned symmetry correlation. So the DOFs needed to be estimated are effectively reduced.
2.3.3 Hierarchical Particle Filtering
Despite numerous creative ideas to reduce the exponential computational cost for high-DOF tracking, there is still no satisfactory solution that solve this problem. Therefore researchers propose the concept of hierarchical method to decompose the search space.
MacCormick and Isard [29] propose the concept of hierarchical partitioned sampling for 2D hand shape tracking. The hand shape is modeled using B-spline composed of 28 measurement lines, in which the 8 measurement lines of the fist are determined first, then other ones are determined with the removal of 8 DOFs. Deutscher et al. [13] think that the parameters of the human postures should not be partitioned into multiple disjoined sets subjectively by researchers. So they propose a method for automatic partitioning, which determines the order and range of sampling in annealed particle filtering with covariance matrix. The hierarchical particle filtering methods for human motion tracking often pre-dict the state of torso first, then regards the four limbs as independent to decompose the search space effectively and then reduces the computational cost. One major disadvantage of hierarchical tracking methods is that inaccurate torso motion may sharply deteriorate the quality of limbs motion tracking.
Mündermann et al. [38] use ICP (Iterative Closest Point) to estimate the state of each body part after reconstructing a 3D human volume. For keeping torso and limbs stay-ing connected, the idea of soft-joint is proposed. The error metric of ICP considers the distances between joints of connected body parts, as well as the original corresponding points.
Chapter 3
Model-Based 3D Human Motion Tracking
In this chapter, we will introduce 3D human model, 3D human volume and particle filter-ing that are several elements to facilitate Model-Based 3D human motion trackfilter-ing. First, we introduce the parameters and characteristics about 3D human model and design an applicable 3D human model for our work. And then, reconstruct available 3D human vol-ume from multiple cameras, that is the important measurement to estimate human posture by matching with 3D human model. Finally, we introduce the advantages and limitations of the particle filtering that is the method used for human motion tracking in our work.
For this, we will propose our improved method in Chapter 4.
3.1 3D Human Model
The shape of 3D human model consists of a group of figure parameters and the pose of 3D human model is described by the motion parameters for the articulates with degree of freedom. When make use of 3D human model for human motion tracking, the human mo-tion can be expressed from the parameters of 3D human model by mapping feature space to parameter space. Two major advantages for this expression are that reasonable kine-matics constraints can be easily enforced and high level applications of tracking results such as animation or action analysis can also be easily performed.
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING
3.1.1 Figure Parameters
Such as the foregoing, the parameters that express the state of the 3D human model can be divided into the figure parameters and the motion parameters for the articulates with degree of freedom. The figure parameters are used to determinate the shape of the 3D human model. In theory, if the shape of the 3D human model is more similar to the hu-man body be tracked the motion, it is more favorable to the estimation of likelihood or measurement function. Though the 3D human model is very close to the primitive human body, in fact it is difficult to obtain perfect observation to estimate. Because the acquisi-tion of observaacquisi-tion must consider a lot of aspects, including the resoluacquisi-tion of the captured images, the method of foreground detection or the accuracy of 3D volume reconstruction.
It is not inevitable to simulate the overly subtle 3D human model. Kehl et al. [25] have used a general and subtle 3D human model to go on 3D human motion tracking. They even consider the situation of model surface blending when the articulates of the body are spread or crooked. But everybody’s figure is always different. Instead, there are too many figure parameters for the overly subtle human model, the availability of 3D human model is reduced. Cheung et al. [9] set up a individually subtle 3D human model of the human in the the environment with many cameras and auxiliary apparatus before human motion tracking. Mündermann et al. [38] obtain 46 full bodies using laser scans and then build a database with deformable models of human shapes learned by using principal compo-nent analysis (PCA). If we want to find a group of figure parameters for the subtle 3D human model, we must have complicated environment, apparatus and other prerequisites.
Otherwise, it is not easy to achieve.
Because of the reasons described above, a lot of researches adopt simple geometric models to make up 3D human model, such as sphere, cylinder and cuboid. The figure parameters are just the parameters that control the the physical dimensions of the geo-metric models. 3D human model of this kind is very convenient to initialize the figure
3.1. 3D HUMAN MODEL
Figure 3.1: 3D human model we design has 22 DOFs totally, 6 DOFs for torso, 4 DOFs for each limb.
parameters manually and automatically to fit the human body. Miki`c et al. [33] mark and divide the possible body parts using the result of the 3D human volume reconstruction.
During the process of tracking, the markers of the body parts are updated to estimate the figure parameters using Bayesian networks via exaggerative motions like stepping over the box and turning around or lifting the leg. It is a common method that is to make use of particular motions to adjust figure parameters automatically. Other researches that mostly use general figure parameters for 3D human model are absorbed in the main issue of hu-man motion tracking. Or, they often choose to initialize the figure parameters hu-manually.
Michoud et al. [32] suppose that human figure accords with certain proportion. So long as the height of the human known can determine figure parameters to generate the 3D human model. Our work is also to use a unsophisticated 3D human model and initialize figure parameters manually. We divide 3D human model simply and easily into several parts, including head, torso, upper arm, forearm, thigh and leg. Except head and torso, other parts are symmetrical, so the human model is made up of ten parts. When the head is represented by using the sphere, the torso is a cuboid with directionality. The limbs that are represented by using the cylinder. The 3D human model is shown in Figure 3.1.
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING
3.1.2 Motion Parameters
After determining the shape of the 3D human model, the motion parameters are what we will estimate for the human posture while tracking human motion. Later the parameter space which we discuss is always consisting of this kind of parameter. In general human motion tracking, we claim the DOFs that 3D human model needs refer to the number of the motion parameters. 3D human model with higher DOFs can imitate out more human motions. It will be also more difficult to estimate correct state of the human motion tracking because of increasing DOFs. In addition to parameter space extending, the reason the same as figure parameters for subtle 3D human model, is that observations obtained usually are not perfect to measure the difference of slightly changed movements. It is the trend of entire motion that we expect to estimate, not slight details of the motion. So we reduce the complexity of human motion tracking by removing the unnecessary DOFs as much as possible. According to 3D human model which we use, we consider 6 DOFs for the torso motion, 3 for rotation and 3 for translation. Only consider the orientation and position of torso, and has not subdivided the blending of shoulder and pelvis. Worth mentioning, we have not designed the model of neck, so we do not consider the DOFs of the neck. But we model the head, this is because the head which loses the degree of freedom is consulted to estimate 6 DOFs for the torso. There are more details in the method discussion about the human motion tracking.
Because 3D human model regards torso as root of the hierarchical structure, the results of estimation between torso and limbs are not independent. Depend on the method of estimating human postures, the joint constraint between torso and limbs set up will be different. And the required DOFs will also be different for the limbs motions. As to using our 3D human model at all, we will analyze the drequired DOFs for the limbs motions according to the joint constraint between torso and limbs. And the corresponding methods of estimation will be discussed further while we introduce the method of the
3.1. 3D HUMAN MODEL
human motion tracking later. We divide the the joint constraint between torso and limbs, into hard-joint, free-joint and soft-joint constraint. We suppose that the joints between upper limb and forelimb always have hard-joint constraint.
• Hard-Joint Constraint
There is a fixed joint that makes both sides link up together tightly between torso
There is a fixed joint that makes both sides link up together tightly between torso