»ñÈ.é^£G.o£G.
Æÿ¡Z
Department of Computer Science and Information Engineering College of Electrical Engineering & Computer Science
National Taiwan University Master Thesis
9ÚëîßzV_·¿à×Pn;!P
PtFÕ°
Multiview 3D Human Motion Tracking with Soft-Joint Constrained ICP
eX
Lu-Jong Chu
¼0>0÷׿ }ÿ Wà }ÿ Advisor: Yi-Ping Hung, Ph.D.
Chu-Song Chen, Ph.D.
ºÓ» 97 O 7 `
July, 2008
Multiview 3D Human Motion Tracking with Soft-Joint Constrained ICP
Lu-Jong Chu
July, 2008
*
âyWÝÆÿ.BÄËOÝ@~aY¸&WÝ&9Ý@~Ä6¯
!¬b[ÆÇ¸s¨æpôUÿ´0X®ÞÝ]°A¢Kö2«E Ã;Î&tÝ[$ Q>0@Ý.Æ}Tæ¼2>
0ñ&°ÜÝ.êW
´& ÷׿>0õWà>0¼0¸&Eݼr½ ÝáIbÝ
?Ý-á>0§¯-ËÁÝV¾ÝßßÌÎ&3?¡ß}
X[°ÝÎP Å §é\ÚÆ@Ý.ƯƩ3@
~î&9¼F3¿ðÝßþ J«&&9ÝK@~XÝ!.Æ
b¯Æ|!8+¹ <©½ &8tòÝ@o.ÆZ
àÎÊÈ1WÑKb¿.¯Æm1º2&.êî ÝÜÃw&5²ßþîÝTsFLÍο.U&"Ät¡Ý@
~$ð×R@\¡ZD¡@~]°¬QÃ& <&«E#ê¼ÝÃ
;¯&?b*TWÆÿ.g 9N#Tb¶Ý.Æ
3<\Æÿ.Ý!.Æ!ø ݨ$ðÝ.¼Äæ½ôâyK
Ý b¯Æ!5²@~B8! Æt¡& ÏÝ
߯¯ÆÎ&túÝ¡; &ÂÕW5`W &T·îÝØH¯&P`P
Ñ2åÕ¯Æá¶Ýn ¯Æ
`
Í¡ZÝ@~êÝÎ8^XÌ?ÕÝÅ _·ßÝzV¬v^b
§k_·ÝzVËv(AþM)ùÇÌ?ïÝ®åÕ¢Ý
§× ÝX×ÚÌ?|sß&Ëtþz£G®ßzT£?
ÿ2ËÝ&ÆDÄ9¬öÅ^ãÿ9ÚÝÅn¬ñëîß
Ahb[2J)9ÚÝ£G
ãyßݾ9n;ÌbÁ{Ýã&ÆèÝ×Í$·PÝßzV_
·]°3NÍ` F£?ºzV¡°òzVÝ£?&Æ2à ½Tày{î_·ÝÔl® ® tp£?ݺzV_·]°æ .ÎÔl® Ý?3yà&aPC9ÁÂÝ¡?^£5¾QEy$
·PÝßzV_·ÍþF3yºzV£?ÝÑ@Pº=2Å(°òÝ£
? ݪ±º0£E°òzV£?ÝÅ(&Æ2à)Ý×Pn;!
PPtFÕ°×Pn;!P.&°òñÒün;Ý §3n
;!PÉ3KåºzV0£ÝWPtFÕ°JÞ°ò¸
à×Pn;`mÝ 7 ÍîÔl® 3K©mXIToIn;
Ý 1 ÍãEy°òzV_·!`ÌnÝÔl® ×Pn;!PP
tFÕ°8F¸ÿ&ÆÝ]°Ç¸3°òy` /{>Ý®`
)ÿb[Ý_·
&Æùs¨ßºÝ]'°òÝzVÌb8 n=P &Æá¼°ò
n;ÝH`;ðµï£ºÝzVLÍ &Æ=bêݰò®£
G` Ýè{º_·ÝÑ@P&Æ¢ãG×Í` FX£?°ò×Pn
;Hï?êG` FºzV¸ÿÔl® b?êÝ£?µ AJ)º°òÝzV_·&ÆèºÝëîßzV_·®Þ×Íb [ÝX]°
n n
n"""ÞÞÞ ßzV_·Ôl® zT£?PtFÕ°9¥Ú
ëîßÿl¥
Abstract
In this thesis, we aim to track 3D human motions in image sequences captured from multiple cameras. The target motion is not limited to specific kinds of human motions, such as walking or jogging, that is, there is no restrictions imposed on possible human motions. Because self-occlusion and depth ambiguity occur easily when using only one single camera, we obtain multiple videos captured with multiple cameras from different viewpoints to reconstruct 3D shape volume of the target subject, which is an effective way to integrate information from multiple views.
We propose a hierarchical human motion tracking method that can effectively capture human articulated motions with high degrees of freedom (DOFs). At each time step, the torso motion is estimated first and then the estimation of the limbs motions is carried out individually. The particle filtering, which is a popular method for high dimensional tracking, is adopted to track the torso motion because it can deal with the nonlinear and multimodal posterior probability distributions.
One disadvantage of hierarchical human motion tracking is that torso tracking errors may deteriorate limbs motion estimation. To reduce the interference from inaccurate torso motions, we propose a soft-joint constrained ICP (Iterative Closest Point) method to estimate limb motions. In contrast to hard joints, limbs with soft joints are allowed to move freely in a small range of area, so it is still possible to track limb motions even with inaccurate torso motions. However, the DOFs of each limb increase from 4 to 7 when the soft-joint constraint is used. The proposed soft-joint constrained ICP can efficiently
determines 6 DOFs such that only 1 DOF (elbow/knee) is left for the particle filtering.
Integrating the advantages of particle filtering and soft-joint constrained ICP at the same time, our method can effectively track limb motions even when there is large motion in a short period of time.
Moreover, we find that the torso motion is strongly related to the limbs motions. If the states of the four limbs are known, it is usually possible to predict the torso state without other information, especially when the limbs states are reliable. In order to improve torso motion tracking, the limbs motions estimated at the previous time step can provide reliable hypotheses of current torso state which is implemented as sampling particles from limbs states for torso tracking. We have conducted experiments with multiple video sequences of different motions, and the results show that our method is effective and reliable for 3D human motion tracking.
Keywords: human motion tracking, particle filtering, pose estimation, ICP, multiview, 3D human model, volume reconstruction
Contents
ý ý
ý õõõºººµµµhhh iii
*
*
* v
`
`
` vii
Abstract ix
List of Figures xiii
1 Introduction 1
1.1 Problem and Challenges . . . 1
1.2 Proposed Method . . . 3
1.3 Overview of Our Method . . . 5
1.4 Outline of the Thesis . . . 6
2 Related Works 7 2.1 Model-Free vs. Model-Based . . . 8
2.2 Single View vs. Multiple View . . . 9
2.3 Image-Based Localization vs. Video-Based Tracking . . . 11
2.3.1 Kalman Filtering vs. Particle Filtering . . . 12
2.3.2 Advanced Particle Filtering . . . 12
2.3.3 Hierarchical Particle Filtering . . . 14
3 Model-Based 3D Human Motion Tracking 15 3.1 3D Human Model . . . 15
3.1.1 Figure Parameters . . . 16
3.1.2 Motion Parameters . . . 18
3.2 3D Volume Reconstruction . . . 20
3.2.1 Introduction to Volume-Based Visual Hull Construction . . . 21
3.2.2 Implementation to Voxel-based Approach . . . 23
3.3 Particle Filter Tracking . . . 25
3.3.1 General Particle Filtering . . . 25
3.3.2 Hierarchical Particle Filtering . . . 29
4 Soft-Joint Constrained ICP and Torso Prediction 33 4.1 Introduction to ICP . . . 34
4.2 Soft-Joint Constrained ICP . . . 36
4.3 Voxel Labeling . . . 41
4.4 Torso Prediction with Soft Joint Locations . . . 44
5 Experiments 47 6 Conclusions and Future Works 59 6.1 Conclusions . . . 59
6.2 Future Works . . . 59
Bibliography 61
List of Figures
1.1 Challenges for human motion capture . . . 2
1.2 System flowchart . . . 6
2.1 Model-Free vs. Model-Based . . . 8
2.2 Single View vs. Multiple View . . . 10
2.3 Image-Based Localization vs. Video-Based Tracking . . . 11
3.1 3D human model . . . 17
3.2 Voxel Reconstruction . . . 22
4.1 Original ICP vs. Soft-joint constrained ICP . . . 36
4.2 Voxel Labeling . . . 41
4.3 Torso prediction with limbs states . . . 45
5.1 Tracking results of pointing . . . 49
5.2 Tracking results of checking watch . . . 50
5.3 Tracking results of scratching head . . . 51
5.4 Tracking results of waving . . . 52
5.5 Tracking results of punching . . . 53
5.6 Tracking results of kicking . . . 54
5.7 Tracking results of picking up and throwing . . . 55 5.8 Tracking results of turning around . . . 56 5.9 Tracking results of walking around . . . 57 5.10 Recovery from drift when tracking video of kicking under poor observations 58
Chapter 1
Introduction
In this chapter, we define the problem and illustrate challenges of 3D human motion tracking. Then the proposed hierarchical human motion tracking method is described briefly and the overview of our method is shown. Finally, the organization of this thesis is introduced.
1.1 Problem and Challenges
The purpose of human motion capture is using different kinds of sensors to estimate the parameters that describe human posture, including the angles of connecting joints, the orientations and positions of body parts. This is an interesting problem and can be used for many applications. In medical science, it can be used for the aided analysis for rehabilitation. In entertainment, human computer interaction and computer animation are both common applications.
One common way for human motion capture is to develop a marker-based system. The user must wear sensors on the articulations of the body, which can detect the acceleration and the center of gravity about movements. In vision-based human motion capture with markers, many reflective markers are pasted on the articulations, and then detected by multiple infrared cameras. The 3D positions of markers are estimated by using triangula-
CHAPTER 1. INTRODUCTION
Figure 1.1: Challenges for human motion capture because of shape variance, appearance variance, pose variance and ambiguity with view dependence. This figure is extracted from [57]
tion from multiple views, which may fail while the markers are not visible in two or more cameras. Though marker-based methods can capture human motions effectively, expen- sive and intrusive equipments render them inappropriate for many applications, such as surveillance for home care or public security, interactive games, and video annotation in multimedia. For these and other emerging home applications, the intrusive and expensive equipments forbid the popularity of marker-based methods.
In recent years, vision-based marker-free human motion capture becomes a popular research issue. This is an attractive but extremely challenging problem, shown in Fig- ure 1.1, because of the following difficulties:
• Shape variance
1.2. PROPOSED METHOD
The shapes of different people vary with their skeleton and muscle variations. More- over, the elasticity of the clothes may also change the observed human shapes.
Shape variance makes the observations different even with the same posture.
• Appearance variance
Besides human skin colors and textures, the wide variety of human clothing leads to various kinds of appearance.
• Pose variance
There are high degrees of freedom (DOFs) in articulated human motions. The hu- man body is made up of hundreds of skeletons and extendable muscles. Human bodies can exhibit an enormous number of different postures, which makes human motion tracking non-trivial.
• View dependence
The same posture exhibits different observations from different viewpoints while different postures may result in similar observations at the same viewpoint because of depth ambiguity.
The challenges of vision-based marker-less human motion tracking includes, but are not limited to, the above items. In general, this is still an open problem in computer vision and thus deserves further investigation.
1.2 Proposed Method
In this thesis, we aim to perform multiview model-based human motion tracking from image sequences observed from different viewpoints. The advantage of using a 3D human model is that reasonable kinematics constraints can be easily enforced and high level application such as animation or action recognition can also be easily performed. When only a single camera is used, self-occlusion and depth ambiguity will occur easily, so
CHAPTER 1. INTRODUCTION
we obtain multiple videos captured from multiple cameras to reconstruct voxel-based 3D human volume, which is an effective way to integrate the information from multiple views.
We propose a hierarchical human motion tracking method with soft-joint constrained ICP, which is effective for human motions that contain high DOFs. In order not to suffer from the computational cost that increases exponentially, the hierarchical method is used to decompose the search space. At each time step, the torso motion is estimated first and then the estimation of limbs motions is carried out individually.
The torso motion is difficult to estimate because of body shape variances and silhou- ette/voxel noises. We adopt particle filtering that is capable of modeling nonlinear and multi-model posterior distributions and can maintain multiple hypotheses to track the ori- entation and position of the torso.
One major disadvantage of hierarchical human motion tracking is that torso estimation error may deteriorate limb motion estimation. To reduce the interference from torso mo- tion errors, we propose a soft-joint constrained ICP to estimate limb motions. In contrast to hard joints, limbs with soft joints are allowed to move freely in a small range area. The soft-joint constraint also allows the rigid 3D human model to accommodate human body flexibility. However, the DOFs of each limb increase to 7 when the soft-joint constraint is used, instead of 4 for the hard-joint constraint. The proposed soft-joint constrained ICP can efficiently determines 6 out of 7 DOFs such that only 1 DOF (elbow/knee) is left for the particle filtering. Integrating the advantages of particle filtering and soft-joint con- strained ICP at the same time, our method can effectively track limb motions even when there is large motion in a short period of time.
Moreover, we find that the torso motion is strongly related to the limbs motions. If the states of the four limbs are known, it is usually possible to predict the torso state without other information, especially when the limbs states are reliable. In order to improve torso motion tracking, the limbs motions estimated at the previous time step can provide reliable
1.3. OVERVIEW OF OUR METHOD
hypotheses of current torso state, which is implemented as sampling particles from limbs states for torso tracking. We have conducted experiments with multiple video sequences of different motions, and the results show that our method is effective and reliable for 3D human motion tracking.
1.3 Overview of Our Method
We provide an overview of the proposed human motion tracking method in this section.
We assume that all cameras are calibrated, that is, the projection functions from a given 3D point to each image plane is known. We also assume that the target subject can be segmented from the background with some background modeling method. The segmen- tation results are not expected to be perfect since segmentation artifacts always exist in real-world cases. The pose of the target subject at the first frame is assumed given, either by manually alignment or by other automatic localization techniques for static images.
At each time step in the tracking process, our method perform hierarchical human motion tracking with previous estimated posture. Each iteration contains the following major steps:
1. capture images from multiple cameras at different viewpoints
2. obtain silhouette images using foreground detection based on some background modeling method
3. reconstruct the 3D shape volume of the target subject from silhouette images 4. track torso motion using particle filtering with torso prediction.
5. label surface voxels to indicate which body part they belong to
6. track limbs motions using particle filtering with soft-joint constrained ICP The flowchart of the proposed method is shown in Figure 1.2.
CHAPTER 1. INTRODUCTION
Figure 1.2: System flowchart of the proposed human motion tracking method.
1.4 Outline of the Thesis
This thesis is organized as follows. Chapter 2 discusses related works about human mo- tion capture. The details of our method are described in Chapter 3 and Chapter 4. Chap- ter 3 includes the prerequisites, such as the human model design and 3D volume recon- struction, and torso motion tracking with particle filtering. Chapter 4 describes limbs mo- tion tracking with soft-joint constrained ICP and how to predict torso state using the soft joint locations of four limbs. Experimental results and analysis are shown in Chapter 5, where multiple videos with different kinds of motions are used to validate our method.
Finally, conclusions and future works are made in Chapter 6.
Chapter 2
Related Works
Research about human motion capture has been developed for more than 20 years. There are a plethora of relevant literature [34][53][19][35][39]. This is a very fascinating yet challenging problem. Some previous works attempt to perform human motion capture under circumstances where there are fewer constraints and unlimited free human move- ments are allowed [27][20][59]. These are the most difficult cases for which there is still no satisfactory solution yet. Therefore, there are some works that enforce useful constraints as needed, such as fixed background environments [9][54] or known clothes colors [33][54] to regularize difficult problem. There are also some works that focus on only some specific human movements, such as walking [5][45][46][58][6][55], jogging [10][1], golf swing [52][51], skating [36], or ballet [18].
In this thesis, we aim to deal with general movements and propose a multiview model- based method for human motion tracking. In this chapter, we will discuss successively pros and cons of and related works about the following disciplines:
• Model-Free vs. Model-Based
• Single View vs. Multiple View
• Image-Based Localization vs. Video-Based Tracking
CHAPTER 2. RELATED WORKS
Figure 2.1: Model-Free vs. Model-Based. The left is extracted from [8] and the right is extracted from [47].
2.1 Model-Free vs. Model-Based
The difference between Model-Free and Model-Based methods (shown in Figure 2.1), as the names suggest, is that the latter uses an auxiliary human model with anatomic structure. Model-Free methods usually estimate human motions with a bottom-up pro- cess. They use part detection technologies for the head, torso, or limbs to detect and measure the possible candidates of each part. Finally the best association is consolidated [21][43]. However, it is not easy to construct a robust detector for each part. Hua et al.
[20] collect 2D shapes of the human motions as prior knowledge and propose a data- driven belief propagation Monte Carlo algorithm to infer pose parameters from image cues. Ramanan et al. [40] set up the appearance detector for every part of the personage in the film automatically. Mori et al. [37] propose an effective segmentation method and acquire appearance information of the parts to build an appearance model in advance.
Ren et al. [42] simply employ various pairwise configuration constraints for edges such as parallelism, to form the best body configurations. The human motion recovery with bottom-up estimation is flexible but relatively unstable too. In 3D cases, Cheung et al. [8]
, after reconstructing a 3D human volume, calculate principle axes of the volume and use oval columns that can change sizes to fit the human volume and recover human postures.
2.2. SINGLE VIEW VS. MULTIPLE VIEW
There is a anatomic structure in a human body such that body parts are correlated with each other. The advantage of using a 3D human model is that reasonable kinematics constraints can be easily enforced and high level application such as animation or action recognition can also be easily performed. Model-Based methods usually estimate human motions with a top-down process. They estimate high-dimensional configurations of the human postures by measuring similarities between predicted and actual observations. The methods in [1][14][16] all employ a 2D model. The advantage lies in that it neglects the depth of the view to simplify their problem, with the disadvantage of not being able to estimate 3D information of human postures. On the other hand, the results of the hu- man motion capture with a 3D model are very intuitive [10][12][48][25][6]. But the main disadvantage of using a 3D model is that the 3D human model is not always available since the body of everyone always differs. For this, Mündermann et al. [38] establish a database of human figures, and Cheung et al. [9] build 3D human shape and appearance models directly from multiple cameras in advance. They resolve the problem of available 3D human model usage. We will adopt a simple 3D human model combing the infor- mation from multiple cameras to explore the model parameters optimizing measurement functions.
2.2 Single View vs. Multiple View
In this section we will discuss the relevant research that use information of a single view or multiple views as illustrated in Figure 2.2. Lee and Cohen [27] localize each body part to estimate the human posture in a single still image. Sidenbladh et al. [45] and Smin- chisescu and Triggs [47] recover 3D postures from a single monocular image sequence.
The difficulty of recovering postures from a single view is that self-occlusion and depth ambiguity may occur easily. Agarwal and Triggs [2] use a mixture of regressors frame- work to find multiple possible poses for monocular images. Like [5][12], a lot of methods
CHAPTER 2. RELATED WORKS
Figure 2.2: Single View vs. Multiple View. The left is extracted from [27] and the right is extracted from [9].
are originally developed for the single view scenario but can be extended for multiple views with a straightforward method. They usually sum up prediction errors calculated from each view independently and find the posture with minimum total errors as their estimation result. This is a simple but not necessary the most effective way to integrate information from multiple views, because not every view contains the same discrimina- tive cues for each human motion all the time. Delamarre and Faugeras [10] estimate 3D movement directions in each view from the differences of silhouettes in each view, and then integrate movement vectors as the model motion. Kakadiaris and Metaxas [24]
utilize three orthogonal cameras and consider occluded regions and motion changes to choose only cameras with significant changes for posture estimation, but the information in the discarded views that is still potentially useful are not considered altogether.
There is also one popular and effective way to integrate the information from multiple views, that is, constructing a 3D shape volume for the human body from multiple views.
Instead of considering 2D human silhouette from each view, the 3D shape volume is a visual hull that is consistent with the silhouettes of multiple views at the same time.
Therefore, the reconstructed shape volume can be used when estimating human postures for the multiview scenario [33][9][25][32].
2.3. IMAGE-BASED LOCALIZATION VS. VIDEO-BASED TRACKING
Figure 2.3: Image-Based Localization vs. Video-Based Tracking. The left is extracted from [59] and the right is extracted from [1].
2.3 Image-Based Localization vs. Video-Based Tracking
We have already mentioned several previous works that localize the human posture (as shown in Figure 2.3) with a single static image, like bottom-up human posture recovery using part detectors. Lee and Cohen [27] perform 3D human motion capture from a single image [59] assume that the human body is made up of several image cues, and then ex- ploit Sequential Monte Carlo to estimate the position of each cues. Mori and Malik [36]
propose a example-based method, where some key poses of skiting regarded as exemplars and the silhouettes of these poses are described by the Shape Context descriptor. Then the most suitable posture exemplars are selected to interpolate the estimated human posture for a given input image. Though there is amazing achievement in image-based localiza- tion methods, they are often limited to trained human postures only and the accuracy is not satisfactory.
Considering human motions as a continuous sequence of postures, the estimated result at the previous time step is an important source of information that can be utilized. The problem of human motion capture for continuous video sequences is regarded as video- based human motion tracking. We will further discuss relevant research about human motion tracking in the following subsections, including Kalman filtering, particle filtering,
CHAPTER 2. RELATED WORKS
advanced and hierarchical particle filtering..
2.3.1 Kalman Filtering vs. Particle Filtering
For 3D human motion tracking, One of the difficulties is the high dimensionality of the configuration space. Yamamoto et al. [56] and Bregler and Malik [5] recover high-DOF articulated human configurations by solving a linear estimation problem. Miki`c et al.
[33] propose a 3D voxel labeling method to label limbs and detect the positions of joints between different body parts, and then use extended kalman filtering to estimate model configurations. But the mapping from the parameter space to the feature space is non- linear and multi-modal. Using linear estimation methods, like Kalman filtering, to solve nonlinear problems is not feasible, not to mention that we cannot expect to find a perfect measurement function between model parameters and real-world observations.
Particle filter [23] remedies this by maintaining multiple hypotheses of state estima- tions. Deutscher et al. [11] and Sidenbladh et al. [45] use general particle filtering to perform human motion tracking. Sidenbladh et al. [45] assume orthographic projection and focus on walking motions only. The configurations of their 3D human model consist of 25 DOFs. So, the particle filter must search in the parameter space with 25 dimensions, where searching may be easily trapped in local maxima. In order to tracking accuracies, exponentially increasing particles can be sampled at the cost of computational overhead.
2.3.2 Advanced Particle Filtering
Due to the inefficient scalability of particle filtering for high-DOF tracking, some ad- vanced particle filtering techniques appear to sample particles and find global maximum effectively. Deutscher et al. [12] propose the annealed particle filtering that incorporate the concept of simulated annealing into particle filtering. With smoothed likelihood func- tions and layered sampling, the annealed particle filtering conduct a coarse-to-fine search
2.3. IMAGE-BASED LOCALIZATION VS. VIDEO-BASED TRACKING
that can find the global maximum with fewer particles. Fontmarty et al. [15] propose a modified annealed particle filtering that also considers the concept of importance sam- pling from ICONDENSATION [22]. Some additional particles estimated by other meth- ods such as parts detection are augmented, and may effectively improve the tracking re- sults. Sminchisescu and Triggs [47] and Sminchisescu and Triggs [48] propose a method called covariance scaled sampling, where particles are sampled at the scale of estimated covariance.
There are also some other advanced particle filtering techniques that utilize gradient descent search methods. Wang and Rehg [54] divide the steps of particle filtering into multiple modules and analyze the influences of particle sampling with different gradient descent search methods at different stages. In addition, there are also some advanced par- ticle filtering techniques that are applied to articulated hand tracking in a high dimensional state space, such as appearance-guided particle filtering [7] and smart particle filtering [4].
Some previous works combine learning methods of dimensionality reduction to re- duce the exponential increase of the number of sampled particles for high-DOF tracking.
A influential dimensionality reduction method is Principal Component Analysis (PCA), which is inadequate to handle the non-linear human motion configuration space. Manifold learning algorithms, such as Locally Linear Embedding(LLE), Isomap, and Laplacian Eigenmaps are also inadequate because the inverse mapping from the low dimensional space to the original state space is not always available. But the inverse mapping is usu- ally indispensable for measuring the likelihood function to reweight sampled particles.
Li et al. [28], Raskin et al. [41] and Hou et al. [18] use the Gaussian Process Model with an inverse mapping that can reduce the dimensions to effectively improve the track- ing accuracies and efficiently decrease computation time. One disadvantage of these methods is that they are only valid for tracking trained human movements. Moreover, Xu and Li [55] exploit symmetry among human postures while walking and find the mo- tion correlation by learning with training images. Then, particle filtering is only required
CHAPTER 2. RELATED WORKS
to estimate parameters of on one side, and other parameters are inferred by the learned symmetry correlation. So the DOFs needed to be estimated are effectively reduced.
2.3.3 Hierarchical Particle Filtering
Despite numerous creative ideas to reduce the exponential computational cost for high- DOF tracking, there is still no satisfactory solution that solve this problem. Therefore researchers propose the concept of hierarchical method to decompose the search space.
MacCormick and Isard [29] propose the concept of hierarchical partitioned sampling for 2D hand shape tracking. The hand shape is modeled using B-spline composed of 28 measurement lines, in which the 8 measurement lines of the fist are determined first, then other ones are determined with the removal of 8 DOFs. Deutscher et al. [13] think that the parameters of the human postures should not be partitioned into multiple disjoined sets subjectively by researchers. So they propose a method for automatic partitioning, which determines the order and range of sampling in annealed particle filtering with covariance matrix. The hierarchical particle filtering methods for human motion tracking often pre- dict the state of torso first, then regards the four limbs as independent to decompose the search space effectively and then reduces the computational cost. One major disadvantage of hierarchical tracking methods is that inaccurate torso motion may sharply deteriorate the quality of limbs motion tracking.
Mündermann et al. [38] use ICP (Iterative Closest Point) to estimate the state of each body part after reconstructing a 3D human volume. For keeping torso and limbs stay- ing connected, the idea of soft-joint is proposed. The error metric of ICP considers the distances between joints of connected body parts, as well as the original corresponding points.
Chapter 3
Model-Based 3D Human Motion Tracking
In this chapter, we will introduce 3D human model, 3D human volume and particle filter- ing that are several elements to facilitate Model-Based 3D human motion tracking. First, we introduce the parameters and characteristics about 3D human model and design an applicable 3D human model for our work. And then, reconstruct available 3D human vol- ume from multiple cameras, that is the important measurement to estimate human posture by matching with 3D human model. Finally, we introduce the advantages and limitations of the particle filtering that is the method used for human motion tracking in our work.
For this, we will propose our improved method in Chapter 4.
3.1 3D Human Model
The shape of 3D human model consists of a group of figure parameters and the pose of 3D human model is described by the motion parameters for the articulates with degree of freedom. When make use of 3D human model for human motion tracking, the human mo- tion can be expressed from the parameters of 3D human model by mapping feature space to parameter space. Two major advantages for this expression are that reasonable kine- matics constraints can be easily enforced and high level applications of tracking results such as animation or action analysis can also be easily performed.
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING
3.1.1 Figure Parameters
Such as the foregoing, the parameters that express the state of the 3D human model can be divided into the figure parameters and the motion parameters for the articulates with degree of freedom. The figure parameters are used to determinate the shape of the 3D human model. In theory, if the shape of the 3D human model is more similar to the hu- man body be tracked the motion, it is more favorable to the estimation of likelihood or measurement function. Though the 3D human model is very close to the primitive human body, in fact it is difficult to obtain perfect observation to estimate. Because the acquisi- tion of observation must consider a lot of aspects, including the resolution of the captured images, the method of foreground detection or the accuracy of 3D volume reconstruction.
It is not inevitable to simulate the overly subtle 3D human model. Kehl et al. [25] have used a general and subtle 3D human model to go on 3D human motion tracking. They even consider the situation of model surface blending when the articulates of the body are spread or crooked. But everybody’s figure is always different. Instead, there are too many figure parameters for the overly subtle human model, the availability of 3D human model is reduced. Cheung et al. [9] set up a individually subtle 3D human model of the human in the the environment with many cameras and auxiliary apparatus before human motion tracking. Mündermann et al. [38] obtain 46 full bodies using laser scans and then build a database with deformable models of human shapes learned by using principal compo- nent analysis (PCA). If we want to find a group of figure parameters for the subtle 3D human model, we must have complicated environment, apparatus and other prerequisites.
Otherwise, it is not easy to achieve.
Because of the reasons described above, a lot of researches adopt simple geometric models to make up 3D human model, such as sphere, cylinder and cuboid. The figure parameters are just the parameters that control the the physical dimensions of the geo- metric models. 3D human model of this kind is very convenient to initialize the figure
3.1. 3D HUMAN MODEL
Figure 3.1: 3D human model we design has 22 DOFs totally, 6 DOFs for torso, 4 DOFs for each limb.
parameters manually and automatically to fit the human body. Miki`c et al. [33] mark and divide the possible body parts using the result of the 3D human volume reconstruction.
During the process of tracking, the markers of the body parts are updated to estimate the figure parameters using Bayesian networks via exaggerative motions like stepping over the box and turning around or lifting the leg. It is a common method that is to make use of particular motions to adjust figure parameters automatically. Other researches that mostly use general figure parameters for 3D human model are absorbed in the main issue of hu- man motion tracking. Or, they often choose to initialize the figure parameters manually.
Michoud et al. [32] suppose that human figure accords with certain proportion. So long as the height of the human known can determine figure parameters to generate the 3D human model. Our work is also to use a unsophisticated 3D human model and initialize figure parameters manually. We divide 3D human model simply and easily into several parts, including head, torso, upper arm, forearm, thigh and leg. Except head and torso, other parts are symmetrical, so the human model is made up of ten parts. When the head is represented by using the sphere, the torso is a cuboid with directionality. The limbs that are represented by using the cylinder. The 3D human model is shown in Figure 3.1.
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING
3.1.2 Motion Parameters
After determining the shape of the 3D human model, the motion parameters are what we will estimate for the human posture while tracking human motion. Later the parameter space which we discuss is always consisting of this kind of parameter. In general human motion tracking, we claim the DOFs that 3D human model needs refer to the number of the motion parameters. 3D human model with higher DOFs can imitate out more human motions. It will be also more difficult to estimate correct state of the human motion tracking because of increasing DOFs. In addition to parameter space extending, the reason the same as figure parameters for subtle 3D human model, is that observations obtained usually are not perfect to measure the difference of slightly changed movements. It is the trend of entire motion that we expect to estimate, not slight details of the motion. So we reduce the complexity of human motion tracking by removing the unnecessary DOFs as much as possible. According to 3D human model which we use, we consider 6 DOFs for the torso motion, 3 for rotation and 3 for translation. Only consider the orientation and position of torso, and has not subdivided the blending of shoulder and pelvis. Worth mentioning, we have not designed the model of neck, so we do not consider the DOFs of the neck. But we model the head, this is because the head which loses the degree of freedom is consulted to estimate 6 DOFs for the torso. There are more details in the method discussion about the human motion tracking.
Because 3D human model regards torso as root of the hierarchical structure, the results of estimation between torso and limbs are not independent. Depend on the method of estimating human postures, the joint constraint between torso and limbs set up will be different. And the required DOFs will also be different for the limbs motions. As to using our 3D human model at all, we will analyze the drequired DOFs for the limbs motions according to the joint constraint between torso and limbs. And the corresponding methods of estimation will be discussed further while we introduce the method of the
3.1. 3D HUMAN MODEL
human motion tracking later. We divide the the joint constraint between torso and limbs, into hard-joint, free-joint and soft-joint constraint. We suppose that the joints between upper limb and forelimb always have hard-joint constraint.
• Hard-Joint Constraint
There is a fixed joint that makes both sides link up together tightly between torso and limbs. After determining orientation and position of torso, the position of the fixed joints will be also determined. The limbs will regard the joint connected with torso as the original point, have 3 DOFs for rotation. In addition, the angle of joint contained between upper limb and forelimb has 1 DOF. Sometimes it is not easy to determine the angle of rotation revolving on its own axis that is the one included in the 3 DOFs for rotation. It can be changed to express with 2 DOFs that upper limb and forelimb have individually in the polar coordinate system. So each limb holds 4 DOFs, 3+1 or 2+2, totally. The motion parameters of the human model altogether 22 DOFs made up of the ones of torso and limbs.
• Free-Joint Constraint
There is no connectivity between torso and limbs. Turn from hard-joint constraint into free-joint constraint, we can deem that 4 original DOFs add the 3 DOFs for translation. Or with 6 DOFs, 3 DOFs for rotation and 3 DOFs for translation, add 1 DOF that is the angle of joint contained between upper limb and forelimb. So each limb holds 7 DOFs, 4+3 or 6+1, totally. The motion parameters of the human model altogether 34 DOFs made up of the ones of torso and limbs.
• Soft-Joint Constraint
The hard-joint constraint makes human motion capture apt to cause wrong esti- mating because of the difference of the shapes between 3D human model and true human body. The free-joint constraint has seemed to lose the original idea of con-
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING
straining human posture using the 3D human model with the restriction of basic human kinematics. In contrast to hard joints, limbs with soft joints are allowed to move freely in a small range of area. The soft-joint constraint is made up of 7 DOFs like free-joint constraint. But the intensity of separation between the joint of limb and the neighboring joint of torso is considered. It is expected that the two joints are close to each other as much as possible, but allowed to separate. It means we want to find a optimizing solution that can satisfy the error function about the separation intensity and the similarity function about the observations at the same time. Though the entire DOFs of the soft-joint constraint are higher than the ones of hard-joint constraint. It will be even more efficient and effective in fact when it combines hierarchical idea and ICP. We will further probe into the advantage of soft-joint constraint ICP while discussing the method of human motion tracking.
3.2 3D Volume Reconstruction
We work to the human motion tracking using images captured from multiple cameras. The observations that each camera gets have the dependence of property for each other. The more effective way is to set up 3D volume for integrating the information from multiple views. The volumetric information computed from multiple views to match generic 3D human model can be regarded as the basic measurement of human motion estimation. The 3D human volume is usually reconstructed from silhouette images obtained by removing the background information of the images captured from each view. It is similar to Shape- From-Silhouette, also called Visual Hull construction that is a popular method of 3D shape estimation from silhouette images. There are two ways to construct a visual hull of the object, surface-based [30] and volume-based method. It is our aim to reconstruct 3D human volume, so the former is obviously not available. We will give a general introduction on the study about volume-based visual hull, and implement a simple and
3.2. 3D VOLUME RECONSTRUCTION
fast method to solve this problem.
3.2.1 Introduction to Volume-Based Visual Hull Construction
The visual hull construction is also called Shape-From-Silhouette that we from this name can more clearly understand it’s concept. For volume-based visual hull construction, the visual hull is equivalent to the maximal volume consistent with silhouettes of the object.
Silhouette images of the object are usually binary images with 0 for background and 1 for the object itself. The silhouette of an object in an image produced from projecting the object to one camera provides some information about the 3D shape of the object. We can define the vision cone of the camera by back-projecting the silhouette using the camera parameters, and we know that the 3D object lies inside the volume from the view area of the silhouette. With silhouette images of the same object from multiple views, we can intersect the generalized cones generated by the silhouettes of the object in each image, to limit a maximal volume which is guaranteed to contain the object. The maximal volume is known as the visual hull of the object. As to the human motion tracking, the object has just been replaced by the human body. The maximal volume is now the 3D human volume that we hope to set up. The more numbers of camera, more exquisite 3D human volume created is close to the actual human body because of the limitation of the maximal volume.
For the volume-based visual hull construction, in order to describe the object volume in the space, the object space is split up into many 3D grids. As to that pixels are the analytic units in a 2D image for the object, the grid in a 3D space for the object volume is known as voxel. There are two main ways to determine voxels that the object occupies in the space. One way, it shows that the voxel is part of object when this voxel projected on each image with different view is in the silhouettes of all images. It can get the volume of this object to finish all voxels in projection. Another way, it shows that the voxels
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING
Figure 3.2: Voxel-Based Visual Hull Construction
are part of object when the voxels are hit at the same time from all views by the image rays that are generated by back-projecting form the pixels belonging to the silhouette of the image using the camera parameters. The precision of the volume depends on the numbers of voxel sampled, but it will be slower to reconstruct 3D volume with more voxels relatively. Because the voxel is convex hull while being projected to the image, the situation is prone to overlap with the silhouette partly. Cheung et al. [8] propose an algorithm called Sparse Pixel Occupancy Test (SPOT), that controls the cost time and the precision of reconstruction based on the number of silhouette images overlapping with voxel and the number of pixels lying inside the voxel while overlapping. Hasenfratz et al. [17] combine many PC and use hardware-accelerated method to speed up the visual hull construction. They obtain a volumetric model of the moving actors and set up an interesting system about human-computer interaction. Michoud et al. [31] consider the situation that movable object can exceed the visual range in some cameras makes visual hull of the object unable to present entirely because of the maximal volume consistent with silhouettes of the object. They have proposed a method that can filter out the cameras offering unreliable information. We will provide a simple and fast implementation of available 3D human volume reconstruction.
3.2. 3D VOLUME RECONSTRUCTION
3.2.2 Implementation to Voxel-based Approach
About capturing silhouette images, it needs the foreground detection for the images to mark the pixels in the image whether belonging to foreground or not. There are a lot of relevant researches and methods about the foreground detection, such as Mixture of Gaussians (MOG) [49], codebook [26] and Background Cut [50] etc.. This is another important issue in computer vision, not absorbed here in our research. The multiple-video data used here are from INRIA Rhône-Alpes (https://charibdis.inrialpes.
fr). They have offered silhouette images captured from five cameras.
The 3D volume reconstruction for human bodies is not like the general case for ob- jects. The objects relative to human bodies are always small and relatively close from the cameras. For this, the aim is to reconstruct realistic volume of the object, even with the subtle descriptor of surface. For that the human motion tracking, it is not easy to produce the exquisite volume because the visual range of the cameras becomes heavily wide to cause the silhouettes to be coarse. In addition, the purpose that we reconstruct 3D human volume is to generate available measurement to estimate the possible motion for the hu- man motion tracking. So 3D human volume expected is only enough to distinguish out the position of body parts. We propose a simple and fast implementation to reconstruct 3D human volume. Figure 3.2 illustrates the voxel-based 3D volume reconstruction and an example of reconstructed voxels.
We capture the images with the human using n cameras, so we let the silhouette SEi of the image Imgiprojected from the camera Camiwhich has the projection matrix P Mi. Now we want to reconstuct the 3D human volume following steps below.
Step 1. We define a visual space S spilt into m voxels {Vj, j = 1, ..., m}. And the point vj is the center of the voxel Vj. We regard the point vj as the position of the voxel Vj int the space S.
Step 2. Let pij is the pixel that is the projection of the point vj projected on the image
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING
Imgi with the projection matrix P Mi of the camera Cami.
pij = P Mi· vj , where i = 1, ..., n, j = 1, ..., m (3.1)
Step 3. We check whether the voxel Vjis part of the human body.
for j = 1 to m do Let i = 1
while pij ∈ SEiand i 5 n do
check next silhouette SEi+1of the image Imgi+1 Let i = i + 1
end while
if pij ∈ SEi , for all i then
Label the voxel Vj to be part of the 3D volume end if
end for
Step 4. the 3D volume Vallis made of all labeled voxels.
The 3D volume reconstruction had finished. The implementation is very simple and fast. When we have new images at next frame, we can reconstruct the volume only re- peating the step 3 because step 2 is progressed at the first time with the known parameters for the positions of voxels and the projection matrixs of the cameras. For the 3D human volume Vall, we use the method that check if the neighboring voxels of the labeled voxel are not all to be labelled the same to determinate the voxels on the surface of the 3D huamn volume. Finally, we will get the entire volume, Vall, and the set of surface voxels, Vsurf ace.
3.3. PARTICLE FILTER TRACKING
3.3 Particle Filter Tracking
After designing the 3D human model and reconstructing the 3D human volume, we will enter the part of algorithm about human motion tracking. Previously, we have referred to the matter that the motion parameters are what we will estimate for the human posture with 3D human model. The hard-joint constrainted 3D human model that is used for the human motion tracking in most researches has highly 22 DOFs in our work. We choose to use the well-known tracking method, particle filtering [23], to track human motions.
There are two main reasons. First reason, this problem is with the high dimensionality and the mapping from the parameter space to the feature space is nonlinear and multi-modal.
The usage of linear estimation method to solve nonlinear problem, like Kalman filtering, is obviously not available. Second reason, we cannot expect to get perfect observations, so it is difficult to estimate the really optimal parameters. The particle filtering will main- tain multiple hypotheses about the posterior of the states to remedy the tracking errors possibly. Now we want to show how to track human motions using particle filter. And then introduce the limitation and improvement about the particle filtering.
3.3.1 General Particle Filtering
For model-based 3D human tracking, we claim that the estimation for the motion param- eters only using basic particle filter, known as the Condensation algorithm [23], is general particle filtering. In contrast to the usage of the human model with free-joint constraint, it has fewer DOFs with hard-joint constraint. The general particle filtering is usually used to track the human motion making use of human model with hard-joint constraint. We take our human model as an example to recommend how to operate this method.
When the connectivity between torso and limbs is hard-joint constraint for our model, the degree of freedom at time t is dit, where i = 1, 2, ..., 22. The state or the configuration vector at time t is xt = {d1t, d2t, ..., d22t } and the history of states at time t is represented
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING
by Xt = {x1, x2, ..., xt}. The observation at time t is zt and the history of observa- tions at time t is represented by Zt = {z1, z2, ..., zt}. After we define the parameters about particle filtering, we want to figure out the posterior distribution for estimating the possible solution. The dynamic Bayesian network structure used for the classical particle filtering, where simply a first-order Markov chain is concerned. Thus, the states are only influenced by previous time steps. In addition to some other conditional independencies inherent in the Bayesian network, they are shown below:
(1) The state at time t, xt, is conditionally independent of the previous states Xt−2, given xt−1.
(2) The observation at time t, zt, is conditionally independent of Zt−1 and Xt−1, given the state xt.
(3) The state at time t, xt, is conditionally independent of Zt−1, given the previous states Xt−1.
From (1) to (3), we resolve the posterior density as
p(xt|Zt) = Z
x1...xt−1
p(Xt|Zt) = Z
x1...xt−1
p(Xt, Zt)
p(Zt) (3.2a)
∝ Z
x1...xt−1
p(Xt, Zt) (3.2b)
= Z
x1...xt−1
p(zt|xt) · p(xt|Xt−1, Zt−1) · p(Xt−1, Zt−1) (3.2c)
= Z
x1...xt−1
p(zt|xt) · p(xt|Xt−1) · p(Xt−1, Zt−1) (3.2d)
= p(zt|xt) · Z
xt−1
p(xt|xt−1) · p(xt−1, Zt−1) (3.2e)
When sample N particles, the posterior probability distribution p(xt|Zt) is repre- sented by a set of weighted particles {(s1t, π1t), (s2t, πt2), ..., (sNt , πtN)} where the weights πit satisfy that Σi=1πit = 1, and πti ∝ πt−1i · p(zt|x = sit). Then we can estimate the
3.3. PARTICLE FILTER TRACKING
possible state xtof current human motion from the set of weighted particles and go on to measure next motion with observation at next time step. The particle filtering framework can then be divided into the following steps: sampling, weighting, and state estimating.
We want to construct a new set of weighted particles {(s1t, πt1), (s2t, π2t), ..., (sNt , πNt )}
at time t from the old set {(s1t−1, πt−11 ), (s2t−1, πt−12 ), ..., (sNt−1, πt−1N )} at time t − 1 and estimate the state xtwith the observation zt.
Step 1. Particles Sampling
For equation (3.2e), the discrete time propagation of state density is derived from R
xt−1p(xt|xt−1) · p(xt−1, Zt−1). The p(xt−1, Zt−1) is the recursive posterior dis- tribution of previous time step. And the p(xt|xt−1) is stochastic dynamics. We set about sampling new set of particles from these two density.
From {(s1t−1, π1t−1), ..., (sNt−1, πt−1N )},
we first construct cumulative probability {ci, for i = 1, 2..., N},
c0 = 0,
ci = ci−1+ πt−1i for i = 1, 2..., N.
From p(xt−1, Zt−1), we select a sample s0(i)t as follows:
(1) select a uniform random number r ∈ [0, 1]
(2) find the smallest j which satisfies the condition cj ≥ r (3) set s0(i)t = sjt−1
Then from p(xt|xt−1= s0(i)t ) to sample sitthat can be generated as
sit= s0(i)t + B (3.4)
where B is a multi-variate gaussian random variable with variance P and mean 0.
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING
Now, we obtain new particles {sit, for i = 1, 2..., N} at time t.
Step 2. Measurement and Particles Weighting
Because we have considered the previous weights for sampling new particles using cumulative probability {ci, for i = 1, 2..., N}. So, the weights πti ∝ πit−1·p(zt|x = sit) can be represented as
πti = k · p(zt|x = sit), (3.5) where k is a normalization constant, let Σi=1πti = 1.
The p(zt|x = sit) is called the likelihood is measured by using 3D human vol- ume and 3D human model given sit in our work. The entire 3D human volume is Vall and the volume generated human model is Mall. We define the likelihood by calculating the number of voxels overlapped between Vall and Mall. The set of overlapped voxels Voverlap is represented as
Voverlap = {Vj|vj ∈ Mall, Vj ∈ Vall , for j = 1...m} (3.6)
In Section 3.2.2, we define the central position vj of the voxel Vj. The measurement of the likelihood is defined as
p(zt|x = sit) ∝ exp(#(Voverlap)/2δ2), (3.7)
where the #(·) is presented as the number of the set, and the δ is a variance constant.
Now, we obtain new set of weighted particles {(sit, πti), for i = 1, 2..., N } at time t. Finally, We want to estimate the optimal state for the human motion.
Step 3. State Estimating
3.3. PARTICLE FILTER TRACKING
The state xtat each time step t can be estimated by
xt= Σi=1πti· sit (3.8)
or
xt= s(∗)t , when πt(∗) = max
i (πti) (3.9)
We choose the later form because it is available for 3D human motion tracking with high DOFs that makes particles be not enough to present the posterior density in the vast configuration space.
3.3.2 Hierarchical Particle Filtering
In Chapter 2, we refer to the particle filter with high DOFs, the search is easily misdirected by local maxima. In order to improve the correct rate, the needed particles cause com- putational cost increasing exponentially. MacCormick and Isard [29] define the survival diagnostic D and survival rate α that indicate whether tracking performance is reliable or not to infer the number of particles required.
N ≥ Dmin
αd , (3.10)
where Dminis the minimum acceptable survival diagnostic for successful tracking. When α ¿ 1, Dmin and α are constant. N is the number of particles needed to maintain the tracking performance. It shows that N increases exponentially followed on d the number of dimensions.
For this, some researches propose the concept of search space decomposition. Regard particle filtering as hierarchical search space in opposition to global search space. The hierarchical particle filtering is carried out and replaces the general particle filter. The hi- erarchical particle filtering in human motion tracking is often prior to predict the position
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING
of torso in the posture. And then, the estimations of four limbs will be independent to de- compose search space effectively. Thus the exponential cost is degraded to be linear. For our work, the state xtcan be divided into xit, where i = 1, 2..., 9 , respectively represents the substate of each body part, torso, left upper arm, left forearm, right upper arm, right forearm, left thigh, left leg, right thigh, and right leg. We can simply regard the human motion tracking using hierarchical particle filtering as body parts tracking using several general particle filtering.
It encounters the difficult problem the same as parts detection. For human motion capture with a still image or single view, even like [5] and Deutscher et al. [12] with multiview 2D images, it is easy to meet the situation that the body parts occlude other parts in a single view. There are not strong measurements that can distinguish the body parts for general case. But now we reconstruct 3D human volume that directly integrates the information from multiple views, the depth problem is lightened and the measurement that we use in equation (3.6) can distinguish the body parts conceivably without other special features. We suppose to combine upper limb and forelimb into single limb, so that the state xt consists of xtorsot , xlef tarmt , xrightarmt , xlef tf oott and xrightf oott . The measurement with 3D human volume will be more available for the usage of the hierarchical particle filtering. We can find the advantage simply from the following proceedings using particle filtering.
(1) use Vall to estimate the state xtorsot similar to equation (3.6)
Vtorso,head = {Vj|vj ∈ Mtorsoor Mhead, Vj ∈ Vall , for j = 1...m} (3.11)
(2) set Vact = Vall− Vtorso,head, remove the voxels considered as torso.
(3) use Vactto estimate the state xlef tarmt
Vlef tarm= {Vj|vj ∈ Mlef tarm, Vj ∈ Vact, for j = 1...m} (3.12)
3.3. PARTICLE FILTER TRACKING
(4) set Vact = Vact− Vlef tarm, remove the voxels considered as left arm.
(5) use Vactto estimate the remaining states using (3) and (4).
We can clearly perceive that the body parts are estimated hierarchically. The DOFs are degraded linearly and the Vactreduced gradually speeds up the computation of measure- ment. And we use the head model without degree of freedom to support torso to determine it’s orientation and position in (1).
When the human model has hard-joint constraint, one major disadvantage of hier- archical tracking methods is that inaccurate torso states may sharply deteriorate limbs motion estimation. Moreover, the torso motion is difficult to estimate because of body shape variances and silhouette/voxel noises. To reduce the interference from torso motion errors, we propose a soft-joint constrained ICP method for limb tracking. In contrast to hard joints, limbs with soft joints are allowed to move freely in a small range of area, so it is still possible to track limb motions even with inaccurate torso motions.
In order to improve torso motion tracking, we also propose a method that the limbs states estimated at the previous time step are used to provide reliable hypotheses of current torso state, since there is strong correlation between torso and limbs states. Our method will be presented in Chapter 4.
CHAPTER 3. MODEL-BASED 3D HUMAN MOTION TRACKING