1.1 Motivation
Recently, more and more 3D characters or avatars have come to our daily life. To animate these virtual characters, experienced artists usually have to adjust their key postures.
On the other hands, motion capture technique (abbreviates as mocap) is a feasible method to produce body animations. The most popular mocap, optic mocap, is based on the theories of computer vision. First, several feature markers are placed on a subject’s body. The subject usually wears a black leotard to enhance the distinctness of feature markers in cameras. The 3D trajectory of feature markers can be estimated by triangulation. Mocap data contain the structure of the skeleton, (usually represented in a hierarchy structure.) and the degree of freedom of each joint in the entire animation sequence. To represent variety of human motions, capturing and storing considerable motion data are usually unavoidable. However, due to the hardware limitation (capacity and capability of computation), only a small set of mocap data can be loaded or processed at the same time. An immediate method to overcome this problem is representing the original motion data in a compressed form. Even though, video and audio compression have been developed for decades. However, the characteristic of human motion data is quite different from video or audio ones. How to compress human motion data effectively is still an interesting and challenging research topic. Therefore, we want to find out a suitable and practical approach such that we can store the motion data in a compact form
data quality. Principal component analysis (PCA) is widely used in existent motion compression methods. By projecting original data set into a lower dimensional space, we have smaller data size but loss some low-variation features. This technique utilizes spatial and temporal coherence implicitly and is easily implemented. However, the entire motions have to be loaded in memory in the compressed phase. Besides, the features discarded by PCA are uncontrollable. Sensible jars may appear in some conspicuous motions. Users need to apply additional smoothing approaches or reduce compression ratios.
A traditional technique for image compression uses perceptual models. Pioneers explore the importance of data such that less important or less sensitive features can be omitted.
Although various models have been proposed in video or audio compression, there is still much work need to be researched in the region of body animation.
One notable characteristic of motion data is considerable coherence, more specifically both spatial and temporal coherence. Due to the articulation of human skeletons, there exist spatial relationships between the neighbor joints. For instance, the gradual contraction and stretch of muscles make us capable of predict how skeletons move in the next few frames.
Therefore we can approximate the body animation in spatial domain, or temporal domain or even both of them.
The basic concept of our compression algorithm is an extension to key-frame animation.
In the key-frame animation, users select several important key frames and interpolate the in-between animation. In this thesis, we generalize this concept. Instead of choosing key frames from animations, we select key features. Each key feature represents the position of a marker captured by mocap devices in a particular time click. If we assume the structure (i.e.
connectivity) of the markers is consistent, we can access any position of marker by its spatial and temporal index. Furthermore, we not only interpolate the data in temporal domain but also in spatial domain. Once we choose several key features from the original data, we establish an appreciate interpolation method. Given the temporal and spatial index, the other
non-key features could be approximated by interpolation. There are many adoptable interpolation techniques that we can choose. Cubic Bezier curve or B-spline is one candidate.
But such function approximates animations in temporal domain only. While apply Bezier surfaces, the data have to be sampled regularly. In this thesis, we use radial basis functions to approximate the motion sequence. Each key feature with time position can be thought as a sample or center in the hyper space. (space and time) We select several key features and put them into the radial basis functions network to establish our approximation functions. In the decompressed phase, the other non-key features can be reconstructed according their spatial and temporal index efficiently.
Since our goal is to compress motion data and maintain the visual quality. We believe that well-selected centers have large effects on the decompressed results. Therefore, how to choose the centers of radial basis functions and how many centers are sufficient are also issues. In this thesis, we use a greedy-algorithm to decide these two questions. This is an iterative procedure. In each iterative step, a small set of centers are selected to train the radial basis functions. Our system evaluates the difference between the original data and the fitted function. Features with large residual will be chosen as additional centers. These steps will be repeated in the next loop until stop criterions are satisfied.
Although many researches show that radial basis functions has an upper-bound of the centers’ amount. Fortunately, there are not too many key features of human motion data in the common case. People usually place 20~50 markers on the subject and each clip is rarely
1.3 Angular or Euclidean system?
Most motion data are represented in terms of quaternions. The commonly used “BVH”
file format is composed of a hierarchical structure in the angular domain. (see Appendix:
“BVH file format”) Such hierarchical structure is very sensitive to small reconstruction error.
This is because error closer to the root will propagate and accumulate to the farer one. In other words, representing posture in Euclidean domain can tolerate more reconstruction error.
Therefore, we reinterpret motion data as the global position of each joint frame by frame and perform approximation in Euclidean coordinate system. After this process, we can access the position of each joint at specific time stamp by its joint index and frame index.
1.5 Normalized Motion
The initial positions and orientations are different between motion data. We call such un-aligned motions as “raw motion data”. If motion data are aligned with their local coordinate system, we call them as “normalized motion”. From our observation, motion clips often have more spatial coherence in normalized motion space. This situation is appreciable when current motion is symmetric especially. For example, if people raise their arms upward simultaneously, joints of left and right arms have more coherence in normalized motion space.
Therefore, we convert each motion data into the normalized space for more efficient usages of spatial coherence.
1.6 Flowchart
Figure 1 is our system flowchart. Our system can be divided into 3 major parts:
Segmentation, Clustering, and Approximation. After we load a BVH file and represent it as a normalized motion, we perform temporal segmentation on this motion according to their behaviors. Then we group joints with similar trajectories to form a smooth surface. Finally we
use space-time radial basis functions to approximate each surface or curve and calculate the difference.
Motion
Segmentation
Clustering
Approximation
Different segment represents different motion behavior
Group joints with similar trajectories
Compress Difference
Figure 1: Our system flowchart. There major parts are: “Segmentation”, “Clustering”, and “Approximation”.