Related Work - 利用深度影像即時估計手指動作

Many optical motion capture systems with markers have been applied to finger motion capture, and they acquired satisfactory results. Park et al. used LEDs to design an interactive system [PY06]. Ludovic et al. solved the synchronization problem of body and finger motion from reduced marker sets [HRM12]. However, they still needed obtrusive markers and complicated cameras setup.

On the other hand, bare-hand tracking is still a challenging topic now. Edge detection and silhouettes are the most common features used to recognize the pose of the hand [SMF04, STT06, DDH06] but their performance is still far from real-time.

Shakhnarovich et al. proposed an upper body pose estimation system which searches a database of synthetic poses [SVD03]. Athitsos et al. developed fast and approximate nearest-neighbor techniques to estimate 3D hand pose [AS03, AAS04]. Ren et al.

built a database of silhouette features for controlling animated human characters [RSH05]. Wang and Popovi´c used a color glove to map the hand configuration to a database of hand poses [WP09]. Later, they introduced data-driven bare hand tracking system for efficient 3D mechanical assembly of computer aided design (CAD) models using [WPP11].

After the launch of inexpensive depth cameras, like Kinect, we can handle a full range of body shapes and sizes at interactive rates on consumer hardware. Ren et al.

focused on a specific gesture set. They employed Finger-Earth Mover’s Distance to measure hand shape dissimilarity such that the classifier can tackle parts of challenging cases for hand gesture recognition [RYZ11]. However, it still requires a black belt on the user’s wrist for hand segmentation and is easily affected by users’

silhouette. Oikonomidis et al. use Particle Swarm optimization to track the full articulation of two strongly interacting hands observed by an RGB-D sensor [OKA12].

Since they used expensive evolutionary optimization method, the system ran at 4fps and was still far from real-time. This precludes their use for interactive applications.

Human motion reconstruction from multiple cameras has produced an numerous literatures. Bregler and Malik used twists motions and exponential maps to produce motion estimation even with complex self-occlusion [BM98]. Rosales and Sclaroff reconstructed human poses from low-level visual features [RS00]. Ioffe and Forsyth grouped parallel edges as candidate body segments and pruned the search of such segments combinations [IF01]. Mori and Malik matched shape context with multiple 2D exemplars [MM03]. Ramanan and Forsyth clustered candidate body segments found by pairs of parallel lines, finding all individuals in each frame [RF03]. Agarwal and Triggs reconstructed poses by learning a regression against shape vectors extracted from image silhouettes [AT04]. In [SBR04], Sigal et al. implemented Eigen-feature detectors for head, upper arms, lower legs and shouters as desired.

Felzenszwalb and Huttenlocher employed pictorial structures for efficiently finding the best match to an image [FH05]. Navaratnam et al. showed marginal distribution sampling of unlabeled data to improve pose fitting [NFC07]. Okada and Stenger built a search tree on a hierarchy of body shape to capture human motion [OS08]. Based on a local mixture of Gaussian Processes, Urtasun and Darrel proposed a regression scheme to inference human poses [UD08]. In [TU08], Tu used auto-context to label body parts, but it did not define localized joints and took about 40 seconds per frame.

Randomized decision forests was built in [RRR08] on classes defined by human action patterns and camera viewpoints. Bourdev and Malik introduced ‘poselets’ that used tightly clustered in both 3D pose and 2D image appearance by using SVM classifier [BM09].

While real-time estimation of full-body motions from monocular intensity image sequences is still an open problem, the gradually popular depth cameras spur further

opportunity for human pose reconstruction. It allows more reliable 3D pose estimation from a single viewpoint. Grest et al. estimated the body poses of a known size and starting position using an Iterated Closest Point (ICP) algorithm [GWK05]. Based on MRFs, Anguelov et al. segmented puppets in 3D scan data into body parts and background [ATC05]. In [ZF07], Zhu and Fujimura used a linear programming relaxation to solve body component identification for coarse upper body parts, but they required a T-pose initialization to size the model. Bleiweiss et al. use 3D model fitting to track human skeletons [BEK09]. Siddiqui and Medioni used a data-driven Markov chain Monte Carlo (MCMC) model to find optimal pose and showed significant improvement over ICP [SM10]. Kalogerakis et al. labeled and segmented 3D meshes into different parts [KHS10], but they did not deal with occlusions and the results were sensitive to training sets. Ganapathi et al. showed that data-driven evidence is crucial for tracking self-occlusion [GPT10]. Plagemann et al. [PGK10]

built an interest point detector for 3D meshes, finding geodesic extrema, and localizing body parts. Their method generated both a location and orientation estimate of each part, but the use of interest points limits the choice of parts, such that left or right is unable to be recognized. Shotton et al. segmented the different human body parts using a random forest classifier implemented in the Kinect system [SFC11]. The segmentation was used to generate joint positions of a skeleton. Baak et al. combined generative and discriminative methods to estimate full-body pose at interactive frame rates [BMB11]. Girshicky et al. extended Hough forests and directly predict the positions of body joint [GSK11].

在文檔中利用深度影像即時估計手指動作 (頁 10-13)