System Overview - 基於影像特徵點擷取結合深度資訊之即時手勢辨識系統設計

Chapter 1 Introduction

1.2 System Overview

For the hardware architecture, our system uses Xbox Kinect as the image sensor, including RGB image and depth information, and Table 1.1shows the specification of Kinect. In the depth image, the pixel with lower intensity indicates that the distance between object and camera is smaller, and all the points are set to 0 if the sensor is not able to measure their depth. The image captured by Kinect would be delivered into Personal Computer(PC) and then be processed to implement hand gesture recognition.

The specification is Intel® Core™ i5-3210M CPU @2.50GHz, 8GB memory, and Windows 7 operation system. The frame rate is about 30 frames per second and the frame is processed using C/C++ and Matlab.

Table 1.1 Specification of Kinect

Effective Range Depth sensor range 1m ~ 4m

Field of view

Horizontal field of view: 57 degrees Vertical field of view: 43 degrees Physical tilt range ±27 degrees

Data stream

320×240 16-bit depth @ 30 frames/sec 640×480 32-bit color @ 30 frames/sec

For the software architecture, Fig- 1.1 is the flowchart of the proposed system.

At first, the system receives the RGB and depth images from Kinect and then selects the region-of-interest (ROI). After ROI selection, the feature extraction is implemented. At the final stage, the overall features are delivered into the hand gesture recognition system to distinguish different gestures. The experimental environment is our laboratory and the Kinect camera is at about 100cm height.

Moreover, there are two limitations when implementing the system: first, the detection distance is between 0.5m to 2m because of the hardware limitation of Kinect.

Second, the system only focus on one-user implementation.

Chapter 2 describes the related works of the system. Chapter 3 introduces the proposed system in detail. Chapter 4 shows the experimental results. Chapter 5 is the conclusions of the thesis and the future works.

1.1 Software architecture

Chapter 2 Related Works

2.1 Hand Gesture Recognition Method

In recent years, many hand gestures recognition approaches have been proposed.

In general, the overall methods could be roughly divided into three categories:

color-based algorithms, appearance-based algorithms, and 3-D hand model-based approaches. 3-D hand model-based approach [3,16] relies on the 3-D kinematic hand model with a large scale of DOF and calculates the hand parameters by comparing the difference between the input images and the possible 2-D appearance projected by the 3-D hand model. This approach is suitable for the realistic interactions in virtual environments since it provides a rich information to describe different type of gestures.

However, the major drawback is that requires a huge image database to deal with the entire characteristic shapes under different views.

The skin color are the image features that are frequently used for hand detection[1]. However, color-based algorithms face the difficult task that human arm and face have the similar color with hands. This methods also very sensitive to the lighting conditions. So when the lighting does not satisfy some environment requirements, this algorithm usually perform not well.

For the appearance-based algorithms, shape descriptors are used to represent the hand shape. In [4] , Fourier descriptor and Zernike moments are used to recognize hand gesture, but the computational cost is too high since it's pixel based algorithm. [5]

obtained the integrated hand contour and then computed the curvature of each point

on the contour, but the noise and the unstable illumination in the segmented background are the major problems for this method. The eigenspace is another technique, which provides an efficient representation using a set of basis vectors, but this method is not invariant to translation, scaling, and rotation. For the reason, a number of researchs on local invariant features, such as SIFT and Haar-like features, are proposed. In [6], SIFT features are used to achieve rotation invariant hand detection. The authors of [7] used Haar-like features, which describe the ratio between dark and bright regions rather than single pixel value, to reach a hand gestures recognition, but the computation cost makes it hard to implement on real-time systems.

Another appearance-based approaches focus on building a hand model containing the palm and finger structures. [8] determined the fingertip positions and the center of the palm using the first moment of the 2-D probability distribution that based on the contour of the hand segmented region. [9] detected the fingertips using the momentum of the skin color region and used the Kalman filter to reach a robust finger tracking. Since there are a number of pixels on the contour or edges, so the computation cost for both [8] and [9] are considerably high. Template matching can also be used to detect fingertips and palms [10] , but the distance from camera to hand cannot be changed and a good hand segmentation results are also required. [11]

proposed a method that combines particle filtering with particles random diffusion.

However, the cost of this method is still too high to implement.

2.2 Skin Color Segmentation

This is an important step to extract all skin pixels from the Kinect sensor. Most skin color segmentation methods are based on the statistical approach, which can be divided into two categories: parametric and non-parametric. The parametric statistical approach, such as Gaussian model, Gaussian-mixture model, etc., represents the probability density function(PDF) of skin-color distribution in a parametric form. The main advantages of this approach are lesser computation time and no training data to be saved. However, a suitable parametric order, especially for Gaussian-Mixture model case, is hard to determined and may not fit the real skin-color distribution.

Non-parametric approach uses quantized histogram to represent the PDF, or uses training data to estimate the skin-color density function, such as the kernel method or support vector machine(SVM). Although this approach can be evaluated intuitively and adequate to different training data sets, a major drawback is that it requires to keep a larger amount training data and costs a relatively high computation time.

Consequently, model selection is a trade-off between proper fitting and computation complexity.

The choice of color space is considered as the primary step in skin-color segmentation. The statistical approaches usually use a suitable color space to reduce the effect of varying luminance and decrease the overlap between skin and non-skin pixels. Usually, HSV and YCbCr are most popular color space for skin detection due to the robustness of varying illuminant and the minimum overlap between skin-color and background-color [12] .

The parameters of the GMM can be obtained from the training data through the iterative expectation-maximization (EM) technique [13] . After proper parameter

estimation, both conditional probability densities for skin and non-skin colors are obtained, denoted as p(X | skin) and p(X | nonskin), where X = [Cr Cb]^T. Given this class conditional probabilities of skin and non-skin models, a skin classifier can be built using Bayes classifier [14] . The classification boundary is determined where the likelihood ratio of p(X | skin) and p(X | nonskin) exceeds some threshold based on the ROC(receiver operating characteristics) curve. That is, for a given image pixel xn

=[Cr(n) Cb(n)]^T, it is classified as skin when it satisfies:

 

where K is a constant and p(skin)  1p(nonskin). Rearranging (2.1), it becomes:

 

The threshold K' is usually determined from the ROC curve, which shows the relationship between the true positives and false positives. The Bayes classifier has been widely used for skin segmentation since its simplicity and less computation time.

We only need to compute the likelihood ratio in (2.2) and check whether it is larger than the threshold K'.

2.3 Classification From Linear Model to ANNs

The goal of classification problem is to assign an input data xn  X, n = 1,2,...,N, in the database, to one of the K classesC_k, k = 1,2,..,K. Usually, these classes are disjoint such that xn is classified into at most one (or none of them). To simplify the discussion in this section, here we only consider the model for two-class classification problems.

2.3.1 Linear Model

To determine which class an input xn belongs to, the simplest decision function for two-class classification problems is usually given as the following linear model:

0 0

n j nj

y^{w x} w 



w x w ^(2.3)

where w is called the weighting vector and w⁰ is the bias. Based on the linear model (1), first choose y=0 as the decision boundary, and then classify the inputs satisfying y

≧0 into the class 1, denoted as C1, and the other inputs into C2. An example for the above two-class classification is shown in Fig- 2.1 (a), where the decision boundary is included. However, this linear model does not work well when there exists an overlap between the two classes, such as the XOR problem depicted in Fig- 2.1 (b). To solve this problem, a generalized linear model is proposed.

2.3.2 Generalized Linear Model

The idea of generalized linear model is to transform the input data xn in the database X into another space, and then uses the transformed data as a substitute of xn

in (2.3), described as below:

 

0 T

yw φ xn w (2.4)

where φ(‧) is a user defined basis function. In other words, the basis function transforms the databaseX to the linearly separable feature space. Moreover, an activation function f(‧) can be further introduced to modify (2.4) as below:



 

ⁿ ⁰



⁽ ^j ^j

^{ }

ⁿ ⁰⁾

y f ^{w φ x} w  f



w ^x w ^(2.5)

which y represents the predicted discrete class labels, or posterior probability lying in the range [0,1]. However, an adequate basis function φ(‧) is often difficult to

determine. In order to apply the modified model (2.5) to a variety of problems, it is necessary to adapt the basis functions correspondingly. Recently, investigators have successfully adopted the feed-forward neural network to implement (2.5) due to the fact that the neural network uses fixed number of basis functions and allows them to be adjusted during training. The cost of this model is to face the nonconvex

optimization problem during training stage which would stocked in the local minimum.

2.3.3 Artificial Neural Network

ANNs have been successfully applied to some complicated problems, such as image analysis, speech recognition, adaptive control, etc. Usually, the ANNs will be adopted to implement human detection via intelligent learning algorithms. The basic

Fig- 2.1 The examples for (a) linear separable set (b) linear inseparable set

(a) (b)

structure of ANNs is called neuron, whose input-output relationship is shown in Fig- 2.2 and described as Usually, the activation function h(‧) is chosen to be logistic sigmoid function ,linear function, or tangent sigmoid function which are described as below：

(1) Logistic sigmoid function：

where (2.7) is range [0,1] and (2.8) is [-1,1]. In summary, a linear output is needed for regression problem while use logistic sigmoid output for classification problem.

x

 z



Fig- 2.2 Basic structure of a neuron

h(‧)

A general ANN is composed of one input layer, one output layer, and one hidden layer, as shown in Fig- 2.3. The input layer just receives input signals, so each node in this layer is taken as a buffer. For the other two layers, their nodes are realized by the structure of neuron as depicted in Fig- 2.2. Each node in the second layer, i.e., the hidden layer, has the same form as (2.6) given by

(1) (1) (1) (1)

where j=1,...,D and the superscript (1) indicates that the weights are related to the first layer of ANNs. Similarly, the output of each neuron in third layer, called the output layer, is with the same form as previous:

(2) (2) (2) (2)

where k=1,...,K. Substitute (2.10) into (2.11), then each output y_k becomes:

(2) (1) (1) (2)

Comparing with (2.5), ANN use this structure to find an adequate basis function φ(‧) adaptive to the training data. As expected, ANNs are indeed able to deal with more complicated problems compared to generalized linear model.

2.4 Classification for Sequential Data

The learning algorithms mentioned before such as linear model and ANN are primarily focused on the independent and identically distributed(i.i.d.) observations.

The i.i.d. assumption allows us to express the likelihood function of all observations as:



1 2 N

 



p x x, ,...,x |  p x | p x | p x | (2.13) where xn is the observation at time n, n1,...,N, and  is a user-defined parameter.

However, this assumption would fail in many applications when the observations are sequential and depend on the previous ones, like speech data, rainfall measurements, etc.. Therefore, it is necessary to consider the correlation among all observations. But in real application, it would be difficult to find a general dependence of present observation to all previous observations since the complexity would grow as time increases. To solve this problem, Markov model is proposed under the assumption that the present data xn only depends on the previous one xn-1.

2.4.1 Markov Model

Given N observations {x1,..., xN }, we can always use the product rule to express the joint distribution for this sequence of observations as below:

         

where Fig- 2.4 shows the fully dependent sequence, that is, the observation xn at time nis correlated to all the previous n1 observations. Assume that each of the

conditional distribution only depends on the previous one, then the n^th conditional

distribution in (2.14) can be simplified as:



_n| 1,..., _n 1

 

_n| _n 1



, 1, 2,...,

p x x x _  p x x _ n N (2.15)

Therefore, the joint distribution of N observations is rewritten as



  





which is known as the first-order Markov chain and shown in Fig- 2.5. In most practical applications, all the N1 conditional distributions are assumed to be equal, that is, the relation between any adjacent observations is the same. The model with this property is then called the homogeneous first-order Markov model. We can use a higher order Markov chain to provide more information from the previous samples, but it would need a large amount of computation time as the order increased. Rather than increasing the number of order, we can provide a rich information of the Markov model by introducing additional latent variables, which forms the so-called hidden Markov model (HMM) and has been widely used to solve the complicated

classification problems.

x

Fig- 2.5 Sequence of first-order Markov chain

x

₁

Fig- 2.4 Sequence of fully dependent observations

x

₂

x

2.4.2 Hidden Markov Model

The HMM depicted in Fig- 2.6 includes 1^st-order Markov chain Z{z1 ,..., zN} where zn is the latent variable, or called hidden state, corresponding to the observation data xn, n1,...,N. It is obvious that zn is only related to the previous latent variable zn-1

and the resulted joint distribution is given by:



x , ..., x , z , ..., z1 _N 1 _N

   

z1 x | z1 1

 

z | z2 1

 

z | z_N _N 1

 

x | z_N _N



where pznzn-1pxnznare called the transition probability andthe emission probability, respectively. Nowadays, the HMM has been widely used in many territories, such as speech recognition, natural language modeling, on-line handwriting recognition, and HGR[15][16].

In real-time applications, there are two problems that must be solved for the use of HMM in (2.17) listed as below:

Problem1: Given N observations X{x1 ,..., xN} and all the model parameters pz1pznzn-1and pxnzn, for n1,...,N, how to compute the

posterior probability of each latent variable, p(znX)?

Problem2: Given N observations X {x1 ,..., xN}, how to estimate the parameters

x

Fig- 2.6 Sequence of hidden Markov model

x

₂

x

z

₁

z

₂

z

pz1pznzn-1and pxnznfor n1,...,N?

For the first problem, given the observations and all model parameters, (2.17) can be obtained and use the property described below:



 



p Z X  p Z X (2.18)

where X { x1 ,..., xN} and Z{ z1 ,..., zN}. Using the probability sum rule, the posterior can be intuitively calculated as:

     

for n1,...,N. It's an intuitive way to find the posterior probability, but it would need a vary large-scale computation time. Assume each latent variable zn is M-state discrete random variable, (2.19) would need N-1 times of M summation calculation and the computation order is O(M ^N-1)! Due to this reason, a more efficient approach, the forward-backward algorithm[17] is used to reduce the computation time and the detail derivation is described in Appendix. A. To further speed up the calculation, the Viterbi algorithm is proposed that finding the most probable sequence of latent variables for a given observations X { x1 ,..., x_N}. These two methods are based on the well-known model parameters, but in real applications, these parameters are unknown and should be proper estimated given observations.

Problem 2 is a more difficult than previous since it need to determine a method to adjust the model parameters such that the likelihood function of the observations given model parameters, p(X|), where is model parameter, is maximum. However, given any observation sequence as training data, there's no close-form solution for estimating model parameters. Instead, an iterative procedure known as Baum-Welch method[18] or expectation-maximization (EM) algorithm is used to maximize the

likelihood function and obtain optimal parameters. Different initial values for the parameters would cause different result, so an suitable initial guess for these values is an important step when using EM algorithm.

Chapter 3 Hand Gesture Recognition System

3.1 Hand Region Detection

It's important to detect the exact hand region in real-time applications. A number of research has been proposed to extract the hand region precisely, such as the Adaboost learning algorithm [16] and SIFT features [17], but the cost of the computation time make it hard to implement on the real-time detection system. Since the extraction of skin color regions is fast, this method is usually used in hand detection even though it is very sensitive to lighting conditions. Based on skin region, the system could implement ROI selection with connected component labeling (CCL) to increase detection rate. To further distinguish the hand and face regions, the scheme proposed in [19] is adopted.

3.1.1 Skin Color Extraction

With the use of Kinect, a suitable distance range from the user to the screen can be selected and the depth information with lower value implies a smaller distance.

Besides, all the points are offset to 0 if the sensor is not able to measure their depth. In our proposed remote hand-gesture control system, it is required that the hands move in the range around 0.5m to 2.0m. There are many skin color extraction methods which have been described in Chapter 2.2. To speed up the processing, we use a look-up-table method in YCbCr color space and the result is shown in Fig- 3.1. There are still a lot of noise pixels in skin color, so a method to remove them is required. As observed, these noise pixels usually have a small region comparing with the hand and

face regions, and thus can be removed by the connected component labeling (CCL) method.

Connected Component Labeling (CCL) [20] is a technique to identify different components and often used to detect connected regions in binary images. This thesis applies a 4-pixel connected component to label interesting regions. Furthermore, in addition to recognizing the connected regions, CCL also compute their areas. If the total number of a connected region is less than a threshold, it will be treated as a noise and then removed. As a result, only large enough connected objects are retained. To further improve the selected CCL objects, the dilation operator [20] is employed to fill the holes of connected components as shown in Fig- 3.2. The next step is to distinguish the hand region based on the feature points searched and extracted in the selected CCL objects.

3.1.2 Feature Points Extraction

There are many distance transformation (DT) algorithms to transform a binary image into a distance images, which are different in the way distances are computed

Fig- 3.1 Skin Color Region. Fig- Fig- 3.2 The final CCL image

[21,22]. In this thesis, two-passed scanning distance transform is used in a binary image for its simplicity and efficiency in calculation, Fig- 3.3(a) given an example. To extract the feature points of distance transform image, a useful characteristics has been introduced in [19] based on two important qualities. First, the skeleton of an object can be extracted by distance transformation, as shown in Fig- 3.3(b), where the skeleton pixels are usually possessed of the local maximum distance transformation value. Second, the distance transformation value of skeleton pixels on the finger is usually not high comparing to the palm region. Based on these information, the system could implement feature points extraction in two steps, which are introduced as following paragraph.

These distance transformation based feature points can be extracted with the following two steps. First, extract the local maximum pixels based on the distance transformation value. On the distance transformation image, a local maximum pixel would satisfy the following condition:

 

where i,j 1,0,1and the function G is defined as:

 

^{1 if}

   

Fig- 3.3 Examples of distance transform (a) Shown its distance value (b) Shown its skeleton

(a) (b)

where D(x,y) is the distance transformation value at point (x,y). This condition implies that D(x,y) of the point at (x,y) is greater than or equal to those of its 8 neighborhood

在文檔中基於影像特徵點擷取結合深度資訊之即時手勢辨識系統設計 (頁 13-0)