基於深度卷積神經網路之手勢辨識技術研究

(1)

國立交通大學

電子工程學系電子研究所

碩

士

論

文

基於深度卷積神經網路之

手勢辨識技術研究

Hand Gesture Recognition based on

Deep Convolutional Neural Network

研

究

生：謝汝欣

指導教授：王聖智教授

(2)

基於卷積深度神經網路之

手勢辨識技術研究

Hand Gesture Recognition based on

Convolutional Deep Neural Network

研究生：謝汝欣 Student：Ju-Hsin Hsieh

指導教授：王聖智教授 Advisor：Prof. Sheng-Jyh Wang

國立交通大學

電子工程學系電子研究所碩士班

碩士論文

A Thesis

Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical and Computer Engineering

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master of Science in

Electronics Engineering

August 2014

Hsinchu, Taiwan, Republic of China

(3)

基於深度卷積神經網路之手勢辨識技術研究

研究生：謝汝欣指導教授：王聖智教授

國立交通大學

電子工程學系電子研究所碩士班

摘要

在這篇論文中，我們提出了一個使用任意單一攝影機在不同角度下仍能遠距離辨識多重手勢的技術。此技術不需固定攝影機角度，不需要由特定使用者操作，且能分辨多種手勢。此技術在影片中自動找出使用者的手的位置，並判斷使用者想傳達的訊息，希望能進行遠距離的操作以及在任何背景之下皆能達到手勢辨識的效果。在此設定議題下，為了能在複雜背景與不同視角拍攝的情況下有效的找到手部出現的區域以及辨識使用者所傳達的訊息，我們不採用易被複雜背景所誤導膚色資訊且不使用事先設定之特徵萃取技術，而是利用卷積神經網路有效且準確的學習不同手勢所擁有的特徵，並結合不同形狀與大小的特徵，以找到能分離不同手勢最佳的特徵空間，再利用深度神經網路找出特徵之間的關係以及不同手勢與各特徵的連接，藉此達到手部偵測與多重手勢辨識的成果。此外，我們也利用影片中已經得到的手部位置與移動資訊加上手勢辨識結果，推測之後較有可能出現手部的區域以及最佳的手勢辨識結果。

(4)

Hand Gesture Recognition based on

Deep Convolutional Neural Network

Student：Ju-Hsin Hsieh Advisor：Prof. Sheng-Jyh Wang

Department of Electronics Engineering, Institute of Electronics

National Chiao Tung University

Abstract

In this thesis, we propose an algorithm which recognize hand gestures with a single camera

under different view-points within a range remotely. The algorithm can recognize

multi-gestures without fixing view-point of the camera or a particular user controlling. In order to

find the hand position and recognize the gestures from the video automatically and efficiency

in the clutter background under different view-points within a range, we don’t take the skin

color as the information which is easily influence by the clutter background, nor do we use the

specific feature extraction processing. Instead, we use the convolutional neural network to learn

the features in the hand gesture image and combine the different kernel sizes to get the best

feature space for separating different gestures. Then we use the deep neural network to find the

relationship between the hand features and the gesture classes. Under this setting, we are able

to locate the hand and recognize multi hand gestures. Furthermore, with the help of the temporal

information for the hand position and motion getting from the video, we are able to infer the

(5)

Content

Chapter 1. Introduction ... 1

Chapter 2. Related Works ... 3

2.1 Hand Gesture Recognition by using Random Forest ... 3

2.2 Hand Gesture Recognition by Neural Network ... 6

Chapter 3. Proposed Method ... 9

3.1 Data Set ... 11

3.1.1 Data set for Convolutional Neural Network ... 11

3.1.2 Data set for Deep Neural Network ... 12

3.2 Basic concept of two learning algorithm ... 14

3.2.1 Convolutional Neural Network ... 14

3.2.2 Deep Neural Network ... 20

3.3 Applicable to Hand Gesture Recognition ... 24

3.3.1 Feature Extraction by Convolutional Neural Network ... 25

3.3.2 Deep Convolutional Neural Network Classification ... 34

3.3.3 Tracking Technique ... 36

3.4 Refinement ... 40

Chapter 4. Experimental Results ... 43

4.1 Detection accuracy ... 43

4.2 Combine Tracking and Detection ... 46

4.3 Video comparison ... 49

(6)

List of Tables

Table 3-1 Different kernel size correspond to different pooling scaling factor. ... 28

Table 4-1 Accuracy rate for training data set. ... 44

Table 4-2 Accuracy rate for testing data set. ... 44

Table 4-3 The computing time for detection only and detection combine tracking... 47

Table 4-4 The hand gesture recognition accuracy for detection only and detection combine tracking of video 1. ... 47

Table 4-5 The hand gesture recognition accuracy for detection only and detection combine tracking of video 2. ... 48

Table 4-6 The hand gesture recognition accuracy for our proposed deep convolutional neural network, traditional convolutional neural network and traditional deep neural network of video 1. ... 50

Table 4-7 The hand gesture recognition accuracy for our proposed deep convolutional neural network, traditional convolutional neural network and traditional deep neural network of video 2. ... 50

(7)

List of Figures

Figure 2-1 Eight edge filters and the corresponding edge maps for the image “Five”. ... 4 Figure 2-2 Compare traditional grid cells and randomized pooling cells for hand image. ... 5 Figure 2-3 (a) Training procedure of random forest with randomized pooling cells.

(b) The final trained random pooling cells for hand image. ... 6 Figure 2-4 Overview of hand gesture recognition system... 7 Figure 2-5 (a) Central coordinates normalization. (b) Example of general features matrix. ... 7 Figure 3-1 (a) Training procedure of our proposed algorithm.

(b) Testing procedure of our proposed algorithm. ... 11 Figure 3-2 Row one is the examples of stop gesture.

Row two is the examples of pointer gesture. Row three is the examples of ok gesture. Row four is the examples of wave gesture.

Row five is the examples of background image. ... 12 Figure 3-3 For each row, there are, 3*3, 6*6, 7*3 and 3*7, four different kinds of kernel size. For each column, there are stop, pointer, ok, wave and background five different kinds of input images. . 13 Figure 3-4 Example of convolutional neural network architecture. ... 15 Figure 3-5 (a) An example of input image.

(b) Neural network : A hidden node with its fully-connected weight which might learns from the input image data set.

(c) Convolutional neural network : A hidden node with its local-connected weight which might learns from the input image data set. ... 16 Figure 3-6 (a) An example of a specific kernel.

(b)&(c) Two different input image with same appearance but some local translation feature (as circled region).

(d)&(e) Two feature map corresponding to two input image and its pooling result. ... 17

(8)

Figure 3-8 (a) Architecture of MCMC by using Gibbs sampling.

(b) Architecture of Contrastive Divergence algorithm ... 24 Figure 3-9 (a) The kernel of convolutional neural network is a small square.

(b) The kernel of convolutional neural network is a large square. (c) The kernel of convolutional neural network is a long and narrow stipe. ... 26 Figure 3-10 (a) The convolutional neural network feature learned by 3*3 kernel size.

(b) The convolutional neural network feature learned by 6*6 kernel size. (c) The convolutional neural network feature learned by 7*3 kernel size.

(d) The convolutional neural network feature learned by 3*7 kernel size. ... 27 Figure 3-11 (a) The feature map of kernel 3*3, red circled region show the rough contour of hand

gestures.

(b) The feature map of kernel 6*6, red circled region show the rough contour of hand gestures.

(c) The feature map of kernel 7*3, red circled region show the finger feature.

(d) The feature map of kernel 3*7, red circled region show the finger feature. ... 31 Figure 3-12 The different kernel size we need for hand gesture images. ... 32 Figure 3-13 (a) Four of sixteen kernels we choose and corresponding feature maps for stop, pointer, ok

and wave gestures.

(b) Four specific kernels and corresponding feature maps for stop, pointer, ok and wave gestures. ... 32 Figure 3-14 (a) The schematic diagram of feature space which is easy to separate.

(b) The schematic diagram of feature space which is not easy to separate. ... 34 Figure 3-15 The feature vectors combining different kernel size for the hand gestures. ... 34 Figure 3-16 (a) Our deep neural network model architecture.

(b) The example of the activation function after the hidden node. ... 35 Figure 3-17 (a) The example of point tracking method.

(b) The example of kernel tracking method.

(c) The example of silhouette tracking method. ... 37 Figure 3-18 An example for the detection region adjustment and the motion estimation result... 38 Figure 3-19 The example of target patch and three testing patch with its two dimension correlation for the target patch. ... 39

(9)

Figure 3-20 The examples for the position map and the motion map. ... 40 Figure 3-21 (a) The flowchart for the weight score of detection result.

(b) The flowchart for the weight score of tracking result. ... 41 Figure 3-22 The example of the temporal information refinement of our proposed method. ... 42 Figure 4-1 (a) Example frames for our proposed method, convolutional neural network and deep

neural network in a row of video 1... 46 Figure 4-2 (a) Example frames for detection only algorithm and detection combine tracking algorithm

in a row of video 1.

(b) Example frames for detection only algorithm and detection combine tracking algorithm in a row of video 2... 49 Figure 4-3 (a) Example frames for our proposed deep convolutional neural network, traditional

convolutional neural network and traditional deep neural network combining our tracking technique in a row of video 1.

(b) Example frames for our proposed deep convolutional neural network, traditional convolutional neural network and traditional deep neural network combining our tracking technique in a row of video 2. ... 52

(10)

Chapter 1. Introduction

In the technological era, computers play an important role in our daily lives. Our most

common way to communicate with computer is through the mouse and keyboard. However, we

want to interact with computer directly instead of using mechanical devices. The progress of

computer vision and machine learning algorithms enable us to interact with computer in novel

ways, such as Kinect. We can use our hand gesture to convey our thought intuitively.

Hand gesture recognition system can be categorized into three stages: detection, tracking

and recognition. We want to know the position of our hand in an image, then track the

movement of hand and the moving trajectory. Finally, the computer must understand the

meaning of user expression. Traditionally, people usually find skin color in the image with

simple background and investigate the appearance of skin color part to determine where the

hand is. Based on the temporal information obtained from acquired trajectory and result, predict

the hand movement. At last, we use some classification methods to identify the gesture we

express. However, the variant of appearance of hand is complicated since our hand is composed

of five fingers and a palm. In addition, a finger has two knuckles to change the exterior of our

hand. Furthermore, the shape of hand will be significantly different as the view point is

changing. If the image has a complex background, it will influence the result of hand gesture

recognition. Therefore, finding a robust classifier against complex background is difficult when

using a simple model. Traditional machine learning method like support vector machine and

random forest may not apply to multiple hand gestures and out-of-plane rotation of hand. As

with the rapid development of deep learning algorithm, we want to use deep learning algorithm

to find a robust classifier of hand gesture.

In this thesis, we want to identify four different gestures, stop, pointer, ok and wave in the

(11)

which the detection and recognition stage is based on convolutional neural network and deep

neural network. These two methods are widely used in handwritten digit recognition recently.

We modify the convolutional neural network to apply it to the hand gesture data set. Combining

convolutional neural network and deep neural network to build our classification model for

multiple hand gestures and the hand detector. Due to the dramatic change of hand shape and

appearance, the points to track are not easy to choose. Therefore, nowadays tracking algorithms

are not robust of hand tracking. Therefore, we rely on the detection achievement and adding

some basic tracking technique to improve the detection accuracy and reduce the computation

time.

In our algorithm, we first use convolutional neural network [1]錯誤! 找不到參照來源。

to extract the feature of an image, and transform RGB images to the feature space. Then use

these features of hand as the input of deep neural network. At deep neural network stage, we

use unsupervised learning method : Restricted Boltzmann Machine (RBM) [3] to pre-train the

relationship within the low level local features. Combine the local features into global features

at RBM training stage. Then add labeled data to do the supervised fine-tuning stage to get the

classification model. After we get the hand position in the frame and the gesture it represents,

we can only search for the neighbor region to find the hand position in the next frame. In order

to improve the accuracy, we calculate the patch similarity of adjacent region to correct the

position of the hand. After we get the detection and tracking results, we will do some refinement

to get the most reliable position and gesture class in the video frame.

The thesis is organized as follows. We will introduce some other hand gesture recognition

(12)

Chapter 2. Related Works

The main purpose of hand gesture recognition is to convey user’s expression for the

computers. The rapid development of vision based hand gesture recognition increase the

possibility of remote control human computer interface. There are two different ways to do the

hand gesture recognition. One way is using some additional equipment to help doing the

recognition, such as Microsoft Kinect and the MIT colored glove. Another way is using the

appearance of hand to classify different gestures in which extracting some specific feature in

the image. The classification accuracy has been strongly associated with the feature extraction.

In this section, we introduce two different methods of using the appearance of hands to

recognize different gestures. The first method introduced in section 2.1 extracts different

orientation edges from hand images and trains the classification model by the random forest

algorithm. In section 2.2, the second method extracts the hand contour and complex moments

in hand images as the feature vector. They use the neural network algorithm to classify the

images into different hand gestures.

2.1 Hand Gesture Recognition by using Random Forest

In [10], they propose a single-camera hand gesture recognition algorithm for

human-computer interface. The hand gesture recognition algorithm is composed of three parts. First,

they us eight pre-defined edge filters to generate the corresponding edge maps for the hand

images. Second, they use randomized pooling cell method to learn more efficient cells for the

hand images. They get image feature vectors by concatenating feature vectors in some different

size and shape cells. Finally, they use these feature vectors to train a random forest model for

(13)

At first, they use 360∘orientation edge filters to generate the corresponding edge maps

for the hand images. They quantize 360 degrees into 8 bins, then get eight filtered images for

all edge filters. The eight edge filters and the examples of the corresponding edge maps for an

hand image are shown in Figure 2-1.

Figure 2-1 Eight edge filters and the corresponding edge maps for the image “Five”.

After generating the edge maps, they use the structure of randomized pooling cells to

calculate feature vectors [9]. Unlike tradition grid cells, they use more efficient cells to calculate

the edge information for the hand images. Figure 2-2(a) shows traditional grid cells for the hand

image. It partitions the image into equal size and extract the edge information in each cell.

(14)

the edge information more efficient for the hand image. After that, the feature vectors of hand

images are concatenated by the feature vectors in all cells.

Figure 2-2 Compare traditional grid cells and randomized pooling cells for hand image.

As mentioned in [9], their proposed method of random forest with randomized pooling

cell training procedure is shown in Figure 2-3(a). First, they choose some grid cells and some

randomized cells. Use these cells to calculate the feature vector of hand images and train the

hand gesture classifier by random forest. After that, they calculate the variable importance at

the training stage for the feature vectors and eliminate some weak variables. They eliminate

weak variables and resample different rectangular to get more informative cells iteratively.

Finally, they can get the most efficient cells to extract the edge information from hand images.

Figure 2-3(b) shows the final randomized pooling cells by random forest training stage.

This hand gesture recognition is good at two class recognition. It can classify the “five”

gesture with background and the “zero” gesture with background respectively. The error rate

for two class recognition is about 1 to 2 percent. However, it might get much higher error rate

when the hand gesture classes increased. The error rate for three class recognition may be 5

percent or higher. Therefore, we want to obtain a more robust multi-gesture classifier for hand

(15)

Figure 2-3 (a) Training procedure of random forest with randomized pooling cells. (b) The final trained random pooling cells for hand image.

2.2 Hand Gesture Recognition by Neural Network

In [8], they proposed a novel technique to do the hand gesture recognition. They infer that

the Neural Network has many benefits at pattern analysis. First, Neural Network learns to

recognize the pattern which exist in the data set instead of some image processing technique

which analyze specific behavior of the model. Second, Neural Network is more robust in

different environment than some specific image processing system which is limited to the

situation they were designed.

The hand gesture recognition process consist three stages: pre-processing, feature

extraction and classification. Furthermore, they consists of two phases: training (learning) and

classification (testing). Overview of gesture recognition system is seen as Figure 2-4. They

(16)

Figure 2-4 Overview of hand gesture recognition system.

At pre-processing stage, they segment the input image into background and objects at first.

They use a thresholding algorithm for segmentation of hand gesture images [13]. In order to

eliminate some errors produced by segmentation algorithm, they use a median filter to reduce

noise. Furthermore, for geometric feature (hand contour), they use edge detection to extract the

hand contour For invariant feature (complex moments), they use image trimming to eliminate

redundant margins and image scaling combine coordinate normalization to let the origin point

coordinates be at the center of the image as shown in Figure 2-5(a).

At feature extraction stage, for the first method, they extract hand contour by edge map.

Then do the feature image scaling to get an 32*32 image and shift the hand gesture section to

the origin point. Furthermore, they use general feature to be an additional information for height

and width offset of the hand gesture image. Figure 2-5(b) shows the example of height and

width offset matrix, which the hand gesture has the width near to 23 and the height near to 25.

For the second method, they calculate the complex moments from zero-order to ninth-order for

each hand gesture image.

(17)

At the last stage, that is classification stage, they use a simple one hidden layer neural

network with back-propagation learning algorithm. Input layer has 1060 nodes or 10 nodes for

two method. Hidden layer has 100 nodes and output layer, the recognition layer, has 6 nodes

for 6 different hand gestures. The recognition rate is 70.83% for the first method and 86.38%

for the second method.

This paper introduce a simple neural network training for hand gesture recognition. It also

pointed out the advantage of learning algorithm. However, this paper still use some image

processing like edge detection to get the feature of data set. In our method, we use convolutional

neural network to extract feature from training data set. It might get more robust recognition

(18)

Chapter 3. Proposed Method

The purpose of our system is to detect the hands position and recognize the meaning of

user represent in a video captured by arbitrary camera and viewpoints within a range. The most

difficult part is the clutter background and multi-class gestures classification. We want to figure

out four different kinds of hand gestures: stop, pointer, ok and wave. Because of the complex

background which might have some other skin color like objects with the remote distance of

hand and camera, we won’t use the skin color clue to do the image pre-processing. To build a more robust hand gesture recognition model, our training data is from arbitrary viewpoints

within a range, different users and clutter background. Furthermore, we extract the feature in

training data set by learning features from convolutional neural network, instead of specific

pre-defined image processing technique like edge, corner, contour and silhouette. In order to find

the relationship between local features extract from convolutional neural network, we use the

unsupervised learning algorithm, deep neural network to recognize four different hand gestures.

First find the internal structure in the local features and combine low level features into high

level features at pre-training stage. After that, find the relationship between input features and

output targets at fine-tuning stage. Furthermore, in order to solve the extremely variant shape

of hand gestures, we modify the convolutional neural network to applicable to the multi-class

hand gestures. At testing stage, we add some tracking technique to assist the detection result in

correcting the hand position in the video frame. We also do some refinement to improve our

hand gesture recognition accuracy.

The overall proposed algorithm procedure is shown in Figure 3-1, Figure 3-1(a) shows the

training stage of proposed algorithm and Figure 3-1(b) shows the testing stage of proposed

algorithm. In training stage, we have three steps, learning features, feature combination and

(19)

two model we get from training stage but also combining the detection and tracking result to

correct the recognition result.

We will describe the data set we use for convolutional neural network (CNN) and deep

neural network (DNN) in Section 3.1. In Section 3.2, we will describe the learning algorithm

of convolutional neural network and deep neural network, which we need to learn the features

in hand gesture data set and classify different gestures in feature space. We do some

modification for the hand gesture data set at the feature extraction stage by using convolutional

neural network. In Section 3.3, we will introduce our modification and how to combine the

convolutional neural network and deep neural network for doing the hand gesture recognition.

Also, we will add some tracking technique to improve the hand gesture recognition accuracy in

video testing. At last, in Section 3.4, we will use the temporal information and compare the

(20)

Figure 3-1 (a) Training procedure of our proposed algorithm. (b) Testing procedure of our proposed algorithm.

3.1 Data Set

We will provide two different data set for convolutional neural network and deep neural

network respectively. For convolutional neural network, we take amount of hand image for four

different gestures. After we extract features from the learning algorithm of convolutional neural

network, we might transform the hand gesture images from RGB color space to the feature

space. In order to detect the hand in the video frame and recognize the hand gestures at once,

we also transform the background from the RGB color space to hand gesture’s feature space.

The training data set for deep neural network will be the feature space of both hand gesture

image and background images.

3.1.1 Data set for Convolutional Neural Network

We take images from arbitrary camera, viewpoints within a range and different clutter

background. The image size is 50*50 pixel by pixel for all gestures. We use particle filter to

automatically extract the hand part in the image. This makes a lot of mistakes, so that we choose

(21)

one finger to represent pointer meaning, three fingers to represent ok meaning and all fingers

with the palm to represent wave meaning. In order to have the same number of right hand image

and left hand image, we flip all the hand image we have. Each gestures have 6400 hand images

for training data set, in other words, there are 25600 hand images. We have 3742 hand images

for testing data set. Figure 3-2 are ten examples of hand gesture image in the training data set

for each gestures and the last row are examples of background image for deep neural network.

Figure 3-2 Row one is the examples of stop gesture. Row two is the examples of pointer gesture.

Row three is the examples of ok gesture. Row four is the examples of wave gesture. Row five is the examples of background image.

3.1.2 Data set for Deep Neural Network

After we extract the feature from convolutional neural network, we add the background

(22)

features, each column illustrate the same gestures. The order for the gestures are stop, pointer,

ok and wave. The last column is the background image’s features. Each row represent the same kernel size for feature extraction in convolutional neural network. There are 3*3, 6*6, 7*3 and

3*7 four different kinds of kernel size represented in sequence.

Figure 3-3 For each row, there are, 3*3, 6*6, 7*3 and 3*7, four different kinds of kernel size. For each column, there are stop, pointer, ok, wave and background five different kinds of input images.

(23)

3.2 Basic concept of two learning algorithm

We will introduce two learning algorithm we need for feature extraction and hand gesture

classification. We will describe the architecture of the learning algorithm at first. After that, we

will explain the basic concept of the learning algorithm.

3.2.1 Convolutional Neural Network

Convolutional neural network is a learning procedure which can extract some regular

pattern or specific feature in a large amount of image which called training data set. The main

idea of convolutional neural network is shared weight and translation invariance. Traditional

neural network connect all input nodes with all hidden nodes. Therefore, when input image has

a corner at left top and a corner at right bottom will make large difference for the learning

weight. Convolutional neural network pose the shared weight concept to learn some regular

pattern in the image. When image has repeated feature, the model might learn only one feature

to represent it. Furthermore, convolutional neural network add a pooling layer after a

convolutional layer, which can let the model be confronted with translation invariance. The

pooling layer calculate the mean of no overlap patch in the feature map which is represent the

appearance probability of feature. In section 3.2.1.1, we will introduce the architecture of

convolutional neural network and describe each layer in detail. We will also brief introduce the

basic learning algorithm of convolutional neural network in section 3.2.1.2.

3.2.1.1 Architecture of Convolutional Neural Network

Convolutional neural network has four different type of layers: input layer, convolution

layer, pooling layer and fully-connected layer. Figure 3.4 is an example of convolutional neural

(24)

Figure 3-4 Example of convolutional neural network architecture.

Input layer

For 2D image processing, the input layer might be a gray-level image which normalized

from zero to one.

Convolution Layer

In convolution layer, we can assume the number of kernel we need to learn in the image

training data set. Traditional neural network will fully connect all input layer with a hidden

node. A hidden node which connect with all input nodes represent the image appearance. The

larger value the hidden node is, the more similar between input image and weight appearance.

Figure 3.5(a) shows an input image as an example. If we want the hidden node can represent

the input image appearance, the weight appearance might be complex, which shows in Figure

3.5(b). However, convolutional neural network use shared weight to represent regular pattern

arise from input image. A hidden node only connect a local patch in the image and let the node

connected weight, which called kernel, to shows the local feature learn from the input image.

(25)

the whole image. Do the convolution function between input image patch and kernel, the larger

the result value, the more similar between local patch and kernel appearance. Figure 3.2(c)

shows an example of a specific kernel and its correspondence feature map.

Figure 3-5 (a) An example of input image.

(b) Neural network : A hidden node with its fully-connected weight which might learns from the input image data set.

(c) Convolutional neural network : A hidden node with its local-connected weight which might learns from the input image data set.

Furthermore, if a hidden node is fully-connected with the input image, the weight number

which we need to learn might be larger and more complex than the local feature learned from

convolutional neural network. Hence, convolutional neural network might save some

computation because of learning local and regular features in input image data set.

For the convolution layer, the feature map size will be (N-M+1) by (N-M+1), where the

input image is N by N and the kernel size is M by M.

Pooling layer

After convolution layer, we will get a feature map corresponding to a specific kernel mask.

A node in the feature map represent the proportion of a local patch is similar as the specific

(26)

feature map to some non-overlap patch and calculate the mean of a local patch in the feature

map. Then a node in the pooling feature map will represent the proportion of the feature

appeared in a larger patch. There is a simple example at Figure 3.6. Even though the input image

has the same appearance with translation, we can consider them to be the same appearance.

This is an advantage that neural network might not have.

Figure 3-6 (a) An example of a specific kernel.

(b)&(c) Two different input image with same appearance but some local translation feature (as circled region).

(d)&(e) Two feature map corresponding to two input image and its pooling result.

For the pooling layer, the feature map size will be down-sample from M+1) by

(N-M+1) to ((N-(N-M+1)/scaling factor) by ((N-(N-M+1)/scaling factor). We usually choose the factor

of (N-M+1) to get the down-sampled feature map with integer size.

Fully-connected layer

In the last layer, we will arrange all feature map to an one column feature. In addition, we

might have a target layer to do the final classification. All nodes in the feature column will

(27)

3.2.1.2 Learning algorithm of Convolutional Neural Network

Convolutional neural network use the same learning algorithm as traditional neural

network, which is back-propagation algorithm. The main concept of back-propagation is

minimize the error function of the training model. We can modify the weight in each layer to

get lower and lower error score. Gradient descent is the optimization method we use in the

learning algorithm. [14] might discuss all the derivation and implementation of convolutional

neural network in detail.

The error function of convolutional neural network is as equation (3-1). N means the total

number of input image in the training data set. C means the output target has C classes. 𝑡_𝑘𝑛 is the target of kth image in nth class is 0 or 1. 𝑦_𝑘𝑛 is the calculation result of kth image in nth class. Consider with respect to a single training data, the error function can be seen as equation

(3-2). 𝐸𝑟𝑟𝑜𝑟𝑁 ₌1 2∑ ∑ (𝑡𝑘 𝑛_{− 𝑦} 𝑘𝑛)2 𝐶 𝑘=1 𝑁 𝑛=1 , (3-1) 𝐸𝑟𝑟𝑜𝑟𝑛 ₌ 1 2∑ (𝑡𝑘 𝑛_{− 𝑦} 𝑘𝑛)2 𝐶 𝑘=1 , (3-2)

We have different function operator in different layer. Therefore, the parameter update in

each layer might be difference. We will introduce them all in the following.

Back-propagation in fully-connected layer

We define the function operator in the last layer to be as equation (3-3). 𝑥𝑙 denotes the current layer l output, as last layer been layer L , input layer been layer 1. 𝑓(. ) is the activation function we choose. 𝑊𝑙 is the weight and 𝑏𝑙 is the bias in layer l.

(28)

𝜕𝐸 𝜕𝑊 = 𝜕𝐸 𝜕𝑎 𝜕𝑎 𝜕𝑊= 𝑥 𝑙−1_(𝛿𝑙₎𝑇_{, 𝑤𝑖𝑡ℎ 𝛿}𝑙_{= (𝑊}𝑙+1₎𝑇_𝛿𝑙+1_⨀𝑓′_(𝑎𝑙_{), (3-4)}

where ⨀ denotes element-wise multiplication operator.

𝜕𝐸 𝜕𝑏 = 𝜕𝐸 𝜕𝑎 𝜕𝑎 𝜕𝑏= 𝛿𝑙. (3-5)

Back-propagation in pooling layer

The pooling layer produce a down-sampled feature maps. The function operator in

convolution layer is defined in equation (3-6).

𝑥_𝑗𝑙 = 𝑓(𝑎𝑙_{) , 𝑤𝑖𝑡ℎ 𝑎}𝑙_{= 𝛽}

𝑗𝑙∙ 𝑑𝑜𝑤𝑛(𝑥𝑖𝑙−1) + 𝑏𝑗𝑙, (3-6)

where 𝑑𝑜𝑤𝑛(. ) represents a non-overlapped sub-sampling function. Each output map might have its own multiplicative biasβ and an additive bias b.

We need to update both multiplicative biasβ and additive bias b in this layer. The gradient

can be compute as fully-connected layer.

𝜕𝐸 𝜕𝑏𝑗 = ∑ (𝛿𝑗 𝑙₎ 𝑢𝑣 , 𝑤𝑖𝑡ℎ 𝛿𝑗𝑙 = 𝑓′(𝑎𝑙)⨀(𝑘𝑗𝑙+1∗ 𝛿𝑗𝑙+1) 𝑢,𝑣 (3-7) 𝜕𝐸 𝜕𝛽𝑗= 𝜕𝐸 𝜕𝑎 𝜕𝑎 𝜕𝛽𝑗= ∑ (𝛿𝑗 𝑙_⨀𝑑 𝑗𝑙)𝑢𝑣 , 𝑤𝑖𝑡ℎ 𝑑𝑗𝑙 = 𝑑𝑜𝑤𝑛(𝑥𝑖𝑙−1) 𝑢,𝑣 (3-8)

Back-propagation in convolution layer

The output feature map will combine several feature maps. The function operator in

convolution layer is defined as equation (3-9), where 𝑆𝑗 represents a subset of input maps, *

denotes convolution operator.

𝑥_𝑗𝑙 _{= 𝑓(𝑎}𝑙_{), 𝑤𝑖𝑡ℎ 𝑎}𝑙 _{= ∑} _𝑥 𝑖𝑙−1

𝑖∈𝑆_𝑗 ∗ 𝑘𝑖𝑗𝑙 + 𝑏𝑗𝑙 (3-9)

The gradient of current layer might be sum over the next layer’s gradient corresponding to

units that connected to the node of interest in the current layer. The update function of kernel

(29)

previous layer which multiplied element-wise by 𝑘_𝑖𝑗𝑙 during convolution. 𝛽_𝑗𝑙+1 denotes the next pooling layer parameter. 𝑢𝑝(. ) denotes an up-sampling operation.

𝜕𝐸 𝜕𝑘_𝑖𝑗𝑙 = ∑ (𝛿𝑗 𝑙₎ 𝑢𝑣(𝑝𝑖𝑙−1)𝑢𝑣 , 𝑤𝑖𝑡ℎ 𝛿𝑗𝑙 = 𝛽𝑗𝑙+1(𝑓′(𝑎𝑙) ⊙ 𝑢𝑝(𝛿𝑗𝑙+1)) 𝑢,𝑣 (3-10) 𝜕𝐸 𝜕𝑏𝑗= ∑ (𝛿𝑗 𝑙₎ 𝑢𝑣 𝑢,𝑣 (3-11)

3.2.2 Deep Neural Network

The reason we introduce “deep” neural network, not traditional neural network, is the high complexity of training data set. While the object we want to distinguish being more variability,

we need more hidden nodes and more layers to learn the model. However, traditional learning

algorithm is not efficiency for deep neural network architecture. For example, when we use

back-propagation algorithm to update our learning weight, we have the gradient from later layer

propagate to previous layer. The gradient might be tiny as propagating to the front layer. This

causes the weight update stagnant at local optima. Furthermore, the traditional learning

algorithm require labels to calculate error function and get gradient of weight to update the

model. The back-propagation algorithm find the relationship between input feature and output

target instead of finding the internal structure in training data set. Deep neural network is an

unsupervised learning method which learns the model without target. We can find the exist

structure behind the input data set.

Hinton introduced a new learning method to train the model layer-by-layer instead of

learning the whole deep architecture once [5]. This algorithm has two stage : unsupervised

pre-training and supervised fine-tuning. At first, they take the whole deep architecture apart to train

(30)

deep neural network. Figure 3-7(b) shows the layer-wise pre-training architecture. After that,

we can get the initial weight of every layer and do supervised fine-tuning to get the relationship

between input image and output target by updating the weight slightly, shows in Figure 3-7(c).

We will introduce the structure of Restricted Boltzmann Machine (RBM) which is suitable for

binary visible and Gaussian Restricted Boltzmann Machine which is suitable for real-valued

visible both in section 3.2.2.1. In section 3.2.2.2, we will describe the learning algorithm of

pre-training and fine-tuning in detail.

Figure 3-7 (a) The whole architecture of deep neural network. (b) Layer-wised pre-training architecture.

(c) Supervised fine-tuning architecture.

3.2.2.1 Restricted Boltzmann Machine

Restricted Boltzmann Machine is a two layer and undirected model without intra-layer

connections. The bottom layer called visible layer is a set of binary units v ∈ {0,1} and the top layer called hidden layer is a set of binary units h ∈ {0,1}. We will use Gaussian Restricted Boltzmann Machine if the bottom layer is a set of real-valued units v ∈ [0,1] which we will describe later. The structure of RBM is shown in Figure 3-6(b) pink circled.

RBM is an energy-based model, the energy of a pair of (v,h) is defined in equation (3-12),

(31)

weights connecting hidden and visible units. The probability distribution of the pair of (v,h) is

defined in equation (3-13), where the constant Z is the normalization term.

Energy(v, h) = −𝑏𝑇_{𝑣 − 𝑐}𝑇_{ℎ − 𝑣}𝑇_{𝑊ℎ (3-12)}

P(v, h) =_𝑍1(𝑒−𝐸𝑛𝑒𝑟𝑔𝑦(𝑣,ℎ)_{) (3-13)}

Since there is no connection within hidden layer and within visible layer, and the hidden

layer and visible layer are conditionally independent given one-another. Using these two

property, we can get the conditional distribution for each layer as equation (3-14) and (3-15),

where 𝑠𝑖𝑔𝑚(𝑥) =_(1+𝑒1_−𝑥₎ is a sigmoid function and M , N denotes the number units in visible layer and hidden layer Respectively.

P(h|v) = ∏ 𝑃(ℎ𝑁 _𝑗 = 1|𝑣)

𝑗 , 𝑤𝑖𝑡ℎ 𝑃(ℎ𝑗 = 1|𝑣) = 𝑠𝑖𝑔𝑚(𝑐𝑗+ 𝑊𝑗𝑣) (3-14)

P(v|h) = ∏ (𝑣𝑀 _𝑖 = 1|ℎ)

𝑖 , 𝑤𝑖𝑡ℎ 𝑃(𝑣𝑖 = 1|ℎ) = 𝑠𝑖𝑔𝑚(𝑏𝑖 + 𝑊𝑖ℎ) (3-15)

As mentioned previously, we use Gaussian Restricted Boltzmann Machine for real-valued

visible. The energy function and the conditional distribution is defined as equation (3-16) to

3.2.2.2 Learning algorithm of Deep Neural Network

(32)

Unsupervised pre-training

We want to minimize the energy function of the RBM, on the other hand, we want to

maximize the log-likelihood given a training data v with model parameters θ. By using gradient

ascent method, we can get the learning rule for the model parameters. The first term of equation

(3-20) denotes the data-term which takes expectation over the training data set empirical

distribution P(h|v). The second term denotes the model-term which takes expectation over the

model distribution P(v,h). lnP(v|θ) = ln1_𝑍∑ 𝑒−𝐸𝑛𝑒𝑟𝑔𝑦(𝑣,ℎ) ℎ = ln ∑ 𝑒ℎ −𝐸𝑛𝑒𝑟𝑔𝑦(𝑣,ℎ)− ln ∑𝑣,ℎ𝑒−𝐸𝑛𝑒𝑟𝑔𝑦(𝑣,ℎ) (3-19) 𝜕 𝜕𝜃lnP(v|θ) = − ∑ 𝑃(ℎ|𝑣) 𝜕 𝜕𝜃𝐸𝑛𝑒𝑟𝑔𝑦(𝑣, ℎ) ℎ + ∑𝑣,ℎ𝑃(ℎ, 𝑣)_𝜕𝜃𝜕 𝐸𝑛𝑒𝑟𝑔𝑦(𝑣, ℎ) (3-20) ∆W = ϵ(𝐸_{𝑑𝑎𝑡𝑎}[vℎ𝑇_{] − 𝐸} 𝑚𝑜𝑑𝑒𝑙[vℎ𝑇]) , 𝑤𝑖𝑡ℎ 𝜖 𝑖𝑠 𝑡ℎ𝑒 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 (3-21) ∆b = ϵ(𝐸_{𝑑𝑎𝑡𝑎}[v] − 𝐸𝑚𝑜𝑑𝑒𝑙[v]), 𝑤𝑖𝑡ℎ 𝜖 𝑖𝑠 𝑡ℎ𝑒 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 (3-22) ∆c = ϵ(𝐸_{𝑑𝑎𝑡𝑎}[h] − 𝐸_{𝑚𝑜𝑑𝑒𝑙}[h]), 𝑤𝑖𝑡ℎ 𝜖 𝑖𝑠 𝑡ℎ𝑒 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 (3-23) We use Gibbs sampling to approximate the model-term because of its difficulty. According

to the Markov chain Monte Carlo (MCMC) algorithm, we can sample the hidden nodes by the

visible nodes then sample the reconstruct visible nodes by the sampled hidden nodes. We will

do this procedure until the Markov chain converge in order to get the model-term. The

architecture of MCMC with Gibbs sampling is shown in Figure 3-8(a). However, it is not

efficient to run a MCMC to converge in an update step. In [5], Hinton propose a fast algorithm

which is called Contrastive Divergence. This method shows that the effect of sampling until the

Markov chain converge and sampling once might get close result. Therefore, the update

function can be approximated as equation (3-24) to equation (3-26). Figure 3-8(b) shows the

architecture of Contrastive Divergence algorithm.

∆W = ϵ(𝐸𝑑𝑎𝑡𝑎[vℎ𝑇] − 𝐸𝑟𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡[vℎ𝑇]) , 𝑤𝑖𝑡ℎ 𝜖 𝑖𝑠 𝑡ℎ𝑒 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 (3-21)

(33)

∆c = ϵ(𝐸𝑑𝑎𝑡𝑎[h] − 𝐸𝑟𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡[h]), 𝑤𝑖𝑡ℎ 𝜖 𝑖𝑠 𝑡ℎ𝑒 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 (3-23)

Figure 3-8 (a) Architecture of MCMC by using Gibbs sampling. (b) Architecture of Contrastive Divergence algorithm

We will train the RBM layer-wise staring from visible layer and hidden layer one. When

we get the weight W1 between visible layer v and hidden layer one h1, we treat the real-valued

probabilities of the conditional distribution E[P(ℎ1|v; 𝑊1)] as the visible units in the second RBM and so on. After we get all trained weight between layers, we can do the next stage,

supervised fine-tuning.

Supervised fine-tuning

We learned the internal structure of the training data set in the previous stage. Next, we

want to learn the relationship between input training data set and the output targets for

classifying the different class in the training data set. We use the back-propagation algorithm to

learn the overall optimization solution.

3.3 Applicable to Hand Gesture Recognition

As we mentioned before, the hand has extremely distinct shape and appearance from

different gestures. Using traditional convolutional neural network or traditional deep neural

(34)

In order to improve the accuracy in the video testing, we add some tracking technique described

in Section 3.3.3.

3.3.1 Feature Extraction by Convolutional Neural Network

We want to use convolutional neural network to learn the feature from training data set

directly. Traditional convolutional neural network have the features for single kernel size and it

is restricted to be square and the down-sampling scaling factor for feature map’s width and

height are limited to be the same. As we observe the training data set of the hand gestures, we

find that hands can be composed of fingers and a palm. Fingers are different orientation,

different length, long and narrow strip. Palm is a larger rectangle which width and height ratio

is extremely different from the fingers. Furthermore, the four hand gestures we need to classify

are composed of different number of fingers and the open or close palm. Therefore, the features

we want to learn in the hand gesture training data set might be modify to some long and narrow

strips and some rectangles close to square. In order to cope with the different kernel size for

width and height, we also modify the down-sampling scaling factor could be distinct for the

width and height. Also, we want to represent the hand gestures by more different kernel size to

get a more robust feature space of the hand gestures. Therefore, we need some algorithm to

choose better features of different kernel size and combine different kernel size feature to find

more robust feature space for hand gestures.

The convolutional neural network architecture we attempt is shown in Figure 3-4. We set

one convolution layer and one pooling layer to get the local features of the hand gesture images.

Also, we don’t use the last full-connected layer for the hand gesture classification. We only extract the feature vector from the output of the pooling layer. The pooling scaling factor might

be cope with the kernel size we choose at the convolution layer. We will have a summary of the

(35)

3.3.1.1 Non-square kernel size and pooling scaling factor

The important issue at convolutional neural network learning is the choice of kernel size.

If the kernel size might be square, a larger square or a smaller square is not the best choice. As

using the smaller one, it will turn on some position of the feature map because of the clutter

background, which is shown in Figure 3-9 (a). The blue rectangles are the clutter background

features we don’t need and red rectangles are the hand features we want. In Figure 3-9(b), we

can discover that a larger block might get a complex feature to cope with the multiple fingers.

However, if we have a complex feature, it might not represent the hand gestures good for the

different hand gestures. Different number of fingers in the patch and the close or open palm

might need different features to represent. These will reduce the effect we want to have in the

learning feature stage of convolutional neural network. We want to learn the local features of

the hand gestures not the global features in the whole image. For a specific long and narrow

block, we can represent the fingers more efficient, as shown in Figure 3-9(c).

Figure 3-9 (a) The kernel of convolutional neural network is a small square. (b) The kernel of convolutional neural network is a large square. (c) The kernel of convolutional neural network is a long and narrow stipe.

(36)

we have different viewpoints of the hand gestures, we not only need long and narrow rectangles

but also need short and wide rectangles for different orientation fingers. However, convolutional

neural network could not tolerate the different kernel size learning at once, we need to train the

different kernel size feature one by one. Figure 3-10 shows four different kernel size features

for the hand gesture training data set. The kernel size are 3*3, 6*6, 7*3, 3*7 pixel by pixel in

Figure 3-10(a) to (d). Each row has sixteen features learned by the convolutional neural network.

Figure 3-10 (a) The convolutional neural network feature learned by 3*3 kernel size. (b) The convolutional neural network feature learned by 6*6 kernel size. (c) The convolutional neural network feature learned by 7*3 kernel size. (d) The convolutional neural network feature learned by 3*7 kernel size.

After the convolution layer, we have the pooling layer to down-sample the feature map we

get from previous layer. The pooling concept might let the convolutional neural network to be

translation invariance as we mention in Section 3.2.1. We calculate the mean of a local patch of

the feature map to indicate the feature is existence or not at a small region in the original input

image as shown in Figure 3-4. The pooling scaling factor is limited as the same for width and

height in the conventional convolutional neural network. We modify it to coordinate the long

and narrow rectangles and the short and wide rectangles of kernel we learned for the hand

gesture. Therefore, we can down-sample for different scaling factor for the feature map width

(37)

invariance. The corresponding scaling factor for four different kernel size is shown in Table

3-1. The unit for the kernel size and scaling factor are pixels.

1 2 3 4

Width Height Width Height Width Height Width Height Kernel Size 3 3 6 6 7 3 3 7 Scaling Factor 6 6 5 5 4 6 6 4

Table 3-1 Different kernel size correspond to different pooling scaling factor.

The feature maps of different kernel size we get from convolutional neural network after

pooling layer are shown in Figure 3-14(a) to (d). We random choose some example from

training data set to get the feature maps for the four different gestures, each gestures has five

example images. The top to down blocks are the stop gesture, pointer gesture, ok gesture and

wave gesture. We can figure out that a long and narrow kernel size, such as 3*7 and 7*3 might

be good at fingers feature, as the red circled region in Figure 3-11 (c) and (d). The small square

kernel size, 3*3 and 6*6, might get the rough contour of the hand gesture, as the red circled

(38)

(39)

(40)

Figure 3-11 (a) The feature map of kernel 3*3, red circled region show the rough contour of hand gestures.

(b) The feature map of kernel 6*6, red circled region show the rough contour of hand gestures.

(c) The feature map of kernel 7*3, red circled region show the finger feature. (d) The feature map of kernel 3*7, red circled region show the finger feature.

3.3.1.2 Combination of different kernel size feature

Because of the complex shape and extremely variant of appearance for the hand gestures,

the features might be different size, such as small square, large square and rectangles. Figure

3-10 shows some example for different kernel size for the hand gestures. Therefore, we want to

use different kernel size to express the hand gesture features as shown in Figure 3-12. For

vertical fingers, horizontal fingers, fist and palm, we need different kernel size to represent its

local feature. We design four different kernel size, 3*3, 6*6, 3*7 and 7*3. We will train sixteen

kernels for a specific kernel size, then choose four of them. The reason why we need to train

sixteen kernels then choose four of them rather than only train four kernels for one kernel size

is the overall kernel number. If we only set four kernels for the convolutional neural network

model, the model might learn more complex feature than we set sixteen kernels for it. It expect

use four features to represent the whole training data set. However, we have sixteen feature to

represent the training data set not four. The features we want to learn is the simple, local feature,

not the complex and global feature. Figure 3-13 shows the different between four features and

sixteen features for a convolutional neural network model. Figure 3-13(a) shows the features

we learned for sixteen kernels and choose four of them and the corresponding feature maps.

Figure 3-13(b) shows the features we learned for only four kernels and the corresponding

feature maps. We can figure out that for four specific kernel, we get a more complicated feature

maps. We couldn’t easily determine the gestures for ok and wave as shown in red rectangles of Figure 3-13(b).

(41)

(42)

As record in Table 3-1, we have four different convolutional neural network for four

different kernel size. For kernel 3*3, the feature map at the convolution layer is 48*48. After

the pooling layer, the down-sampled feature map is 8*8 because of the scaling factor is 6*6.

For kernel 6*6, the feature map at the convolution layer is 45*45. After the pooling layer, the

down-sampled feature map is 9*9 because of the scaling factor is 5*5. For kernel 7*3, the

feature map at the convolution layer is 44*48. After the pooling layer, the down-sampled feature

map is 11*8 because of the scaling factor is 4*6. For kernel 3*7, the feature map at the

convolution layer is 48*44. After the pooling layer, the down-sampled feature map is 8*11

because of the scaling factor is 6*4. We transform the feature map from two dimension to one

dimension vector and concatenate four different kernel size to get our overall feature vector.

The feature vector dimension is 1284. (((8*8)*4) + ((9*9)*4) + ((11*8)*4) + ((8*11)*4) = 1284)

As we have sixteen features for a specific kernel size and four different kernel size, we

need some algorithm to choose the best combination for different kernel size. We take an

intuitive idea to solve it. The criteria we choose are the more far from different gestures and

more close for same gestures in the feature space. The concept is shown in Figure 3-14. Figure

3-14(a) shows the feature space we prefer to and Figure 3-14(b) shows the feature space which

the gestures is not easy to separate. We random sample four of sixteen features for a specific

kernel size and combine four different kernel size to get a feature vector with different kernel

size. Then calculate the difference map in different gestures of the specific kernel. Also, we

calculate the difference map in same gesture of the specific kernel. After we get the difference

map, we calculate the variance of the map and compute the ratio of the variance for same

gestures and the variance for different gestures as equation (3-24). We choose the smallest value

of the ratio to get a feature space with large distance from the different gestures and close in the

same gestures. There are some examples in Figure 3-15.

(43)

Figure 3-14 (a) The schematic diagram of feature space which is easy to separate. (b) The schematic diagram of feature space which is not easy to separate.

Figure 3-15 The feature vectors combining different kernel size for the hand gestures.

3.3.2 Deep Convolutional Neural Network Classification

(44)

as mentioned before. Our training data set for deep neural network is shown in Figure 3-3. The

architecture of our deep neural network model is shown in Figure 3-17(a). Our feature vector

combines four different kernel size and each kernel size has four features. The input feature

dimension is 1284 and output target is 5(four gestures and background). The model with layers

of size 1284-1024-1024-1600-5. Also, we apply a sigmoid function after the hidden units and

a softmax function after the target units as the activation function, as shown in Figure 3-17(b).

Equation (3-25) and (3-26) are the formula of sigmoid function and softmax function

respectively.

sigmoid(x) =_{1+𝑒𝑥𝑝}1 _−𝑥 (3-25)

softmax(𝒙)_𝑖 = _∑𝐾𝑒𝑥𝑝_𝑒𝑥𝑝𝑥𝑖_𝑥𝑘

𝑘=1 , for K is the dimension of 𝒙 (3-26)

Figure 3-16 (a) Our deep neural network model architecture.

(b) The example of the activation function after the hidden node.

The deep neural network training procedure is mentioned as Section 3.2.2. We have two

stages, that is, unsupervised pre-training stage and supervised fine-tuning. We trained the neural

(45)

divergence algorithm to update the weight between two layers. The basic concept for Restricted

Boltzmann Machine is to find the hidden nodes which can represent the visible nodes better. As

we use the hidden nodes to re-sample the visible nodes, we can figure out whether these hidden

nodes could represent these visible nodes or not by calculating the different between visible

nodes and the re-sampling visible nodes. After several times updating, we can get the hidden

nodes which can represent our visible nodes. We might get some combination for the input

features, on the other words, more global features be composed of the local features in the

visible layer. As we trained the first two layers of the model, we might use visible nodes and

the trained weight between visible nodes and hidden one nodes to get the output of hidden layer

one nodes. Then we do the second two layers training by setting the hidden one nodes as the

visible nodes, and so on. Finally, we have trained all the weights between any two layers. After

that, we stack the two layer-wise structure to be a deep neural network architecture. At the

supervised fine-tuning stage, we use the back-propagation algorithm to minimize the error

function by updating the weight we get from previous stage. We adjusted the weight slightly, to

get the best solution of this neural network model. At last, we could have a classifier for hand

gesture recognition by using the deep neural network model.

3.3.3 Tracking Technique

The purpose of an object tracker is to find the object position from previous frame to next

frame. The tracking method might get the trajectory of the object in every frame which we are

interested. In [15], according to different object representation, the tracking method can be

categorized into three types, point tracking, kernel tracking and silhouette tracking. Their

(46)

one might find the optimum solution of the silhouette or contour matching. Figure 3-17(a) to

(c) briefly illustrate the three tracking method. The dot arrow denote the previous motion

estimation and the dash arrows denote the predict motion for the next frame. The solid nodes

denote the previous hand gesture position and the light-colored nodes denote the predict hand

position. Because of the complex changes for the hand shape from different gestures, there isn’t

a robust tracking method for the hand gestures. Therefore, we only use the basic concept of the

tracking technique in our hand gesture recognition algorithm.

Figure 3-17 (a) The example of point tracking method. (b) The example of kernel tracking method. (c) The example of silhouette tracking method.

The idea of our tracking algorithm are the constraints of the searching region of our

detection algorithm and using a simple two dimension correlation of two patches to find the

similarity of them. Because of the video will have about thirty frames in a second, the hand

position won’t be distance from two successive frame. Also, the motion of previous two frame and next two frame won’t be a lot of difference. We restrict the searching region of our detection by using convolutional neural network and deep neural network. The searching region will

adjusted according to the motion estimation. For example, as we estimate the object move from

the bottom left corner to the top right corner, the range we search at the top right corner will

larger than the bottom left corner to cope with the object trajectory. We estimate the object

(47)

our hand gesture recognition result for the previous frames. Figure 3-18 shows an example of

searching region adjustment and the motion estimation for our proposed method. In Figure

3-18, the blue dash block is the adjustment of searching region for the detection algorithm and

the green arrow is our motion estimation result. For the similarity issue, we use the two

dimension correlation function for the target patch and a testing patch. The target patch is the

hand gesture we find in the previous frame whether getting by detection or by tracking. The

testing patches are the adjacent region of the target patch. The two dimension correlation

function is described in equation (3-27),

𝑐𝑜𝑟𝑟2 = ∑ ∑ (𝐴𝑖 𝑗 𝑖𝑗−𝐴̅)(𝐵𝑖𝑗−𝐵̅)

√∑ ∑ (𝐴𝑖 𝑗 𝑖𝑗−𝐴̅)2√∑ ∑ (𝐵𝑖 𝑗 𝑖𝑗−𝐵̅)2

, (3-27)

where A denotes the target patch and B denotes the testing patch. Aijis the pixel at the i row and

j column in the target patch and same as Bij. 𝐴̅ and 𝐵̅ are the mean of the target patch and

testing patch respectively. We have an example target patch and some testing patches to shown

the correlation between two patches in Figure 3-18.

(48)

Figure 3-19 The example of target patch and three testing patch with its two dimension correlation for the target patch.

In order to reinforce the hand position constraint, we use two maps, position map and

motion map to weighted the result of detection and tracking. The position map is obtained from

the previous hand gesture recognition result. We use a two dimension Gaussian distribution to

model the position map for the previous hand gesture recognition is one and drop to the

surrounding. Also, the motion map is obtained from the previous hand gesture recognition result

add the displacement we predict by motion estimation. The previous position add the

displacement is set to one and drop to the surrounding as a two dimension Gaussian distribution.

Figure 3-20 shows an example of the position map and motion map. The blue point and green

point in the position map are the previous hand gesture recognition result. The red arrows in the

(49)

Figure 3-20 The examples for the position map and the motion map.

3.4 Refinement

The last stage of our hand gesture recognition algorithm is refinement. We have to choose

the result from the detection algorithm and the tracking algorithm. The first refinement step is

the weighted score for both detection and tracking result. The weighted are the position map

and the motion map. The closer the local patch and the previous hand gesture recognition result

the higher weighted it has. After we get the weighted score for both detection and tracking result,

we will find the highest two scores to represent the right hand result and left hand result. Figure

3-21 is the flowchart of the weighted score for detection and tracking result. This step is

(50)

Figure 3-21 (a) The flowchart for the weight score of detection result. (b) The flowchart for the weight score of tracking result.

Secondly, we use the temporal information to improve the stability of our hand gesture

recognition results. The idea of our proposed refinement for temporal domain is the records of

(51)

seven frames to average the score for four different gestures. The closer to the frame we want

to figure out the hand gestures in the temporal domain, the higher weighted we multiply to the

score of detection or tracking result we choose. Figure 3-22 is an example for the temporal

information refinement in our proposed method. The rows under the arrow is the result of our

choice for the detection and tracking result in the temporal domain from ten frames before the

frame we are processing. The square above the arrow is the refinement result in the frame we

are processing. We can see that even the result in the processing from is distinct from previous

few frames, we can use the temporal information to correct the result. This information is under

the limitation of the hand shape would not dramatic change.

基於深度卷積神經網路之手勢辨識技術研究

國 立 交 通 大 學

電子工程學系 電子研究所

碩

士

論

文

基於深度卷積神經網路之

手勢辨識技術研究

Hand Gesture Recognition based on

Deep Convolutional Neural Network

研

究

生：謝汝欣

指 導 教 授：王聖智 教授

基於卷積深度神經網路之

手勢辨識技術研究

Hand Gesture Recognition based on

Convolutional Deep Neural Network

研 究 生：謝汝欣 Student：Ju-Hsin Hsieh

指 導 教 授：王聖智 教授 Advisor：Prof. Sheng-Jyh Wang

國 立 交 通 大 學

電子工程學系 電子研究所碩士班

碩 士 論 文

基於深度卷積神經網路之手勢辨識技術研究

研究生：謝汝欣 指導教授：王聖智 教授

國立交通大學

電子工程學系 電子研究所碩士班

摘要

Hand Gesture Recognition based on

Deep Convolutional Neural Network

Student：Ju-Hsin Hsieh Advisor：Prof. Sheng-Jyh Wang

Department of Electronics Engineering, Institute of Electronics

National Chiao Tung University

Abstract

Content

Chapter 1.

Introduction ... 1

Chapter 2.

Related Works ... 3

Chapter 3.

Proposed Method ... 9

Chapter 4.

Experimental Results ... 43

List of Tables

List of Figures

Chapter 1. Introduction

Chapter 2. Related Works

2.1 Hand Gesture Recognition by using Random Forest

2.2 Hand Gesture Recognition by Neural Network

Chapter 3. Proposed Method

3.1 Data Set

3.1.1 Data set for Convolutional Neural Network

3.1.2 Data set for Deep Neural Network

3.2 Basic concept of two learning algorithm

3.2.1 Convolutional Neural Network

3.2.1.1 Architecture of Convolutional Neural Network

3.2.1.2 Learning algorithm of Convolutional Neural Network

3.2.2 Deep Neural Network

3.2.2.1 Restricted Boltzmann Machine

3.2.2.2 Learning algorithm of Deep Neural Network

3.3 Applicable to Hand Gesture Recognition

3.3.1 Feature Extraction by Convolutional Neural Network

3.3.1.1 Non-square kernel size and pooling scaling factor

3.3.1.2 Combination of different kernel size feature

3.3.2 Deep Convolutional Neural Network Classification

3.3.3 Tracking Technique

3.4 Refinement

國立交通大學

電子工程學系電子研究所

指導教授：王聖智教授

研究生：謝汝欣 Student：Ju-Hsin Hsieh

指導教授：王聖智教授 Advisor：Prof. Sheng-Jyh Wang

國立交通大學

電子工程學系電子研究所碩士班

碩士論文

研究生：謝汝欣指導教授：王聖智教授

電子工程學系電子研究所碩士班