Chapter 1 Introduction
1.2 System Overview
In this thesis, a system is proposed for head attitude estimation (HAES). For software architecture, the image shown in Figure 1.2.1 is the flow chart of the proposed system.
There are three steps to complete the intelligent head attitude estimation, First, detect the human face based on skin color and use the geometric facial features to the detection of eyes and mouth in high accuracy rate. Second, build up a stereo facial model to simulate the head attitude which is able to adjust the face orientation and angle by seven detecting points marked on the face model. Record the seven detecting points on each image referring to a specific face orientation and angle, which will be used in neural network learning. Third, the HAES is completed by intelligent neural networks
3
under supervised learning. The remainder of this thesis is organized as follows. Chapter 2 describes the related works of the system. Chapter 3 describes intelligent head attitude estimation based on geometric facial features system. Chapter 4 shows the experiment results. Chapter 5 is the conclusions of the thesis and the future works.
Figure 1.2.1 Architecture of head attitude estimation.
Input Image
Facial feature detection
Geometric facial feature
Head attitude estimation system design
Step I
Step II
Step III
4
Chapter 2
Related Work
2.1 Introduction to ANNs
The human nervous system consists of a large amount of neurons, including somas, axons, dendrites and synapses. Each neuron is capable of receiving, processing, and passing signals from one to another. To mimic the characteristics of the human nervous system, recently investigators have developed an intelligent algorithm, called artificial neural networks (ANNs). In the artificial intelligence field, ANNs have been applied successfully to speech recognition, image analysis and adaptive control. This thesis will apply ANNs to the face detection in an eyeball system through learning.
Figure. 2.1 Basic structure of a neuron.
w1
w2
w3
wn
b x2
x3
xn
x1
f()
y5
Figure. 2.1 shows the basic structure of a neuron, whose input-output relationship is described as function f() can be linear or nonlinear, such as linear function, log-sigmoid function and tan-sigmoid function, respectively expressed as below:
(1) linear function
Here, each input xi is multiplied by a corresponding weight wi, analogous to synaptic strengths. The weighted inputs are summed to determine the activation level of the neuron.
A general multilayer feed-forward network is composed of one input layer, one output layer, and some hidden layers. For example, Figure. 2.2 shows a neural network with one input layer, one output layer, and two hidden layers. Each layer is formed by neurons with basic structure depicted in Figure. 2.1. The input layer receives signals and response from the outside world, and then through the hidden layer to the output layer, the response of the net can be read. Note that in some cases only the input layer and output layer are required and the hidden layer can be omitted, i.e., the hidden layer is not necessary to be used.
6
Compared with networks using single hidden layer, networks with multi-hidden layer can solve more complicated problems. However, the training process of multi-hidden layer networks may be more difficult.
Figure. 2.2 Multilayer feed-forward network.
In addition to the architecture, the method of setting the weights is an important matter of different neural networks. For convenience, the training for a neural network is mainly classified into supervised learning and unsupervised learning. Training via supervised learning is mapping a given set of inputs to a specified set of target outputs.
The weights are then adjusted according to a pre-assigned learning algorithm. For the unsupervised learning, it can self-organize a neural network without training data, i.e., only input vectors are provided, but no target vectors are specified. Through the unsupervised learning, the network modifies its weights so that the most similar input vectors can be assigned as the same group. In this thesis, the neural network is designed
7
for image feature extraction and recognition which requires two images, input image and target image, as the training input-output pairs. Hence, the neural network will be trained via supervised learning.
2.2 Back-Propagation Network
In supervise learning, the back propagation learning algorithm, is widely used in most application. The back propagation, BP in brief, algorithm was proposed in 1986 by Rumelhart, Hinton and Williams, which is based on the gradient steepest descent method for updating the weights to minimize the total square error of the output. To clearly explain the BP algorithm, an example is given in Fig. 2.3 which is a neural network with one hidden layer. Let the inputs be xi, i=1,2,…,I, where I is the total number of input nodes and let the outputs be yj, j=1,2,…,J, where J is the total number of output nodes. For the hidden layer, the k-th hidden node, k=1,2,…,K, with K being the total number of hidden nodes, receives information from input layer and sends out hk to the output layer. These three layers are connected by two sets of weights, vik and wkj. The weigh vik connects the i-th input node and the k-th hidden node, while the weigh wkj connects the k-th hidden node and the j-th output node.
8
Figure. 2.3 Neural network with one hidden layer.
Based on the neural network in Figure. 2.3, the BP algorithm for supervised learning is generally processed by eight steps as below:
Step 1: Set the maximum tolerable error Emax and then the learning rate between 0.1 and 1.0 to reduce the computing time or increase the precision.
Step 2: Set the initial weight and bias value of the network randomly.
Step 3: Input the training data, x[x1 x2 xI]T and the desired output
1 2
[ J]T
d d d d .
Step 4: Calculate each output of the K neurons in hidden layer
where fh() is the activation function, and then each output of the J neurons in output layer
9
where fy() is the activation function.
Step 5: Calculate the following error function
Step 6: According to gradient descent method, determine the correction of weights as below:
Step 8: Check the next training data. If it exists, then go to Step 3, otherwise, go to Step 9.
Step 9: Check whether the network converges or not. If EEmax , terminate the training process, otherwise, begin another learning circle by going to Step 1.
10
BP learning algorithm can be used to model various complicated nonlinear functions. Recently years The BP learning algorithm is successfully applied to many domain applications, such as: pattern recognition, adaptive control, clustering problem, etc. In the thesis, the BP algorithm was used to learn the input-output relationship for clustering problem.
2.3 Skin Color Detection
Color is an important source of information during the human visual perception activities. Skin color in a color image is relatively concentrated and stable. In recent years, skin color detection has become a popular research topic, and reached a great number of achievements. Nowadays, skin color detection has applied to a variety of tasks, for examples detecting and tracking human faces and gestures, filtering web image contents, and diagnosing disease [6,7,8,9].
As the first task in face detection technique, skin color detection can highly reduce the computational cost [10], and then extracts the potential face regions. To obtain the face locations in the image, these potential face regions are analyzed based on a face model including face shape and physical geometric information [11]. Furthermore, color image segmentation is computationally fast while being relatively robust to changes in scale, viewpoint, and complex background.
According to the characteristics of skin color in color space distribution, skin color pixels can be detected quickly by a skin color model. However the use of different color spaces for different races and different illuminations often results in different detection accuracy [12]. In this thesis, the experimental environment is our laboratory and the lighting condition is fixed.
11
Skin color characteristics are mainly described by skin color model. Usually, the skin color detection should be considered two aspects: color space selection and how to use the color distribution to establish a good skin color model. Nowadays main color spaces include RGB, HSV, HSI, YCrCb, some of their variant, etc, while RGB is the foundational method to represent color.
2.4 Edge Detection
Edge detection is a fundamental tool in image processing and computer vision, particularly suitable for feature detection and feature extraction, which aim at identifying points with brightness changing sharply or discontinuously in a digital image.
In the ideal case, the result of applying an edge detector to an image may lead to a set of connected curves that indicate the boundaries of objects and surface markings.
Based on the boundaries that preserve the important structural properties of an image, the amount of data to be processed may be reduced since some irrelevant information is neglected. Following the edge detection, it seems that the task of abstracting information from the original image will be much simpler.
The common edge detection methods are based on differential operators, such as Laplacian [13], Roberts [14], Sobel [15], LOG [16], Prewitt [17], and Canny [18]
operator, etc. In these classic methods, firstly masks are moved around the image. The pixels which are the dimension of masks are processed. Then, new pixels values on the new image provide us necessary information about the edge. These differential operators are all sensitive to abrupt change of pixel gray level so that they are sensitive to noises. Most of the existing edge detection techniques are effective in a certain cases
12
but often subject to a large amount of computation time and threshold setting. With neural networks, not only the existing approaches can be improved, but also develop new ones.
2.5 Morphology Operation
Morphology has two simple function dilation and erosion.
Dilation is defined as:
: ( )ˆ x
A B x B A (2.11)
where A and B are sets in Z. This equation simply means that B is moved over A and the intersection of B reflected and translated with A is found. Usually A will be the signal or image being operated on and B will be the structuring element. Figure. 2.4 Shows how dilation works.
Figure. 2.4 Example of dilation.
13
The opposite of dilation is known as erosion. This is defined as:
: ( )x
A B x B A (2.12)
which simply says erosion of A by B is the set of points x such that B, translated by x, is contained in A. Figure. 2.5 shows how erosion works. This works in exactly the same way as dilation. However equation (2.12) essentially says that for the output to be a one, all of the inputs must be the same as the structuring element. Thus, erosion will remove runs of ones that are shorter than the structuring element. This thesis will applied two kind of this operation to process the image.
Figure. 2.5 Example of erosion.
14
Chapter 3 Head Attitude Estimation System
The image process of detecting human facial features such as the eyes, nose and mouth is crucial to applications like automatic face recognition [19] and head attitude estimation [20]. This thesis further achieves geometric facial features based on the detected human facial features and proposes a head attitude detection system using artificial neural network (ANN) to detect the orientation and angle of a human head.
3.1 Facial Features Detection
Automatic human face analysis and recognition has received significant attention during the past decades, due to the emergence of many potential applications such as person identification, video surveillance and human computer interface. An automatic face recognition usually begins with the detection of face pattern, and then proceeds to normalize the face images using information about the location and appearance of facial feature such as eyes and mouth [21] ,[22]. Therefore, detecting faces and facial features are a crucial step. 3.1.1-3.1.3 will introduce how to detect the human face in a successive frames. After the human face is confirmed, 3.1.4 will introduce a method to detection human eyes and mouth.
3.1.1 Human Face Detection
Human face detection and recognition have long been a popular research topic.
In the last decades, researchers have devoted much effort to these two problems and have obtained some satisfactory results. Some of these previous efforts were focused on face recognition. However, an accurate and efficient method for human face detection is still lacking.
Popular algorithms for face detection include template matching, geometry features and skin color detection. Skin color detection has been gaining popularity and important in pattern recognition. Generally, it is the first step of computer vision tasks, such as
15
detection, tracking, and recognition of face. Many researches have indicated that skin color can be captured easily under suitable color space. Because of the human’s skin color can be limited in a range of some specific color spaces even if the human races are different. Hence, several color spaces have been used for displaying the skin color distribution introducing normalized RGB, HSV, YCbCr, CIE-Lab color space, etc.
In the many methods, this thesis uses the normalized RGB method, which is effective used for skin color segmentation. Because this method consider the white balance effective and the illumination variable, which are both reasons to perform whether skin color detection well or not. In the practice, the normalized RGB method can show the good performance of detection skin color, which is the most importance reason to use this method of the thesis.
Through the computerized statistics difference illumination condition and human skin color range know the RGB space sensitive to external environment, therefore, thesis convert RGB space to NCC (Normalized Color Coordinate), formats are showing as:
r = 𝑅+𝐺+𝐵𝑅
(3.1)
g = 𝑅+𝐺+𝐵𝐺
(3.2)
The format (3.1) and (3.2) are normalized red color and green color, respectively, which target is reduce original color dependence the brightness. Figure 3.1.1.1 shows the skin locus which the X coordinate represents r and the Y coordinate represents g, therefore, we can observe the figure 3.1.1.1 which the skin range are very centralize.
The values range from 0.2 to 0.6 of the X coordinate, on the other hand, the values range from 0.2 to 0.4 of the Y coordinate, furthermore, the statistic result can define boundary function, which are defined as following:
16
Figure 3.1.1.1 statistic range of skin.
First, a simple membership function to the skin locus is a pair of quadratic functions defining the upper and lower bound of the cluster. For each r, the maximum and minimum g was used to estimate the upper and lower quadratic function. Using least square estimation, the upper bound quadratic coefficients are found to be Au= -1.3767, bu =1.0743, cu = 0.1452; the lower bound coefficients are Ad= -0.776, bd =0.5601, cd = 0.1766. Therefore, the Q+ and Q- are define as(3.3) and (3.4):
Q+=Aur2 + bu r + cu (3.3) Q- = Adr2 + bd r + cd (3.4)
Because the white points are included the Q+ and Q- so we have to eliminate the white points, therefore the quadratic are showing in (3.5):
W =(r-0.33)2 + (g-0.33)2 (3.5) Pixels with chromaticity (r, g) are then given skin locus membership values S (r, g) where
(3.6)
17
If the S be assgin to 1 then reprement to skin regin, otherwise, 0 to reprement to non-skin region.Figure 3.1.1.2 shows the thresholded image obtained by above equetions.
Using Eqs.(3.1)-( 3.6), the binary image is obtained as the shown in Fig.5.1.1.2(b), where the black color represents the non-skin region, in this case white objects, while white color represents skin region.
(a) (b)
Figure 3.1.1.2 (a) Color image, (b) thresholded image using Eqs.(3.1)-(3.6)
18
3.1.2 Morphology Operation
After applying color extraction, color regions are extracted from the original image, but some noise still exists therein. One of the conventional ways to eliminate noise regions is using the morphology operations. In the thesis, the noises are eliminated by the morphology erosion operation expressed as
: ( )x
A B x B A (3.7)
where B is a disk-shaped structuring element with radius 4 as shown in Fig. 3.5.2.1 and the noises in image A with region smaller than B are erased after operation. However, some gaps may be also generated in isolated regions after erosion. In order to repair these gaps, further employ the morphology dilation operation expressed as
: ( )ˆ x
A C x C A (3.8)
where C is a disk-shaped structuring element with radius10 as shown in Figure. 3.1.2.1 (b) and the gaps in image A are repaired after operation. Fig. 3.1.2.2 shows an example of erosion and dilation using the structuring elements B and C.
(a) (b)
Figure. 3.1.2.1 (a) Structuring element B (b) Structuring element C
0 0 1 1 1 1 0 0
19
(a) (b) (c)
Figure. 3.1.2.2 Steps for morphology operations (a) Initial image (b) Result of erosion using structuring element B (c) Result of dilation using structuring element C
20
3.1.3 Connected Components Labeling
After morphology operation different components are identified by using Connected Components Labeling (CCL), which is often used in computer vision to detect connected regions containing 4 or 8 pixels in the binary digital image [24]. In this thesis, the 4-pixel connected component will be used to label potential face regions.
Figure. 3.1.3.1 Scanning the image.
The 4-pixel connected CCL algorithm can be partitioned into two processes, labeling and componentizing. During the labeling, the image is scanned pixel by pixel, from left to right and top to bottom as shown in Figure. 3.1.3.1, where p is the pixel being processed, and r and t are respectively the upper and left pixels to p. Defined v() and l() as the binary value and the label of a pixel. If v(p)=0, then move on to next pixel, otherwise, i.e., v(p)=1, the label l(p) is determined by following rules:
R1. For v(r)=0 and v(t)=0, assign a new label to l(p).
R2. For v(r)=1 and v(t)=0, assign l(r) to l(p), i.e., l(p)=l(r).
R3. For v(r)=0 and v(t)=1, assign l(t) to l(p), i.e., l(p)=l(t).
R4. For v(r)=1, v(t)=1 and l(t)=l(r), then assign l(r) to l(p), i.e., l(p)=l(r).
R5. For v(r)=1, v(t)=1 and l(t)≠l(r), then assign l(r) to both l(p) and l(t),
21
i.e., l(p)=l(r) and l(t)= l(r).
For example, after the labeling process, Figure. 3.1.3.1(a) is changed into Figure.
3.1.3.1(b). It is clear that some connected components contain pixels with different labels. Hence, it is required to further execute the process of componentizing, which sorts all the pixels connected in one component and assign them by the same label, the smallest number among the labels in that component. Figure. 3.1.3.1(c) is the result of Figure. 3.1.3.1(b) after componentizing.
(a) Digital image (b) Labeling (c) Componentizing
Figure. 3.1.3.1 Example of 4-pixel connected CCL.
3.1.4 Face Classification System
After the skin color extraction and connected components labeling (CCL), which get the face candidates, therefore, this thesis define two conditions to confirm the face location of per frame.
(I) Areas are judged which one is face region. After the CCL, the potential face regions are located. The face area have to fit (3.9), too small areas will difficult search eyes and mouth, on the other hand, if the area is too huge not real face
0 0 0 0 0 0 0 0 0 0
22
size so both condition have to eliminate.
1000 Pixels ≤ AREA 8000 Pixels (3.9)
(II) The proportions of the human face’s width and height have to fit the (3.10). In the general condition, human face’s height bigger than the width, the figure 3.1.4.1 show the proportion of the human face’s width and height, besides, the human face candidates include some non-face skin regions(the neck and arms) , hence, we have define the proportion of the human face’s width and height to confirm the human face region.
1≤ 𝑊𝑖𝑑𝑡ℎ
𝐻𝑒𝑖𝑔ℎ𝑡 ≤ 3 (3.10)
Figure 3.1.4.1 The proportion of the human face’s width and height.
Width
Height
23
Before the detect eyes and mouth we have to human face detection, hence, I will introduce the human face detection conclusion steps, which are as following figure 3.1.4.2 and the figure 3.1.4.3 shows Figure 3.1.4.2 flow chart per steps real human face plots.
Figure 3.1.4.2 The flow chart of the human face detection.
Skin color extraction
Area and proportion define
Erosion
CCL Dilation Binary image
24
Figure 3.1.4.3 To show real human faces detection of per step from the flow chart.
Skin color extraction and Binary image
Erosion
Dilation
CCL
Area and proportion define
25
3.1.4 Facial Feature Detection
In this thesis goal is to know human face faces where then get two information which are orientation and angle, therefore, this thesis has to face detection and facial feature detection. In the last sections already get human face position, which are known the human face position where in per successive images. In this section wants to use last section result then reach goal which are detection human facial feature, just like eyes, mouth, nose and so on, but this thesis just to search two features on the human face which are mouth and eyes. In this section will show more details how to detect human eyes and mouth in this thesis.
3.1.4.1 Eye and mouth detection
The eye is the most significant and important feature in the human face, as extraction of the eye are often easier as compared to other facial features. Eye detection
The eye is the most significant and important feature in the human face, as extraction of the eye are often easier as compared to other facial features. Eye detection