Software Architecture - System Overview - 基於色彩及深度影像之即時人形偵測系統設計

Chapter 1 Introduction

1.2 System Overview

1.2.2 Software Architecture

For the software architecture, the image shown in Fig-1.2 is the flowchart of the proposed system. For a start, the system obtains the depth images and color images from Kinect, and then it selects the region-of-interest (ROI) through the color and depth (RGB-D) information. ROI selection could be divided into three sub-steps:

histogram projection, connected component labeling (CCL) and moving objects segmentation. After ROI selection, the size of the ROI would be normalized based on the bilinear interpolation no matter how large it is. Afterword histogram of oriented gradients (HOG) is implemented to be human feature descriptors. At the next step, the overall features are passed to the human shape recognition system to judge whether the ROIs contain human or not. Finally, the system check if any motionless human exists in the current frame, and regard it as new ROI in next frame.

Fig-1.2 Software architecture

The remainder of this thesis is organized as follows: Chapter 2 describes the related works of the algorithm in the system. Chapter 3 introduces the above-mentioned human detection system in detail. Chapter 4 shows the experimental results. Chapter 5 is the conclusions of the thesis and the future works.

Chapter 2 Related Works

2.1 Human Detection Methods

In recent years, many human detection approaches have been developed. In general, there are three parts of human detection system, foreground segmentation, feature extraction and human recognition.

2.1.1 Foreground Segmentation

In order to simplify image scanning, the foreground segmentation is used to filter out the background of an image and separate the interested regions, regions of interest (ROI), from the original image. There are several methods to segment foreground such as optical flow method [3-5], background subtraction [6-9]. For the optical flow method, the relative motion of a camera to the scene results in optical flow, treated as the pattern of apparent motion of objects, surfaces, and edges in a visual scene. This method can detect interesting foreground regions accurately, but it costs lots of time on computation and is difficult to realize in real-time. As for the background subtraction, it has been widely used in computer vision systems to detect moving objects in videos. In this method, the difference between a current frame and a reference frame is calculated, and then the objects of interest are retrieved according to a threshold. But the background model must be updated continuously due to the illumination change or changeable background.

Furthermore, some methods are used with the special techniques such as infrared cameras [10] or depth sensors [10-15]. Based on the depth information, it is

feasible to generate a faithful background model for the scene, which can be used to detect foreground objects, because it is not sensitive to illumination change and distance influence. Therefore, this thesis uses depth information to implement foreground segmentation of the humans in the realistic environment.

2.1.2 Feature Extraction

As soon as the foreground regions are segmented, different combinations of features and classifiers will be applied to distinguish human shape from other objects.

The purpose of feature extraction is to extract human-related features to increase detection rate. Besides, there are many kinds of features which could be used to recognize the human shape. For example, one of the features is on the basis of image gradient, like the histogram of oriented gradient (HOG) [16, 17], edge information [10, 12, 13], and Haar-like features [18]. Fig-2.1 shows some of Haar-like features. Since the image gradient reflects the direction change in the intensity or color in a digital image, the boundary of objects including the human shape information could be found and extracted from image gradient. The other feature is based on the motion [13, 18], because humans often perform periodic motion, like walking. Therefore, humans could be separated from other objects by their motion periodicity. As to the other features, such as texture [12], skeleton [19], scale-invariant feature [20], and so on, they are often applied in human detection. In practice, it is common to use more than one feature to implement human detection due to the high variation of human appearance.

Fig-2.1 Examples of Haar-like features

2.1.3 Human Recognition

After feature extraction, the system can separate the human from other objects through many kinds of features. In this way, the system can recognize the humans by the techniques of machine learning such as support vector machine (SVM) [7, 16], Adaptive Boosting (AdaBoost) [18, 21], artificial neural network (ANN) [10, 22, 23], and so on. The main advantages of machine learning are the learning ability and the tolerance of variation. Nevertheless, the system often requires a large amount of training samples in order to learn how to judge human and non-human in machine learning. With regard to support vector machines, they can determine the best discriminant support vectors according to the minimum empirical classification error and the maximum geometric margin. As for the Adaptive Boosting, it constructs a classifier based on a weighted linear combination of selected features for the lowest error on the training set consisting of human and non-human. For the artificial neural networks, they have been successfully applied to pattern recognition and data processing. For example, the artificial neural networks are able to model complex relationship between inputs and outputs or find patterns for judging human and non-human.

Besides, the technique of template matching [11, 14, 24-27] is also popular in human detection. When compared with machine learning, the template matching is easier to implement and costs lower computation time, but suffers from less variation tolerance. For example, the system in [28] uses a set of templates to form a hierarchical structure [27, 29] according to the similarity between each pair of models.

The root of the structure, like a tree, is situated by the model with the highest similarity while the more particular models will be leaves of the tree. As a result, it

would improve the variation tolerance without increasing the computation time because of its hierarchical structure. Nonetheless, it is difficult and inefficiency to collect the templates and determine the similarity between each pair of models.

Except the above methods which directly detect the whole human shape, another methods use part of the templates [21, 24, 26, 27, 30] to match human shape in order to get higher detection rate and handle the occlusion problems. These methods decompose the structure of a human body into a number of body parts, where each part is described by a small set of part templates and would contribute differently to recognition of the whole human body. Therefore, the methods could detect different body parts rather than detect the whole human shape. In the part templates matching, it could reduce the variation of human appearance by a street and resolve partial occlusion problem; however, it would result in more computational cost and decrease the detection speed.

2.2 Neural Network

Artificial neural networks (ANNs) are systems that are constructed to make use of some organization of principles resembling those of the human brain. They represent the promising new generation of information processing system. ANNs are good at tasks such as pattern matching, classification, function approximation and optimization, while traditional computers are inefficient at these tasks, especially pattern-matching. However, traditional computers are faster in algorithmic computation and precise arithmetic operations.

2.2.1 Introduction to ANNs

The human nervous system consists of a large amount of neurons. Each neuron includes four parts: somas, axons, dendrites and synapses, and is capable of receiving, processing, and passing signals from one to another. To mimic the characteristics of the human nervous system, recently investigators have developed an intelligent algorithm, called artificial neural networks (ANNs). Through proper learning processes, ANNs have been successfully applied to some complicated problems, such as image analysis, speech recognition, adaptive control, etc. In this thesis, the ANNs will be adopted to implement human detection via intelligent learning algorithms.

The basic structure of a neuron is shown in Fig-2.2, whose input-output relationship is described as bias. There are three common activation functions, including linear function, log-sigmoid function and tan-sigmoid function, which are described as below:

i. Linear function

weighted inputs and the bias are propagated to the activation function to determine the activation level of the neuron.

Generally, a multilayer feed-forward network is usually divided into three kinds of layers, including one input layer, one output layer, and at least one hidden layer.

Take Fig-2.3 as an example, it shows a neural network with one input layer, one output layer and two hidden layers. Each layer is based on neurons whose basic structure is depicted in Fig-2.2. The input layer receives signals from the outside world, and then delivers their responses layer by layer. From the output layer, the overall response of the network can be obtained. As expected, a neural network with multi-hidden layers is indeed able to deal with more complicated problems compared to that with a single hidden layer. Nonetheless, the training process of multi-hidden layer networks may be more tedious.

Fig-2.2 Basic structure of ANNs

Fig-2.3 Multilayer feed-forward network

In addition to the structure, it is still demanded to determine the way of training for a neural network. In general, the training could be separated into two kinds of learning process, supervised and unsupervised [31]. The main difference between them is whether the set of target outputs is given or not. Training via supervised learning is mapping a given set of inputs to a specified set of target outputs and adjusts weights according to a pre-assigned learning algorithm. On the other hand, unsupervised learning could self-organize a neural network without any target outputs, and modify the weights so that the most similar inputs can be assigned to the same group. In this thesis, the neural network is designed for image recognition based on supervised learning; as a result, both the input and target images are necessary.

2.2.2 Back-Propagation Network

In supervised learning, the back-propagation algorithm (BP algorithm) is a

familiar method to train artificial neural networks for executing a given task. The BP algorithm proposed in 1986 by Rumelhart, Hinton and Williams, is based on the gradient steepest descent method for updating the weights to minimize the total square error of the output. In order to explain the BP algorithm clearly, take a neural network with one hidden layer as an example shown in Fig-2.4. Let the inputs be x_i,

1, 2, ,

i=  I, and the outputs be y_o, o=1, 2,,O, where I and O are the total numbers of input and output neurons, respectively. In the hidden layer, it owns J hidden neurons for receiving information from input layer and sending out the response to the output layer. Moreover, three kind of layers are connected by two sets of weights v and _ij w , where _jo v connects the i-th input node to the k-th hidden _ij node, and w connects the k-th hidden node to the j-th output node. _jo

Fig-2.4 Neural network with one hidden layer

Considering the neural network example in Fig-2.4 again, the BP algorithm for supervised learning is generally processed step by step as below:

A. Set the initial weight and bias value of the network at random.

B. Set the maximum tolerable error E_max and the learning rate η between 0.1 and 1.0 to reduce the computing time or increase the precision.

C. Input the training data of input x=[x₁ x₂  x_i  x_I]^T and the target data of desired outputt=[t₁ t₂  t_o  t_O]^T.

D. Calculate each output of the J neurons in hidden layer

1 the O neurons in output layer

E. Calculate the error function between network output and desired output by following equation

F. According to gradient descent method, determine the correction of weights as below

G. Propagate the correction backward to update the weights as below

( 1) ( )

H. Check whether the overall training data set have learned already or not.

Networks learn the overall training data set once considered a learning circle. If the networks don’t go through a learning circle, return to Step-C; otherwise, go to the next step, Step-I.

I. Check whether the network converges or not. If

E <Emax, terminate the training process; otherwise, begin another learning circle by going to Step-A.

BP learning algorithm can be used to model various complicated nonlinear functions. In recent years, the BP learning algorithm is successfully applied to many domain applications, such as pattern recognition, adaptive control, clustering problem, etc. In the thesis, the BP algorithm was used to learn the input-output relationship for clustering problem.

2.3 Support Vector Machine

In machine learning, support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts that each given input is under which one possible class of the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. In addition to performing linear classification, SVM can efficiently perform non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

Taking a graphic example for more detailed illustration, Fig-2.5 is a linearly separable set of 2D-points which belong to one of two classes, and then find a separating straight line.

Fig-2.5 a set of 2D-points which belong to one of two classes

By the above-mentioned, there are multiple separating straight lines passing muster to offer a solution to the problem in Fig-2.6. However, a line is bad if it passes

too close to the points because it will be noise sensitive and it will not generalize correctly. Therefore, SVM algorithm should be to find the line passing as far as possible from all points.

Fig-2.6 multiple straight lines separate the set

In Fig-2.7, the operation of the SVM algorithm is based on finding the hyperplane that gives the largest minimum distance (maximum margin) to the training examples. Therefore, the optimal separating hyperplane maximizes the margin of the training data.

Fig-2.7 the optimal separating hyperplane with maximum margin

As for a hyperplane, the notation is formally defined as

f(x) =β + β x0 (2.13)

where _β is known as the weight vector and _β₀ as the bias. The optimal hyperplane can be represented in an infinite number of different ways by scaling of β and β₀. As a matter of convention, among all the possible representation of the hyperplane, the one chosen is

β + β x0 (2.14)

where x symbolizes the training examples closest to the hyperplane. In general, the training examples that are closest to the hyperplane are called support vectors. This representation is known as the canonical hyperplane. Based on the concept of geometry, the distance between a point and a hyperplane ( ,β β₀) can be defined as

distance =

β + β x0

β (2.15)

According to (2.14), for the canonical hyperplane, the numerator is equal to one and the distance to the support vectors is

support vectors

distance = = 1

β + β x0

β β (2.16)

and recall that the margin introduced in Fig-2.7, here denoted as M, is twice the distance to the closest examples:

M = 2

β ^(2.17)

Finally, the problem of M maximizing is equivalent to the problem of minimizing a function L( )β subject to some constraints. The constraints model the requirement for the hyperplane to classify correctly all the training examples x_i, and can be formally described as

where y_i represents each of the labels of the training examples, and (2.18) is a problem of Lagrangian optimization [32]. Therefore, (2.18) can be solved using Lagrange multipliers [32] to obtain the weight vector and the bias of the optimal hyperplane. In this example, it only deals with lines and points in the Cartesian plane instead of hyperplanes and vectors in a high dimensional space. However, the same concepts apply to tasks where the examples to classify lie in a space whose dimension is higher than two.

2.4 Morphological Operations

The word morphology commonly denotes a branch of biology that deals with the form and structure of animals and plants. In image processing, mathematical morphology is a tool for extracting image components that are useful in the representation and description of region shape, such as boundaries, skeletons, and the convex hull.

2.4.1 Dilation and Erosion

This section will introduce two fundamental morphological operations, dilation and erosion [33], and many morphological algorithms are based on these two primitive operations. The dilation operation is often used to repair the gaps and defined as

{ | ( )ˆ _z }

A⊕ =B z B ∩ ≠A φ (2.19)

where A and B are sets in Z , ² B^ˆ means the reflection of the set B and ( )B^ˆ _z is the

translation of the set B^ˆ by z. The result of dilation in (2.19) is the set of all displacement, z, such that B^ˆ and A overlap at least one element. For example, Fig-2.8(a) shows a simple set A, and Fig-2.8(b) and Fig-2.8(c) are two structuring elements. Fig-2.8(d) is the dilation of A using the symmetric element B, and shown shaded. Similarly, Fig-2.8(e) is the dilation of A using the element C, and shown shaded. In Fig-2.8(d) and Fig-2.8(e), the dashed line shows the original set A and all points inside this boundary constitute the dilation of A. Therefore, we can find that the dilation region will extend more when the structuring element is larger.

Fig-2.8 Example of dilation operation

Oppositely, to remove noise and irregular regions, the erosion operation is often used and defined as

{ | ( )_z }

A BΘ = z B ⊆ A (2.20)

where A and B are sets in Z , ² ( )B _z is the translation of set B by z. Unlike dilation, which is an expanding operation, erosion shrinks objects in the image. Taking Fig-2.9 as an example, it shows a process similar to that shown in Fig-2.8. Fig-2.9 (a) shows a simple set A, and Fig-2.9(b) and Fig-2.9(c) are two structuring elements. Fig-2.9(d) and Fig-2.9(e) are the erosion of A using the elements B and C. Besides, from Fig-2.9(e) we can find that if an element is too large, the result of erosion will be almost empty.

Fig-2.9 Example of erosion operation

2.4.2 Opening and Closing

As we have seen, dilation expands an image and erosion shrinks it. This section will introduce two other important morphological operations based on dilation and erosion, which are opening and closing [33]. Generally, opening can smooth the contour of an object, breaks narrow isthmuses, and eliminates thin protrusions.

Closing, as opposed to opening, it fuses narrow breaks and long thin gulfs, eliminates small holes, and fills gaps in the contour, but it also tends to smooth sections of contours.

The opening operation is the dilation of the erosion of a set A by a structuring element B, defined as

( )

A BA = A BΘ ⊕B (2.21)

where Θ and ⊕ denote the erosion and the dilation operations, respectively. For example, Fig-2.10 is a simple geometric interpretation of opening operation. The region of A BA is established by the points of A that can be reached by any point of B when B moves vertically or horizontally inside A. In fact, the opening operation can be also expressed as

{( ) | ( )_z _z }

A BA = B B ⊆ A (2.22)

where  �{ } denotes the union of all the sets inside the braces.

Fig-2.10 Example of opening operation

Similarly, the closing operation is simply the erosion of the dilation of a set A by a structuring element B, defined as

( )

A B• = A⊕ ΘB B (2.23)

where ⊕ and Θ respectively denote dilation and erosion operation. Fig-2.11 is also a geometric example which introduces closing operation. In the Fig-2.11, B is moved on the outside of the boundary of A. Besides, it will be shown that opening and closing are duals of each other, so having to move the structuring element on the outside is not unexpected.

Fig-2.11 Example of closing operation

Chapter 3 Human Detection Algorithm

Human detection system is implemented by four main steps as shown in Fig-3.1,

在文檔中基於色彩及深度影像之即時人形偵測系統設計 (頁 14-0)