Software Architecture - System Overview - 基於深度資訊之智慧型人形偵測系統設計

Chapter 1 Introduction

1.2 System Overview

1.2.2 Software Architecture

For the software architecture, the image shown in Fig-1.2 is the flowchart of the proposed system. At first, the system receives the depth images from Kinect and then selects the region-of-interest (ROI) based on the depth information. ROI selection could be separated into two steps: histogram projection and connected component labeling (CCL). After ROI selection, the size of the ROI would be normalized based on the distance between object and camera. Then, edge detection and distance transformation are implemented to extract human shape features. At the final stage, the overall features are delivered into the human recognition system to judge whether the ROIs contain human or not. The experimental environment is our laboratory and the Kinect camera is at about 100cm height. Moreover, there are two limitations when implementing the human detection system: first, the detection distance is between 1m to 4m because of the hardware limitation of Kinect. Second, the human detection system focuses on detecting standing or walking people only.

Fig-1.1 Example of depth image

The remainder of this thesis is organized as follows. Chapter 2 describes the related works of the system. Chapter 3 introduces the proposed human detection system in detail. Chapter 4 shows the experimental results. Chapter 5 is the conclusions of the thesis and the future works.

Input Depth Image

Connected Component Labeling Histogram Projection

Normalization

Edge Detection Distance Transformation

Human Recognition System Step I

ROI selection

Step II

Feature extraction

Step III

Human recognition

Fig-1.2 Software architecture

Chapter 2 Related Works

2.1 Human Detection Methods

In recent years, many human detection approaches have been proposed. In general, the overall process of human detection could be roughly separated into three main steps: foreground segmentation, feature extraction and human recognition.

2.1.1 Foreground Segmentation

In order to reduce computational cost, the foreground segmentation is required to filter out background regions and segment the region-of-interest (ROI). There are various methods for foreground segmentation. Some are based on 2-D information, such as optical flow method, background subtraction, etc. Optical flow [12-14]

reflects the image changes due to motion during a time interval, and the optical flow field is the velocity field that represents the three-dimensional motion of foreground points across a two-dimensional image. It is accurate at detecting interesting foreground region, but it has complex computation and is hard to realize in real-time.

Background subtraction [15-18] is the most common method for segmentation of foreground regions in sequences of images. This method has to build the initial background model in order to subtract background image from current image for obtaining foreground regions. Through this method, the detected foreground regions are very complete and the computational cost is low. But this method could not be used in the presence of camera motion, and the background model must be updated continuously because of the illumination change or changeable background.

Other methods are based on some types of additional information such as infrared images [9] or depth images [5-9, 19]. The use of depth image to implement human detection would have some distinct advantages over conventional techniques.

First, it is robust to illumination change and influence of distance. Second, it could deal with occlusion problems efficiently. Third, it is suitable for moving camera because no background modeling is required. Based on the depth information, the foreground segmentation could be implemented by finding the vertical distribution of objects in the 3-D space because a human would present vertically in general.

However, implementing stereo-vision requires more than one camera and often has distance limitation.

2.1.2 Feature Extraction

Once the foreground regions are detected, different combinations of features and classifiers can be applied to make the distinction between human and non-human. The objective of feature extraction is extracting human-related features to increase detection rate, and there are many kinds of features which could be used to recognize human beings. The first kind of features is based on gradient computation, like edge [7-9], histogram of oriented gradient (HOG)[1, 20], Haar-like features [21], etc. The gradient computation aims at identifying points with brightness changing sharply or discontinuously in a digital image. Therefore, the boundaries of objects and the shape information of human could be found and extracted based on gradient computation.

Fig-2.1 shows the examples of Haar-like features. The second kind of feature is motion-based features [8, 21]. Because a human, especially a walking human, would have periodic motion, then the human could be distinguished from other objects based on the periodicity. Other features, like texture [7], skeleton [10], SIFT [22], etc., are

often used in human detection. However, because of the high variation of human appearance, it is common to use more than one kind of features to implement human detection.

2.1.3 Human Recognition

After feature extraction, the system has to distinguish the human with other objects based on the set of features. Many approaches use the techniques of machine learning to recognize humans, including support vector machine (SVM)[1, 16], artificial neural network (ANN)[9, 23, 24], AdaBoost[2, 21], etc. The main advantages of machine learning are the tolerance of variation and its learning ability.

However, it needs many training samples to make the system to learn how to judge human and non-human. Support vector machine is a powerful tool to solve pattern recognition problems. It can determine the best discriminant support vectors for human detection. Similarly, artificial neural network has been applied successfully to pattern recognition and image analysis. ANN uses a lot of training samples to make the network to be capable to judge human and non-human. AdaBoost is used to construct a classifier based on a weighted linear combination of selected features, which yield the lowest error on the training set consisting of human and non-human.

Besides machine learning, the technique of template matching [3-6, 25, 26] is also widely used in human detection. It is easy to implement and has low computational cost, but the variation tolerance is less than machine learning. In [5, 6],

Fig-2.1 Examples of Haar-like features

the system first uses head template to find possible human candidates, because the variation of human head is much less than other parts of body. Then, use other features to further judge whether the candidates are human or not. In [4], the system combines a large amount of human poses into a “template tree,” and the similar poses would be grouped together. Therefore, it could have more variation tolerance and still has low computational cost because of its tree structure. However, the process of collecting human poses and determining the similarities between different poses is time-consuming and difficult.

The methods introduced above are directly detecting the whole human shape.

However, this kind of methods has to deal with high variation and is hard to handle the occlusion problem. Therefore, component-based concept [2, 3, 25-27] is proposed to achieve higher detection rate and resolve the occlusion problems. This kind of approaches attempt to break down the whole human shape into manageable subparts.

In other words, the whole human shape is represented as a combination of parts of body. Therefore, the system doesn’t have to directly detect the whole human shape, and it could use component-based detectors to detect different parts of body. There are some advantages of component-based detection methods. First, the variation of human appearance could be highly reduced. Second, it could deal with partially occlusion. However, it might cause more computational cost and influence the detection speed.

2.2 Introduction to ANNs

The human nervous system consists of a large amount of neurons. Each neuron is composed of four parts, including somas, axons, dendrites and synapses, and is capable of receiving, processing, and passing signals from one to another. To mimic the characteristics of the human nervous system, recently investigators have developed an intelligent algorithm, called artificial neural networks or ANNs in brief.

Through proper learning processes, ANNs have been successfully applied to some complicated problems, such as image analysis, speech recognition, adaptive control, etc. In this thesis, the ANNs will be adopted to implement human detection via intelligent learning algorithms.

Fig-2.2 shows the basic structure of a neuron, whose input-output relationship is described as function. There are three common activation functions, including linear function, log-sigmoid function and tan-sigmoid function, which are described as below:

(1) Linear function

In detail, each input 𝑥𝑖 is multiplied by a corresponding weight 𝑤𝑖, and the sum of weighted inputs is delivered to the activation function to determine the activation level of the neuron.

A general multilayer feed-forward network is composed of one input layer, one output layer, and one or some hidden layers. For example, Fig-2.3 shows a neural network with one input layer, one output layer and two hidden layers. Each layer is formed by neurons whose basic structure is depicted in Fig-2.2. The input layer receives signals from the outside world, and then delivers their responses layer by layer. From the output layer, the overall response of the network can be attained. As expected, a neural network with multi-hidden layers is indeed able to deal with more complicated problems compared to that with a single hidden layer. Accordingly, the training process of multi-hidden layer networks may be more tedious.

In addition to the structure, it is required to determine the way of training for a neural network. Generally, the training could be separated into two kinds of learning process, supervised and unsupervised. The main difference between them is whether the set of target outputs is given or not. Training via supervised learning is mapping a given set of inputs to a specified set of target outputs. The weights are then adjusted according to a pre-assigned learning algorithm. On the other hand, unsupervised learning could self-organize a neural network without any target outputs, and modify the weights so that the most similar inputs can be assigned to the same group. In this thesis, the neural network is designed for image recognition based on supervised learning, and thus both the input and target images are required.

Fig-2.3 Multilayer feed-forward network

2.3 Back-Propagation Network

In supervised learning, the back-propagation algorithm, BP algorithm in brief, is a common method for training artificial neural networks to perform a given task. The BP algorithm was proposed in 1986 by Rumelhart, Hinton and Williams, which is based on the gradient steepest descent method for updating the weights to minimize the total square error of the output. To explain the BP algorithm clearly, a neural network with one hidden layer is given and shown in Fig-2.4. Let the inputs be 𝑥_𝑖, i=1,2,…, I, and the outputs be 𝑦𝑗, j=1,2,…, J, where I and J are respectively the total numbers of input and output neurons. For the hidden layer with K hidden neurons, it receives information from input layer and sends out the response to the output layer.

These three layers are connected by two sets of weights, 𝑣_𝑖𝑖 and 𝑤_𝑖𝑗, where 𝑣_𝑖𝑖 connects the i-th input node to the k-th hidden node, and 𝑤𝑖𝑗 further connects the k-th hidden node to the j-th output node.

Fig-2.4 Neural network with one hidden layer

Based on the neural network in Fig-2.4, the BP algorithm for supervised learning is generally processed step by step as below:

Step 1: Set the maximum tolerable error 𝐸𝑚𝑚𝑚 and then the learning rate 𝜂 between 0.1 and 1.0 to reduce the computing time or increase the precision.

Step 2: Set the initial weight and bias value of the network randomly.

Step 3: Input the training data, 𝑥 = [ 𝑥₁ 𝑥₂ ⋯ 𝑥_𝐼 ]^𝑇 and the desired output data 𝑑 = [ 𝑑1 𝑑2 ⋯ 𝑑𝐽 ]^𝑇.

Step 4: Calculate each output of the K neurons in hidden layer

Step 5: Calculate the following error function

Step 6: According to gradient descent method, determine the correction of weights as below:

Step 7: Propagate the correction backward to update the weights as below:

( 1) ( )

Step 8: Check the next training data. If it exists, then go to Step 3, otherwise, go to Step 9.

Step 9: Check whether the network converges or not. If

E<Emax, terminate the training process, otherwise, begin another learning circle by going to Step 1.

BP learning algorithm can be used to model various complicated nonlinear functions. In recent years, the BP learning algorithm is successfully applied to many domain applications, such as pattern recognition, adaptive control, clustering problem, etc. In the thesis, the BP algorithm was used to learn the input-output relationship for clustering problem.

2.4 Morphology Operations

There are two common morphology operations in image processing, called dilation and erosion [28-30], which are related to the reflection and translation of a set A in the 2-D integer space 𝑍². The reflection of set A about its origin is defined as

𝐴̂ = {𝑎� | 𝑎� = −𝑎, for 𝑎 ∈ 𝐴} (2.11)

and the translation of set A by z is defined as

(𝐴)_𝑧= { 𝑎_𝑧 | 𝑎_𝑧= 𝑎 + 𝑧, for 𝑎 ∈ 𝐴} (2.12) where all the points in set A are moved by 𝑧 = (𝑧1, 𝑧2).

The dilation and erosion operations are often used to repair gaps and eliminate noise regions, respectively. The dilation of A by B is defined as

𝐴 ⊕ 𝐵 = �𝑧|(𝐵�)_𝑧∩ 𝐴 ≠ 𝜙� (2.13)

where A and B are two sets in 𝑍². The dilation operation (2.13) results in the set of all displacements, z, such that A is overlapped at least one element by 𝐵�. Take Fig-2.5 for an example, where the elements of A and B are shown shaded and the background is white. The shaded area in Fig-2.5(c) is the result of the dilation between Fig-2.5(a) and Fig-2.5(b). Through the dilation operation, the objects in the image could grow or thicken, so the dilation could repair gaps. Similarly, the shaded area in Fig-2.5(e) is the result between Fig-2.5(a) and Fig-2.5(d). Comparing Fig-2.5(c) and Fig-2.5(e), we can find that when the mask becomes larger, the dilation area will also extend.

The opposite of dilation is known as the erosion. For sets A and B in 𝑍², the erosion of A by B is defined as

𝐴 ⊖ 𝐵 = {𝑧|(𝐵)_𝑧 ⊆ 𝐴} (2.14)

which results in the set of all points z such that B, after translated by z, is contained in A. Unlike dilation, which is a thickening operation, erosion shrinks objects in the image. Fig-2.6 shows how erosion works. The shaded area in Fig-2.6(c) is the result of the erosion between Fig-2.6(a) and Fig-2.6(b). Similarly, Fig-2.6(e) shows the erosion of Fig-2.6(a) by Fig-2.6(d).

A d

d/4 d/4

𝐵 𝐴 ⊖ 𝐵

3d/4 d/8 d/8

d/2

d/4 𝐶

Fig-2.6 Examples of erosion (a)

(b)

(c)

(d)

(e) d/4 d/4

3d/4

d/8 d/8

𝐴 ⊖ 𝐶

Chapter 3 Intelligent Human Detection

The intelligent human detection is implemented in three main steps as shown in Fig-3.1, including region-of-interest (ROI) selection, feature extraction and human recognition. The system uses depth images generated by Kinect as input and then selects the ROIs based on the histogram projection and connected component labeling.

Further, the ROI is normalized and then processed by edge detection and distance transformation to extract necessary features. Finally, the overall feature set would be delivered into the human recognition system to get the results.

Input Depth Image

Connected Component Labeling Histogram Projection

Normalization

Edge Detection Distance Transformation

Human Recognition System Step I

ROI selection

Step II

Feature extraction

Step III

Human recognition

Fig-3.1 Flowchart of the intelligent human detection system

Fig-3.2(a) shows an example of the depth image generated by Kinect, which contains 320×240 pixels with intensity values normalized into 0-255. The intensity value indicates the distance between object and camera, and the lower intensity value implies the smaller distance. Besides, all the points are offset to 0, the dark areas, if the sensor is not able to measure their depth. Some small dark areas are resulted from noises, which are undesirable and could be repaired by dilation operation. Fig-3.2(b) shows that the small dark areas could be filled through dilation operation.

Fig-3.2 (a) Example of the depth image generated by Kinect (b) The image after dilation operation

3.1 ROI Selection

In general, a standing or walking human would present vertically. In other words, the height of human in the depth image must exceed a certain value, given as a threshold. Based on the threshold, the system could implement ROI selection with the histogram projection and connected component labeling (CCL) to increase the speed and detection rate. Accordingly, the system generates the rough distribution in the 3-D space by histogram projection and locates potential human regions by CCL.

(a) (b)

3.1.1 Histogram Projection

Based on the information of depth image, the system could implement histogram projection in three steps, which are introduced as following:

Step 1:

The system computes the histogram of every column in depth image with intensity levels in the range [0, 255]. Let the histogram of the i-th column be

𝒉_𝑖 = [ℎ0,𝑖 ℎ1,𝑖 ⋯ ℎ255,𝑖]^𝑇, 𝑖 = 1,2, … ,320 (3.1) where ℎ_𝑖,𝑖 is the number of pixels related to intensity k in the i-th column. Then, define the histogram image as

𝑯 = [𝒉1 𝒉₂ 𝒉₃ … 𝒉₃₂₀] (3.2)

with size 256×320, which can be expressed in detail as

𝑯 =

Note that the value of ℎ_𝑖,𝑖 could be seen as the vertical distribution at a specific position in the real world. Take Fig-3.2(b) as an example and obtain the result of histogram computing shown in Fig-3.3. Unfortunately, there are a large amount of pixels of intensity k=0, that is, the first row of H contains large values of ℎ_0,𝑖. As a result, an undesired “wall” will be formed by ℎ_0,𝑖 to block other objects as shown in Fig-3.3.

Step 2:

After histogram computing, the result has to be further processed to filter out unnecessary information. Since the detection distance is from 1m to 4m, the corresponding intensity range is [40, 240] and the components ℎ_𝑖,𝑖 in H should be rectified as

ℎ𝑖,𝑖 = �ℎ^𝑖,𝑖 , 𝑘 = 40,41, … , 240

0 , otherwise (3.4)

Clearly, the components of H in the first 40 rows and last 15 rows are all set to 0, which implies that the unwanted background is also filtered out because the related intensity is presented in the first row of H. The rectified result of Fig-3.3 is shown in Fig-3.4, where the histogram value ℎ𝑖,𝑖 can be treated as the vertical distribution of the objects at coordinate (i,k) in the real world. Comparing Fig-3.2(b) with Fig-3.4, it is obvious that there are four objects, which are wall, human, chair and shelf from left to right. Consequently, if the height of object in the image is above a threshold, it would have a clear shape in the histogram image.

Fig-3.3 Result of histogram computing of Fig-3.2(b)

Step 3:

The top-view image of the 3-D distribution in (i,k) coordinate is shown in Fig-3.5. If an object has higher vertical distribution, it would have larger intensity in the top-view image. Afterwards, dilation operation is implemented to enhance the interior connection of an object as shown in Fig-3.6(a). Finally, define the ROI image 𝑹 as

𝑹(𝑘 + 1, 𝑖) = �1, ℎ_𝑖,𝑖 > 𝑀

0, ℎ_𝑖,𝑖 < 𝑀 (3.5)

with size 256×320 and 𝑀 is a given threshold value. Therefore, the component ℎ_𝑖,𝑖 in 𝑯 would be in the ROI when it exceeds 𝑀. The final result of histogram projection is shown in Fig-3.6(b).

Fig-3.4 Filtered result of Fig-3.3

3.1.2 Connected Component Labeling

Connected Component Labeling (CCL) [31] is a technique to identify different components and is often used in computer vision to detect connected regions containing 4- or 8-pixels in binary digital images. This thesis applies the 4-pixel connected component to label interesting regions.

Shelf

Chair Human Wall

Fig-3.5 Example of top-view image

(a) (b)

Fig-3.6 (a) Top-view image after dilation operation (b) The ROI image

The 4-pixel CCL algorithm can be partitioned into two processes, labeling and componentizing. The input is a binary image like Fig-3.8(a). During the labeling, the image is scanned pixel by pixel, from left to right and top to bottom as shown in Fig-3.7, where p is the pixel being processed, and r and t are respectively the upper and left pixels.

Fig-3.7 Scanning the image.

Defined v( ) and l( ) as the binary value and the label of a pixel. N is a counter and its initial value is set to 1. If v(p)=0, then move on to next pixel, otherwise, i.e., v(p)=1, the label l(p) is determined by following rules:

R1. For v(r)=0 and v(t)=0, assign N to l(p) and then N is increased by 1.

在文檔中基於深度資訊之智慧型人形偵測系統設計 (頁 15-0)