Support Vector Machine - Related Works - 基於色彩及深度影像之即時人形偵測系統設計

Chapter 2 Related Works

2.3 Support Vector Machine

In machine learning, support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts that each given input is under which one possible class of the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. In addition to performing linear classification, SVM can efficiently perform non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

Taking a graphic example for more detailed illustration, Fig-2.5 is a linearly separable set of 2D-points which belong to one of two classes, and then find a separating straight line.

Fig-2.5 a set of 2D-points which belong to one of two classes

By the above-mentioned, there are multiple separating straight lines passing muster to offer a solution to the problem in Fig-2.6. However, a line is bad if it passes

too close to the points because it will be noise sensitive and it will not generalize correctly. Therefore, SVM algorithm should be to find the line passing as far as possible from all points.

Fig-2.6 multiple straight lines separate the set

In Fig-2.7, the operation of the SVM algorithm is based on finding the hyperplane that gives the largest minimum distance (maximum margin) to the training examples. Therefore, the optimal separating hyperplane maximizes the margin of the training data.

Fig-2.7 the optimal separating hyperplane with maximum margin

As for a hyperplane, the notation is formally defined as

f(x) =β + β x0 (2.13)

where _β is known as the weight vector and _β₀ as the bias. The optimal hyperplane can be represented in an infinite number of different ways by scaling of β and β₀. As a matter of convention, among all the possible representation of the hyperplane, the one chosen is

β + β x0 (2.14)

where x symbolizes the training examples closest to the hyperplane. In general, the training examples that are closest to the hyperplane are called support vectors. This representation is known as the canonical hyperplane. Based on the concept of geometry, the distance between a point and a hyperplane ( ,β β₀) can be defined as

distance =

β + β x0

β (2.15)

According to (2.14), for the canonical hyperplane, the numerator is equal to one and the distance to the support vectors is

support vectors

distance = = 1

β + β x0

β β (2.16)

and recall that the margin introduced in Fig-2.7, here denoted as M, is twice the distance to the closest examples:

M = 2

β ^(2.17)

Finally, the problem of M maximizing is equivalent to the problem of minimizing a function L( )β subject to some constraints. The constraints model the requirement for the hyperplane to classify correctly all the training examples x_i, and can be formally described as

where y_i represents each of the labels of the training examples, and (2.18) is a problem of Lagrangian optimization [32]. Therefore, (2.18) can be solved using Lagrange multipliers [32] to obtain the weight vector and the bias of the optimal hyperplane. In this example, it only deals with lines and points in the Cartesian plane instead of hyperplanes and vectors in a high dimensional space. However, the same concepts apply to tasks where the examples to classify lie in a space whose dimension is higher than two.

2.4 Morphological Operations

The word morphology commonly denotes a branch of biology that deals with the form and structure of animals and plants. In image processing, mathematical morphology is a tool for extracting image components that are useful in the representation and description of region shape, such as boundaries, skeletons, and the convex hull.

2.4.1 Dilation and Erosion

This section will introduce two fundamental morphological operations, dilation and erosion [33], and many morphological algorithms are based on these two primitive operations. The dilation operation is often used to repair the gaps and defined as

{ | ( )ˆ _z }

A⊕ =B z B ∩ ≠A φ (2.19)

where A and B are sets in Z , ² B^ˆ means the reflection of the set B and ( )B^ˆ _z is the

translation of the set B^ˆ by z. The result of dilation in (2.19) is the set of all displacement, z, such that B^ˆ and A overlap at least one element. For example, Fig-2.8(a) shows a simple set A, and Fig-2.8(b) and Fig-2.8(c) are two structuring elements. Fig-2.8(d) is the dilation of A using the symmetric element B, and shown shaded. Similarly, Fig-2.8(e) is the dilation of A using the element C, and shown shaded. In Fig-2.8(d) and Fig-2.8(e), the dashed line shows the original set A and all points inside this boundary constitute the dilation of A. Therefore, we can find that the dilation region will extend more when the structuring element is larger.

Fig-2.8 Example of dilation operation

Oppositely, to remove noise and irregular regions, the erosion operation is often used and defined as

{ | ( )_z }

A BΘ = z B ⊆ A (2.20)

where A and B are sets in Z , ² ( )B _z is the translation of set B by z. Unlike dilation, which is an expanding operation, erosion shrinks objects in the image. Taking Fig-2.9 as an example, it shows a process similar to that shown in Fig-2.8. Fig-2.9 (a) shows a simple set A, and Fig-2.9(b) and Fig-2.9(c) are two structuring elements. Fig-2.9(d) and Fig-2.9(e) are the erosion of A using the elements B and C. Besides, from Fig-2.9(e) we can find that if an element is too large, the result of erosion will be almost empty.

Fig-2.9 Example of erosion operation

2.4.2 Opening and Closing

As we have seen, dilation expands an image and erosion shrinks it. This section will introduce two other important morphological operations based on dilation and erosion, which are opening and closing [33]. Generally, opening can smooth the contour of an object, breaks narrow isthmuses, and eliminates thin protrusions.

Closing, as opposed to opening, it fuses narrow breaks and long thin gulfs, eliminates small holes, and fills gaps in the contour, but it also tends to smooth sections of contours.

The opening operation is the dilation of the erosion of a set A by a structuring element B, defined as

( )

A BA = A BΘ ⊕B (2.21)

where Θ and ⊕ denote the erosion and the dilation operations, respectively. For example, Fig-2.10 is a simple geometric interpretation of opening operation. The region of A BA is established by the points of A that can be reached by any point of B when B moves vertically or horizontally inside A. In fact, the opening operation can be also expressed as

{( ) | ( )_z _z }

A BA = B B ⊆ A (2.22)

where  �{ } denotes the union of all the sets inside the braces.

Fig-2.10 Example of opening operation

Similarly, the closing operation is simply the erosion of the dilation of a set A by a structuring element B, defined as

( )

A B• = A⊕ ΘB B (2.23)

where ⊕ and Θ respectively denote dilation and erosion operation. Fig-2.11 is also a geometric example which introduces closing operation. In the Fig-2.11, B is moved on the outside of the boundary of A. Besides, it will be shown that opening and closing are duals of each other, so having to move the structuring element on the outside is not unexpected.

Fig-2.11 Example of closing operation

Chapter 3 Human Detection Algorithm

Human detection system is implemented by four main steps as shown in Fig-3.1, including region-of-interest (ROI) selection, feature extraction, human shape recognition and motionless human checking. The system obtains the depth images and color images from Kinect, and then it selects the region-of-interest (ROI) through histogram projection, connected component labeling (CCL) and moving objects segmentation. After ROI selection, the size of the ROI would be normalized based on the bilinear interpolation no matter how large it is. Afterwards histogram of oriented gradients (HOG) is implemented to be human feature descriptors. At the next step, the overall features are passed to the human shape recognition system to judge whether the ROIs contain human or not. Finally, the system check if any motionless human exists in the current frame, and regard it as new ROI in next frame.

Fig-3.1 flowchart of the human detection system

Fig-3.2(a) shows an example of the depth image generated by Kinect, which contains 320×240 pixels with intensity values normalized into 0-255. The intensity value indicates the distance between object and camera, and the lower intensity value implies the smaller distance. Besides, Fig-3.2(a) shows an example of the color image, which contains 320×240 pixels as well.

(a) (b)

Fig-3.2 Example of the depth image and color image generated by Kinect

3.1 ROI selection

In order to reduce computational cost, the foreground segmentation is required to filter out the background to gain the region of interest (ROI), that is, a human moving in the image. The system is designed to detect the moving objects from the temporal difference frame by frame, and based on the difference the system applies the histogram projection, connected component labeling (CCL) and the moving object segmentation to select the ROI and then increase the speed and detection rate.

3.1.1 Histogram Projection

Based on the information of depth image with size 320×240, the system could

implement histogram projection as the following steps. First, the system computes the histogram of all columns in depth image with intensity levels in the range [0, 255].

Let the histogram of i-th column in depth image be

0, 1, 2, 255,

[ ] , 1, 2,^T , 320

i i i i i

h = c c c  c i=  (3.1)

where c_{k i}_, is the number of pixels related to intensity k in the i-th column. Then, define the histogram image as

1 2 3 320

with size 320×256. Note that the histogram image H could be considered as the vertical distribution of objects or humans in the real world. However, there are a large amount of pixels of intensity k=0 due to the camera limitation, not able to detect the depth of objects near to the camera or the depth of background far from the camera.

Thus, the first row of H, c0= [c0,1 c0,2 c0,3 …c0,320], contains large values of c0,i.

After obtaining the histogram image, it has to be further processed to filter out unnecessary information such as the first row c0 of H. Because the effective range of the camera is from 1.2m to 3.6m, the corresponding intensity range is [40,240] and the value c_{k i}_, in H is adjusted as

Obviously, the values of H in the first 40 rows and last 15 rows are all set to be zero, which implied that the objects near to the camera and the background far from the

camera are also filtered out. Through filtering, the histogram value c_{k i}_, can be treated as the vertical distribution of the objects at coordinate ( , )i k in the real world.

Consequently, the histograms of depth intensity within a specified range, which may be an object or human, would be connected and corresponding to a clear shape in the histogram image.

Afterward the histogram image can be treated as a top-view image of 3-D distribution in (i,k) coordinate. If an object has a higher vertical distribution, it would have lager intensity in the top-view image. In order to eliminate non-human objects with lower intensity, the histogram image H is further processed to obtain the ROI image R as

where the threshold M is a suitable height of human. Then, the closing operation of morphology is implemented to eliminate the too elongated object and enhance the interior connection of an object. Hence, there are only the objects or humans with a certain height in the ROI image R.

3.1.2 Connected Component Labeling

Connected Component Labeling (CCL) [34] is a technique to identify different components and is often used in computer vision to detect connected regions containing 4- or 8-pixels in binary digital images. This thesis applies the 4-pixels connected component to label ROI in the image R.

The 4-pixel CCL algorithm can be partitioned into two processes, labeling and componentizing. The input is a binary image like Fig-3.4(a). During the labeling, the

image is scanned pixel by pixel, from left to right and top to bottom as shown in Fig-3.3, where p is the pixel being processed, and r and t are respectively the upper and left pixels.

Fig-3.3 The process of scanning image

Defined v( )• and l( )• as the binary value and the label of a pixel. N is a counter and its initial value is set to 1. If v(p)=0, then move on to next pixel, otherwise, i.e., v(p)=1, the label l(p) is determined by following rules:

Rule 1: If v(r)=0 and v(t)=0, assign N to l(p) and then N is increased by 1.

For example, Fig-3.4(a) is changed into Fig-3.4(b) after the process of labeling.

Clearly, some connected components contain pixels with different labels in Fig-3.4(b).

Hence, it is required to further execute the process of componentizing, which sorts all the pixels connected in one component and assign them by the same label, the

smallest number among the labels in that component. Fig-3.4(c) is the result of Fig-3.4(b) after the process of componentizing.

(a) Binary image

(b) Labeling

Fig-3.4 Example of 4-pixel CCL.

0 0 0 0 0 0 0 0 0 0

3.1.3 Moving Objects Segmentation

Moving objects segmentation is based on temporal difference in the image sequence frame by frame. The system subtracts previous frame Gray_pre from current frame Gray_cur to acquire their difference, and then gives a threshold to create the binary image Gray_sub correspondingly, which are defined as following:

0,1 0,2 0,3 0,320 and y-coordinate, respectively.

Further, the binary image is processed by morphological operation in order to avoid the influence of illumination change such that the moving objects are highlighted in the binary image. Taking Fig-3.5 as an example, Fig-3.5(a) is the previous frame and Fig-3.5(b) is the current frame. Fig-3.5(a) shows the difference between the previous frame and the current frame.

(a) (b)

(c)

Fig-3.5 (a)Previous frame (b)Current frame (c) The difference of frames

In this thesis, moving objects segmentation is implemented to narrow down the amount of ROI for less computational cost. More specifically, the above ROI including the moving pixels from the difference between previous frame and current frame would be considered as a new ROI; on the other hand, the region would not be interested. Thus, moving objects segmentation could accelerate the system again.

3.2 Feature Extraction

After ROI selection, the system must extract necessary features from these ROI in order to increase the detection rate and decrease the computational cost. The overall

feature extraction is composed of two parts. First, the size of the selected ROI would be normalized into 64× 128 no matter how large the selected ROI is. Second, histogram of orientation gradient is computed to extract the human information which plays an important role for human detection.

3.2.1 Normalization

Because the size of selected ROIs are usually different, they would be processed difficultly and have no consistency. Therefore, the selected ROIs have to be resized into the identical size to avoid the worse detection rate and execution speed. The principle of image resizing (shrinking and zooming) is image interpolation which is basically image resampling method. Fundamentally, interpolation is the process of using known data to estimate values at unknown locations. Take a simple example, suppose that an image of size 500×500 pixels has to be enlarged 1.5 times to 750×750 pixels. A simple way to visualize zooming is to create an imaginary 750×750 grid with the same pixel spacing as the original, and the shrink it so that is fits exactly over the original image. Obviously, the pixel spacing in the shrunken 750×750 grid will be less than the pixel spacing in the original image. To perform intensity-level assignment for any point in the overlay, interpolation methods look for its closest pixel in the original image and assign the intensity of that pixel to the new pixel in the 750×750 grid. When interpolation methods are finished assigning intensities to all the points in the overlay grid, interpolation methods expand it to the original specified size to obtain the zoomed image.

In fact, there are two common image interpolation approaches to resize the image, a nearest neighbor interpolation approach and a bilinear interpolation approach.

The nearest neighbor interpolation approach assigns to each new location the intensity

of its nearest neighbor in the original image. This approach is simple, but it has tendency to produce undesirable artifacts, such as severe distortion of straight edges.

For this reason, it is used infrequently in practice. The bilinear interpolation approach is a more suitable approach, and it uses the four nearest neighbors to estimate the intensity at a given location. Let ( , )x y denote the coordinates of the location to which the approach wants to assign an intensity value, and let υ( , )x y denote that intensity value. For the bilinear interpolation approach, the assigned value is obtained using the equation which expressed as

( , )x y ax by cxy d

υ = + + + (3.5)

where the four coefficients are determined from the four equations in four unknowns that can be written using the four nearest neighbors of point ( , )x y . Take Fig-3.6 as an example, the bilinear interpolation approach gives much better results than nearest neighbor interpolation approach, with a modest increase in computational burden.

(a) (b)

Fig-3.6 (a)The result of nearest neighbor interpolation approach (b)The result of bilinear interpolation approach

In this thesis, the normalization method adopts the bilinear interpolation approach to resize the selected ROIs into the same size. Hence, the normalized ROI

could be processed more simply, and the consistency of normalized ROIs would increase the detection rate and execution speed.

3.2.2 Histogram of Oriented Gradient

Histogram of Oriented Gradients (HOG) is feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. The concept of HOG is similar to edge orientation histograms [35], scale-invariant feature transform descriptors, and shape contexts, but differs in that HOG is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

Fig-3.7 the parameters of HOG for computation

In general, HOG is often used to detect objects of interest, and it would scan the whole image using multi-scale to detect objects with different size. However, in this thesis, HOG is just processed the normalized ROIs instead of being used in the above situation. Thus, the system could reduce the computational time for higher detection

speed. The HOG method presented in the thesis is proposed by Dalal and Triggs [16], and Fig-3.7 shows the parameters of HOG for computation. The detection window of HOG has the size 64×128, same as the normalized ROIs. Besides, the block with size 16×16 would move 8 pixels horizontally and vertically, and the cell with size 8×8 would be divided into 9 bins. Because a HOG descriptor represents a detection window, a detection window would contain 105 blocks by equation (3.6) and the vector length of HOG descriptor are 3780 dimensions by equation (3.7).

((128 16) / 8 1) ((64 16) / 8 1)− + × − + =105 (3.6) 1 block = 4 cells = 4 9 bins

the vector length of HOG descriptors = 105 4 9 = 3780

∴ × ×

7 (3.7)

Then, the HOG is computed by following steps:

Step-1: Compute both the magnitude and orientation of the gradient, which are defined as below: directions obtained by convolving the image, and the convolution operations are shown in the Fig-3.8, respectively.

Fig-3.8 (a) the horizontal operation (b) the vertical operation

Step-2: The gradient orientation of cell is evenly divided into 9 bins over 0 to 180° °.

The sign of the orientation is ignored, so the orientations between 180 and 360° ° are considered the same as those between 0 and 180° °. Then, the gradient orientation histogram E i j( , )_k in each orientation bin k of block B i j( , ) are obtained by summing all the gradient magnitudes whose orientations belong to bin k in B i j( , )

( , ) ( , )

and the example of HOG is shown in Fig-3.9.

Fig-3.9 the example of HOG

After the HOG descriptors are extracted and computed, they are used as training patterns of human recognition system to determine whether any human is in the image or not.

3.3 Human Shape Recognition

In this section, the system has to judge whether the ROI contains human based on the extracted features. For higher detection rate and the ability of variation tolerance, this thesis adopts Leeds Sport Pose (LSP) dataset [36] which contains 2000 pose images of mostly sports people gathered from Flickr. Some examples of original

在文檔中基於色彩及深度影像之即時人形偵測系統設計 (頁 26-0)