Connected Component - Face Detection

Chapter 3 Detection Algorithm

3.2 Face Detection

3.2.2 Connected Component

Connectivity between pixels is a fundamental concept that simplifies the definition of numerous digital image concepts, such as regions and boundaries. To establish whether these two pixels are connected, it is determined by their neighbors and finds their gray levels satisfy a specified criterion or similarity [48]. For instance, in binary image with values 0 and 1, two pixels maybe 4-neighbors, but they are said to be connected only if they have the same value.

Let V be the set of gray-level values used to define adjacency. In a binary image, V = {1}

if we are referring to adjacency of pixels with value 1. We consider three types of adjacency/connectivity [48]:

1. 4-connectivity

Two pixels p and q with values from V are 4-connectivity if q is in the set N₄(p).

2. 8-connectivity

Two pixels p and q with values from V are 8-connectivity if q is in the set N8(p).

0 1 1

Figure 3.13 (a) Arrangement of pixels, (b) pixels that are 4-connectivity, (c) pixels that are 8-connectivity, (d) m-connectivity [48]

3. m-connectivity

Two pixel p and q with values from V are m-connectivity if (i) q is in N4(p), or

(ii) q is in ND(p) and the set N4(p) ∩ N4(q) has no pixels whose values are from V.

Figure 3.13(a) shows binary image which uses to find the connectivity between every pixel. Figure 3.13(b) shows the 4-connectivity, pixel p has 4-connectivity to its neighbor whose in horizontal or vertical position and contain V= {1}. If pixel p has connectivity to neighbor pixel in horizontal, vertical, or diagonal position, it will define as 8-connectivity.

The last figure is m-connectivity, it is a modification of 8-connectivity which introduced to eliminate the ambiguities that often arise when 8-connectivity is used. The three pixels at the top of Fig.3.13(c) show ambiguous of 8-connectivity, as indicated by the dashed lines. This ambiguity is removed by using m-connectivity, as shown in Fig. 3.13(d).

Connected component works by scanning an image, pixel-by-pixel in order to identify connected pixel regions [50]. Its works on binary or gray-level images and different measures connectivity are possible. Choice of the connectivity is among 4, 8, 6,

10, 18, 26 connectivity which are 4 and 8-connectivity for 2D connected component extraction and the others for 3D connected component extraction. However in this thesis, the input is binary images and the connectivity is 8-connectivity. The connected components labeling operator scan the image by moving along a row until it comes to a point p where denotes the pixel to be labeled at any stage in the scanning process for which V= {1}. When it is true, it examines the four neighbors of p which already been encountered in the scan. Based on this information, the labeling of p occurs as follows:

[50]

(i) If all four neighbors are 0, assign a new label to p, else (ii) If only one neighbors has V={1}, assign its label to p, else

(iii)If one or more of the neighbors have V = {1}, assign one of the labels to p and make a note of the equivalence. For this case, we are labeling p with minimum label value.

Connected component with 8-connectivity have neighbors as shown in Figure 3.14. The center of 3x3 mask is a pixel that we want to assign to a new label and it has eight neighbor which indexes ‘1’ to ‘8’. When we scan an image from left to right and up to down, pixel which already labeled is shown in Fig. 3.14(b) which has indexes 1, 2, 3, and 4, and other neighbors have not assigned in a new label. So, based on this information, labeling rules will be valid in these neighbors. We take other connectivity such as 4-connectivity as another example and it shows in Fig 3.15 (c) and (d), the labeling rules will be valid in neighbor whose have index 1 and 2.

(a) (b)

Figure 3.14(a) 8-connectivity, (b) Neighbors of 8-connectivity (a) 4-connectivity, (b) Neighbors of 4-connectivity

Figure 3.15 Labeling process

Figure 3.15 shows labeling process by using rule (i), (ii) and (iii). In rule (i) all neighbors of pixel p (center of 3x3 mask) is ‘0’ so we assign it to a new label. This figure shows the new label is ‘4’. Rule (ii) is applying in next pixel. Pixel p in this position only has one neighbor with V = {1}, so we label pixel p same with its neighbor and the label is

‘4”. Scan image pixel by pixel, the last image is shown rule (iii) condition where pixel p has more than one neighbors contains V = {1} so we label it to minimum label of its neighbors. In our example all of its neighbors have same labels ‘4’ so pixel p is labeled to

‘4’. If there is another labels value such as ‘2’ we take its label as pixel p label.

Figure 3.16 Connected component (a) skin color region, (b) connected component result

Figure 3.17 Connected component labeling [50]

Figure 3.16 and Fig. 3.17 show the result of connected component. Figure 3.16 shows skin regions obtained from HSI skin color and its connected component labeling result. It has four different colors which indicate different labels. The labels are arranged by large size to small which label ‘1’ has largest size and label ‘4’ has smallest size.

Every label is fitting into ellipse model and this process is concerned to component that has small label or large skin region because face region is larger than human hand as shows in Fig. 3.16. The ellipse function is fitting to this label to find the face region. The fitting process will stop if there is obtained match skin region although the system still has another labels which its skin region smaller than match region.

Chapter 4 Tracking System

Tracking system uses PTZ (pan-tilt-zoom) camera to track object position on an image by drive pan-tilt camera to keep the object in FOV (field of view) camera. The system divided into two types, tracking using human information and face information.

First condition will work automatically if face of moving human is back-view position and the other condition will work if face of moving human is front-view or side-view position. Figure 4.1 shows our tracking system module, it is included pan-tilt-zoom. The zooming system will work by using recognizable face index information.

Moving Human Position

Human Face Position Local

Motion Vector

Recognizable Face Index

Zooming Camera Control Tracking (PT)

camera control

Face Image

Face?

y n

Figure 4.1 Tracking system

4.1 Active Camera

Figure 4.2 FOV camera

FOV camera is shown in Fig. 4.2. Image in this figure is captured after human detection part and the human region is entire the bounding box. In this condition, position of human is not in the center of FOV or image, so camera will drive pan-tilt by θ angle therefore human position in FOV center. In FOV camera, tilt upward is negative and pan right is positive.

The Camera drives pan-tilt by using angle values but human and face position in an image by pixel. We convert a pixel value to an angle by multiply it with a scale value which is called ‘step size’. The step size is divided into x-axis direction and y-axis direction, if center of moving object close to center of image, step size will decrease, otherwise if far from center of image, step size will be increased.

Figure 4.3 Step-size regions

Based on these rules we divide an image into two regions (x-axis and y-axis) as shown in Fig. 4.3. Every region is represented by a step-size value and the region inside bounding box has step-size smaller than regions outside bounding box.

4.2 Tracking and Zooming Process

Tracking and zooming are working together to follow a moving human. Figure 4.4 shows the tracking and zooming flowchart. It is started by detect location of moving object, if there is a human then we apply face detection to this image, otherwise the system will return and capture new images until we obtain moving human. If face obtained from face detection system has index smaller than a threshold value and its position in zooming region as shown in Fig. 4.5, the system is worked automatically to zoom-in face region therefore the result is contained a face with larger size and more clearly, otherwise if face index larger than other threshold value the system will do zoom-out.

Human

Note: ths1= threshold value 1 ths 2= threshold value 2

Figure 4.4 Tracking and zooming process

We also apply local motion vector to find the direction of moving object then uses the direction information to drive the pan (left-right direction) of camera.

Step 1 & 2

Step 3 & 4

Figure 4.5 Zooming region

Training set of face images

Low pass filter (LPF)

Back-propagation neural network training

Weights and biased in back-propagation neural network

Feature extraction Image Binerization and partition into 25 blocks

Figure 4.6 Training of recognizable face index

Back-propagation neural network testing

Percentage face index Face image

Low pass filter (LPF)

Feature extraction Image Binerization and partition into 25 blocks

Figure 4.7 Testing of recognizable face index

4.2.1 Recognizable Face Index

Recognizable face index is used to give an index that face image can be recognized.

The index is a percentage value i.e., high percentage value indicates the face is very clear and can be recognized. When we work in real-time face detection, our system still can detect a face although its size is small, i.e. 15x15 pixels, but a face image with this size is very difficult to recognize as face or non face because some details or features face is not available or to small to extract. In this case face index is needed as indicator to system therefore we can obtain a clear and high resolution face image.

Figure 4.8 Recognizable face index training

Figure 4.9(a) Original image, (b) After Filtering, (c) After filtering and binerization

Recognizable face index is performed by multilayer feedforward networks trained with back-propagation learning algorithm. It works offline but testing process works online. Figure 4.6 and Fig. 4.7 shows its training and testing process. The network consists of three layers. Input layer has 25 dimensions, two hidden layers, and output layer. Face images is captured by real-time face detection for several positions and classify it into three classes of percentage values. Front-view face is classified into index 90-100%, side-view 50-60% and 0-20%. Figure 4.9(a) shows a front-view face which has index 90%. The pre-processing steps are including low pass filter, image binerization, and find image features. Low pass filter is used to reduce small detail of face and find smooth image as shown in Fig. 4.9(b). By thresholded it with its median value we find

binary image as show in Fig. 4.9(c). This binary image is divided into 25 non-overlapping blocks. The feature of every block is calculated by averaging face region (pixels) in every block. These features will train as input neural network.

First hidden layer of neural network has forty neurons, second hidden layer has eighty neurons and output layer is a face index result. The activation function is sigmoid function that shows in Eq. (4.1)

( ) 2 1

1 ^x

x e

ϕ ₋ −

+ (4.1)

Figure 4.10 shows some face images which used for recognizable face index training data.

These face images already classified into three percentage values, first row images are face images with percentage value 20%, second row images are face images with percentage value 50-60%, and the last row images are face images with percentage value 90%.

(a)

(b)

(c)

Figure 4.10 Training face images, (a) view faces with percentage index 20%, (b) side-view faces with percentage index 50-60%, (c) front-side-view faces with percentage index 90%

The testing process is worked online, its process similar with training part which consisted filtering and feature extraction. Apply weighting values and biases which obtained from training part in the testing image and the output is percentage of face index.

Some face index result shows in Fig. 4.11. The index of first image is 60%, second image is 20% and the last is 91%. Small index indicates a face is difficult to recognize because the information is not enough. Meanwhile, face index 60% and 100% shows a face is clear and can be recognized.

Figure 4.11 Recognizable face index

Recognizable face index is applied in zooming process, as shown in Fig. 4.12. If face index smaller than threshold value ‘ths1’ camera will zoom-in face region, otherwise if face index larger than threshold value ‘ths2’ camera will zoom-out face region. This threshold values is changeable depends on users define. Zooming process has four steps which are step 1 until step 4 and it will zooming face region according to face index so on until we get clear and high resolution image.

y

Note: ths1= threshold value 1 ths2= threshold value 2

Figure 4.12 Zooming procedure

4.2.2 Local Motion Vector

Motion vectors are typically used to compress video by storing the changes of an image from one frame to the next. It can used to find the movement of a pixel or block of pixels in a frame f(n) to next frame f(n+1) to extract moving object direction. We use the direction to predict position of moving object in the next frame. To compute motion vector in entire image it need high computational cost and spend long time. To achieve real-time system, we combined motion vector and segmentation to reduce the computational cost. The motion vector is using binary difference image as reference position to find the motion vector.

We apply local motion vector to predict the direction of moving human. It is worked automatically when the object in area as shown in Fig. 4.13. Because in this region the moving human is easily walk outside FOV camera when moving human change its direction, so we predict the direction of moving human to make sure camera movement can follow the moving object by using local motion vector. We only work in x-axis direction, because in our experiment only a human which walk from left to right or right to left can cause it walk outside the FOV, meanwhile if the human away of walk-approach, it still in FOV.

Figure 4.13 Region of motion vector

Local motion vector is worked by scanning binary difference image pixel by pixel.

Difference image in Fig. 4.14 contains ‘1’ and ‘0’ value, ‘1’ is edge of moving object and

‘0’ is background. If pixel p has V = {1} and refers to Fig. 4.14, motion vector is applied in shaded pixel in position (x,y) = (6.4) by using its position as center 3x3 image block of previous frame as shown in Fig. 4.15(a). By using SAD method in Eq. (4.2) we find its correlation in 7x7 image block of current frame. The center position of 7x7 image block same with center position of 3x3 image block shows in Fig. 4.15(b). Based on this rule, the movements of 3x3 image block only in three pixel of any direction.

1 1

Figure 4.14 Difference between two frames

0 1 2 3 4 5 6 7 8 9 10 11 12 13 160 Figure 4.15 (a) Previous frame, (b) Current frame

Local motion vector is simulated by Matlab. Figure 4.16 shows two image frames, current and previous frame, and difference between these two frames. The local motion vector of it shows in Fig. 4.17. Directions of motion vector are left, right and fix. Most of directions in this image are from right to left hence we decided the object is moved from right to left side. The tracking system receives information about next position and left direction and so camera can move pan-tilt to new position and if index of human face larger than 50% than we capture that face.

Figure 4.16 (a) Current frame, (b) Previous frame, (c) Difference between two frames

Figure 4.17 Local motion vector result

So, we can resume the tracking and zooming conditions as,

• Tracking us worked automatically when human in back-view condition and outside region zooming as shown in Fig. 4.5

• Zoom-in is worked automatically when the face index is smaller than 50% and moving object position in zooming region.

• Zoom-out is worked automatically when the face index is larger than 50% and moving object position in zooming region.

Chapter 5 Experimental Results

This chapter will show detection and tracking under condition without zooming camera control and with zooming camera control.

5.1 The Experimental System

The experimental system is using following components,

• The system uses SONY EVI-D100 active camera for capture image sequence. The sequence is composed of 320x240 color images acquired at a frame rate 20 frames per second.

• The system is developed in Borland C++ Builder 6 and has been tested on color image sequences acquired on indoor environments.

• The system has been implemented on an AMD Athlon XP 2000+ 1.6 GHz CPU computer under Microsoft XP and 512 Mb RAM.

• The active camera has two interfaces which are RS-232 and video-in. RS-232 interface is used to drive pan-tilt-zoom camera and video-in interface is an analog input that needs video grabber card so computer can read out the image data.

5.2 Environment Setup

The environment of our experimental locates in our laboratory. The complexity of the environment is enough to verify our system while tracking and detecting moving human. Figure 5.1 shows several images of the environment using focal length 3.1 mm which a normal condition without zoom-in/out operation. For zoom-in and zoom-out condition, Fig. 5.2 shows several images from zoom step one (focal length 4.65 mm) to zoom step four (focal length 12.4 mm).

Figure 5.1 Indoor environments at reference focal length 3.1 mm

Reference focal length 3.1 mm zoom-in step 1 = 4.65 mm

zoom-in step 2 = 6.2 mm

zoom-in step 3 = 9.3 mm

zoom-in step 4 = 12.4

Figure 5.2 Frames at several focal lengths

5.3 Experimental I

This experimental will detect and track without recognizable face index and zooming camera control. It is worked by drive pan-tilt camera to follow movement of moving human. The results are shown in figure below,

Figure 5.3 Tracking without zooming operation

This figure shows the system is successfully tracked the human by drive the pan-tilt but its face is too small to recognize and has low resolution. Some of human faces that have been extracted from this image sequence are shown in Fig. 5.4.

Figure 5.4 Human and Face detection sequence

The human has been detected within a few frames after entering the field of view camera. The bounding box is shown moving human region and the detected face is shows in the small upper-left window. The camera is tracking the human during back-view face or tracking the human face during front-view face.

The detected face images more difficult to recognize because size of image face too small and has low resolution so the face feature is more difficult to extract.

5.4 Experimental II

This experimental will detect and track moving human with recognizable face index and zooming camera control. The zooming process includes zoom-in and zoom-out which has four steps. The steps are depended on camera focal length. Zoom-in and zoom-out have same focal length which are the changeable focal length in zoom-in is opposite with zoom-out. The tracking and zooming experimental II results are shown in figure below,

Reference position (3.1 mm) zoom-in step 1

zoom-in step 2 zoom-in step 3

Figure 5.5 Tracking and zooming image sequence (zoom-in)

zoom-in step 3 zoom-out step 3

zoom-out step 3 zoom-in step 3

zoom-in step 4 zoom-in step 4

Figure 5.6 Tracking and zooming image sequence (zoom-in and zoom-out)

Figure 5.7 Tracking and zooming step-4

Figure 5.5 shows zoom-in image sequence from reference focal length 3.1 mm until zoom in-step three. In this condition the camera is not increase the focal length to zoom-in step four because human face already clear and has face zoom-index larger than 50%. Figure 5.6 is continued from Fig. 5.5 which is the first image has same step with the last image in Fig. 5.5 (zoom-in step three condition). In this condition we obtain face index larger than 70% so the system automatically do zoom-out step three. When it is already worked we obtain face index smaller than 50% so system do zoom-in step three and zoom-in step four. In this condition the system can not increase the camera focal length because we

在文檔中利用動態攝影機實現人形追蹤與人臉偵測技術 (頁 44-0)