Morphology Operations - Related Works - 基於深度資訊之智慧型人形偵測系統設計

Chapter 2 Related Works

2.4 Morphology Operations

There are two common morphology operations in image processing, called dilation and erosion [28-30], which are related to the reflection and translation of a set A in the 2-D integer space 𝑍². The reflection of set A about its origin is defined as

𝐴̂ = {𝑎� | 𝑎� = −𝑎, for 𝑎 ∈ 𝐴} (2.11)

and the translation of set A by z is defined as

(𝐴)_𝑧= { 𝑎_𝑧 | 𝑎_𝑧= 𝑎 + 𝑧, for 𝑎 ∈ 𝐴} (2.12) where all the points in set A are moved by 𝑧 = (𝑧1, 𝑧2).

The dilation and erosion operations are often used to repair gaps and eliminate noise regions, respectively. The dilation of A by B is defined as

𝐴 ⊕ 𝐵 = �𝑧|(𝐵�)_𝑧∩ 𝐴 ≠ 𝜙� (2.13)

where A and B are two sets in 𝑍². The dilation operation (2.13) results in the set of all displacements, z, such that A is overlapped at least one element by 𝐵�. Take Fig-2.5 for an example, where the elements of A and B are shown shaded and the background is white. The shaded area in Fig-2.5(c) is the result of the dilation between Fig-2.5(a) and Fig-2.5(b). Through the dilation operation, the objects in the image could grow or thicken, so the dilation could repair gaps. Similarly, the shaded area in Fig-2.5(e) is the result between Fig-2.5(a) and Fig-2.5(d). Comparing Fig-2.5(c) and Fig-2.5(e), we can find that when the mask becomes larger, the dilation area will also extend.

The opposite of dilation is known as the erosion. For sets A and B in 𝑍², the erosion of A by B is defined as

𝐴 ⊖ 𝐵 = {𝑧|(𝐵)_𝑧 ⊆ 𝐴} (2.14)

which results in the set of all points z such that B, after translated by z, is contained in A. Unlike dilation, which is a thickening operation, erosion shrinks objects in the image. Fig-2.6 shows how erosion works. The shaded area in Fig-2.6(c) is the result of the erosion between Fig-2.6(a) and Fig-2.6(b). Similarly, Fig-2.6(e) shows the erosion of Fig-2.6(a) by Fig-2.6(d).

A d

d/4 d/4

𝐵 𝐴 ⊖ 𝐵

3d/4 d/8 d/8

d/2

d/4 𝐶

Fig-2.6 Examples of erosion (a)

(b)

(c)

(d)

(e) d/4 d/4

3d/4

d/8 d/8

𝐴 ⊖ 𝐶

Chapter 3 Intelligent Human Detection

The intelligent human detection is implemented in three main steps as shown in Fig-3.1, including region-of-interest (ROI) selection, feature extraction and human recognition. The system uses depth images generated by Kinect as input and then selects the ROIs based on the histogram projection and connected component labeling.

Further, the ROI is normalized and then processed by edge detection and distance transformation to extract necessary features. Finally, the overall feature set would be delivered into the human recognition system to get the results.

Input Depth Image

Connected Component Labeling Histogram Projection

Normalization

Edge Detection Distance Transformation

Human Recognition System Step I

ROI selection

Step II

Feature extraction

Step III

Human recognition

Fig-3.1 Flowchart of the intelligent human detection system

Fig-3.2(a) shows an example of the depth image generated by Kinect, which contains 320×240 pixels with intensity values normalized into 0-255. The intensity value indicates the distance between object and camera, and the lower intensity value implies the smaller distance. Besides, all the points are offset to 0, the dark areas, if the sensor is not able to measure their depth. Some small dark areas are resulted from noises, which are undesirable and could be repaired by dilation operation. Fig-3.2(b) shows that the small dark areas could be filled through dilation operation.

Fig-3.2 (a) Example of the depth image generated by Kinect (b) The image after dilation operation

3.1 ROI Selection

In general, a standing or walking human would present vertically. In other words, the height of human in the depth image must exceed a certain value, given as a threshold. Based on the threshold, the system could implement ROI selection with the histogram projection and connected component labeling (CCL) to increase the speed and detection rate. Accordingly, the system generates the rough distribution in the 3-D space by histogram projection and locates potential human regions by CCL.

(a) (b)

3.1.1 Histogram Projection

Based on the information of depth image, the system could implement histogram projection in three steps, which are introduced as following:

Step 1:

The system computes the histogram of every column in depth image with intensity levels in the range [0, 255]. Let the histogram of the i-th column be

𝒉_𝑖 = [ℎ0,𝑖 ℎ1,𝑖 ⋯ ℎ255,𝑖]^𝑇, 𝑖 = 1,2, … ,320 (3.1) where ℎ_𝑖,𝑖 is the number of pixels related to intensity k in the i-th column. Then, define the histogram image as

𝑯 = [𝒉1 𝒉₂ 𝒉₃ … 𝒉₃₂₀] (3.2)

with size 256×320, which can be expressed in detail as

𝑯 =

Note that the value of ℎ_𝑖,𝑖 could be seen as the vertical distribution at a specific position in the real world. Take Fig-3.2(b) as an example and obtain the result of histogram computing shown in Fig-3.3. Unfortunately, there are a large amount of pixels of intensity k=0, that is, the first row of H contains large values of ℎ_0,𝑖. As a result, an undesired “wall” will be formed by ℎ_0,𝑖 to block other objects as shown in Fig-3.3.

Step 2:

After histogram computing, the result has to be further processed to filter out unnecessary information. Since the detection distance is from 1m to 4m, the corresponding intensity range is [40, 240] and the components ℎ_𝑖,𝑖 in H should be rectified as

ℎ𝑖,𝑖 = �ℎ^𝑖,𝑖 , 𝑘 = 40,41, … , 240

0 , otherwise (3.4)

Clearly, the components of H in the first 40 rows and last 15 rows are all set to 0, which implies that the unwanted background is also filtered out because the related intensity is presented in the first row of H. The rectified result of Fig-3.3 is shown in Fig-3.4, where the histogram value ℎ𝑖,𝑖 can be treated as the vertical distribution of the objects at coordinate (i,k) in the real world. Comparing Fig-3.2(b) with Fig-3.4, it is obvious that there are four objects, which are wall, human, chair and shelf from left to right. Consequently, if the height of object in the image is above a threshold, it would have a clear shape in the histogram image.

Fig-3.3 Result of histogram computing of Fig-3.2(b)

Step 3:

The top-view image of the 3-D distribution in (i,k) coordinate is shown in Fig-3.5. If an object has higher vertical distribution, it would have larger intensity in the top-view image. Afterwards, dilation operation is implemented to enhance the interior connection of an object as shown in Fig-3.6(a). Finally, define the ROI image 𝑹 as

𝑹(𝑘 + 1, 𝑖) = �1, ℎ_𝑖,𝑖 > 𝑀

0, ℎ_𝑖,𝑖 < 𝑀 (3.5)

with size 256×320 and 𝑀 is a given threshold value. Therefore, the component ℎ_𝑖,𝑖 in 𝑯 would be in the ROI when it exceeds 𝑀. The final result of histogram projection is shown in Fig-3.6(b).

Fig-3.4 Filtered result of Fig-3.3

3.1.2 Connected Component Labeling

Connected Component Labeling (CCL) [31] is a technique to identify different components and is often used in computer vision to detect connected regions containing 4- or 8-pixels in binary digital images. This thesis applies the 4-pixel connected component to label interesting regions.

Shelf

Chair Human Wall

Fig-3.5 Example of top-view image

(a) (b)

Fig-3.6 (a) Top-view image after dilation operation (b) The ROI image

The 4-pixel CCL algorithm can be partitioned into two processes, labeling and componentizing. The input is a binary image like Fig-3.8(a). During the labeling, the image is scanned pixel by pixel, from left to right and top to bottom as shown in Fig-3.7, where p is the pixel being processed, and r and t are respectively the upper and left pixels.

Fig-3.7 Scanning the image.

Defined v( ) and l( ) as the binary value and the label of a pixel. N is a counter and its initial value is set to 1. If v(p)=0, then move on to next pixel, otherwise, i.e., v(p)=1, the label l(p) is determined by following rules:

R1. For v(r)=0 and v(t)=0, assign N to l(p) and then N is increased by 1.

For example, after the labeling process, Fig-3.8(a) is changed into Fig-3.8(b). It is clear that some connected components contain pixels with different labels. Hence, it is required to further execute the process of componentizing, which sorts all the pixels

connected in one component and assign them by the same label, the smallest number among the labels in that component. Fig-3.8(c) is the result of Fig-3.8(b) after componentizing.

(a) Binary image

(b) Labeling

0 0 0 0 0 0 0 0 0 0

In this thesis, the CCL is used to detect whether the ROI in Fig-3.6(b) contains human information or not. CCL could not only recognize the connected regions but also compute their areas. If the area of a connected region is too small, i.e., less than a human-related threshold, then the region would be filtered out because it is treated as a non-human object. The result of CCL is shown in Fig-3.9(a) where four potential objects are marked by red rectangles and a small dot-like region is filtered out. Then, map the marked objects into the depth image, correspondingly shown in Fig-3.9(b).

Note that both Fig-3.9(a) and Fig-3.9(b) have the same horizontal coordinate. As for the vertical coordinate of Fig-3.9(a), it represents the intensity value of Fig-3.9(b).

Based on their mapping, the relative regions could be found in the depth image, also marked by red rectangles in Fig-3.9(b).

Fig-3.9 (a) Result of CCL. (b) The corresponding regions in the depth image

(a) (b)

Besides, the result of CCL could also be used to judge the cases of occlusion. In general, the occlusion could be roughly separated into four cases: non-occlusion, frontal-occlusion, left-occlusion and right-occlusion. After CCL, the selected ROI would be marked by a red rectangle and the system has to check whether the area below the red rectangle contains other objects or not. If an object appears in this area, it is required to determine the case of occlusion from the overlapping region which blocks the object in ROI. If it is left/right-occlusion, the overlapping region would be small and shown on the left/right side of the object in ROI. If it is frontal-occlusion, the overlapping region would be larger to block more than half of the object in ROI.

Fig-3.10 shows different cases of occlusion, which are non-occlusion, frontal- occlusion, left-occlusion and right-occlusion from left to right. The filled rectangles are the areas should be checked, and the overlapping regions are encircled by green circles. The occlusion information is also a kind of feature and would be sent into the recognition system as a reference.

(a)

Fig-3.10 Results of CCL and examples of occlusion judgment. (a) Non-occlusion (b) Frontal-occlusion (c) Left-occlusion (d) Right-occlusion

3.2 Feature Extraction

After ROI selection, the system has to extract necessary features to increase the detection rate and decrease the computational cost. The overall feature extraction could be separated into three parts. First, the size of the selected ROI would be normalized based on the distance between object and camera. Second, edge detection is executed to extract the shape information which is an important cue for human detection. Finally, distance transformation is implemented to convert the binary edge image into distance image.

3.2.1 Normalization

Obviously, if the object is farther from the camera, the object would have smaller size in the image. Therefore, the detection process would be influenced by different distances. In order to reduce the influence, the system has to normalize the size of object. According to the property of perspective projection, the relation between the height of the object in the image and the distance from object to camera could be expressed as

' f

= d

 (3.6)

where f is the focal length, d is the distance between object and camera, and ' and

are the heights in the image and in the real world, respectively. The concept of normalization is that no matter where the object is, the object would be transformed to the standard distance through normalization. For example, set d₀ as the standard distance and put the object in d1 as shown in Fig-3.11. According to (3.6), the height of the object in the image is

If move the object to the standard distance d0, its height in the image becomes

which could be used for normalization. For explanation, let’s assume an object with any size is put in some distance. Once the height  in the image and the distance d'₁ ¹

between object and camera are measured, its height L in standard distance could '₁ be obtained based on (3.9).

After ROI selection, the result is shown in Fig-3.12(b) and then the selected ROIs are separated in Fig-3.12(c). The height of object in the image could be obtained by computing the number of rows of ROI and the distance between object and camera could be directly acquired by the intensity of depth image. Therefore, the normalization could be implemented based on (3.9) and the results are attained in Fig-3.12(d). Note that the standard distance is set to be 2.4m in this thesis.

Fig-3.11 Example of perspective projection

Fig-3.12 (a) The original image (b) The result of ROI selection (c) Extracted regions from ROI selection (d) The results of normalization

3.2.2 Edge Detection

Edge detection is a fundamental tool in image processing and computer vision, particularly suitable for feature detection and feature extraction which aim at identifying points with brightness changing sharply or discontinuously in a digital image. In the ideal case, the result of applying an edge detector to an image may lead to a set of connected curves that indicate the boundaries of objects. Based on the boundaries that preserve the important structural properties of an image, the amount of data to be processed may be reduced since some irrelevant information is negligible.

(a) (b)

The edge detection methods are commonly based on gradient, which is a tool to find edge strength and direction using first derivative. The gradient at point (x,y) of an image f is defined as: same size as the original image, and it is referred as the gradient image in general.

In digital image processing, gradients could be approximated by mask operations, such as Laplacian [32], Sobel [33] , Prewitt [34], Canny [35], etc. Take Sobel operators as an example, the gradient is implemented by two masks shown in Fig-3.13(b) and Fig-3.13(c), which are Sobel operators for x-direction and y-direction, respectively. Assume Fig-3.13(a) contains the intensities 𝑧𝑖, i=1 to 9, of the i-th image pixel in a 3×3 region. By the use of Fig-3.13(b), the gradient 𝑓_𝑚 of the 5^th pixel along the x-direction is obtained as

𝑓_𝑚= ^𝜕𝜕_𝜕𝑚= (𝑧₇+ 2𝑧₈+ 𝑧₉) − (𝑧₁+ 2𝑧₂+ 𝑧₃) (3.12) Similarly, the gradient 𝑓_𝑦 of the 5^th pixel along the y-direction can be attained from Fig-3.13(c) and expressed as

𝑓𝑦 =^𝜕𝜕_𝜕𝑦= (𝑧3+ 2𝑧6+ 𝑧9) − (𝑧1+ 2𝑧4+ 𝑧7) (3.13) After computing the partial derivatives with these masks, the gradient image 𝑀(𝑥, 𝑦) could be obtained using (3.11). In this thesis, the system implements edge detection based on Sobel operators. Take Fig-3.14 as an example, the selected ROIs are

separated and normalized as shown in Fig-3.14(b) and Fig-3.14(c), respectively. Then, the normalized ROIs are scanned by Fig-3.13(b) and Fig-3.13(c) separately, and the magnitude of gradient is computed based on (3.11). If the magnitude of gradient of a pixel is larger than a threshold, it is a pixel on the edge. Fig-3.14(d) shows the result of edge detection. With the above edge detection process, the edge information could be extracted from the depth image.

Fig-3.13 Example of Sobel operators

(a) (b) (c)

-1 0 1 -2 0 2 -1 0 1 -1 -2 -1

0 0 0 1 2 1 𝑧₁ 𝑧₂ 𝑧₃

𝑧₄ 𝑧₅ 𝑧₆ 𝑧₇ 𝑧₈ 𝑧₉

(a) (b) (c) (d)

Fig-3.14 Result of edge detection

3.2.3 Distance Transformation

Distance transformation (DT)[36, 37] is a technique to transform a binary edge image into a distance image. DT is often applied to approximate differential estimators, find the skeleton of objects and match templates. There are many DT algorithms, differing in the way distances are computed [36-38]. In general, the size of distance image is the same as edge image and the edge pixels in distance image are all set to be 0. Following, the other pixels in distance image contain the distance to the closest edge pixel. In this thesis, the 4-neighbor distance is used to compute the distance between a pixel and the closest edge pixel. The value at point (x,y) of a distance image is defined as:

𝐷(𝑥, 𝑦) = |𝑥 − 𝑥₀| + |𝑦 − 𝑦₀| (3.14)

where (𝑥0, 𝑦0) represents the coordinate of the closest edge pixel in the edge image.

Fig-3.15 is an example of distance transformation. Fig-3.15(a) is a binary edge image, where the 0 value represents the edge pixel. After distance transformation, the edge image is transformed to distance image as shown in Fig-3.15(b).

1 1 1 1 1 1 1 1 1 1

Fig-3.15 (a) Example of edge image (b) Result of distance transformation

(a) (b)

After edge detection, the system would have binary edge images as shown in Fig-3.16(d), which contains important shape information. Because the variation of human is high, the system is required to implement DT to enhance the variation tolerance. Fig-3.16(e) shows the result of distance transformation which will be used to match templates in the following steps. The use of distance image to match templates would have much smoother result than the use of edge image. Therefore, the system would allow more variation and enhance the detection rate.

(a) (b) (c) (d) (e)

Fig-3.16 Result of distance transformation

3.3 Human Recognition

In this section, the system has to judge whether the ROI contains human or not based on the extracted features. In order to achieve higher detection rate and resolve occlusion problems, this thesis adopts component-based concept which considers a whole human shape is formed by different parts of body. There are two steps in this section, including chamfer matching and shape recognition. Chamfer matching is a technique to evaluate the similarity between two objects and could be used to detect possible locations of different parts of body. Following, the result of chamfer matching would be combined into shapes to decide whether the ROI contains human or not.

3.3.1 Chamfer Matching

Chamfer matching [38] is a matching algorithm to evaluate the similarity between test image and template image. First, the shape of the target object, such as head, leg, etc., is captured by a binary template. The test image is pre-processed by edge detection and distance transformation. After implementing the DT, the distance image would be scanned by the template image at all the locations. Note that the size of template image must be smaller than the size of test image. Assume T is a binary template image with size m×n and I is a distance image of the test image. Define the similarity measure as:

where C(x,y) is the matching score at coordinate (x,y) of I. The numerator of (3.15) is equivalent to the cross-correlation between template image and test image at

coordinate (x,y). Following, the result of cross-correlation is normalized by the number of edge pixels in T to get the matching score. The lower score means that the matching between test image and template image at this location is better. If the matching score lies below a certain threshold, the target object is considered as detected at this location. Fig-3.17 is an example of chamfer matching. Fig-3.17(a) and Fig-3.17(c) are the test image and template image, respectively, and Fig-3.17(b) is the distance image of Fig-3.17(a). The template scans the distance image at all the locations and evaluate the similarity based on (3.15). When the matching score is lower than a given threshold, it would be marked by yellow dots as shown in Fig-3.17(d).

In this thesis, chamfer matching is implemented to detect different parts of body, including head, torso and legs. Fig-3.18(a) shows a set of template images, which are called full-head (FH), full-torso (FT) and full-legs (FL) from left to right. These three template images would scan the ROIs respectively and the coordinates of matched regions would be recorded and sent into the next step. However, when a human is occluded by objects or other humans, there might be only left-side body or right-side body in the image. In order to deal with the occlusion problem, separating the template image into left-side one and right-side one might be an option, but it may

Fig-3.17 Example of chamfer matching

(a) (b) (c) (d)

also cost more computation time. Therefore, another template set is proposed in Fig-3.18(b) which contains six template images, the left-side and right-side of Fig-3.18(a), named as left-head (LH), right-head (RH), left-torso (LT), right-torso (RT), left-leg (LL) and right-leg (RL) from left to right. These two template sets,

“Set-I” and “Set-II,” would be tested and their detection rate and speed would be compared in the next chapter.

3.3.2 Shape Recognition

After chamfer matching, the system has to judge whether the ROIs contain human or not based on the coordinates of matched regions of different parts of body.

Because of the ability of variation tolerance, chamfer matching has high true positive rate to correctly detect most of real parts of body, but also has unwanted high false

在文檔中基於深度資訊之智慧型人形偵測系統設計 (頁 27-0)