• 沒有找到結果。

Chapter 3 Design of Proposed Octagonal 9-KINECT Imaging

3.4 Analysis of Device Performance

3.4.2 Imaging Sequence and Speed

As we mentioned in the previous sections, we acquire the data from the 9 KINECT devices sequentially. When we take a frame consisting of a color image and

28

a depth image from a single KINECT device, the frame rate of the device is 30 fps. In other words, we take a frame from a single KINECT device in 33 milliseconds. Then, when we use the 9 KINECT devices to take 9 frames sequentially, on the whole the imaging speed is 33  9 = 297 milliseconds, so the fps is 1/297 3.37. But we assume that the monitored object or human moves not too fast, so it will not be a problem to our processing work.

29

Chapter 4

Construction of 3D Images from KINECT Images

4.1 Introduction

The data acquired from a KINECT device each time consists of a color image and a depth image. We call them KINECT images. The KINECT images are not 3D in nature and so inconvenient for processing for 3D video surveillance applications. So, we want to construct a corresponding 3D image from each pair of KINECT images.

The 3D image contains three kinds of data. One is color data which come from the color image directly. Another is the 3D data which are obtained by converting the depth image into a 3D version. The third is a mapping array, which is obtained by using the Kinect-for-Windows SDK provided by Microsoft and is used as a tool for combining the former two parts, the 3D data and the color data. With the 3D image, we not only can conduct appropriate processing works required by a 3D video surveillance system more conveniently, but also can display results in 3D manners more easily.

4.2 Review of KINECT Image Structures

In this section, we will introduce the structure of the KINECT images in detail.

As we mentioned in the last section, the KINECT images include a color image and a

30

depth image. We use the KINECT device, which yields images with the resolution of 640480 pixels, together with the Kinect-for-Windows SDK to get KINECT images.

Each pixel in the color image has four bytes. The first three bytes are used to show the color and the last one is used to show the skeleton information. We can directly display the color image as a picture. An example of such color images is shown in Figure 4.1(a). Each pixel in the depth image has a value of an unsigned short integer.

In other words, each pixel in the depth image has sixteen bits. The first thirteen bits are used to represent depth information and the last three bits are used to specify the skeleton information. We can display a depth image as a gray level image. An example of such depth images is shown in Figure 4.1(b).

(a) (b)

Figure 4.1 An example of a KINECT image pair. (a) The color image. (b) The depth image.

4.3 Construction of 3D Images

4.3.1 Review of Pinhole Camera Model

The pinhole camera [10] is a simple camera model. Its structure is an opaque box with an aperture of only the pinhole size on one side. The light reflected from the

31

object and passing through the pinhole produces a projection of the scene in front of the pinhole. In the projection, right and left, and up and down are both reversed. An illustration is shown in Figure 4.2.

Figure 4.2 An illustration of the pinhole camera model.

The pinhole camera model describes the mathematical relationship between the coordinates of a 3D point and its projection on the image plane of the pinhole camera.

An example of the geometry of the pinhole camera model is shown in Figure 4.3.

More specifically, in Figure 4.3(a), there is a 3D orthogonal coordinate system with its origin at O. The origin O is also the location of the camera aperture. The three axes of the 3D orthogonal coordinate system are referred to as X1, X2 and X3. The X3-axis is pointing in the viewing direction of the camera and is referred to as the optical axis. There is also a 2D coordinate system on the image plane with its origin at R. The origin R is at the intersection of the optical axis and is referred to as the image center. The two axes of the 2D coordinate system in the image plane are referred to as Y1 and Y2 which are parallel to the axes of X1 and X2, respectively. The distance from point O to R is f. The distance f is referred to as the focal length of the pinhole camera.

With the basic definitions given above, we can find out the relation between the point P with the 3D coordinates (x , 1 x , 2 x ) and the projection point Q with the 2D 3

32

coordinates (y , 1 y ). When we look in the negative direction of the X2 2-axis from Figure 4.3(a), we get Figure 4.3(b). From the two similar triangles appearing in Figure 4.3(b), we can derive the following equation according to the similar-triangle

When we look in the negative direction of the X1-axis, the following equation can be derived similarly:

Summarizing these two equations, we get the following vector equation:



33

4.3.2 Idea of 3D Image Construction and Coordinate Conversion

With the concept of the pinhole camera model, we can construct the 3D image from the depth image of the KINECT image by coordinate conversion. We will use Figure 4.3 to help us to explain the conversion process. From Equation 4.3, we get:

1 device and is denoted as d in the sequel. Let R represent the center of the depth image, which is located at coordinates (320, 240) in a depth image of resolution 640480 acquired by the KINECT device. And let Q be a pixel located at image coordinates (x ,p y ), and let p y and 1 y represent the distance to the center R in the vertical and 2 horizontal directions, respectively. The focal length f of the KINECT device is 600.

The equations (4.7), (4.4), (4.5) and (4.6) can be rewritten according to the mentioned parameter values to be:

34

Input: a depth image Id and a color image Ic acquired from the KINECT device.

Output: a 3D image I3D formed from converted Id and original Ic combined with a mapping array A.

Steps:

Step 1. Convert the coordinates of the depth image Id into 3D data by the coordinate conversion process described in Section 4.3.2.

35

Step 2. Use the Kinect-for-Windows SDK provided by Microsoft with the converted depth image Id as input to get a mapping array A.

Step 3. Display the 3D image I3D by drawing 3D data in the 3D space with the corresponding color, which is produced by the color image Ic and the mapping array A, by the OpenGL.

Kinect image

Depth Image Color Image

Coordinate Conversion

3D Data

Produce Mapping Array

3D image Renderring

Mapping Array

Figure 4.4 The flowchart of the 3D image construction algorithm.

36

4.4 Geometric Correction of 3D Images

4.4.1 Need of Correction

In our experiments of this study, when we displayed a plane in the 3D image, we discovered that the plane becomes a curved surface rather than a flat one. An example of such a phenomenon is shown in Figure 4.5. The reason why this problem arises is that the infrared light rays sent out by the KINECT device for depth sensing do not go in parallel. It affects the accuracy of the depth because the depth is not a vertical distance anymore. In order to solve this problem, we propose a method for the geometric correction of 3D images.

Figure 4.5 The 3D image of a plane (a wall).

4.4.2 Proposed Correction Technique by Interpolation

The idea of the proposed correction technique is to use a paraboloid to

37

approximate the curved surface formed by the 3D data of the 3D image. An illustration is shown in Figure 4.6. When the paraboloid equation is found, we can substitute the values of the coordinates x and y of each pixel of the 3D image into the paraboloid equation to compute the correction distance. Then, we subtract the correction distance from the value of the coordinate z of the 3D data of the 3D image, and we get the correction result.

But we discovered that when we use the KINECT device to sense planes with the different distances from the KINECT device, the degrees of the curvature for the curved surfaces formed by the 3D data of the 3D image are also different. So we try to find several paraboloid equations with different sensing distances from the KINECT device. And we use these paraboloid equations according to the value of the coordinate z of the 3D data of the 3D image to find suitable correction distances by the interpolation.

Figure 4.6 The paraboloid equation.

4.4.3 Correction Algorithm

In this section, we will describe the method for getting the paraboloid equation

38

and the process of the interpolation mentioned previously. The criterion of minimum sum of squared errors (MSSE) is used to decide the parameters of the approximating shape. That is, we will use the MSSE criterion to approximate the paraboloid. The detail of the process is as follows.

First, let the paraboloid equation be described by:

C

where C is the distance between the KINECT device and the apex of the paraboloid, as shown in Figure 4.6. The equation for computing the value SSE of the SSE is:

 

2 compute the partial derivatives of Equation (4.13) with respect to variables A, B, and C, respectively, to produce the following equations:

   

The values of A, B, and C are computed by solving the simultaneous equations (4.14), (4.15) and (4.16). For this, we substitute all the known values ofx , i y and i z into i the simultaneous equations (4.14), (4.15) and (4.16) to get three three-variable linear equations. We use the substitution and elimination method to solve three three-variable linear equations. After solving these three independent equations, we can get the values of A, B, and C.

But, as mentioned we need more solutions of the values A, B, and C by using different sets of 3D data of the 3D image with different distances between the

39

KINECT device and the planes, and the results are shown in Table 4.1.

Table 4.1 Results of paraboloid coefficient estimations using different sets of 3D data of the 3D image with the different distances between the KINECT device and planes.

coefficient distance

A B C

1003.62 (mm) 0.000495949 0.000499793 -1003.62

1535.92 (mm) 0.000323624 0.000380348 -1535.92

2120.88 (mm) 0.000232773 0.000281706 -2120.88

2560.30 (mm) 0.000188799 0.000248159 -2560.30

3111.78 (mm) 0.000155055 0.000205877 -3111.78

With Table 4.1, we can use it to decide which equations we will use to do interpolation by the value of the coordinate z of the 3D data of the 3D image. When the equations are found, we subtract the value of C of the equations themself from the equations to get correction equations. Then, we substitute the values of the shown in Figure 4.7 and the detail of the correction algorithm is as follows.

Algorithm 4.2: correction algorithm.

Input: the values of the coordinates x, y, and z of the 3D data of the 3D image.

40

Output: the correction value zCorrected. Steps:

Step 1. Use the values of the coordinate z to find the paraboloid equations PEs.

Step 2. Subtract the values of C from the paraboloid equations PEs themselves and get correction equations CEs.

Step 3. Substitute the values of the coordinates x and y into the correction equations CEs to get the solutions zCorrections.

Step 4. Use the solutions zCorrections and the values of the coordinate z to do

interpolation and get the result zInterpolation . Step 5. Subtract the result zInterpolation

from the values of the coordinate z and get

Figure 4.7 A flowchart of the correction algorithm.

41

4.5 Experimental Results

4.5.1 Results of 3D Image Construction

We use the KINECT image to construct 3D images and display them by the OpenGL. An example of the results of 3D image construction is shown in Figure 4.8.

(a) (b)

(c) (d)

Figure 4.8 An example of construction of 3D images. (a) The color image of the KINECT image. (b) The depth image of the KINECT image. (c) The 3D image seen from the top. (d) The 3D image seen from a side.

42

4.5.2 Results of 3D Image Correction

We use the correction algorithm to correct the 3D data of the constructed 3D image and display the result by the OpenGL. But there is still a problem. That is, the corners of the 3D image are still curved irregularly. For this, on solution is to avoid the use of the 3D data of the corners of the 3D image. An example of the results of such geometric corrections for planes is shown in Figure 4.9. Another example of the results of such geometric corrections for an indoor environment is shown in Figure 4.10.

(a) (b)

(c) (d)

Figure 4.9 Results of geometric correction. (a) Original data seen from the top before correction. (b) Data seen from the top after correction. (c) Original data seen from the side before correction. (d) Data seen from the side after correction.

43

(a) (b)

(c) (d)

Figure 4.10 Results of geometric correction. (a) Original data seen from the top before correction. (b) Data seen from the top after correction. (c) Original data seen from the front before correction. (d) Data seen from the front after correction.

44

Chapter 5

Construction of 3D Indoor

Environment Model from Multiple KINECT Images

5.1 Introduction

In this chapter, we describe how we construct the indoor environment model for 3D video surveillance using images acquired by the octagonal 9-KINECT imaging device. More specifically, we use the nine KINECT devices to get nine sets of KINECT images and convert them into nine 3D images individually. Then, we merge the nine 3D images to build up an indoor environment model. But, before doing so, we should calibrate the spatial relation between the nine KINECT devices in advance.

The detail of the proposed calibration technique will be described in Section 5.2. After the calibration work, we use the results to merge the nine 3D images by shifting, rotating, and merging them to build up the indoor environment model. Finally, we display the model in 3D manners. The details of data merging and model displaying will be shown in Sections 5.3 and 5.4, respectively.

5.2 Calibration of KINECT Devices

45

5.2.1 Review of Iterative Closest Point (ICP) Algorithm

The iterative closest point (ICP) algorithm [11] can be employed to minimize the difference between two groups of points. It is often used to match objects, which are constructed by many points, to compute their similarity. It is useful for constructing 2D or 3D images from different views, because object registration or stitching requires shape matching.

The concept of the algorithm is simple. It iteratively revises the transformation, including translation and rotation, from an object into another in order to minimize the total distance between the points of the two objects. The algorithm is as follows.

Algorithm 5.1: ICP algorithm

Input: a group of points GA, another group of points GB, a set of transformations Tis, an initialized minimum value M, and an initialized transformation T0.

Output: A transformation T which is the relation between group GA and group GB. Steps:

Step 1. Apply a transformation Ti, which is not used yet, to all points in group GB. Step 2. Find points PMDs with the minimum distance in group GA for each point in

group GB.

Step 3. Compute the values VMDs of the minimum distance between the found points PMDs in group GA and the corresponding points in group GB.

Step 4. Sum up the values VMDs to get a total sum TS.

Step 5. If the total sum TS is small than the input minimum valueM, update the minimum value M with the total sum TS and the desired transformation T with the transformation Ti.

Step 6. Repeat Step 1 through Step 5 if the transformations Tis are not exhausted yet.

46

Step 7. Take the last updated transformation T as the output.

5.2.2 Calibration of Spatial Relation between KINECT Devices

In this section, we want to use the ICP algorithm to calibrate the spatial relations of the nine KINECT devices in the octagonal 9-KINECT imaging device. By using the ICP algorithm to merge the 3D images of two objects which are the same object but come from two different KINECT devices, we can get the result of the transformation between them, which is just the spatial relation of the two KINECT devices, because the transformation between 3D images is equivalent to the transformation between KINECT devices. With the concept above, we should prepare three things before starting calibration.

First, we should decide the range of the transformation parameters, and for this, we divide the transformation into two parts  a rotation and a translation. For the rotation, because the sensing directions of the nine KINECT devices of the octagonal 9-KINECT imaging device are fixed, the angles between the nine KINECT devices are also fixed. We can use the values of these angles for the rotation. For the translation, we divide it into two directions to facilitate running the ICP algorithm.

The place of each of the nine KINECT devices is fixed, so the distance between every two of the nine KINECT devices is also fixed. We would like to enlarge values of these distances and divide these distances into two directions for the translation of the two directions.

Second, we should find out the overlap region of the 3D images acquired from every two KINECT devices. Using the overlap region, we can merge the 3D images of an identical object “seen” from different KINECT devices by the ICP algorithm in

47

order to get the result of the transformation. The overlap regions may be found by manpower.

Third, we should choose objects, whose 3D images from different KINECT devices can be merged in the overlap regions, and we will call them calibration targets. Basically, we should use a calibration target which is big enough and can appear in the overlap region apparently. For this, we use common objects which appear in the indoor environment as calibration targets, such as couch, table, chair, clapboard, etc. Sometimes, we will also use a box which is put at suitable height as the calibration target, if there is no apparent object in the overlap region. Some calibration targets are shown in Figure 5.1.

(a) (b)

(c) (d)

Figure 5.1 Some calibration targets used in this study. (a) A couch. (b) A clapboard. (c) A chair and a table. (d) A box with a suitable height.

48

5.2.3 Algorithm for KINECT Device Calibration

With the preparation done, we start to calibrate the spatial relations between the nine KINECT devices in the octagonal 9-KINECT imaging device. Firstly, we label the nine KINECT devices by numbers, and two consecutively numbered KINECT devices mean that they are neighboring. Then, we use the 3D images, which include the pre-selected calibration target in their overlap region, to calibrate the inter-KINECT relation parameters by the ICP algorithm. Totally, we conduct such calibration for eight times.

Before we conduct such calibrations each time, we reset the range of the possible transformations between the two devices for the ICP algorithm, set the two 3D images including the calibration target from two neighboring KINECT devices as inputs to the ICP algorithm, and use the overlap region in the images to assist the calibration work. The proposed algorithm for KINECT-device calibration is as follows.

Algorithm 5.2: KINECT device calibration.

Input: the 3D images CT0, CT1, …, CT8 which are constructed from KINECT images acquired by the nine KINECT devices D0, D1, …, D8 and include the calibration target; the transformation NTj and the overlap region ORj between every two neighboring KINECT devices Dj and Dj+1 where j = 0, 1, …, 7; a counter with its value C set to be 0 initially.

Output: the transformation Rk between every two KINECT devices Dk and Dk+1, where k = 0, 1, …, 7, which can be used to “register” the 3D images CTk and CTk+1.

Steps:

Step 1. Take two 3D images CTc and CTc+1, which include the calibration target in their overlapping region, as input data for the ICP algorithm.

49

Step 2. Set the transformation NTc to be the transformation sets for the ICP algorithm.

Step 3. Start the ICP algorithm described in Section 5.2.1 while using the overlap region ORc to assist finding the calibration target for the ICP algorithm.

Step 4. Store the result of transformation of the ICP algorithm as the result of the transformation Rc.

Step 5. Increment the value C of the counter by 1.

Step 6. If the value C is smaller than eight, then repeat Steps 1 through 5; else, exit.

5.3 Environment Model Construction

5.3.1 Idea of Construction

After calibrating the spatial relations between the nine KINECT devices in the octagonal 9-KINECT imaging device, we can get eight transformations between the nine KINECT devices. As we mentioned previously, a transformation between two 3D images is equivalent to the transformation between the two corresponding KINECT devices, and vice versa. So we will use the results of the transformation to “register”

the nine 3D images, which are constructed from KINECT images acquired by the nine KINECT devices. By doing so, we can merge the nine 3D images into one to

the nine 3D images, which are constructed from KINECT images acquired by the nine KINECT devices. By doing so, we can merge the nine 3D images into one to