Organization of the thesis - 應用於仿人眼視覺系統之智慧型深度偵測技術

Chapter 1 Introduction

1.2 Organization of the thesis

The rest of the thesis is organized as follows in chapter 2 we have introduced the intelligent learning algorithm. In chapter 3 we have represented the traditional depth detection algorithm and ANN depth detection algorithm respectively. In chapter 4 experimental results and discuss are provided. At last, chapter 5 represents the conclusions and future works.

Chapter 2 Intelligent learning algorithm

2.1 Introduction to ANNs

The human nervous system consists of a large amount of neurons, including somas, axons, dendrites and synapses. Each neuron is capable of receiving, processing, and passing electrochemical signals from one to another. To mimic the characteristics of the human nervous system, recently investigators have developed an intelligent algorithm, called artificial neural networks (ANNs), to construct intelligent machines capable of parallel computation. This thesis will apply ANNs to the depth detection in an eyeball system through learning.

. .

Fig. 2.1 Basic element of ANNs

ANNs can be divided into three layers which contain input layer, hidden layer, and output layer. The input layer receives signal form the outside world, which just includes input values without neuron. The neuron’s number of output layer is depending on the output number. Form the output layer, the response of the net can be

read. The neurons between input layer and output layer are belonging to hidden layer which does not exist necessarily. Here, each input is multiplied by a corresponding weight, analogous to synaptic strengths. The weighted inputs are summed to determine the activation level of the neuron. The connection strengths or the weights represent the knowledge in the system. Information processing takes place through the interaction among these units. The Basic element of ANNs, single layer net, is shown in Fig. 2.1 Basic element of ANNs which obeys the input-output relations

1 has many types cover linear and nonlinear. Note that the commonly used activation function is

( )

- x

f x = 1

1+ e (2.1-2)

which is a sigmoid function. Base on the basic element, the commonest multilayer feed-forward net shown in Fig. 2.2 Multilayer feed-forward network, which contains input layer, output layer, and two hidden layers. Multilayer nets can solve more complicated problem than single layer nets, i.e. a multilayer nets is possible to solve some case that a single layer net cannot be trained to perform correctly at all.

However, the training process of multilayer nets may be more difficult. The number of hidden layer and its neuron in the multilayer net are decided by complicated degree of the problem wait to solve.

Output layer

Fig. 2.2 Multilayer feed-forward network

In addition to the architecture, the method of setting the values of the weights is an important matter of different neural net. For convenience, the training for a neural network mainly classified into supervised learning and unsupervised learning.

Training of supervised learning is mapping a given set of inputs to a specified set of target outputs. The weights are then adjusted according to various learning algorithms.

Another type, unsupervised learning, can self-organize neural nets group similar input vectors together without the used of training data to specify what a typical member of each group looks like or to which group each vector belongs. For unsupervised learning, a sequence of input vector is provided, but no target vectors are specified.

The net modifies the weights so that the most similar input vectors are assigned to the same output unit. In addition, there are nets whose weights are fixed without iterative training process, called structure learning, which change the network structure to achieve reasonable responses. In this thesis, the neural network learns the behavior by

many input-output pairs, hence that is belongs to supervised learning.

2.2 Back-Propagation Network

In supervise learning, the back propagation learning algorithm, is widely used in most application. The back propagation, BP, algorithm was proposed in 1986 by Rumelhart, Hinton and Williams, which is based on the gradient steepest descent method for updating the weights to minimize the total square error of the output. The training by BP mainly is applied to multilayer feed-forward network which involves three stages: the feed-forward of the input training pattern, the calculation and back-propagation of the associated error, and the adjustment of the weights. Fig. 2.3 Back-propagation network shows a back-propagation network contains input layer with Ninp neurons, one hidden layer with Nhid neurons, and output layer with Nout

neurons. In Fig. 2.3 Back-propagation network, ₁ ₁ input, hidden, and out note of the network. In addition, v_ij is the weight form the i-th neuron in the input layer to j-th neuron in the hidden layer and wgh is the weight form the g-th neuron in the hidden layer to h-th neuron in the output layer.

w

Fig. 2.3 Back-propagation network The learning algorithm of BP is elaborated on below:

Step 1: Input the training data of input ₁ ₁

leaning rate η which between 0.1 and 1.0 to reduce the computing time

. Step 3: Calculate the output of the m-th neuron in hidden layer

d (2.1-3)

neuron and the output of the i-th neuron in output layer

ut (2.1-4)

or increase the precision.

Step 2: Set the initial weight and bias value of the network at random

where f_h(․)is the activation function of the

where f_y(․)is the activation function of the neuron.

Step 4: Calculate the error function between network output and desired output.

( ) ( )

Step ient steepest descent method, determining the correction of weights.

Step 6: Propagate the correction backward to update the weights.

7: Check whether the whole training data set have learned already.

Networks learn whole training data set once called a learning circle. If the network no

(2.1-8)

Step

t goes through a learning circle, return to Step 1; otherwise, go

to Step 8.

: Check whether the network converge. If E<E

Step 8 ining

, the BP algorithm was used to learn the input-output relationship for epth function.

max, terminate the tra process; otherwise, begin another learning circle by going to Step 1.

BP learning algorithm can be used to model various complicated nonlinear functions. Recently years The BP learning algorithm is successfully applied to many domain applications, such as: pattern recognition, adaptive control, clustering problem, etc. In the thesis

Chapter 3 Intelligent depth detection for a humanoid vision system

3.1 Humanoid Vision System Description

The HVS is built with two cameras and five motors to emulate human eyeballs as shown in Fig. 3.1. These five motors, FAULHABER DC−servomotors, are used to drive the two cameras to implement the eye movement, one for the conjugate tilt of two eyes, two for the pan of two eyes, and two for the pan and tilt of the neck correspondingly. The control of DC−servomotors is executed by the motion control card, MCDC 3006 S, in a positioning resolution of 0.18°. With these 5 degrees of freedom, the HVS would track the target whose position is determined from the image processing of the two cameras. In addition, these two cameras, QuickCam^TM Communicate Deluxe, have specifications listed below. [19]

1.3-megapixel sensor with RightLight™2 Technology

Built-in microphone with RightSound™ Technology

Video capture: Up to 1280 x 1024 pixels (HD quality) (HD Video 960 x 720 pixels)

Frame rate: Up to 30 frames per second

Still image capture: 5 megapixels (with software enhancement)

USB 2.0 certified

Optics: Manual focus

In this proposed system structure, the baseline 2d is set as constant equal to 10.5cm. The control and image process are both implemented in personal computer with 3.62 GHz CPU.

Fig. 3.1 Humanoid vision system

3.2 Depth Detection

3.2.1 Traditional depth detection algorithm

Before introducing depth computing, the triangulation for one camera will be introduced first [20]. Fig. 3.2 illustrates the relation of a world point with world coordinates (X, Y, Z), which is projected onto camera coordinates (x, y) and onto image coordinates (u, v) in the image plane. The mapping between the camera coordinates (x, y) and the world coordinates (X, Y, Z) is formed by means of similar triangles as

where f is the focal length of the camera. The origin of the camera frame is located at the intersection of the optical axis with the image plane, while the origin of the image frame is located at the top left corner of the image. The transformation between the image frame and the camera frame is given by

( )

respectively. Besides, ( , ) are the image coordinates of the origin of the camera frame and the function round(․) rounds the element to its nearest integer.

u0 v₀

It is known that mapping a 3-D scene onto an image plane is a multiple-to-one transformation, i.e., an image point may represent different locations in world coordinate. To derive the world coordinate information uniquely, two cameras should be used. With thetriangulation theory and the disparity of a pair of corresponding object points in two camera’s frames, the real location of an object can be reconstructed.

Fig. 3.2 Relation of a world point projected onto an image plane

Y

Fig. 3.3 Configuration of a HVS with binocular cameras

A HVS is implemented by two cameras with fixed focal length as shown in Fig.

3.3. The baseline of the two cameras is 2d, where d is the length from the center of baseline to the camera, left or right. Let the pan angles of the left and right cameras be denoted by α and α , respectively. Since the cameras may not be placed precisely, there often exists a small tilt angle 2β and a small roll angle 2φ between two cameras. Choose the stereo reference frame at O , which divides the tilt angle, roll angle, and baseline equally, with Z−axis pointing towards the fixating point F. For the left camera coordinate frame L, it is related to the st e by a

ranslation vector dL=(−d,0,0) and a rotation matrix ΩL =Ω

(

α β φL, L, L

)

, where Front View

Y

αL, βL=β0and φL , tilt and roll of frame L and

=φ0are respectively the Euler angles of pan

(

α β φ^{, ,}

)

( )

is the general rotation matrix expressed as

( ) ( ) ( )

B from frame L to the reference frame is described as

is the position vector of B in the reference frame. From (3.1-1) and (3.1-4), we have

ω11(α,β,φ)= cosα cosφ

The transformation equation of the object point

( )

( ) (

31 , 0, 0

)

(

, 0, 0

)

(

, 0, 0

)

L s L s L s L

Z = X +d ω α β φ +Yω α β φ +Zω α β φ

Therefore, the relation between image coordinate of frame L and world coordinates are

the right camera coordinate frame R, the transformation equation of B is described as

p =Ω p −d (3.1-7)

T B in frame =(d,0,0) is th

translation vector, and the rotation matrix

( )

lation between the right camera coordinates and the world coordinates are given by

R R

Accordingly, the relation between image coordinate of frame R and world coordinates can be express as

the disparity of B between the left image

( )

for both the left and the right frames are usually much smaller compared to the other terms. Further define stereo disparity, u_sd, as

frame and right image frame, expressed as

sd L R

u =u − (3.1-10) u

whic

osition vector of B in the left and right camera frames are

R (3.1-12)

h will be applied to the determination of the depth of B.

In the HVS, if two cameras are placed precisely in parallel, then the pan, tilt, and roll angles between these two cameras are zeros, i.e., αL=αR =0, βL =βR=β0 =0, and φL

=φR=φ0=0. From (3.1-4) and (3.1-7), the p respectively obtained as

L s

Since the term on the right can be approximated as

)

Rearranging it leads to

= − ≈ case, for arbitrary object point, it is hard to obtain the XL and XR values. Therefore, the traditional depth computation is usu se er t ssumption that both cameras have the sam

( ) ( )

cameras. Clearly, with (3.1-17), only and u_sd are required to find the depth Z.

3.2

e used to train the system for elim

training algorithm, which is the most commonly adopted for MLP

L−X_R=2d is the separation of two k f , _u

.2 ANN depth detection algorithm

Stereo pair obtained from two cameras can be utilized to compute the depth of a point by using the traditional depth computation introduced in previous section.

However, to apply the computation, parameters of each camera need to be experimentally obtained in advance. Therefore, ANN can b

inating the complicate computation process.

It is well known that multilayer neural networks can approximate any arbitrary continuous function to any desired degree of accuracy [14]. A number of investigators have it for different proposes in stereo vision. For example they have been used for camera calibration [15], establishing correspondence between stereo pair based on features [16] and generating depth map in stereo vision [12][15]. In this thesis, a feed forward neural network was used for depth estimation. The ANN computation system doesn’t need to calibrate the cameras in HVS. This can be very helpful in rapid prototyping application. The proposed thesis employs a Multi-Layer Perceptron (MLP) network trained by BP

network.

In the problem, the thesis proposed a multilayer ANN model because camera calibration problem is a nonlinear problem and cannot be solved with a single network [17]. Further, according to the neural network literature [18] more than one hidden layer is rarely needed. The more layers that a neural network have, the more parameter values need to be set because the number of neurons in each layer must be

determined. Therefore, to reduce the number of permutations, a network with one hidd

ctual depth, after training;

give the world coordinates for any matched pair of points.

en layer was selected.

The network model had been used in Fig. 3.4 for simulation consists of four input neurons, five hidden neurons and three output neurons. The input neurons’

corresponding to the image coordinates of matched points found in the stereo images (ul, vl) and (ur, vr). These points are generated by the same world point on both images and formed the input data for the neural network. The output neurons corresponding to the world coordinates of a point which are mapped as (ul, vl) and (ur, vr) on the two images. The network is trained in an interesting range of a

Fig. 3.4 ANN model used for this thesis

The algorithm requires training a set of matched image points whose corresponding world point is known. The set of matched points and the world

Y

coordinates thus obtained and formed the training data set for the ANN. Once the network is trained, we present it with arbitrary matched points and it directly gives us the d

thesis) will be issued in next Chapter for getting more precise detection result.

epth corresponding to the matched pair.

The main problem to using the MLP network is how to choose optimum parameters. Presently, there is no standard technique for automatically setting the parameters of MLP network. That is to say, the best architecture and algorithm for the problem can only be evaluated by experimentation and there are no fixed rules to determine the ideal network model for a problem. Therefore, experiments were performed on the neural network to determine the parameter according to its performance. Parameters with the number of neurons in the hidden layer (one hidden layer is employed in this

Chapter 4 Experimental Results and Discussion

4.1 Experimental Settings

The board consisting of a set of grid points is placed in front of the HVS for depth detection as shown in Fig.4.1 and Fig.4.2, one for the left camera and the other for the right. With the use of these two cameras, the HVS captures images of a specified cross at various distances, ranging from 65 to165 cm. To verify the usefulness of the proposed ANN depth detection algorithm, the experiment is implemented by changing the distance Z between the HVS and the board, where Z starts from 65 cm to 165 cm at an increment of 10 cm.

In the next section, there are different cases used to show the restriction of traditional depth detection algorithm and the flexibility and features of the ANN depth detection algorithm proposed in the thesis.

Fig. 4.1 The board at Z=105cm captured form left camera

Fig. 4.2 The board at Z=105cm captured form right camera

4.2 Experimental Results

4.2.1 Restrictions on Traditional depth detection algorithm

It is known that the traditional depth detection algorithm (3.1-17) is only suitable for the case that both cameras have the same focal length and their optic axes are parallel. However, in practical situation, the two cameras of an HVS generally have a little difference in their focal length or a little deflection between their optic axes. As a result, (3.1-17) is inappropriately used to detect the depth of a scene. To show these restrictions on an HVS, experiments will be set for demonstration.

The tradition depth detection algorithm (3.1-17) is derived from (3.1-16) under the assumption that k f_uL _L =k_{u R}f_R and 2d = X_L−X_R. Here, let’s find the real values parallel. Then, (3.1-16) becomes

( ) ( )

which can be further rearranged as

( )

where the baseline 2d is fixed. Next, let’s show the way to calculate for the left camera in the HVS. By setting the shortest focal length for each camera, which is fixed but not exactly known, two images are obtained in

ΦL

Fig. 4.5 for Z=95 cm and 2d=10.5 cm. To calculateΦ_L, the term XR

(

Φ ΦL− R

)

in (4.1-2) has to be eliminated.

By choosing the same 12 points in both images Fig. 4.5 (a) and (b), enclosed in the

dashed square, the addition of XR

(

Φ ΦL− R

)

of these twelve points will vanish when they are vertically symmetric to the center line l_R in the right image. Hence,

( )

different distance Z from 45 cm to 105 cm to verify that ΦL is around the average value Φ =142.918 with 0.24116% variation. In a similar way, the value of _L corresponding to the case of the shortest focal length, Z=85 cm and 2d=10.5 cm, can be obtained as

where the 12 points are chosen from the images shown in Fig. 4.6, vertically symmetric to the center line lL in the left image. Fig. 4.4 shows the result of ΦR for different distance Z. It is obvious that Φ is approximate to _L Φ with 0.27821% _R variation. Since Φ is indeed near to _L Φ , the traditional depth detection algorithm _R (3.1-17) is applicable and adopts the average of Φ and _L Φ as k_R uf.

40 50 60 70 80 90 100 110

(a)

l_R

(b)

Fig. 4.5 The same 12 points using to calculateΦ_L for Z=95cm

(a) (b)

Fig. 4.6 The same 12 points using to calculateΦ_R for Z=95cm l_L

P₁ P2

(a)

(b)

Fig. 4.7 Two adjacent points P and P in the center line of the training area for Z=135

Human eyes always focus on the center of entire eyeshot. Based on this characteristic, two adjacent points P1 and P2 in the center line of the training area, within eyeshot center of HVS, on the experimental board was shown in Fig. 4.7 are chosen as the testing points. P1 and P2 used to acquire test the error between depth detection result and actual depth.

The average error is computed by mean absolute error that can be written as

ma ˆ

e =mean Z⎡⎣ −Z ⎤

⎦ (4.1-5)

where e_ma is the mean absolute error of the network. Z and Z^ˆ denote the depth that is actually measured and the corresponding depth given by the network respectively.

Therefore, to represent distinct situation of e_ma, four different conditions are schemed as shown in the Fig. 4.8 to Fig. 4.11. The first case represent two cameras of HVS are placed in parallel with the same focal length. The second case is respect to distinct focal length. The third case is respect to insignificant deflection between optic axes of two cameras. And for the last case distinct focal length and insignificant deflection are included.

Case I

65 75 85 95 105 115 125 135 145 155 165

-10 -8 -6 -4 -2 0 2 4 6 8

Traditional depth computation error plot(caseI), (Error range: 0.017387%~1.7834%)10

(cm)

(%)

Average

Fig. 4.8 e_ave of two cameras placed in parallel with the same focal length Case II

65 75 85 95 105 115 125 135 145 155 165

Traditional depth computation error plot(caseII), (Error range: 15.1465%~18.9796%)30

(cm)

Traditional depth computation error plot(caseIII), (Error range: 58.3189%~1145.9954%)

(cm)

(%)

Fig. 4.10 e_ave of two cameras with insignificant deflection between optic axes

Case IV

65 75 85 95 105 115 125 135 145 155 165

500 1000 1500 2000 2500

Traditional depth computation error plot(caseIV), (Error range: 155.1946%~2653.9037%)

(cm)

(%)

Fig. 4.11 of two cameras with distinct focal length and insignificant deflection

eave

These figures show that the ema of the actual depth range in 65 to165 cm of case I is between 0.017387% and 1.7834%, case II is between 15.1465% and 18.9796%, case III is between 58.3189% and 1145.9954% and case IV is between 155.1946%

and 2653.9037%. The maximum of percentage error results of these case are greater than 15% except case I. This is mean that the algorithm only is used in the situation similar to case I, or else need to rewrite the formula with more parameter to fit all

在文檔中應用於仿人眼視覺系統之智慧型深度偵測技術 (頁 11-0)