• 沒有找到結果。

Chapter 2 Intelligent learning algorithm

2.2 Back-Propagation Network

In supervise learning, the back propagation learning algorithm, is widely used in most application. The back propagation, BP, algorithm was proposed in 1986 by Rumelhart, Hinton and Williams, which is based on the gradient steepest descent method for updating the weights to minimize the total square error of the output. The training by BP mainly is applied to multilayer feed-forward network which involves three stages: the feed-forward of the input training pattern, the calculation and back-propagation of the associated error, and the adjustment of the weights. Fig. 2.3 Back-propagation network shows a back-propagation network contains input layer with Ninp neurons, one hidden layer with Nhid neurons, and output layer with Nout

neurons. In Fig. 2.3 Back-propagation network, 1 1 input, hidden, and out note of the network. In addition, vij is the weight form the i-th neuron in the input layer to j-th neuron in the hidden layer and wgh is the weight form the g-th neuron in the hidden layer to h-th neuron in the output layer.

w

Fig. 2.3 Back-propagation network The learning algorithm of BP is elaborated on below:

Step 1: Input the training data of input 1 1

leaning rate η which between 0.1 and 1.0 to reduce the computing time

. Step 3: Calculate the output of the m-th neuron in hidden layer

d (2.1-3)

neuron and the output of the i-th neuron in output layer

ut (2.1-4)

or increase the precision.

Step 2: Set the initial weight and bias value of the network at random

1

where fh(․)is the activation function of the

1

where fy(․)is the activation function of the neuron.

Step 4: Calculate the error function between network output and desired output.

( ) ( )

Step ient steepest descent method, determining the correction of weights.

Step 6: Propagate the correction backward to update the weights.

1

7: Check whether the whole training data set have learned already.

Networks learn whole training data set once called a learning circle. If the network no

(2.1-8)

Step

t goes through a learning circle, return to Step 1; otherwise, go

to Step 8.

: Check whether the network converge. If E<E

Step 8 ining

, the BP algorithm was used to learn the input-output relationship for epth function.

max, terminate the tra process; otherwise, begin another learning circle by going to Step 1.

BP learning algorithm can be used to model various complicated nonlinear functions. Recently years The BP learning algorithm is successfully applied to many domain applications, such as: pattern recognition, adaptive control, clustering problem, etc. In the thesis

d

Chapter 3

Intelligent depth detection for a humanoid vision system

3.1 Humanoid Vision System Description

The HVS is built with two cameras and five motors to emulate human eyeballs as shown in Fig. 3.1. These five motors, FAULHABER DC−servomotors, are used to drive the two cameras to implement the eye movement, one for the conjugate tilt of two eyes, two for the pan of two eyes, and two for the pan and tilt of the neck correspondingly. The control of DC−servomotors is executed by the motion control card, MCDC 3006 S, in a positioning resolution of 0.18°. With these 5 degrees of freedom, the HVS would track the target whose position is determined from the image processing of the two cameras. In addition, these two cameras, QuickCamTM Communicate Deluxe, have specifications listed below. [19]

ƒ 1.3-megapixel sensor with RightLight™2 Technology

ƒ Built-in microphone with RightSound™ Technology

ƒ Video capture: Up to 1280 x 1024 pixels (HD quality) (HD Video 960 x 720 pixels)

ƒ Frame rate: Up to 30 frames per second

ƒ Still image capture: 5 megapixels (with software enhancement)

ƒ USB 2.0 certified

ƒ Optics: Manual focus

In this proposed system structure, the baseline 2d is set as constant equal to 10.5cm. The control and image process are both implemented in personal computer with 3.62 GHz CPU.

Fig. 3.1 Humanoid vision system

3.2 Depth Detection

3.2.1 Traditional depth detection algorithm

Before introducing depth computing, the triangulation for one camera will be introduced first [20]. Fig. 3.2 illustrates the relation of a world point with world coordinates (X, Y, Z), which is projected onto camera coordinates (x, y) and onto image coordinates (u, v) in the image plane. The mapping between the camera coordinates (x, y) and the world coordinates (X, Y, Z) is formed by means of similar triangles as

where f is the focal length of the camera. The origin of the camera frame is located at the intersection of the optical axis with the image plane, while the origin of the image frame is located at the top left corner of the image. The transformation between the image frame and the camera frame is given by

( )

respectively. Besides, ( , ) are the image coordinates of the origin of the camera frame and the function round(․) rounds the element to its nearest integer.

u0 v0

It is known that mapping a 3-D scene onto an image plane is a multiple-to-one transformation, i.e., an image point may represent different locations in world coordinate. To derive the world coordinate information uniquely, two cameras should be used. With thetriangulation theory and the disparity of a pair of corresponding object points in two camera’s frames, the real location of an object can be reconstructed.

Fig. 3.2 Relation of a world point projected onto an image plane

Y

L

Fig. 3.3 Configuration of a HVS with binocular cameras

A HVS is implemented by two cameras with fixed focal length as shown in Fig.

3.3. The baseline of the two cameras is 2d, where d is the length from the center of baseline to the camera, left or right. Let the pan angles of the left and right cameras be denoted by α and α , respectively. Since the cameras may not be placed precisely, there often exists a small tilt angle 2β and a small roll angle 2φ between two cameras. Choose the stereo reference frame at O , which divides the tilt angle, roll angle, and baseline equally, with Zaxis pointing towards the fixating point F. For the left camera coordinate frame L, it is related to the st e by a

ranslation vector dL=(−d,0,0) and a rotation matrix ΩL

(

α β φL, L, L

)

, where Front View

Y

R

Y

L

αL, βL0and φL , tilt and roll of frame L and

0are respectively the Euler angles of pan

(

α β φ, ,

)

Ω

( )

is the general rotation matrix expressed as

( ) ( ) ( )

B from frame L to the reference frame is described as

is the position vector of B in the reference frame. From (3.1-1) and (3.1-4), we have

ω11(α,β,φ)= cosα cosφ

The transformation equation of the object point

( )

( ) (

31 , 0, 0

)

32

(

, 0, 0

)

33

(

, 0, 0

)

L s L s L s L

Z = X +d ω α β φ +Yω α β φ +Zω α β φ

Therefore, the relation between image coordinate of frame L and world coordinates are

the right camera coordinate frame R, the transformation equation of B is described as

ppd (3.1-7)

T B in frame =(d,0,0) is th

translation vector, and the rotation matrix

( )

lation between the right camera coordinates and the world coordinates are given by

R R

Accordingly, the relation between image coordinate of frame R and world coordinates can be express as

the disparity of B between the left image

( )

for both the left and the right frames are usually much smaller compared to the other terms. Further define stereo disparity, usd, as

frame and right image frame, expressed as

sd L R

u =u − (3.1-10) u

whic

osition vector of B in the left and right camera frames are

L

R (3.1-12)

h will be applied to the determination of the depth of B.

In the HVS, if two cameras are placed precisely in parallel, then the pan, tilt, and roll angles between these two cameras are zeros, i.e., αLR =0, βLR0 =0, and φL

R0=0. From (3.1-4) and (3.1-7), the p respectively obtained as

L s

Since the term on the right can be approximated as

)

Rearranging it leads to

= − ≈ case, for arbitrary object point, it is hard to obtain the XL and XR values. Therefore, the traditional depth computation is usu se er t ssumption that both cameras have the sam

( ) ( )

2

cameras. Clearly, with (3.1-17), only and usd are required to find the depth Z.

3.2

e used to train the system for elim

training algorithm, which is the most commonly adopted for MLP

L−XR=2d is the separation of two k f , u

d

.2 ANN depth detection algorithm

Stereo pair obtained from two cameras can be utilized to compute the depth of a point by using the traditional depth computation introduced in previous section.

However, to apply the computation, parameters of each camera need to be experimentally obtained in advance. Therefore, ANN can b

inating the complicate computation process.

It is well known that multilayer neural networks can approximate any arbitrary continuous function to any desired degree of accuracy [14]. A number of investigators have it for different proposes in stereo vision. For example they have been used for camera calibration [15], establishing correspondence between stereo pair based on features [16] and generating depth map in stereo vision [12][15]. In this thesis, a feed forward neural network was used for depth estimation. The ANN computation system doesn’t need to calibrate the cameras in HVS. This can be very helpful in rapid prototyping application. The proposed thesis employs a Multi-Layer Perceptron (MLP) network trained by BP

network.

In the problem, the thesis proposed a multilayer ANN model because camera calibration problem is a nonlinear problem and cannot be solved with a single network [17]. Further, according to the neural network literature [18] more than one hidden layer is rarely needed. The more layers that a neural network have, the more parameter values need to be set because the number of neurons in each layer must be

determined. Therefore, to reduce the number of permutations, a network with one hidd

ctual depth, after training;

give the world coordinates for any matched pair of points.

en layer was selected.

The network model had been used in Fig. 3.4 for simulation consists of four input neurons, five hidden neurons and three output neurons. The input neurons’

corresponding to the image coordinates of matched points found in the stereo images (ul, vl) and (ur, vr). These points are generated by the same world point on both images and formed the input data for the neural network. The output neurons corresponding to the world coordinates of a point which are mapped as (ul, vl) and (ur, vr) on the two images. The network is trained in an interesting range of a

Fig. 3.4 ANN model used for this thesis

The algorithm requires training a set of matched image points whose corresponding world point is known. The set of matched points and the world

Y

coordinates thus obtained and formed the training data set for the ANN. Once the network is trained, we present it with arbitrary matched points and it directly gives us the d

thesis) will be issued in next Chapter for getting more precise detection result.

epth corresponding to the matched pair.

The main problem to using the MLP network is how to choose optimum parameters. Presently, there is no standard technique for automatically setting the parameters of MLP network. That is to say, the best architecture and algorithm for the problem can only be evaluated by experimentation and there are no fixed rules to determine the ideal network model for a problem. Therefore, experiments were performed on the neural network to determine the parameter according to its performance. Parameters with the number of neurons in the hidden layer (one hidden layer is employed in this

Chapter 4

Experimental Results and Discussion

4.1 Experimental Settings

The board consisting of a set of grid points is placed in front of the HVS for depth detection as shown in Fig.4.1 and Fig.4.2, one for the left camera and the other for the right. With the use of these two cameras, the HVS captures images of a specified cross at various distances, ranging from 65 to165 cm. To verify the usefulness of the proposed ANN depth detection algorithm, the experiment is implemented by changing the distance Z between the HVS and the board, where Z starts from 65 cm to 165 cm at an increment of 10 cm.

In the next section, there are different cases used to show the restriction of traditional depth detection algorithm and the flexibility and features of the ANN depth detection algorithm proposed in the thesis.

Fig. 4.1 The board at Z=105cm captured form left camera

Fig. 4.2 The board at Z=105cm captured form right camera

4.2 Experimental Results

4.2.1 Restrictions on Traditional depth detection algorithm

It is known that the traditional depth detection algorithm (3.1-17) is only suitable for the case that both cameras have the same focal length and their optic axes are parallel. However, in practical situation, the two cameras of an HVS generally have a little difference in their focal length or a little deflection between their optic axes. As a result, (3.1-17) is inappropriately used to detect the depth of a scene. To show these restrictions on an HVS, experiments will be set for demonstration.

The tradition depth detection algorithm (3.1-17) is derived from (3.1-16) under the assumption that k fuL L =ku RfR and 2d = XLXR. Here, let’s find the real values parallel. Then, (3.1-16) becomes

( ) ( )

which can be further rearranged as

( )

where the baseline 2d is fixed. Next, let’s show the way to calculate for the left camera in the HVS. By setting the shortest focal length for each camera, which is fixed but not exactly known, two images are obtained in

ΦL

Fig. 4.5 for Z=95 cm and 2d=10.5 cm. To calculateΦL, the term XR

(

Φ ΦLR

)

in (4.1-2) has to be eliminated.

By choosing the same 12 points in both images Fig. 4.5 (a) and (b), enclosed in the

dashed square, the addition of XR

(

Φ ΦLR

)

of these twelve points will vanish when they are vertically symmetric to the center line lR in the right image. Hence,

( )

different distance Z from 45 cm to 105 cm to verify that ΦL is around the average value Φ =142.918 with 0.24116% variation. In a similar way, the value of L corresponding to the case of the shortest focal length, Z=85 cm and 2d=10.5 cm, can be obtained as

where the 12 points are chosen from the images shown in Fig. 4.6, vertically symmetric to the center line lL in the left image. Fig. 4.4 shows the result of ΦR for different distance Z. It is obvious that Φ is approximate to L Φ with 0.27821% R variation. Since Φ is indeed near to L Φ , the traditional depth detection algorithm R (3.1-17) is applicable and adopts the average of Φ and L Φ as kR uf.

40 50 60 70 80 90 100 110

(a)

lR

(b)

Fig. 4.5 The same 12 points using to calculateΦL for Z=95cm

(a) (b)

Fig. 4.6 The same 12 points using to calculateΦR for Z=95cm lL

P1 P2

P1

P2

(a)

(b)

Fig. 4.7 Two adjacent points P and P in the center line of the training area for Z=135

Human eyes always focus on the center of entire eyeshot. Based on this characteristic, two adjacent points P1 and P2 in the center line of the training area, within eyeshot center of HVS, on the experimental board was shown in Fig. 4.7 are chosen as the testing points. P1 and P2 used to acquire test the error between depth detection result and actual depth.

The average error is computed by mean absolute error that can be written as

ma ˆ

e =mean Z⎡⎣ −Z

⎦ (4.1-5)

where ema is the mean absolute error of the network. Z and Zˆ denote the depth that is actually measured and the corresponding depth given by the network respectively.

Therefore, to represent distinct situation of ema, four different conditions are schemed as shown in the Fig. 4.8 to Fig. 4.11. The first case represent two cameras of HVS are placed in parallel with the same focal length. The second case is respect to distinct focal length. The third case is respect to insignificant deflection between optic axes of two cameras. And for the last case distinct focal length and insignificant deflection are included.

Case I

65 75 85 95 105 115 125 135 145 155 165

-10 -8 -6 -4 -2 0 2 4 6 8

Traditional depth computation error plot(caseI), (Error range: 0.017387%~1.7834%)10

(cm)

(%)

Average

Fig. 4.8 eave of two cameras placed in parallel with the same focal length Case II

65 75 85 95 105 115 125 135 145 155 165

Traditional depth computation error plot(caseII), (Error range: 15.1465%~18.9796%)30

(cm)

Traditional depth computation error plot(caseIII), (Error range: 58.3189%~1145.9954%)

(cm)

(%)

Fig. 4.10 eave of two cameras with insignificant deflection between optic axes

Case IV

65 75 85 95 105 115 125 135 145 155 165

500 1000 1500 2000 2500

Traditional depth computation error plot(caseIV), (Error range: 155.1946%~2653.9037%)

(cm)

(%)

Fig. 4.11 of two cameras with distinct focal length and insignificant deflection

eave

These figures show that the ema of the actual depth range in 65 to165 cm of case I is between 0.017387% and 1.7834%, case II is between 15.1465% and 18.9796%, case III is between 58.3189% and 1145.9954% and case IV is between 155.1946%

and 2653.9037%. The maximum of percentage error results of these case are greater than 15% except case I. This is mean that the algorithm only is used in the situation similar to case I, or else need to rewrite the formula with more parameter to fit all situation able to appear.

4.2.2 Flexibility of ANN Depth Detection Algorithm

The existence of restrictions on Traditional depth detection algorithm is already confirmed and shown in previous subsection. In order to unrestricted, the ANN depth detection algorithm proposed in this thesis.

The ANN architecture for depth detection will evaluate by experimentation. Here, the number of neuron in the output layer need to be decides first. One case is three neurons which respectively corresponding to the world coordinates (X, Y, Z) of the world object point in the output layer and another one is just one neuron for depth Z.

The case IV in the previous subsection is used to treat as the general case in the problem. The training data of the neural network is the same 12 points in left and right of each image, with different distance Z from 65cm to 165cm at an increment of 20cm.

To check the accuracy of the trained network, we presented the network with stereo-pair points that were not completely included in the training set but were from within our range of interest of distance. The testing data is the two same points; they are adjacent, as introduced in the previous subsection in the left and right image with different distance Z from 65cm to 165cm at an increment of 10cm. After the training process had finished, each neural network is tested with the training and testing data sets.

Fig. 4.12 shown the ema in each depth simulated from the net that consists of four input neurons, five hidden neurons and three output neurons. As the diagram indicates, the ema ranges from 0.48455% to 2.4771%. The maximum ema, 2.4771%, represents the error of its corresponding net. Each different number of neuron creates ten distinct nets and the ema of ten nets shown in Fig. 4.13 with five hidden neurons. The average of ema from ten nets with the same number of neurons Hn in hidden layer, ema_Hn, represent the error of Hn neurons where Hn starts form 1 to 10. Fig. 4.14 and Fig. 4.15

show each ema_Hn with one and three output neurons respectively. It is clearly to find that the error range of one output neuron is always greater than three output neurons.

For accuracy, the three output neuron is chosen in the proposed architecture.

65 75 85 95 105 115 125 135 145 155 165

0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

NN(XYZ)-Error plot, neural number=5, (Error range: 0.48455%~2.4771%)

(cm)

(%)

Fig. 4.12 ema in each depth simulated from the net

1 2 3 4 5 6 7 8 9 10

BP(XYZ) minimun error=2.2748% (neural no.=5) Average=2.9542%, Variance=0.23892%

Average

Fig. 4.13 ema of ten nets with five hidden neurons

0 5 10 15

neural number in hidden layer

%

BP(Z) minimun error=20.3715% (neural number=3 )

error average

Fig. 4.14 Each ema_Hn with one output neuron

0 5 10 15

neural number in hidden layer

%

BP(XYZ) minimun error=2.9542% (neural number=5 )

error average

Fig. 4.15 Each ema_Hn with three output neurons

After decide the number of output neuron, the number in the hidden layer is proceeded to be resolved. From Fig. 4.15 also tells us that the best choice for neuron number in the hidden layer in the problem is five. Therefore, we may reasonably conclude that the better MLP network architecture for detecting depth should be consists of four input neurons, five hidden neurons and three output neurons.

In order to eliminate the net that doesn’t train the training data successfully, a threshold value T of ema from training data need to be set. If the ema form training data of the net is large than T, the net will not be enrolled. The setting value H must be large than the ema from training data such that the network could.

The error results can be seen for different case introduced in subsection from proposed net is shown from Fig. 4.16 to Fig. 4.19. They can be noted that even in the

The error results can be seen for different case introduced in subsection from proposed net is shown from Fig. 4.16 to Fig. 4.19. They can be noted that even in the

相關文件