Vision-Based Autonomous Vehicle Guidance for Indoor Security Patrolling by a SIFT-Based Vehicle-Localization Technique

(1)

Vision-Based Autonomous Vehicle Guidance for

Indoor Security Patrolling by a SIFT-Based

Vehicle-Localization Technique

Kuan-Chieh Chen and Wen-Hsiang Tsai, Senior Member, IEEE

Abstract—A novel method for guidance of vision-based autonomous vehicles for indoor security patrolling using scale-invariant feature transformation (SIFT) and vehicle localization techniques is proposed. Along-path objects to be monitored are used as landmarks for vehicle localization. The localization work is accomplished by three steps: SIFT-based object image feature matching, 2-D affine transformation using the Hough transform, and analytic 3-D space transformation. Object monitoring can be simultaneously achieved during the vehicle-localization process, and most planar-surfaced objects can be utilized in the process, greatly enhancing the applicability of the proposed method. Vehi-cle trajectory deviations from the planned path due to mechanic error accumulation are also estimated by setting up a calibra-tion line on the monitored object image and applying the 3-D space transformation. Moreover, a path-correction technique is proposed to conduct a path adjustment and guide the vehicle to navigate to the next path node. Analysis of the accuracy of the vehicle-localization and path-correction results is finally included. The experimental results show that the proposed method, utilizing only a single view of each object, can guide the vehicle to navigate accurately and monitor objects successfully.

Index Terms—Autonomous vehicle, computer vision, guidance, landmarks, planar-surfaced objects, scale-invariant feature trans-formation (SIFT), security patrolling, vehicle localization, 3-D space transformation.

I. INTRODUCTION

I

N RECENT years, due to fast developments of computer vision techniques, studies on vision-based autonomous ve-hicle navigation have high prominence because of their great potential in various applications [1]–[8]. Autonomous vehicles are becoming increasingly capable of performing a great va-riety of dangerous or dreary work to replace human beings

Manuscript received August 17, 2009; revised December 7, 2009 and February 18, 2010; accepted April 27, 2010. Date of publication June 7, 2010; date of current version September 17, 2010. This work was supported by the Ministry of Economic Affairs under Project MOEA 98-EC-17-A-02-S2-0047 in the Technology Development Program for Academia. The review of this paper was coordinated by Prof. S. Ci.

K.-C. Chen was with the Institute of Multimedia Engineering, College of Computer Science, National Chiao Tung University, Hsinchu 30010, Taiwan. He is now with the Research Center for Information Technology Innovation, Academia Sinica, Taipei 115, Taiwan (e-mail: [email protected]).

W.-H. Tsai is with the Department of Computer Science, College of Com-puter Science, National Chiao Tung University, Hsinchu 30010, Taiwan, and also with the Department of Information Communication, Asia University, Taichung 41354, Taiwan (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVT.2010.2052079

in applications of interoffice document delivering, unmanned transportation, house cleaning, security patrolling, etc. The goal of this study is to develop an autonomous vehicle system for indoor security patrolling applications. The major issue in this study is to make the vehicle precisely navigate in its patrolling path. This may be accomplished by accurate vehicle

localization in each navigation cycle.

Traditionally, an autonomous vehicle is equipped with an odometer to measure the current location of the vehicle with respect to a starting point. However, this scheme usually suffers from incremental mechanic errors that are caused by the vehicle wheel system. On the other hand, it is common to equip vehicles with visual cameras that provide more flexible visual informa-tion for navigainforma-tion control. It is, thus, desired to develop an

automatic vision-based vehicle-localization technique to

over-come the mechanic error-accumulation problem. One way to achieve this goal is to utilize the features of artificial landmarks or natural scenes in the environment to locate the vehicle by fea-ture matching. Chou and Tsai [9] proposed a method to utilize house corners to estimate vehicle locations. Okuma et al. [10] used a colored marker for camera pose estimation. Huang et al. [11] used a colored rectangular signboard to obtain the relative position of the vehicle by calculating the vanishing points in the signboard image. Wu and Tsai [12] used a circular shape that is attached on the ceiling for vehicle-location estimation using an upward-looking omni-camera equipped on the vehicle. Xu et al. [13] utilized parallel lines and corner points on ceilings as features for vehicle localization.

Most of the aforementioned methods can only deal with land-marks with specific shapes or in ideal backgrounds, resulting in unreasonable restrictions on the environment in which the vehicle can navigate. Another approach to vehicle localization [14]–[17], [26] is to represent vehicle locations as a visual

path that consists of a set of reference node images and a

topological graph. Each node in the graph corresponds to a vehicle location, and each between-node link represents a path segment through which the vehicle can navigate. The vehicle-localization process is performed by finding the reference node image that is most similar to the current location view and then accordingly estimating the relative vehicle position. By repeating this process, the vehicle may be guided to follow the visual path. However, such a continuous image-matching process acquires the along-path environmental images very frequently, even at nonnode spots, and are therefore very time consuming, resulting in slow navigation speeds, in general.

(2)

In this paper, a vision-based method for indoor security pa-trolling is proposed, which conducts vehicle localization using the image features of along-path fixed objects to be monitored. Such a type of object is abundant in indoor environments and, thus, can be used as a landmark for vehicle localization. An advantage of this method of localization by along-path objects is that vehicle-location estimation and object monitoring can be conducted simultaneously.

More specifically, the security patrolling process as proposed is divided into two stages: learning and navigation. In the learning stage, the vehicle is instructed to learn relevant infor-mation of a preselected path and the along-path objects to be monitored. In the navigation stage, the vehicle moves along the learned path and visits each node where a concerned object is located. Since the position of an object in an image of it will be just close to, instead of exactly at, the position of the object found in a reference image acquired in the learning stage, we adopt the scale-invariant feature transformation (SIFT) tech-nique proposed by Lowe [18]–[20] for image feature extraction and matching. Accordingly, any planar-surfaced object can be utilized as a landmark, leading to the great improvement on the flexibility to choose a landmark.

After the image-matching step is conducted, an affine trans-formation between the image taken in the learning stage and that taken in the navigation stage is estimated from the matching result using the Hough transform. Then, a line called the

calibration line is set up in the object image to compute

the deviation of the vehicle trajectory from its correct path by utilizing the found affine transformation and an analytic 3-D transformation technique proposed in this paper. Finally, a path-correction technique is proposed to guide the vehicle to navigate more accurately to the next object to be monitored. Only single views of the monitored objects are needed for vehicle localization and object monitoring. With reasonable between-node distances, the proposed method can guide the vehicle to smoothly navigate specified paths without hitting the walls in the environment, as demonstrated by the experiments conducted in this study.

In the remainder of this paper, the problem dealt with in this paper and the advantages of the proposed method are described in Section II. The detail of the proposed analytic 3-D trans-formation technique is described in Section III. The proposed path-correction technique utilizing the 3-D transformation re-sult is presented in Section IV. The proposed object image-matching scheme based on the SIFT to find the transformation between two images is given in Section V. Some experimental results are described in Section VI, followed by conclusions in Section VII.

II. IDEA OF THEPROPOSEDMETHOD

The vehicle used as a test bed in this study is shown in Fig. 1(a). An odometer in the vehicle provides three parameters of the vehicle location in each navigation cycle—one parameter for the vehicle orientation and the other two for the x- and

y-coordinates of the vehicle position in the real-world space.

A pan-tilt-zoom (PTZ) camera is equipped on the vehicle for image acquisition, as shown in Fig. 1(b).

Fig. 1. Autonomous vehicle system used in this study. (a) Vehicle. (b) PTZ camera equipped on the vehicle.

A. Problem Definition and Advantages of the Proposed Method

The problem investigated in this paper is defined here. A vehicle navigates in an indoor environment along a preselected path on which a number of objects are to be monitored. These objects are utilized for vehicle localization, each with at least a

planar surface facing the vehicle. Such objects may be posters,

paintings, boxes, etc. They need not be manmade; instead, even wall planes with granite surfaces, for example, are also appro-priate. The vehicle visits each object in a navigation session to monitor the existence of the object, uses the features existing in the object’s surface image to find the deviation of the vehicle from the planned trajectory, and corrects its pose (including its position and orientation) before moving to the next node, where another object is to be monitored. This way, several advantages can be identified, as described in the following.

1) The proposed method is more flexible than traditional ones, which require landmarks with special shapes (circles, rectangles, etc.).

2) The proposed method utilizes single views of fixed object scenes for vehicle localization, which is simpler than many visual path-following techniques [14]–[17], [26] utilizing contents of the entire environmental views. 3) Object monitoring and vehicle localization may be

si-multaneously conducted to save time and speed up

navigation.

4) Objects with planar surfaces are abundant in indoor envi-ronments so that the localization work can be frequently carried out along the navigation path to guide the vehicle more effectively.

5) Frequent vehicle localization is advantageous as an aid for path correction during navigation, overcoming the mechanical error-accumulation problem that is often encountered in autonomous vehicle guidance.

B. Main Idea of Vehicle Localization by Planar-Surfaced Objects

The main idea of using a planar-surfaced object for vehicle localization, as proposed in this paper, is to set up a reference

coordinate system (RCS) on a line, which is called the calibration line, on the object surface with one end of the line

as the origin of the system and compute the vehicle location (including its position and orientation) with respect to this coor-dinate system. Also, the calibration line is selected to be parallel to the environment floor and is used as an axis of the RCS.

(3)

Fig. 2. Two methods for creating a calibration line. (a) Using a line stand affixed with a line structure. (b) Manually drawing a virtual line in the image (the green one on the top of the poster).

To set up the calibration line, it is noted that a manmade object often has rectangular surface shapes, on which a real line can be detected for this purpose. For an object that does not have line features on its planar surface, for example, a granite-surfaced wall, it is proposed in this paper to create for it a

virtual (physically nonexistent) calibration line by two ways:

1) putting right under the object a “line stand” affixed with a line structure (e.g., a straight steel wire or a line on the surface of the stand) and then automatically detecting the line in the image or 2) drawing in the object image a line that is parallel to the environment floor using any clue of the image features. See Fig. 2 for an illustration.

With the allowance of a virtual line as an axis of the RCS, the flexibility of choosing objects for use as vehicle-localization landmarks is greatly raised. This is another advantage of the proposed method.

C. Sketch of the Proposed Navigation Process

In the navigation stage, the vehicle moves from one node to another according to the path that is learned in the learning stage. Vehicle localization at a node along the path is imple-mented to include three stages: SIFT-based feature matching, 2-D affine transformation, and analytic 3-D space transforma-tion. The first stage is to match the SIFT feature vectors of two monitored object images, i.e., one taken in the navigation stage and the other taken in the learning stage. The second stage is to find an affine transformation from the matching results using the Hough transform [21]. In addition, the third stage is to find the vehicle location based on a new technique of analytic 3-D space transformation proposed in this paper. More details are described in the following, where the two images mentioned above taken in learning and navigation stages are denoted as IL and IN, respectively.

Algorithm 1: autonomous vehicle navigation process. Stage 1—SIFT-based feature matching for finding similar feature point pairs.

1) Apply the SIFT to INto obtain a set FNof SIFT feature vectors, and retrieve the corresponding feature vector set

FLof IL, which was obtained in the learning stage. 2) Take every pair of similar feature vectors, i.e., one from

FL and the other from FN, to define a group of four

Fig. 3. Object images ILand INtaken in learning and navigation stages.

(a) Red calibration line lLdetected or selected manually. (b) Cyan calibration

line lN(on the top of the poster) generated by applying the transform T . parameters of an affine transformation from IL to IN, where the feature vector similarity is as defined in [18]. Stage 2—2-D affine transformation for finding the relative mapping between images taken in learning and navigation.

3) Put all found parameter groups into a Hough space. 4) Detect the peak in the space.

5) Find the affine transform T corresponding to the peak, and take it to represent the best relative mapping from IL to IN.

Stage 3—analytic 3-D space transformation for computing the deviation of the vehicle from the planned trajectory and conducting path correction.

6) Use T to transform the calibration line lL, which was automatically detected or manually selected in ILin the learning stage, into the image space of IN, resulting in a new image line in IN, which is denoted as lN(see Fig. 3 for an example, and note that if the navigation incurs no deviation from the planned path, then lL and lN will perfectly match in position).

7) Unambiguously transform lLand lNinto the RCS accord-ing to an analytic 3-D transformation process (proposed in this paper and described later in the next section) from the image space to the RCS to obtain two sets SL and

SNof vehicle poses with respect to the RCS, i.e., one for the learning stage and the other for the navigation stage, respectively.

8) Use SL and SN to derive the translation Vt and the

ori-entation θtof the vehicle’s current location with respect

to its location planned in the learning stage, and call the parameter set (Vt, θt) the deviation of the vehicle

trajectory.

9) Use the deviation (Vt, θt) to derive a sequence of vehicle

commands to guide the vehicle to the correct path learned in advance, and continue the navigation work for the next session.

D. Advantages of Using SIFT for Object Image Matching

The use of the SIFT as described above is advantageous in several aspects. First, in traditional vehicle-localization tech-niques, obvious features like lines and curves on landmarks are detected. However, this limits the variety of objects for use as

(4)

landmarks because it is not always true that there exist obvious line or curve features on an object’s surface. Use of the SIFT followed by the Hough transform removes this inconvenience; any point pattern on the object surface may be utilized, leading to great improvement in the flexibility to select objects as landmarks. Nearly all point-patterned object surfaces may be used as landmarks, as found in this paper. This also leads to the previously mentioned convenience of selecting a virtual calibration line on the object surface to construct an RCS for vehicle localization and path correction.

E. Sketch of the Proposed Learning Process

In the proposed learning stage, when the user drives the vehicle to visit each object to be monitored, relevant data are recorded, including the navigation path, the SIFT feature vec-tors of the monitored object, and the calibration line parameters. All data are saved in a learning database such that the learning process need only be conducted once, and the database can be repeatedly used in every navigation session. The proposed learning process is sketched in the following.

Algorithm 2: navigation path-learning process.

1) Direct the PTZ camera on the vehicle toward the object when the vehicle arrives at a spot in front of an object B. 2) Take an image of B, and draw, as IL, a rectangle

to-enclose B.

3) Apply the SIFT to ILto obtain a set FL of SIFT feature vectors.

4) Automatically detect or manually select a calibration line

lL in IL by either of the two ways of calibration line creation mentioned previously.

5) Save the set FLand the parameters of lLin the path node for B.

6) Repeat the above steps for the next object to be monitored until all concerned along-path objects are “learned.”

III. ANALYTIC3-D SPACETRANSFORMATIONFROM

IMAGESPACE TOREAL-WORLDSPACE FORVEHICLELOCALIZATION

Vehicle localization and computation of the deviation of the vehicle trajectory are made possible by the use of the calibration line and unambiguous mapping of the image space to the real-world space specified by the previously mentioned RCS, as described in the following.

Let (X, Y, Z) denote the RCS that is set up on the calibration line L and one of its two endpoints, i.e., R0, with L taken to be the X-axis of the RCS and R0to be the origin. An illustration of the system is shown in Fig. 4. Recall that the calibration line

L, possibly virtual, is assumed to be parallel to the floor so that

the X−Y plane of the RCS, as illustrated, is parallel to the floor as well. Also, the images of L in ILand IN, respectively, are lL and lN, as mentioned previously. The configuration of the three axes of the RCS may be regarded as being similar to a virtual Y

-shaped corner attached on the ceiling of a house, and therefore,

Fig. 4. Virtual house corner specified by a given calibration line (the cyan line on the top of the poster) and one of its end point (the red point on the left-top corner of the poster).

the vehicle location-estimation method using house corners, which was proposed by Chou and Tsai [9], can be employed and simplified to estimate the location (including the position and the direction) of the vehicle with respect to the RCS (i.e., with respect to the object to be monitored). This means that while the object is being monitored, localization of the vehicle with respect to the object is achieved simultaneously, as mentioned previously.

More specifically, the X- and Y -axes may be regarded to specify the two perpendicular lines on the ceiling plane of the

virtual house corner, and the Z-axis specifies the virtual line

perpendicular to the X- and Y -axes, as shown at the left top of Fig. 4. Also shown in the figure are 1) a camera coordinate

system (CCS), which is set up on the camera equipped on the

vehicle with the lens center as the origin and the optical axis as one of its three axes (the W -axis) and 2) an image coordinate

system (ICS), which is set up in the image taken by the camera.

Before deriving the location of the vehicle, we derive that of the camera as an intermediate result in the following.

Let the projection of L and R0in an image be denoted as l and r0, respectively. Also, let the image coordinates of r0 be (u0, v0). Assume that the equation of l in the ICS in terms of image coordinates (u, v) is described by u + bv + c = 0. The pose of the camera may be represented by three position param-eters Xc, Yc, and Zc, as well as three orientation parameters θ,

ψ, and ω, as described in the following.

1) Zc is the distance from the camera to the ceiling of

the virtual house corner, which is assumed to be known (manually measured in advance).

2) θ is the angle between the optical axis of the camera and the Y -axis of the RCS.

3) ψ is the angle of the optical axis of the camera with respect to the RCS, which is also assumed to be known (provided by the PTZ camera system).

4) ω is the swing angle of the camera, which is set to zero in this study.

Accordingly, the camera pose may be described just by three parameters Xc, Yc, and θ, which may be derived in terms of

(5)

we transform the RCS coordinates into the CCS coordinates using four steps: 1) Translate the origin R0 of the RCS with coordinates (Xc, Yc, Zc) to the origin of the CCS; 2) rotate the

X−Y plane about the Z-axis through the pan angle θ such that

the Y−Z plane is parallel to the V −W plane of the CCS (see Fig. 4); 3) rotate the Y−Z plane about the X-axis through the tilt angle ψ such that the X−Y plane is parallel to the U−V plane; and 4) reverse the Z-axis such that the positive direction of the Z-axis is identical to the negative direction of the W -axis. As a result, the transformation of the RCS coordinates (x, y, z) of a point in the 3-D real-world space into the CCS coordinates (u, v, w) may be described by

(u, v, w, 1) = (x, y, z, 1)· Tr (1) where Tr= ⎡ ⎢ ⎣

cos θ − sin θ cos ψ − sin θ sin ψ 0 sin θ cos θ cos ψ cos θ sin ψ 0

0 sin ψ − cos ψ 0 x0 y0 z0 1 ⎤ ⎥ ⎦ (2) with x0=− Xccos θ− Ycsin θ

y0= (Xcsin θ− Yccos θ) cos ψ− Zcsin ψ

z0= (Xcsin θ− Yccos θ) sin ψ + Zccos ψ. (3)

Next, let P be any point on the X-axis (the calibration line

L) with RCS coordinates (x, 0, 0). Then, its CCS coordinates

(ux, vx, wx) can be derived, using the matrix Tr described

above, to be

(ux, vx, wx, 1) = (x, 0, 0, 1)· Tr

= (x cos θ + x0,−x sin θ cos ψ + y0

− x sin θ sin ψ + z0, 1). (4) Therefore, we have

ux= x cos θ + x0

vx=− x sin θ cos ψ + y0

wx=− x sin θ sin ψ + z0. (5) Also, let (up, vp) be the image coordinates of the projection

of P in the ICS. Then, according to the camera’s optical geometry, we have the following:

ux= wx· up f vx= wx· vp f (6)

where f is the camera focus length, which is assumed to be known. Substituting the values of ux, vx, and wxof the three

equations of (5) into the above two equations and eliminating

the variable x, we get the equation for the projection of the calibration line l (the X-axis) in the image plane as follows:

up+

z0cos θ + x0sin θ sin ψ

−y0sin θ sin ψ + z0sin θ cos ψ

vp

− f· (y0cos θ + x0sin θ cos ψ)

−y0sin θ sin ψ + z0sin θ cos ψ

= 0. (7) Furthermore, assume that the equation of the calibration line (lNor lL mentioned previously) in the ICS is up+ bvp+ c =

0, which can be obtained from the image, i.e., b and c are known. Then, by comparing the coefficients of this equation with those of (7) and substituting the values of x0, y0, and

z0 previously described in (3) into the result, we obtain the following equalities:

b =Ycsin ψ− Zccos θ cos ψ −Zcsin θ

(8)

c =f· (−Yccos ψ− Zccos θ sin ψ) −Zcsin θ

. (9) Since b, c, and Zcin the above equations are known values,

we can now derive the values of Ycand θ. This is done first by

eliminating Zcand Ycfrom the above equalities to get

tan θ = f

f b cos ψ + c sin ψ (10)

from which the value of θ can be obtained. Also, from (8), we can get the value of Ycas

Yc =

Zccos θ cos ψ− bZcsin θ

sin ψ . (11)

By substituting the values of θ and Yc above as well as the

coordinates (u0, v0) of r0 (the projection of the origin R0 of the RCS) into (7) with the values of x0, y0, and z0 described by (3), we get, finally, after some equation simplifications, the following for computing Xc:

Xc =

u0(Yccos θ cos ψ + Zcsin ψ sin θ)− Ycu0sin θ

u0sin θ cos ψ + v0cos θ

. (12) In summary, we have completed the derivations of (10)–(12) for computing the parameters Xc, Yc, and θ of the camera pose using the following known data:

1) the coefficients b and c of the equation u + bv + c = 0 of the calibration line in the image;

2) the height Zcof the virtual ceiling above the camera (i.e., the height of the calibration line above the camera in the real-world space);

3) the tilt angle ψ of the camera provided by the PTZ camera system;

4) the image coordinates (u0, v0) of the origin R0 of the RCS and the focal length f of the camera.

IV. PATHCORRECTION

We now describe the proposed path-correction process for use in the navigation stage. With the camera pose described by (Xc, Yc) and θ with respect to the RCS on the planar surface

(6)

Fig. 5. Top view of the relations among the vehicle, camera, and reference coordinate systems.

of an object as derived in the previous section, we want to accordingly derive the location of the vehicle with respect to the RCS. For this, we define an additional coordinate system, which is called the vehicle coordinate system (VCS), as illustrated in Fig. 5. The VCS is 2-D with its origin located at the middle point of the line segment joining the centers of the two vehicle wheels, which is called the center of the vehicle in the sequel. The y-axis of the VCS is defined to align with this line segment, and the x-axis is defined to be perpendicular to the y-axis and parallel to the ground. The location of the vehicle with respect to the RCS is described by the orientation θvand the translation

(Xv, Yv) of the vehicle in the RCS, which are easy to derive by

the geometry shown in Fig. 5 to be as follows:

θv = θc+ θ + 90◦

Xv = Xc− Dccos θv

Yv = Yc+ Dcsin θv (13)

where θc is the pan angle of the camera provided by the PTZ

camera system, and Dcis the distance between the camera and the center of the vehicle (assumed to be known in advance by manual measurement). For convenience, the vehicle location will be described by the 3-tuple (Xv, Yv, θv) in the sequel.

We now describe how to correct the path of the vehicle at a certain node N in front of an object using the information of the current vehicle location SN= (XN, YN, θN), as well as that of the learned vehicle location SL= (XL, YL, θL), which is obtained through a similar process for obtaining the current vehicle location, as described previously. On the other hand, let the current and learned vehicle locations in the global coor-dinate system (GCS) be described by S_N = (X_N, Y_N, θ_N) and

S_L = (X_L, Y_L, θ_L), respectively, where the latter was provided by the odometer during the learning stage, whereas the former is to be derived now using the relation among the RCS, VCS, and GCS illustrated in Fig. 5. The basic idea is to find the relative angle and translation between the current vehicle location SN and the learned one SL in the RCS, and then, add the angle and translation to the learned vehicle location S_L in the GCS. The result is finally used to guide the vehicle to navigate to the location of the next learned node Nin the navigation path, whose location is denoted as S_L = (X_L, Y_L, θ_L) in the GCS. The details are as follows, with Fig. 6 as an illustration for notation references.

Algorithm 3: path-correction process.

1) Compute the relative angle θrbetween θNand θLby

θr= θN− θL. (14) 2) Compute the relative translation in the following way.

a) Compute the angle

φ = tan−1 YN− YL XN− XL − θN. (15) b) Compute the angle φGof the line segment from SN to

S_L in the GCS by

φG = θL − φ − θr. (16) c) Compute the translation (Xr, Yr) in the GCS with

respect to the learned vehicle location S_L by

X_r = Drcos φG Y_r= Drsin φG (17) where Dr= (XN− XL)2_{+ (YN}_{− Y}_L)2_. ₍₁₈₎ 3) Compute the desired current vehicle location S_N =

(X_N, Y_N, θ_N) using the following:

X_N = X_L − X_r

Y_N = Y_L− Y_r

θ_N= θ_L− θr. (19)

4) Compute the vector Vtfrom SN to SL of the next node

Nand the direction θtof Vtby

Vt= (XL− XN, YL− YN) θt= tan−1 Y_L− Y_N X_L− X_N . (20) 5) Use Vtand θtto derive the following vehicle commands

for path correction.

a) Turn the angle θ_N− θtto guide the vehicle to head to

(7)

Fig. 6. Top view of the relations among the RCS, the VCS, and the GCS, as well as corresponding angles of the vehicle, where the learned vehicle location is at position (XL, YL) in the RCS, the current vehicle location is at position

(XN, YN) in the RCS, and the location of the next node N is at position

(X_L, Y_L) in the GCS.

b) Move ahead for the following distance to reach N:

|Vt| =

(X_L− X_N )2+ (Y_L− Y_N)2.

c) Turn the angle θt= θL− θt after arriving at N to

resume the original vehicle moving direction at N.

V. OBJECTIMAGEMATCHING

In the navigation stage, the vehicle stops in front of each object to be monitored by the use of the learned path data. However, the stop position in front of an object may not be accurate in every navigation session but is just close to the one recorded in the learning stage. This results in a slight change in the viewing angle of the object from the camera or, equivalently, in the camera pose with respect to the object. An image acquired of the same object will so be different in translation, scaling, and orientation from the one taken in the learning stage. Thus, a method with the ability to match corresponding objects in images that are taken with different camera poses is needed.

In the past, SIFT has been proven to be one of the robust image-matching techniques that use local invariant feature de-scriptors with respect to different geometrical changes [19]. To allow efficient matching between images, each image is processed to extract feature points, each of which is then repre-sented as a SIFT feature vector, as mentioned previously. Each SIFT feature vector consists of local image measures that are

invariant to image translation, scaling, and rotation and partially invariant to 3-D viewpoint changes. In this study, we utilize such invariance properties of SIFT feature vectors to match object images. Specifically, the SIFT process [18] includes four major stages to generate a set of SIFT feature vectors: 1) selection of a set of feature points in a scale space of the input image; 2) localization of them to determine their locations (x, y) and scales s; 3) assignment of orientations θ to them; and 4) generation of a descriptor Φ for each of the feature points as a vector of all the gradient-orientation histogram entries in a region around the feature point.

By using the aforementioned SIFT process for our case here, the object images INand ILtaken in the navigation and learning stages are first transformed to obtain two sets of SIFT feature vectors: one set FN from IN and the other FL from IL, as mentioned previously. According to the matching algorithm proposed by Lowe [18], [20], the best candidate match for each feature vector fN in FN can be found by identifying its

nearest neighbor in FL. Each nearest neighbor is defined as the feature vector with the closest Euclidean distance to the feature vector fN. Then, an affine transform between the match pairs is estimated. For this purpose, many well-known fitting methods, such as the random sample consensus algorithm [22], can be used. However, according to [18], a better performance can be obtained using the Hough transform [21], which is adopted in this paper to identify the best affine transform between the match pairs.

In more detail, using the aforementioned notations, for each feature vector denoted as fN(xN, yN, sN, θN, ΦN), a point (x_N, y_N ) in the neighborhood of (xN, yN) can be generated using sNand θNto be

x_N= xN+ k· sN· cos θN

y_N = yN+ k· sN· sin θN (21) where k is a constant value. In addition, the corresponding point (x_L, y_L) in FL can be computed in the same manner. Then, an affine transform defined in the following way can be used to obtain mapping from the pair (xL, yL) and (x_L, y_L) in FLto the pair (xN, yN) and (x_N, y_N) in FN in terms of a scaling factor

s, an orientation θ, and a translation (tx, ty) described in the

following: ⎡ ⎢ ⎣ xL −yL 1 0 yL xL 0 1 x_L −y_L 1 0 y_L x_L 0 1 ⎤ ⎥ ⎦ × ⎡ ⎢ ⎣ m n tx ty ⎤ ⎥ ⎦ = ⎡ ⎢ ⎣ xN yN x_N y_N ⎤ ⎥ ⎦ (22)

where m = s· cos θ, and n = s · sin θ. Equation (22) can be solved to obtain the parameters m, n, and (tx, ty). In

addi-tion, accordingly, the values s and θ can be computed by the following: θ = tan−1 _n m and s = m cos θ. (23) The above process creates a set of pose parameters (tx, ty, θ, s) for each match pair of the SIFT feature vectors.

Next, a Hough space is created to find the best affine transform for the computed pose parameters (tx, ty, θ, s) of all the match

(8)

Fig. 7. Learned navigation map.

and a peak that is found in the space with the largest number of votes is finally taken to specify the best affine transform from

ILto IN.

VI. EXPERIMENTALRESULTS

To test the feasibility of the proposed method, we conducted a series of experiments with a commercially available vehicle called Pioneer 3-DX manufactured by MobileRobots, Inc., and a PTZ camera produced by Axis Communications, Ltd. The testing environment, as shown in Fig. 7, is a large building with an M-shaped corridor whose length is 75.83 m. In addition, nine objects labeled as N1 through N9 were monitored, including one painting, seven posters, and one granite-surfaced wall plane, which are roughly evenly distributed in the entire range of the corridor with an average between-node distance of 7.58 m.

At first, in the learning stage, the vehicle was controlled to learn a path in the corridor and the nine objects along the path. The objects are all on walls at the two sides of the path. When-ever the vehicle arrived at a spot in front of one object, the object features were extracted from the image that is acquired at the spot, which, together with the vehicle location, were re-corded into the vehicle system. As a result, a navigation map that contains nine path nodes was created, as illustrated by Fig. 7.

A. Vehicle-Localization Accuracy

To evaluate the accuracy of vehicle localization performed by the proposed method, we compared the estimated vehicle positions and orientations at the nine nodes N1 through N9 with the corresponding real values measured manually. In more detail, when the vehicle was guided to navigate to each node

TABLE I

RESULTS OFACCURACYANALYSIS OFCONDUCTED

VEHICLELOCALIZATION

in the learning stage, the real vehicle location described by (Xr, Yr, θr) at the node was measured as the reference data, where 1) Xr and Yrdenote, respectively, the distances in the

X- and Y -directions from the vehicle center to the origin of the

RCS, which is set up on the planar-surfaced object on the wall; and 2) θrdenotes the angle of the vehicle’s moving direction with respect to the X−Z plane, which coincides with the wall plane. After the proposed vehicle localization process was per-formed, the estimated vehicle location parameters (Xe, Ye, θe) were computed and compared with the corresponding reference data. Such comparisons were conducted three times at each node. Table I includes a summation of the estimated and reference data, as well as the average and the standard deviation of the error ratios yielded by the comparison. Here, the errors (Xerr, Yerr, θerr) of a vehicle-localization result are defined as the absolute differences between the estimated and real vehicle location data, respectively, and the error ratio of an estimated distance in the X- or Y -direction is defined as the ratio of the error Xerror Yerrover the real one, respectively.

From Table I, we can see that all the estimated distances have error ratios that are smaller than 5% and that all the estimated orientations have errors that are smaller than 2◦. The averages of the X- and Y -distance error ratios, as well as those of the orientation errors, are 2.33%, 1.93%, and 1.02◦, respectively. The standard deviation of these error parameters are 1.32%, 1.5%, and 0.565◦, respectively, which are also small. Such vehicle-localization results may be considered to be accurate enough for smooth vehicle navigation applications, as shown by our experiments.

(9)

TABLE II

RESULTS OFPATH-CORRECTIONACCURACYANALYSIS

B. Path-Correction Accuracy

To show the effectiveness of the proposed path correction process, we compared the corrected vehicle location yielded by the process with the one learned in the learning stage at each of the three selected path nodes N1 through N3. More specifically, at each node, we put the vehicle at three differ-ent testing locations described by Sn= (Xn, Yn, θn) in the RCS. Next, we performed the vehicle-localization and path-correction processes to guide the vehicle to move toward the learned location Sr= (Xr, Yr, θr), yielding a corrected

loca-tionSn = (Xn, Yn, θn). Then, we computed the path-correction accuracy by finding the location errors (Xerr , Yerr , θerr), which are the absolute differences between the parameters of Sn and

Sr, respectively. Here, the values of the location parameters in Sr, Sn, and Sn were all measured manually. The results are summarized in Table II, in which the testing locations are denoted as P11 through P33. From the table, we can see that the largest distance error ratio in the X- and Y -directions occurred at P11, which is about 6%, and the largest orientation error occurred at P21, which is about 2.4◦. The averages of the distance error ratios in the X- and Y -directions as well as that of the orientation error are 2.37%, 0.92%, and 0.8◦, respectively. The standard deviations of these error parameters are 1.74%, 0.68%, and 0.7◦, respectively. Again, these data are all small in value. Accordingly, the maximum allowable between-node vehicle traveling distance may be computed by 40 cm/ sin(0.8◦) = 28.6 m, under the condition that the largest deviation from the trajectory is set to be the reasonable value of 40 cm in the corridor of our experimental environment, whose width is 2.5 m. Even in the worst case with an orientation error of 2.4◦, the distance is 40 cm/ sin(2.4◦) = 9.55 m, which is

Fig. 8. Experimental results of object monitoring and path correction at four selected path nodes. (a) Labels of monitored objects. (b) Vehicle monitoring the objects at the nodes. (c) Matching results and the calibration lines used for path correction. (d) Images of the learned objects.

still larger than the average between-node distance of 7.5 m in our experimental environment mentioned previously. As a summary, as these experiment results illustrate, the proposed method is feasible for guiding the vehicle to accurately navigate to the next path node in a normal corridor.

C. Indoor Security-Patrolling Navigation

We have also conducted many experiments of complete security patrolling navigations in our M-shaped corridor envi-ronment, all according to the learned navigation path, as shown in Fig. 7. Whenever the vehicle was guided to arrive at a learned node in front of an object, object image matching was performed to check the existence of the object. If successful, the deviation of the trajectory of the vehicle was computed ac-cording to the matching result, and the proposed path correction was preformed to continue its navigation on the path; otherwise, a warning message was issued. Some results, together with the acquired monitored object images, are shown in Fig. 8, where Fig. 8(b) shows a view of the vehicle arriving at a node with a monitored object, Fig. 8(c) shows the matching result and a computed calibration line in the image, and Fig. 8(d) shows the object image taken in the learning stage. The average navigation speed of the vehicle is 0.3 m/s or, equivalently, 1.08 km/h. The average navigation cycle, which includes the five processes of image taking, object image matching, vehicle localization, object monitoring, and path correction, is about

(10)

TABLE III

RESULTS OFACCURACYANALYSIS OF THENAVIGATIONTRAJECTORIES

TRAVERSED BY THEVEHICLEWITHRESPECT TO THEREFERENCE

NAVIGATIONPATH

4 s, in which the image-matching process takes about 2.3 s. Since there is no real-time requirement in the navigation process from the viewpoint of security patrolling, this amount of time required to monitor each node is considered to be tolerable. In addition, if necessary, the implementation of the proposed algorithms or the speed of the used personal computer in the vehicle system can be improved to raise the overall computation speed of the system.

To show the overall accuracy of the navigation trajectory traversed by the vehicle with respect to the reference navigation path, at each path node in a navigation session, which we con-ducted in our experiments, we measured further both the

refer-ence location described by Sr= (Xr, Yr, θr) and the corrected

location by Sc= (Xc, Yc, θc) after the path-correction process was performed, followed by computations of the location errors (Derr, θerr), which are defined as the differences between the parameters of the two locations. The computation results of these errors and their error ratios with respect to the between-node distances Drare summarized in Table III. In addition, the measured reference and corrected positions of the vehicle at the nine path nodes are illustrated in Fig. 9. The average errors in position and orientation at each node are 2.6 cm and 1.2◦ with standard deviations of 1.0 cm and 1.0◦, respectively, and the average error ratio is 0.37% with a standard deviation of 0.18%. These data are all small enough again. They, together with Fig. 9, show that the errors resulting from mechanic and vehicle-localization errors can be reset at each path node by the proposed vehicle-localization and path-correction processes. This fact shows that, with a sufficient number of nodes dis-tributed along the navigation path, the vehicle can accurately and smoothly navigate according to the learned path from a viewpoint of the entire navigation path range.

We also compared our results both with those yielded by the ceiling-based method proposed by Xu et al. [13] for robot positioning, where the average position error ratio with respect to a traversed distance of 8.742 m was 0.34%, and with those yielded by the visual path-following method proposed by Achar

Fig. 9. Results of accuracy analysis of visited spots in front of the objects yielded by the proposed method. The blue line segments were the planned reference path, and the red ones were the corrected navigation path.

and Jawahar [26], where the average position error ratio with respect to a traversed distance of 80 m was 0.48%. Obviously, our results with the average position error ratio of only 0.37% for a traversed distance of 75.83 m have better accuracy.

VII. CONCLUSION

A novel method for vehicle localization utilizing images of objects to be monitored has been proposed. The method is based on the use of a SIFT-based object image-matching process, a 2-D affine transformation scheme, and an analytic 3-D space transformation technique for vehicle-location es-timation. The method can simultaneously perform vehicle-location estimation and object monitoring. At first, a relative transformation between each image taken in the learning stage and a corresponding one taken in the navigation stage has been derived by matching the SIFT feature vector pairs extracted from the images and finding the best affine transform yielded by the Hough transform. Next, it has been proposed to set up a calibration line in the image of each object to be monitored to compute the deviation of the vehicle trajectory utilizing the found relative transformation and the aforementioned analytic 3-D space transformation newly proposed in this paper. Finally, a path-correction scheme has been proposed to compute neces-sary path adjustment according to the found vehicle trajectory deviation and to accordingly guide the vehicle to move toward the next path node.

An advantage of the proposed method over others is that nearly any planar-surfaced objects can be used as a landmark

(11)

for vehicle-location estimation, resulting in the improvement of flexibility in choosing landmarks. By adopting the SIFT and the Hough transform, as long as a sufficient number of feature pairs are available, the vehicle can precisely estimate its location with respect to every monitored object. The experimental results and the accuracy analysis of the yielded data revealed the feasi-bility and practicality of the proposed system for smooth and accurate navigation and object-security monitoring in indoor environments.

Since the SIFT process in the proposed system can be substituted separately, more accurate and efficient feature-matching techniques, like some recently proposed SIFT ac-celerations and feature-transformation techniques [23]–[25], may be fitted into the proposed method to yield better results. Other future works may be directed to extending the proposed method to deal with more types of objects in more complicated environments.

REFERENCES

[1] G. N. DeSouza and A. C. Kak, “Vision for mobile robot navigation: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 2, pp. 237– 267, Feb. 2002.

[2] G. J. Jang, S. H. Lee, and I. S. Kweon, “Color landmark based self-localization for indoor mobile robots,” in Proc. IEEE Int. Conf. Robot. Autom., Washington, DC, May 2002, vol. 1, pp. 1037–1042.

[3] S. Segvic and S. Ribaric, “Determining the absolute orientation in a corridor using projective geometry and active vision,” IEEE Trans. Ind. Electron., vol. 48, no. 3, pp. 696–710, Jun. 2001.

[4] C. H. Ku and W. H. Tsai, “Obstacle avoidance in person following for vision-based autonomous land vehicle guidance using vehicle location estimation and quadratic pattern classifier,” IEEE Trans. Ind. Electron., vol. 48, no. 1, pp. 205–215, Feb. 2001.

[5] N. X. Dao, B.-J. You, S.-R. Oh, and M. Hwangbo, “Visual self-localization for indoor mobile robots using natural lines,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Las Vegas, NV, Oct. 2003, vol. 2, pp. 1252–1257.

[6] Z. Jia, A. Balasuriya, and S. Challa, “Recent developments in vision based target tracking for autonomous vehicles navigation,” in Proc. IEEE Conf. Intell. Transp. Syst., Toronto, ON, Canada, Sep. 2006, pp. 765–770. [7] K. L. Chiang and W. H. Tsai, “Vision-based autonomous vehicle guid-ance in indoor environments using odometer and house corner location information,” in Proc. IEEE Int. Conf. Intell. Inf. Hiding Multimed. Signal Process., Pasadena, CA, Dec. 2006, pp. 415–418.

[8] C. Micheloni, G. L. Foresti, C. Piciarelli, and L. Cinque, “An autonomous vehicle for video surveillance of indoor environments,” IEEE Trans. Veh. Technol., vol. 56, no. 2, pp. 487–498, Mar. 2007.

[9] H. L. Chou and W. H. Tsai, “A new approach to robot location by house corners,” Pattern Recognit., vol. 19, no. 6, pp. 439–451, 1986.

[10] T. Okuma, K. Sakaue, H. Takemura, and N. Yokoya, “Real-time camera parameter estimation from images for a mixed reality system,” in Proc. 15th Int. Conf. Pattern Recognit., Barcelona, Spain, Sep. 2000, vol. 4, pp. 482–486.

[11] J. Huang, C. Zhao, Y. Ohtake, H. Li, and Q. Zhao, “Robot position identification using specially designed landmarks,” in Proc. IEEE Conf. Instrum. Meas. Technol., Sorrento, Italy, Apr. 2006, pp. 2091–2094. [12] C. J. Wu and W. H. Tsai, “Location estimation for indoor autonomous

vehicle navigation by omni-directional vision using circular landmarks on ceilings,” Robot. Auton. Syst., vol. 57, no. 5, pp. 546–555, May 2009. [13] D. Xu, L. Han, M. Tan, and Y. F. Li, “Ceiling-based visual positioning

for an indoor mobile robot with monocular vision,” IEEE Trans. Ind. Electron., vol. 56, no. 5, pp. 1617–1628, May 2009.

[14] Y. Matsumoto, K. Sakai, M. Inaba, and H. Inoue, “View-based approach to robot navigation,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Takamatsu, Japan, Nov. 2000, vol. 3, pp. 1702–1708.

[15] G. Blanc, Y. Mezouar, and P. Martinet, “Indoor navigation of a wheeled mobile robot along visual routes,” in Proc. IEEE Int. Conf. Robot. Autom., Barcelona, Spain, Apr. 2005, pp. 3365–3370.

[16] A. Remazeilles and F. Chaumette, “Image-based robot navigation from an image memory,” Robot. Auton. Syst., vol. 55, no. 4, pp. 345–356, Apr. 2007.

[17] F. Fraundorfer, C. Engels, and D. Nister, “Topological mapping, localiza-tion and navigalocaliza-tion using image colleclocaliza-tions,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., San Diego, CA, Oct. 2007, pp. 3872–3877. [18] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”

Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.

[19] K. Mikolajczyk and C. Schmid, “A performance evaluation of local de-scriptors,” in Proc. Int. Conf. Comput. Vis. Pattern Recog., Madison, WI, Jun. 2003, vol. 2, pp. 257–263.

[20] D. G. Lowe, “Local feature view clustering for 3D object recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Kauai, HI, Dec. 2001, vol. 1, pp. 682–688.

[21] D. H. Ballard, “Generalizing the Hough transform to detect arbitrary patterns,” Pattern Recognit., vol. 13, no. 2, pp. 111–122, 1981.

[22] M. A. Fischler and R. C. Bolles, “Random sample consensus: A para-digm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, Jun. 1981. [23] Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representation

for local image descriptors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Washington, DC, Jun. 2004, vol. 2, pp. 506–513.

[24] H. Lejsek, F. H. Asmundsson, B. T. Jonsson, and L. Amsaleg, “Scalability of local image descriptors: A comparative study,” in Proc. 14th Annu. ACM Int. Conf. Multimed., Santa Barbara, CA, Oct. 2006, pp. 589–598. [25] J. M. Morel and G. Yu, “ASIFT: A new framework for fully affine

invari-ant image comparison,” SIAM J. Imag. Sci., vol. 2, no. 2, pp. 438–469, Apr. 2009.

[26] S. Achar and C. V. Jawahar, “Adaptation and learning for image based navigation,” in Proc. 6th Indian Conf. Comput. Vis. Graph. Image Process., Bhubaneswar, India, Dec. 2008, pp. 103–110.

Kuan-Chieh Chen received the B.S. and M.S.

de-grees in computer science from National Chiao Tung University, Hsinchu, Taiwan, in 2005 and 2008, respectively.

He is currently a Research Assistant with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. His current research interests include computer vision, robotics, pattern recognition, and image processing, as well as all of their applications.

Wen-Hsiang Tsai (SM’91) received the B.S. degree

in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1973, the M.S. degree in electrical engineering from Brown University, Providence, RI, in 1977, and the Ph.D. degree in electrical engineering from Purdue University, West Lafayette, IN, in 1979.

Since 1979, he has been with National Chiao Tung University (NCTU), Hsinchu, Taiwan, where he is currently a Chair Professor of Computer Science. At NCTU, he has served as the Head of the Department of Computer Science, the Dean of General Affairs, the Dean of Academic Affairs, and a Vice President. From 1999 to 2000, he was the Chair of the Chinese Image Processing and Pattern Recognition Society of Taiwan and, from 2004 to 2008, the Chair of the Computer Society of the IEEE Taipei Section in Taiwan. From 2004 to 2007, he was the President of Asia University, Taichung, Taiwan. He has been an Editor or the Editor-in-Chief of several international journals, including Pattern Recognition, the International Journal of Pattern Recognition and Artificial Intelligence, and the Journal of Information Science and Engineering. He has been a consultant to many major research organizations in Taiwan. He has published 141 journal papers and 223 conference papers. His current research interests include computer vision, information security, video surveillance, and autonomous vehicle applications. Dr. Tsai has received many awards, including the Annual Paper Award from the Pattern Recognition Society of the USA; the Academic Award of the Ministry of Education, Taiwan; the Outstanding Research Award of the National Science Council, Taiwan; the ISI Citation Classic Award from Thomson Scientific, and more than 40 other academic paper awards from various academic societies. He is a Life Member of the Chinese Pattern Recognition and Image Processing Society in Taiwan.