FLEXIBLE 3D OBJECT RECOGNITION FRAMEWORK USING 2D VIEWS VIA A SIMILARITY-BASED ASPECT-GRAPH APPROACH

(1)

Vol. 22, No. 6 (2008) 1141–1169 c

World Scientiﬁc Publishing Company

FLEXIBLE 3D OBJECT RECOGNITION FRAMEWORK USING 2D VIEWS VIA A SIMILARITY-BASED

ASPECT-GRAPH APPROACH

JWU-SHENG HU∗and TZUNG-MIN SU† Department of Electrical and Control Engineering

National Chiao-Tung University Hsinchu, Taiwan, R.O.C.

∗_{jshu@cn.nctu.edu.tw} †_{linux.ece89g@nctu.edu.tw}

This work presents a flexible framework for recognizing 3D objects from 2D views. Similarity-based aspect-graph, which contains a set of aspects and prototypes for these aspects, is employed to represent the database of 3D objects. An incremental database construction method that maximizes the similarity of views in the same aspect and minimizes the similarity of prototypes is proposed as the core of the framework to build and update the aspect-graph using 2D views randomly sampled from a viewing sphere. The proposed framework is evaluated on various object recognition problems, including 3D object recognition, human posture recognition and scene recognition. Shape and color features are employed in different applications with the proposed framework and the top three matching rates show the efficiency of the proposed method.

Keywords: Aspect-graph; object representation; object recognition; human posture recognition; scene recognition.

1. Introduction

Object recognition is an important topic in computer vision where various approaches have been developed.7,27,32,44 However, numerous technical issues require further investigation, especially for 3D object recognition. Variations in viewing direction and angle,7,34 _{illumination changes,}13,19 _{and scene clutter and} occlusion28,36 _{are the main challenges for object recognition. In recent years, many} researches were presented for solving these issues. For example, a generic object class detection system38that combines the Implicit Shape Model and multiview speciﬁc object recognition is presented to detect object instances from arbitrary viewpoints. A new framework35that combines a visual-cortex-like hierarchical structure and an increasingly complex and invariant feature was proposed for robust object recog-nition. Furthermore, a new object representation, Multicolored Region Descriptor (M-CORN),29 _{was proposed to describe the color and local shape information of} objects. Moreover, some low-level visual features, such as object shading, surface

1141

Int. J. Patt. Recogn. Artif. Intell. 2008.22:1141-1169. Downloaded from www.worldscientific.com

(2)

texture and an object’s contour or binocular disparity, have recently been proposed to describe 3D object representation.8,12_{However, 3D object recognition is} primar-ily inﬂuenced by position variations and illumination source type, and the relative positions of an observer and object.

Some advanced theorems of 3D object perception have been investigated to solve these issues and enhance the 3D object recognition task.3 _Existing the-orems for high-level 3D object perception can be categorized as object- and viewer-centered representations based on a coordinate system,32_{and as volume- (or} model-based) and view-based representations based on the constituent elements.40 Viewer-centered representation describes portions of an object relative to a coordi-nate system based on an observer. A view-based representation characterizes a 3D object using a set of object views. Both viewer-centered and view-based frameworks conform to the intuition of human perception, during which a person memorizes an object using several primary views without requiring an exhaustive 3D object model. Moreover, Kimet al.18_{proposed a combined model-based method to} recog-nize 3D objects using a combination of a bottom-up process (model parameter initialization) and a top-down process (model parameter optimization).

1.1. Human posture recognition

Human posture recognition is an important example of 3D object recognition. A considerable number of studies have been made on this field over the past ten years.1,17Existing approaches6for human posture recognition are classified as direct and indirect approaches based on the human body model. The model has either a 2D or 3D representation based on the dimensionality of features. The direct approach typically consists of a detailed human body model. For example, Ghost14_developed a silhouette-based body model, incorporating hierarchical body pose estimation, a convex hull analysis of the silhouette and a partial mapping from body parts to silhouette segments. Furthermore, Pfinder42_{utilized color information to develop a} multiclass statistical model and identified human body parts using shape detection. However, occlusions and perspective distortion lead to the unreliable results. The indirect approach extracts features about the human body instead of a detailed human body model, and combines classifiers to estimate human posture. For exam-ple, Ozeret al.31 _{utilized the AC-coefficients as the features and adopted principal} component analysis as the classifier. A recent work30_{used color, edge and shape as} the features and the hidden Markov model as the classifier. Furthermore, complex 3D models utilize different equipment to solve problems associated with the angle from which human postures are observed. For instance, Delamarreet al.9 _proposed a method for building a 3D human body via three or more cameras, and then cal-culated the projection of the silhouette for comparison with 2D projections in a database. Additionally, 3D laser scanners41 _{or thermal cameras}16 _{have also been} adopted to build a 3D human body model. However, these 3-D solutions require enormous computing time and high device costs.

(3)

1.2. Scene recognition

Recognizing scene can be addressed as a problem of 3D object recognition, where the scene represents variations due to changing the viewer location or camera pose. Scene recognition is a fundamental element in the topological representation of environment,23,39_{where the graph node of the adjacency graph describes the robot’s} location. Moreover, scene recognition can also be employed to memorize and detect visual landmarks in geometrical representation of environments.11,21 _{In Ref. 2, a} series of experiments were presented to show that only the overall geometry and a few key features are required to perform scene recognition. For capturing the key features, Kr¨ose et al.22 _{proposed a method for appearance-based modeling of} an environment by extracting scene features using principal component analysis (PCA). Moreover, a framework combined with a supervised method for recognizing the door and an unsupervised method for learning door-reaching behavior has been proposed in Ref. 5.

1.3. Aspect-graph representation

The common challenge in 3D object recognition, human posture recognition, and scene recognition is the variation in orientations. The simplest method for solv-ing this problem is to characterize an object with a densely sampled collection of independent views. The object can be described in detail by constructing an object model with numerous 2D views; however, this approach signiﬁcantly increases com-puting time due to the expansive search space. Thus, several approaches have been developed to extract a minimal set of object views. Appearance-based methods focus on changes in intensity of each view. However, changes to object lighting, rotation, deformation and occlusion aﬀect object recognition results when using the appearance-based method. Aspect-graph representations focus on shape changes to an object’s projection. Koenderink et al. developed the underlying theory that describes 3D objects using aspect-graph representation.20Moreover, the traditional aspect-graph method37_{assumes that an object belongs to a limited class of shapes,} and that characteristic views can be extracted using prior knowledge of the object. Aspect-graph vertices represent the characteristic views extracted from points on a transparent viewing sphere with an object in the object center. These characteristic views are extracted as prototypes of an object from a densely sampled collection of object views.

1.4. Motivation for the proposed method

Cyr and Kimia7 _{presented a similarity-based aspect-graph method to extract the} characteristic views using shape similarity between views. The viewing sphere is sampled at regular (ﬁve-degree) intervals and two similarity metrics, which one based on curve matching and the other based on shock matching, are applied to

(4)

combine views into aspects. Let there be N objects {O1, O2, . . . , On, . . . ,

ON −1, ON}, which comprise an object database. Each object is composed of M

views sampling the viewing sphere giving rise to a set of views{V₁1_{, . . . , V}_mn_{, . . . , V}_MN} where Vn

mdenotes the mth view of object On. The aspect p of object n is deﬁned as

Anp, which is a collection of views ranging from Vm−kn −to Vm+kn + and represented by

the characteristic view Vn

m. Moreover, the dis-similarity of two views is represented

as d(Vn

m, Vji), which is the distance between the mth view of object n and the ith

view of object j. The goal is to minimize the set of views required to represent each object On. Two criteria are imposed to maintain successful object recognition

while forming aspects representation by characteristic views. The ﬁrst criterion (local monotonicity) supposes that the dis-similarity of two views increases as their relative viewing angle between them increases. The second criterion describes that the distance of each view Vn

i in an aspect Anm and the characteristic view of that

aspect Vn

m is smaller than the distance between any non-aspect view Vjn and the

characteristic view Vmn.

The training views of an object in Ref. 7, which are sampled at five-degree increments and sorted by order, are collected in advanced. When additional views of an object are collected to improve object representation in the work of Ref. 7, the total views of an object must be resorted in order of view angles. The first criterion, local monotonicity, is not suitable when the object is symmetrical in the feature space. It is inconvenient to update the aspect-graph representation while collecting more new 2D views. To improve the flexibility of an update mecha-nism, this work presents an incremental database construction method for building and updating the aspect-graph with object views sampled at random intervals. Object representation becomes increasingly detailed using additional captured and characteristic views without recalculating similarity measures by resorting total views. Moreover, the first criterion in the work of Ref. 7, local monotonicity, is not utilized, thereby improving flexibility of extracting aspects of symmetri-cal objects. Although the proposed approach cannot confirm the view angle of a test view using a specific object view, the proposed approach improves flexi-bility when building an aspect-graph representation, and reduces computing time when updating object aspects. Additionally, the accuracy of the object representa-tion increases with minimal growth of search space while collecting addirepresenta-tional new object views.

The remainder of this paper is organized as follows. Section 2 presents an overview of the proposed method. Section 3 describes the procedure for extracting features and the similarity measures for building database and object matching. Section 4 describes the incremental database construction method for extracting the aspects and characteristic views of objects. Furthermore, the object matching procedure is described with a weighting combination between diﬀerent similarity measures. Section 5 presents experimental results that demonstrate the performance of the proposed method for 3D rigid objects, human postures and scene recognition. Conclusions are discussed in Sec. 6.

(5)

The 1st Assistant 3-D Object Database (AOD) The 2nd Assistant 3-D Object Database (AOD) The Nrd Assistant 3-D Object Database (AOD) . . .

1. Database Building Procedure (A Main 3D Object Database (MOD) )

A 2D view sampled from an unknown object

The 1st Feature Extraction

The 2nd Feature Extraction

The 1st Similarity Measure

The 2nd Similarity Measure

The Nnd Feature Extraction The Nnd Similarity Measure

Weighted Combination

Recognition Results (Top Three Matches)

. . . . . .

Feature Extraction Similarity Measure

2. Matching Procedure

...

Fig. 1. The system architecture of the proposed framework. A main 3D object database (MOD) comprises of total AODs.

2. The Overview of the Proposed Method

The proposed framework (Fig. 1) contains two parts, which are called the database building procedure and the matching procedure. Suppose an object database con-tains T0objects, and T12D views of each object are randomly sampled from a view-ing sphere. In the database buildview-ing procedure, a proposed incremental database construction method (Fig. 2) is applied to extract the aspects of each object using

T12D views. The main 3D database contains a set of assistant 3D object databases (AOD). Furthermore, an AOD comprises the aspects of each object, where the aspects are represented by their characteristic views. Figure 3 illustrates the inner structure of an AOD. In Fig. 3, a set of aspects is employed to represent the database of a 3D object in the aspect level. The prototypes for these aspects, called the char-acteristic views, are utilized to represent an object for object matching. The passage from one characteristic view to another is defined with only the similarity measure. The proposed similarity-based aspect-graph focuses on an efficient learning method with associated features and similarity functions. While the object features are sufficient to discriminate the similarity between each two 2D training views, the

No The 2D view sampled

from the known 3-D object Similarity Measure Feature Extraction Collect the characteristic views of the object Yes th i th j No 1? j T i T0? th i 1 j = > > j + i i 1;j 0

An incremental learning method based on similarity-based aspect-graph

Yes An Assistant 3-D Object Database = + =

Fig. 2. The database building procedure, whereT₀ is the number of objects in the database and T₁ is the number of sampled views required to build the aspect-graph representation of an object.

(6)

View 1 Training Views . . . The Proposed Incremental Learning Method View 2 Aspect 1

View 1 View 3 View 3 View 4 View 6 View 7 View 9 View 67 View 68 View 69 View 70 View 71 View 72

Aspect Level

Characteristic View Level

Aspect 2 View 2 Aspect 16 View 65 Aspect 15 View 44 View 45 . . . View 2 View 10 View 13 View 5 View 18 View 15 View 20 View 28 View 31 View 32 View 33 View 37 View 43 View 44 View 65

Fig. 3. The inner structure of an AOD.

aspects and characteristic views can be extracted using associated similarity mea-sures. Even if the objects are complex, the characteristic views can be extracted in the feature space.

In the matching procedure, a similarity measure is applied between a 2D view sampled from an unknown object and all the characteristic views of the 3D object database. After the weighted combination of all similarity measures, the ﬁrst three characteristic views that have the highest similarity with the testing 2D view are regarded as the recognition results (the top three matches).

3. Object Representation

In this work, shape and color features are utilized to measure similarity between two object views. To extract shape information, a robust background subtraction framework from previous works15,24 _{is utilized to extract foreground regions while} considering shadows and highlights. Foreground detection provides ﬂexibility when constructing the object database, even in an out-of-control environment. Canny edge detection4 _{is then applied to extract shape edge, and the Gradient Vector} Flow Snake (GVF)43 _{is applied to extract the contour information. Assume that} the contour information is included in a set Z, which is composed of N points zi,

where zi is a complex form given by Eq. (1). Two kinds of shape features, which

are called the Fourier descriptor (FD) and the point-to-point length (PPL), are extracted from Z.

Z ={z(i)} = {x_i_{+ jy}_i}, 0≤ i < N. (1)

(7)

3.1. Shape features

The points inside the set Z are resampled using Eq. (2) to eliminate variations in shift and scale.

˜

Z ={˜z(i)} = {L_c_[(x_i− x_c_{) + j(y}_i− y_c_)]/L} (2)

where 0≤ i < N; L denotes contour length of Z, L_c is expected contour length, and (xc, yc) is the location of the contour center of Z. Then, the Fourier transform

is applied to ˜Z to compute FD using Eq. (3).

FD(k) =

N −1 n=0

˜

z(n) exp(−j2πkn/N ), 0≤ k < N. (3)

The low-frequency parts of FD are extracted with the consideration of decreasing the variations of high-frequency noises, and are deﬁned as MAG. Notably, MAG is composed of 2T2 magnitude values of frequency information selected among 2N frequencies. The method for extracting MAG is given by Eq. (4).

MAG ={ | ˜_{Z(k) | , | ˜}_{Z(N − k) | , 1 ≤ k ≤ T}2}. (4) Intuitively speaking, MAG only characterizes the shape and not the orientation of human posture. Therefore, MAG cannot discriminate between similar shapes oriented diﬀerently. To solve this problem, phase information for FD must be used for memorizing an object. The work in Ref. 25 proposes that memorizing the phase value at low frequency is suﬃcient. Suppose the phase information is θz, then θz

can be calculated using FD(1) and FD(N − 1), as described in Eqs. (5) and (6). FD(1) = | FD(1) | · exp(jθ1) = R1+ jI1 (5) FD(N − 1) = | FD(N − 1) | · exp(jθN −1) = RN −1+ jIN −1. (6)

Furthermore, θzcan be calculated using Eq. (7).

θz= (θ1+ θN −1)/2 = (arc tan(I1/R1) + arc tan(IN −1/RN −1))/2 (7)

where R1 and RN −1 denote the real parts of FD(1) and FD(N − 1), I1 and IN −1

denote the imaginary parts of FD(1) and FD(N − 1), and θ1 and θN −1 are the

phases of FD(1) and FD(N − 1).

Moreover, the lengths between each pair of points in Z are deﬁned as PPL, which is suitable for describing shape details. To calculate PPL is time consuming since each point is considered as a start point. Equations (8) and (9) describe the calculating process of PPL. PPL(k) = {li} = {˜z(i) − ˜z(i − 1)} ={_(x_i− x_i−1)2_{+ (y} i− yi−1)2}, 1≤ k ≤ N, k ≤ i ≤ N + k (8) ˜ z(k) = ˜z(N + k). (9)

(8)

3.2. Color features

Numerous features, such as edge, corner, texture, color and shape, have been utilized to extract useful information from an image. Among these features, color involves the intuitive information to represent the conceptual idea of an image. Therefore, pixel color and pixel position are utilized in this work to extract the conceptual idea of an image. The color space used is RGB color space, a format common to most video devices. To enhance the regional information of an image, the position (x, y) feature is combined with RGB color information as the feature vector. That is, each pixel contains a 5D feature vector (R, G, B, x, y), which is shown in Fig. 4. This work applies Gaussian mixture model (GMM) to model region information in a scene image as a blob model, which is defined as BM, using 5D feature vectors (R, G, B, x, y). We assume that the density function of color and position features have Gaussian distributions. First, each pixel x is defined as a five-dimensional vector at time t. Moreover, N Gaussian distributions are used to construct the GMM, which is described in Eq. (10).

f (x | λ) = N i=1 wi 1 (2π)d| i| exp −1 2(x − µi) T −1 i (x − µi) (10)

λ represents the parameters of GMM, λ = wi, µi, i , i = 1, 2, . . . , N and N i=1 wi= 1.

Next, parameters λ of GMM are calculated to enable the GMM to match the feature vector distribution with least errors. The most common method for cal-culating parameters λ is maximum likelihood (ML) estimation. The objective of ML estimation is to identify model parameters by maximizing the likelihood func-tion of GMM obtained from training feature vectors X. The ML parameters are

Red Level

Green Level

Blue Level

Color Image Plane (3D) Pixel(x,y) Feature Plane (5D) Position x Position y Combine Spatial Information Red Level Green Level Blue Level

Color Image Plane (3D) Pixel(x,y) Feature Plane (5D) Position x Position y Feature Plane (5D) Position x Position y Combine Spatial Information

Fig. 4. 5D feature vector construction.

(9)

derived iteratively using the expectation maximization (EM) algorithm.10 Suppos-ing there are s feature vectors x1, x2, . . . , xs(in this work s is deﬁned as image size,

320× 240 = 76, 800), then the ML estimation of λ can be calculated using Eq. (11).

λML= arg max λ s j=1 log f (xj| λ). (11)

Furthermore, unsupervised data clustering is used before the EM algorithm iterations to accelerate convergence. This study uses the K-means algorithm26 _for clustering. The number of clusters is deﬁned, and then the initial center of each cluster is obtained randomly. The appropriate center and variance of each cluster can be estimated iteratively using the K-means algorithm and applied as the initial mean and variance of each Gaussian component of the GMM.

3.3. Similarity functions

To determine the similarity between two objects when building databases and rec-ognizing objects, a similarity measurement D(U, V ) is applied to the extracted features. We assume that the features extracted from two contours are U =

{u0, . . . , ui, . . . , uI−1} and V = {v0, . . . , vi, . . . , vI−1}, respectively, where I denotes

the feature size. Two similarity measures are applied using 1-norm distance [Eq. (12)] and K-L distance33

[Eq. (13)], where c denotes the number of points on an extracted contour and s denotes image size. In this work, c is deﬁned as 256 and s is deﬁned as 76,800. D1−norm(U, V ) = I−1 i=0 |ui− vi|, I = c (12) DKL(U, V ) ≈ I−1 t=0 p1(t) · log p1(t) p(t) + p0(t) · log p0(t) p(t) , I = s (13) where p0(t) = ut usum, p 1(t) = vt vsum, u sum= s−1 i=0 ui, vsum = s−1 i=0vi, p(t) = p0(t) + p1(t) 2 . 3.4. Similarity measures

Suppose Vnewn represents a new sampled view of the nth object and Cmn represents

the mth characteristic view of the nth object. Moreover, Ammindenotes the aspects

that have the minimum distance from Vn

new and C_mnmin represents the minimal distance, where mmin

is the index of Ammin. Cmnmin−1 and Cmnmin+1 denote the neighboring views of Cn

mmin. Let d

1

M(Vnewn , Cmn) [Eq. (14)] denote the similarity

(10)

measure using MAG and 1-norm distance, d2M(Vnewn , Cmn) [Eq. (15)] denote the

similarity measure using PPL and 1-norm distance, d3

M(Vnewn , Cmn) [Eq. (16)] denote

the similarity measure using BM and K-L distance, and d1a(Vnewn , Cmn) [Eq. (17)]

denote the similarity measure using θz and 1-norm distance.

d1M(Vnewn , Cmn) = T2 k=1 |MAGVn new_{(k) − MAG}Cmn_(k)|

+|MAGVnewn _{(N − k) − MAG}Cnm_{(N − k)|} ₍₁₄₎

d2M(Vnewn , Cmn) = N k=1 |PPLVn new_{(k) − PPL}Cmn_(k)| ₍₁₅₎ d3M(Vnewn , Cmn) = s−1 t=0 p1(t) · log p1(t) p(t) + p0(t) · log p0(t) p(t) (16) where p0(t) = ut usum , _p1(t) = vt vsum , _usum= s−1 i=0 ui, vsum= s−1 i=0vi , _{p(t) =} p0(t) + p1(t) 2 d1a(Vnewn , Cmn) = |θzT(k) − θzD(k)|. (17)

4. A Flexible 3D Object Recognition Framework

A ﬂexible framework using the proposed incremental database construction method is described in this section. In the framework, a MOD is composed of one or more AODs. Each AOD is built using one main feature or using one main feature with one assistant feature. Moreover, each feature has its similarity function, such as Eqs. (14)–(17).

4.1. Generation of aspects and characteristic views

The proposed incremental database construction method is a four-step procedure and is illustrated as in Fig. 5. Steps A-1 to A-4 are applied to extract aspects and characteristic views. Those aspects comprise the object database and the charac-teristic views are used for object matching with a new view Vnewn .

Step A-1. Initialize the number of aspects to be zero. 2D views of the nth object are randomly sampled from a viewing sphere and each 2D view is regarded as Vn

new.

Step A-2. When the number of existing aspects of the nth object equals zero, Vn

new is regarded as a characteristic view of a new aspect.

(11)

Total Aspects = 0? The feature extracted

from a new 2D view

Total Aspects < 3 ?

Add a new aspect and set the new view as the characteristic view of the

new aspect Eq. (20) meets or Eq. (21) meets Eq. (18) meets ? Eq. ( 22) meets? Eq. (19) meets No Eq. (18) meets? No Yes

Set the new view as the sub-characteristic view of the

aspect

Yes Combine the new view

into the min aspect .

m

min

m

No

Yes

Add a new characteristic view between viewmmin −1 and view mmin

Add a new characteristic view between viewmmin −1 and view min m Yes No Yes No No Yes Yes No

Has any assistant feature ?

Yes No

Fig. 5. The procedure of the proposed incremental database construction method.

Step A-3. When the number of existing aspects of the nth object equals one or two,

(A-3.1) When Eqs. (18) and (19) are both satisﬁed, Vnewn is combined into the mmin aspect, and the characteristic view of the mmin _{aspect remains the same,}

min allCn_m∈A mmin dM(Vnewn , Cmn) < T3 (18) min allCn_m_∈A mmin da(Vnewn , Cmn) < T5 (19) where T3 and T5are both predeﬁned threshold values.

(A-3.2) Otherwise, if Eq. (18) is satisﬁed and Eq. (19) is not, Vn

newis combined into the mmin

aspect, and is regarded as a new characteristic view of the mmin aspect.

(12)

(A-3.3) Otherwise, if Eqs. (18) and (19) are both unsatisﬁed, a new aspect of the

nth object is established, and Vn

new is regarded as the new characteristic view of the new aspect.

Step A-4. When the number of existing aspects of the nth object is ≥ 3,

(A-4.1) If either Eq. (20) or Eq. (21) is true, a new aspect is constructed and

Vn

new is considered the characteristic view of the new aspect. When a new aspect is established, the aspect order can be determined to let similar aspects be close to each other using Eq. (22). If the similar-ity distance between Vn

new and C_mnmin+1 exceeds that between Vnewn and

C_mnmin₋₁, then the new aspect is inserted between aspect mminand aspect

mmin_{− 1; otherwise, the new aspect is inserted between aspects m}min

and mmin_{+ 1.} min allCn m∈Ammin dM(Vnewn , Cmn) > T4 (20) T3< min allCn m∈Ammin

dM(Vnewn , Cnm) < T4 and dM(Vnewn , Cn_mmin_±1) > T4 (21)

dM(Vnewn , C_mnmin+1) > dM(Vnewn , C_mnmin₋₁). (22) (A-4.2) Otherwise, if Eqs. (20) and (21) are both unsatisﬁed and Eq. (19) is true,

Vn

new is combined into the mmin aspect and the characteristic view of the

mmin_{aspect remains the same.}

(A-4.3) Otherwise, if Eqs. (20) and (21) are both unsatisﬁed and Eq. (19) is not true, Vn

new is combined into the mmin aspect and is regarded as a new characteristic view of the mmin_aspect.

Terms T3and T4are two predefined threshold values, where T4≥ T3. The crite-rion for selecting T3 and T4 depends on the precise level for describing the object. If T3 and T4 are both small, then the criterion of combining 2D views becomes strict and, thus, the number of aspects increases. Furthermore, if the difference between T3 and T4 decreases, the tolerance for the difference between 2D views inside an aspect decreases, thereby increasing the number of aspects. Addition-ally, T3 and T4 should be initialized manually and modified iteratively until the final number of aspect reaches an acceptable number, which is determined based on the degree of object symmetry. In this work, Td1M

3 and T

d1M

4 are deﬁned as T3 and T4 while adopting MAG as the feature; Td

2

M

3 and T

d2M

4 are deﬁned as T3 and

T4 while adopting PPL as the feature; Td 3

M

3 and T

d3_M

4 are deﬁned as the T3 and

T4 while adopting BM as the feature, and Td 1

a

5 is deﬁned T5 while adopting θz as

the feature. Section 5 presents the values of Td1M 3 , T d2M 3 , T d3M 3 , T d1M 4 , T d2M 4 , T d3M 4 , and Td1a 5 .

(13)

4.2. Object recognition using 2D characteristic views

After constructing the aspect-graph representation of each object, a test view of an unknown object is recognized by matching itself with all the characteristic views of each AOD. If multiple AODs are utilized in the framework, a hierarchical matching process is applied to calculate the ﬁnal recognition results with a weighting combi-nation of all similarity measures. Suppose the candidate objects in the kth AOD are included in a set of Nk_{, and n(N}k_{) denotes the number of candidate objects in N}k_.

The number of candidate objects reduces after each object matching procedure, which is described in Eq. (23).

n(Nk+1)≤ n(Nk_), _{k ≥ 1.} (23)

In the kth AOD, the main feature and the assistant feature of the test view are extracted to match with all the characteristic views. Suppose Vi

j denotes a test view

of an unknown object, the object matching in the proposed framework is described as Eqs. (24a) and (24b).

dk(Vji, Cmn) = dkm(Vji, Cmn) + ω1k· dka(Vji, Cmn), n ∈ Nk, k = 1 (24a)

dk(Vji, Cmn) = dmink−1(n) + ωk2· (dkm(Vji, Cmn) + ω1k· dka(Vji, Cmn)),

n ∈ Nk, k ≥ 2. (24b)

Let dkm(Vji, Cmn) and dka(Vji, C n(k)

m ) denote the main and assistant similarity

distances between the unknown object and Cn

m, where n denotes the nth object in

the set of candidate objects Nk. If the framework comprises only one AOD, the characteristic views of the ﬁrst three smallest similarity distances in d1

(Vi j, Cmn)

[Eq. (24a)] are regarded as the top-three matches. In Eq. (24a), ωk

1 is a weighting parameter for combining diﬀerent similarity measures. When no assistant feature is utilized, ωk

1 is set to zero. Otherwise, the objects included in the ﬁrst half smallest similarity distances of dk(Vji, Cmn) are deﬁned as a set Nk+1. The objects in Nk+1

are preserved for further recognition in the (k + 1)th AOD.

If the framework comprises two or more AODs, the characteristic views of the ﬁrst three smallest similarity distances in dk_(Vi

j, Cmn) [Eq. (24b)] are regarded as the

top-three matches. In Eq. (24b), dk−1min(n) denotes the minimum similarity distance between the unknown object and the nth candidate object in the (k−1)th database. Moreover, ωk

2 is a weighting parameter for combining the similarity measure between the kth and (k − 1)th AODs.

4.3. Applications

The proposed framework is evaluated on various object recognition problems, including 3D object recognition, human posture recognition and scene recognition. The features and similarity measures described in Sec. 3 are employed in the three applications.

In the 3D rigid object recognition, two AODs are utilized with two main features MAG and PPL. The weighting combination of the similarity measures is described

(14)

in Eqs. (25) and (26). d1(Vji, Cmn) = d 1 m(Vji, Cmn), n ∈ N 1 ₍₂₅₎ d2(Vji, Cmn) = d 1 min(n) + ω 2 2· (d 2 m(Vji, Cmn)), n ∈ N 2 . (26) Moreover, d1

m(Vji, Cmn) is calculated with MAG using Eq. (14), and d2m(Vji, Cmn)

is calculated with PPL using Eq. (15). Furthermore, the weighting parameters ω11 and ω2

1 are both set to zero and the weighting parameters ω22 is deﬁned as Eq. (27). Td1M

4 and T

d2M

4 are the threshold values applied on the incremental database construction method, and are deﬁned in Sec. 4.1.

ω22= Td 2 M 4 /T d1M 4 . (27)

In the human posture recognition, only one AOD is utilized with one main fea-ture MAG and one assistant feafea-ture θz. The weighting combination of the similarity

measures is described in Eq. (28).

d1(Vji, Cmn) = d 1 m(Vji, Cmn) + ω 1 1· d 1 a(Vji, Cmn), n ∈ N 1 . (28) In Eq. (28), d1

m(Vji, Cmn) is calculated using Eq. (14) and d1a(Vji, Cn

m) is calculated

using Eq. (17). Furthermore, the weighting parameter ω1

1 is deﬁned as Eq. (29), where Td1a

5 is the threshold value applied on the incremental database construction method, and is deﬁned in Sec. 4.1.

ω = 1/Td1a

5 . (29)

In the scene recognition, only one AOD is utilized with one main feature BM. The weighting combination of the similarity measures is described in Eq. (30).

d1(Vji, Cmn) = d1m(Vji, Cmn), n ∈ N1. (30)

In Eq. (30), dm(Vji, Cmn) is calculated using Eq. (16). Furthermore, the weighting

parameter ω1

1 is deﬁned as zero.

5. Experimental Results

This section describes several experiments demonstrating the effectiveness of the proposed method. A SONY EVI-D30 PTZ camera was employed to capture object views. The following three databases were built to test the proposed method: Fig. 6 contains 12 3D rigid objects, Fig. 7 contains six 3D human postures, and Fig. 8 contains 11 scenes. The notation V_d1,j and V2,j_d denote the sets of training views captured at five-degree intervals, where V1,j_d is employed during rigid object recog-nition, and V2,j_d is employed during human posture recognition. The notation V3,j_d , which denotes the set of training views captured at each location at a one-degree increment, is utilized during scene recognition. Moreover, V1,j_t and V2,j_t denote the set of testing views captured from trisection points between each pair of points separated by five-degree, where V1,j_t is utilized during rigid object recognition,

(15)

Object 1 Object 2 Object 3 Object 4

Fig. 6. The ﬁrst database containing 12 3D rigid objects.

Posture 1 Posture 2 Posture 3 Posture 4

Posture 5 Posture 6 Posture 7 Posture 8

Fig. 7. The second image database containing eight 3D human postures.

and V_t2,j is utilized during human posture recognition. Moreover, V3,j_t denotes the set of testing views captured at locations away from the original locations in four directions (forward, backward, left and right), ﬁve distances (5 cm, 10 cm, 15 cm, 20 cm and 50 cm) and ﬁve covering rates (5%, 10%, 15%, 20% and 50%). V3,j_t is utilized during scene recognition. The descriptions of the captured views are given in Eqs. (31)–(36).

V1,j_d ={V_d1,j_(i)}, where 1≤ j ≤ 12, 1 ≤ i ≤ 72 (31) V2,j_d ={V_dj_(i)}, where 1≤ j ≤ 8, 1 ≤ i ≤ 72 (32)

(16)

Scene 1 Scene 2 Scene 3 Scene 4

Scene 5 Scene 6 Scene 7 Scene 8

Scene 9 Scene 10 Scene 11

Fig. 8. The third image database containing 11 scenes.

V_d3,j ={V_dj_(i)}, where 1≤ j ≤ 11, 1 ≤ i ≤ 61 (33) V1,j_t ={V_tj_(i)}, where 1≤ j ≤ 12, 1 ≤ i ≤ 216 (34) V_t2,j ={V_tj_(i)}, where 1≤ j ≤ 8, 1 ≤ i ≤ 216 (35) V3,j_t ={V_tj_(i)}, where 1≤ j ≤ 11, 1 ≤ i ≤ 6100. (36) In the following experiments, T0 denotes the number of objects, and is 12 for the rigid object recognition, 8 during human posture recognition, and 11 during scene recognition; T1 denotes the number of training views, and is 72 during rigid object recognition and human posture recognition, and 61 during scene recognition.

T2denotes the number of low frequency information in FD, and is 40 in the follow-ing experiment. Moreover, the threshold values used in the proposed incremental database construction method are listed as in Table 1.

Table 1. The threshold values for the proposed incremental database construction method.

The First AOD The Second AOD

Assistant Assistant

Main Feature Feature Main Feature Feature

Experiments T3 T4 T5 T3 T4 T5 3D Object Td1M 3 = 640 T d1_M 4 = 1.25 ∗ T d1_M 3 N/A T d2_M 3 = 336 T d2_M 4 = 1.25 ∗ T d2_M 3 N/A Recognition Human Td1M 3 = 1450 T4d1M= 1.25 ∗ T3d1M T5d1a= 10 N/A Posture Recognition Scene Td3M 3 = 1100 T4d3M= 1.25 ∗ T3d3M N/A N/A Recognition

(17)

Computing time for calculating similarity between a test view and a view in the database was approximately 0.006 s for rigid object recognition, 0.004 s for human posture recognition and 0.01 s for scene recognition on a P4 3.2 G CPU with 1 GB RAM.

5.1. 3-D rigid object recognition

In the first experiment, the efficiency of the proposed framework was assessed using 2-D views captured at random intervals with the first database (Fig. 6). To deter-mine average performance of the proposed method, training views were generated by sampling views in Vd1,j in 200 different random orders. Background subtraction

was ﬁrst performed on training 2D views to extract foreground objects. After that, Canny edge detection and GVF were performed on the extracted foreground objects to extract the object contour. Two features, called the MAG and PPL, were then extracted from the object contour and used for building the AODs with the pro-posed incremental database construction method (Fig. 5). The characteristic views of aspects in each AOD were utilized for object matching. A recognition result is calculated with a weighted combination of the similarity measures from both AODs. Figure 9 illustrates the system architecture of the proposed framework for the 3D rigid object recognition.

Table 2 presents statistical information on the aspect numbers using MAG and PPL. Furthermore, symmetrical objects, such as objects 2, 5, 6 and 7, had few aspects, thereby reducing computing time for recognizing objects. The views in V1,j_t were adopted as unknowns, and tested whenever aspect-graph representations were built each time (200 times). The proposed aspect-graph generation is eﬃcient due to its high recognition rate in the Top 1 to Top 3 matches in Table 3.

The proposed method, which constructs an aspect-graph representation using sampled views at random intervals, generates a practical updating mechanism that integrates the database using new collected views. In this experiment, 18 ran-dom views sampled from V1,j_d are ﬁrst utilized to construct a coarse aspect-graph

The 1st AOD The 2nd AOD

MOD

MAG feature Extraction

PPL feature Extraction

3D rigid object recognition

Fig. 9. The system architecture of the proposed framework applied during the ﬁrst experiment (3D rigid object recognition).

(18)

Table 2. The result of rigid object recognition using 2D views via MAG and PPL. The Index of the Objects in the First Database Listed in Fig. 6 Recognition Results 1 2 3 4 5 6 7 8 9 10 11 12 Avg. Numbers of 34.66 3.84 27.83 24.75 6.87 9.47 2.04 25.62 17.14 16.16 16.62 28.75 17.81 Aspect of MAG Numbers of 38.72 14.08 14.32 22.84 10.98 20.12 8.41 31.07 25.79 17.68 23.61 19.88 20.63 Aspect P P L Top 1 98.25 99.97 97.71 97.39 100 99.81 99.79 99.35 99.90 97.97 98.44 96.83 98.78 Match (%) Top 2 99.21 100 98.96 98.73 100 99.96 99.86 99.67 99.97 98.68 99.47 98.17 99.39 Match (%) Top 3 99.61 100 99.39 99.34 100 99.98 99.89 99.78 99.99 98.98 99.77 98.64 99.62 Match (%)

Table 3. Results for the aspect numbers using MAG and PPL after updating by additional training views.

The Index of the Objects in the First Database Listed in Fig. 6 Numbers of Aspect 1 2 3 4 5 6 7 8 9 10 11 12 D18 14.11 3.40 11.98 10.13 5.32 6.36 1.58 12.64 8.80 8.43 8.73 11.10 D36 22.86 3.60 18.83 16.15 6.23 8.01 1.80 19.16 12.53 12.17 12.48 18.29 D54 29.52 3.74 23.95 20.96 6.65 8.94 1.92 23.24 15.20 14.53 15.06 24.05 D72 34.66 3.84 27.83 24.75 6.87 9.47 2.04 25.62 17.14 16.16 16.62 28.75 D90 39.28 3.99 28.68 25.99 7.14 9.83 2.14 27.32 18.04 17.37 17.86 30.99 D108 43.28 4.07 29.50 27.14 7.36 10.12 2.26 28.67 18.90 18.50 19.06 33.12 representation of each object, called D18. Eighteen additional random views are then adopted from the remaining views in V1,j_d to increase the accuracy of the database D18, called D36. Similarly, D54 and D72 are constructed using views in remaining V1,j_d _{. Additionally, D}90 and D108 are further constructed with extra random views sampled from V1,j_t . Table 3 presents the average aspect numbers for each rigid object from 200 iterations. Although the aspect numbers increase when new views are employed to update the coarse database, the number of stored views remains significantly smaller than the number of original views. Figure 10 presents the recognition rate results obtained when using coarse to fine databases. Figure 11 presents the standard deviations for recognition rates. The recognition rate increases when aspect-graph representations are trained using additional object views. Moreover, stability increases based on decreasing standard deviation. There-fore, the proposed method is demonstrated as effective for updating aspect-graph representations without resorting the overall collected views, or recalculating overall similarity measures.

5.2. Human posture recognition

The eﬃciency of the proposed method is demonstrated using the second image database (Fig. 7). As the same preprocessing in the ﬁrst experiment, object contour (the contour of human posture) was extracted for further utilization. In the second

(19)

Fig. 10. Recognition rates of coarse and ﬁne databases (D₁₈,D₃₆,D₅₄,D₇₂,D₉₀and D₁₀₈), calculated using 200 results.

experiment, two features, called the MAG and θz, are extracted from the object

con-tour and used for building the AODs (Fig. 5). The characteristic views of aspects in each AOD are utilized for human posture recognition. Figure 12 illustrates the sys-tem architecture of the proposed framework for human posture recognition. Table 4 shows the eﬃciency of the proposed method with a high recognition rate.

The proposed method decreases the number of aspects for each human posture, and, thus, reducing the computing time for recognizing objects. Furthermore, adopt-ing θz instead of PPL reduces the computing time. The similarity measure, which

is based on posture contour with N points between an unknown posture and the posture in the database, requires computing N similarity distances while adopting PPL as the feature, but the similarity is computed only once while adopting θz.

(20)

Fig. 11. Standard deviations of recognition rates using coarse to ﬁne databases (D₁₈,D₃₆,D₅₄, D72,D₉₀andD₁₀₈), calculated using 200 results.

The 1st AOD

MOD

M

θ

AG feature Extraction

feature Extraction

Human posture recongition

z

Fig. 12. The system architecture of the proposed framework applied during the second experi-ment (human posture recognition).

(21)

Table 4. Results of human posture recognition using 2D views via MAG andθz. The Index of the Postures in the Second Database Listed in Fig. 7

Recognition Results 1 2 3 4 5 6 7 8 Avg.

Number of Aspects 8 25 37 41 42 38 8 38 29.63

Top 1 Match (%) 94.91 99.07 98.15 100 99.07 96.30 99.54 100 98.38 Top 2 Match (%) 99.07 99.54 100 100 100 99.54 100 100 99.77 Top 3 Match (%) 100 99.54 100 100 100 99.54 100 100 99.88 5.3. Scene recognition

In the third experiment, training images of 11 locations (Fig. 8) in an environment (Fig. 13) are obtained by rotating the PTZ camera from −30◦ to 30◦ using one-degree increments at each location, thereby generating 61 images for each position. Furthermore, 12 Gaussian distributions are adopted in this work to build the blob model (BM feature). The number of aspects of each scene is below 13 after the combination processes. Figure 14 presents the sample training images, blob models and conceptual descriptions for each scene. Figure 15 illustrates the system archi-tecture of the proposed framework for the scene recognition. Additionally, for the sake of illustration, the set of characteristic views at the sixth position in the indoor environment is cited as an instance of scene (Fig. 16).

To test the efficiency of the proposed method, test images are captured by a mobile robot moving in four directions (forward, backward, left and right) at five different distances (5 cm, 10 cm, 15 cm, 20 cm and 50 cm) and five different levels of occlusion (5%, 10%, 15%, 20% and 50%). Sixty-one images are captured at each position by rotating the camera from−30◦to 30◦at one-degree increments with no occlusion. Figure 17 presents the test image samples captured at the sixth position. Figure 17(a) shows the test images captured in the forward and backward directions; Fig. 17(b) presents the test images captured in the left and right directions.

Fig. 13. The indoor environment from which scenes in the third database are obtained.

(22)

(a)

(b)

(c)

Fig. 14. The sample training image, blob model and conceptual description of each scene captured in the indoor environment (Fig. 13). (a) The sample image captured at each location in the indoor environment (from left to right are positions 1, 2, . . . , 11). (b) The blob model of each sample captured image in (a) with 12 Gaussian distributions. (c) The conceptual description of each sample captured image in (a), which are calculated by comparing the original pixel values of each captured image with its blob model.

The 1st AOD

MOD

A 2D view sampled

from an unknown object BM feature Extraction The 1st Similarity Measure Weighted Combination

Scene Recognition

Fig. 15. The system architecture of the proposed framework applied during the second experi-ment (human posture recognition).

Fig. 16. The 11 characteristic views at the sixth position in the indoor environment.

To increase the robustness of scene cognition, multiple-view recognition is appro-priate for testing. In this experiment, three arbitrary images Ii(1≤ i ≤ 3) obtained

with different rotating angles of the PTZ camera are utilized for scene recognition. The first three recognized results that have the first three minimum similarity mea-sures are adopted as candidates for further processing. Suppose O is the set of recognized results defined as follows:

O = {oij}, 1 ≤ i ≤ 3, 1 ≤ j ≤ 3, 1 ≤ oij ≤ 11

(23)

(a)

(b)

Fig. 17. The test images captured from the sixth position in the indoor environment. (a) The test images captured in the forward and backward directions; the shifted distances are as follows: backward 50 cm, backward 20 cm, backward 15 cm, backward 10 cm, backward 5 cm, 0 cm, forward 5 cm, forward 10 cm, forward 15 cm, forward 20 cm and forward 50 cm. (b) The test images captured in the left and right directions; the shifted distances are left 50 cm, left 20 cm, left 15 cm, left 10 cm, left 5 cm, 0 cm, right 5 cm, right 10 cm, right 15 cm, right 20 cm and right 50 cm.

where i is the index of the test image, and j is the index of the order of recognition result.

Three methods are proposed for estimating the final scene cognition result. The first result, R1, is estimated using only one recognition result with only one captured image. The second result, R2, uses the first three recognition result with only one captured image. The third result, R3, uses all combinations of the first three recognition results with three captured images. The descriptions of R1, R2 and R3 are derived by Eq. (31).

Rk =    o11, k = 1 o11· ¯D1+ r1· D1, k = 2 o11· ( ¯D1D¯2D¯3) + r1· (D1) + ¯D1[r2· (D2) + ¯D2(r3· D3)], k = 3 (37) where rp= arg max(Fp), 1≤ p ≤ 3 Dp= 1, if argmax(F_p) exists

0, if argmax(F_p) does not exist 1≤ p ≤ 3

Fp={fpq, 1 ≤ q ≤ 11}, fpq=

3

j=1

δ(q − vpj).

Based on recognition results (Table 5), the recognition rates of the three methods are all above 95% when the level of occlusion is less than 20% and variation positions are below 20 cm. Although the level of occlusion is 50% and variation posi-tions are 50 cm, recognition rates are still above 50%. Moreover, the third method,

R3, performs best and is reasonable based on the human vision used for localiza-tion. When a person enters an unknown place, multidirectional views are captured by the eyes to assist recall of past experiences of the unknown place. In this work, the same strategy is adopted to increase scene cognition robustness.

(24)

T a b le 5 . H u m a n p o st u re re c o g n it io n re su lt s u sin g 2 D v ie w s v ia B M w it h p o sit io n v a ria ti o n s a n d d iﬀ e re n t le v e ls o f o c c lu sio n . C o v e ring Rate (1 .000 = 100% ) 5% 10% 15% 20% 50% Shift D is ta n ce (cm) a n d D irectio n R1 R2 R3 R1 R2 R3 R1 R2 R3 R1 R2 R3 R1 R2 R3 0 c m 1 .000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 0. 800 0. 769 0. 817 5 F orw a rd 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 0. 797 0. 775 0. 809 B a c k w a rd 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 0. 999 1. 000 1. 000 0. 794 0. 763 0. 809 Le ft 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 0. 794 0. 779 0. 806 Ri gh t 1 .000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 0. 791 0. 754 0. 818 10 F o rw ard 1 .000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 0. 796 0. 784 0. 802 B a c k w a rd 0. 997 0. 979 1. 000 1. 000 0. 997 1. 000 1. 000 0. 997 1. 000 0. 997 0. 996 1. 000 0. 785 0. 738 0. 802 Le ft 1. 000 0. 993 1. 000 1. 000 0. 996 1. 000 1. 000 0. 996 1. 000 1. 000 0. 993 1. 000 0. 796 0. 772 0. 805 Ri gh t 1 .000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 0. 997 1. 000 1. 000 0. 999 1. 000 0. 770 0. 747 0. 817 15 F o rw ard 1 .000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 0. 784 0. 769 0. 797 B a c k w a rd 1. 000 0. 996 1. 000 1. 000 0. 993 1. 000 0. 997 0. 988 1. 000 0. 994 0. 987 1. 000 0. 763 0. 726 0. 781 Le ft 0. 997 0. 979 1. 000 0. 997 0. 979 1. 000 0. 997 0. 979 1. 000 0. 994 0. 975 1. 000 0. 779 0. 736 0. 794 Ri gh t 0 .992 0. 982 0. 997 0. 992 0. 977 0. 997 0. 992 0. 979 0. 997 0. 992 0. 977 0. 997 0. 726 0. 705 0. 757 20 F o rw ard 1 .000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 1. 000 0. 765 0. 751 0. 782 B a c k w a rd 0. 999 0. 987 1. 000 0. 994 0. 982 1. 000 0. 991 0. 979 1. 000 0. 988 0. 976 1. 000 0. 733 0. 711 0. 748 Le ft 0. 976 0. 960 0. 987 0. 979 0. 970 0. 993 0. 975 0. 961 0. 979 0. 975 0. 963 0. 979 0. 748 0. 699 0. 776 Ri gh t 0 .975 0. 955 0. 984 0. 979 0. 970 0. 993 0. 976 0. 951 0. 984 0. 975 0. 951 0. 984 0. 714 0. 694 0. 744 50 F o rw ard 0 .845 0. 838 0. 854 0. 845 0. 844 0. 849 0. 842 0. 829 0. 845 0. 832 0. 815 0. 839 0. 508 0. 503 0. 523 B a c k w a rd 0. 881 0. 839 0. 925 0. 848 0. 821 0. 896 0. 830 0. 809 0. 872 0. 796 0. 785 0. 833 0. 525 0. 508 0. 553 Le ft 0. 750 0. 741 0. 775 0. 748 0. 742 0. 768 0. 733 0. 729 0. 754 0. 723 0. 711 0. 735 0. 502 0. 501 0. 531 Ri gh t 0 .811 0. 794 0. 841 0. 799 0. 784 0. 827 0. 781 0. 770 0. 817 0. 753 0. 763 0. 778 0. 532 0. 502 0. 531

(25)

6. Conclusions

This study presents a flexible framework for recognizing 3-D objects by building aspect-graph representations using 2D views sampled at random intervals. The proposed framework comprises an incremental database construction method and a hierarchal weighting combination structure. A robust database, called a MOD, is composed of AODs. Each AOD is built using the incremental database construction method with one main feature or one main feature and one assistant feature. The final recognition result can be estimated by combining the results calculated from each AOD. To demonstrate the efficiency of the proposed framework, three vari-ous object recognition problems, including 3D object recognition, human posture recognition, and scene recognition are performed in the experiments.

Although the threshold values (T3and T4) applied in the proposed incremental database construction method are determined manually case by case, the crite-ria for selecting T3 and T4 are described in Sec. 4.1. The selection of T3 and T4, which is a trade-off in this work, affects the number of aspects and thus affects the computing time and the error performance. Moreover, the feature selection plays an important role while applying the proposed method in different applications. Although the recognition rate decreases while the number of objects in the database increases in most applications, the proposed framework provides a hierarchical structure to combine more features to maintain the robustness of the recognition system.

Moreover, the proposed incremental database construction method is practical for extracting aspects when features of an object conflict with the first criterion, namely, local monotonicity, as indicated by Cyr and Kimia.7_{For instance, the} com-binational algorithm developed by Cyr and Kimia7 _{cannot efficiently combine 2D} views of a human posture with MAG. However, the proposed incremental database construction method overcomes this problem, and efficiently decreases the aspect number. In Fig. 18(a), the blue circles represent 2D views of a human posture, and the black human postures with the red and green lines connected to the blue circle represent the 2D views belonging to the same aspect. In Fig. 18(b), the black human postures are the characteristic views of the aspects of human postures. The two aspects in Fig. 18(b) clearly contain two clusters of 2D views that are opposites.

The proposed method decreases computing time when updating the aspects with new 2D views. Using the method proposed by Cyr and Kimia,7 _{an object} with N collected views requires a computing time of N (N + 1)/2 to calculate the mutual similarity distances between the (N +1) 2D views and to extract the aspects and characteristic views. However, the proposed method requires only a computing time of N times to calculate the similarity distance between new incoming views and N existing views via the proposed method. However, as the proposed method has a high computation requirement, improving its eﬃciency is a topic for future works.

(26)

(a) (b)

Fig. 18. The aspect-graph representation of the ﬁrst human posture listed in Fig. 6 via MAG only. (a) The similar 2D views of two aspects. (b) The characteristic views of two aspects.

Acknowledgments

This work was supported by National Science Council of the R.O.C. under grant no. NSC94-2218-E009064 and DOIT TDPA Program under the project number 95-EC-17-A-04-S1-054.

References

1. K. Akita, Image sequence analysis of real world human motion, Patt. Recogn. 17(4) (1984) 73–83.

2. M. Bessa, A. Coelho, J. B. Cruz and A. Chalmers, Selective presentation of perceptu-ally important information to aid orientation and navigation in an urban environment,

Int. J. Patt. Recogn. Artif. Intell.20(4) (2006) 467–482.

3. V. Blanz, M. J. Tarr and H. H. Bultho, What object attributes determine canonical views? Perception28 (1999) 575–599.

4. J. Canny, A computational approach to edge detection, IEEE Trans. Patt. Anal.

Mach. Intell.8(6) (1986) 679–698.

5. G. Cicirelli, T. D’Orazio and A. Distante, Diﬀerent learning methodologies for vision-based navigation behaviors, Int. J. Patt. Recogn. Artif. Intell.19(8) (2005) 949–975. 6. R. Cucchiara, C. Grana, A. Prati and R. Vezzani, Probabilistic posture classiﬁcation

for human-behavior analysis, IEEE Trans. Syst. Man Cybern.35(1) (2005) 42–54. 7. C. M. Cyr and B. Kimia, A similarity-based aspect-graph approach to 3D object

recognition, Int. J. Comput. Vis.57(1) (2004) 5–22.

8. C. de Trazegnies, C. Urdiales, A. Bandera and F. Sandoval, 3D object recognition based on curvature information of planar views, Patt. Recogn. 36(11) (2003) 2571– 2584.

9. Q. Delamarre and O. Faugeras, 3-D articulated models and multi-view tracking with silhouettes, IEEE Conf. Comput. Vis. (September 1999), pp. 716–721.

10. P. Dempster, N. M. Laird and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Stat. Soc.39(1) (1977) 1–38.

(27)

11. G. N. Desouza and A. C. Kak, Vision for mobile robot navigation: a survey, IEEE

Trans. Patt. Anal. Mach. Intell.24 (2002) 237–267.

12. A. Diplaros, T. Gevers and I. Patras, Color-shape context for object recognition, IEEE

Workshop on Color and Photometric Methods in Computer Vision (in conjunction with ICCV 2003), Nice, France (2003).

13. A. Diplaros, T. Gevers and I. Patras, Combining color and shape information for illumination-viewpoint invariant object recognition, IEEE Trans. Imag. Process. 15(1) (2006) 1–11.

14. I. Haritaoglu, D. Harwood and L. S. Davis, Ghost: a human body part labeling system using silhouettes, Proc. Int. Conf. Patt. Recogn.1 (1998) 77–82.

15. J. S. Hu, T. M. Su and S. C. Jen, Robust background subtraction with shadow removal for indoor environment surveillance, Proc. IEEE IROS, China (October 2006). 16. S. Iwasawa, K. Ebihara, J. Ohya and S. Morishima, Real-time human posture

estima-tion using monocular thermal images, IEEE Int. Conf. Automatic Face and Gesture

Recognition (April 1998), pp. 492–497.

17. H. Jiang, Z. N. Li and M. S. Drew, Recognizing posture in pictures with succes-sive convexiﬁcation and linear programming, IEEE. Trans. Multimed. 14(6) (2007) 26–37.

18. S. Kim, G. J. Jang, W. H. Lee and I. S. Kweon, Combined model-based 3D object recognition, Int. J. Patt. Recogn. Artif. Intell.19(7) (2005) 839–852.

19. T. K. Kim, J. Kittler and R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations, IEEE Trans. Patt. Anal. Mach. Intell.29(6) (2007) 1005–1018.

20. J. J. Koenderink and A. J. van Doorn, The singularities of the visual mapping, Biol.

Cybern.24 (1976) 51–59.

21. A. Kosaka and A. C. Kak, Fast vision-guided mobile robot navigation using model-based reasoning and prediction of uncertainties, Comput. Vis. Graph. Imag. Process.

— Imag. Underst.56(3) (1992) 271–329.

22. B. J. A. Kr¨ose, N. Vlassis, R. Bunschoten and Y. Motomura, A probabilistic model for appearance-based robot localization, Imag. Vis. Comput. 19 (2001) 381–391.

23. P. Lamon, A. Tapus, E. Glauser, N. Tomatis and R. Siegwart, Environmental modeling with ﬁngerprint sequences for topological global localization, IEEE Int. Conf. Intell.

Robots and Systems (October 2003), pp. 3781–3786.

24. C. C. Lin, Shape memorization and recognition of 3-D objects using a similarity-based aspect-graph approach, Department of Electrical and Control Engineering, National Chiao-Tung University, Hsinchu, Taiwan, R.O.C., Master thesis (June 2005). 25. P. C. Lin, Human posture recognition system using 2-D shape features, Department

of Electrical and Control Engineering, National Chiao-Tung University, Hsinchu, Tai-wan, R.O.C., Master thesis (June 2006).

26. B. MacQueen, Some methods for classiﬁcation and analysis of multivariate obser-vations, Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, Berkeley (University of California Press, 1967), Vol. 1, pp. 281–297.

27. G. Mamic and M. Bennamoun, Representation and recognition of 3D free-form objects, Dig. Sign. Process.12 (2002) 47–76.

28. S. Mian, M. Bennamoun and R. A. Owens, Three-dimensional model-based object recognition and segmentation in cluttered scenes, IEEE Trans. Patt. Anal. Mach.

Intell.28(10) (2006) 1584–1601.

29. S. K. Naik and C. A. Murthy, Distinct multi-colored region descriptor for object recognition, IEEE Trans. Patt. Anal. Mach. Intell.29(7) (2007) 1291–1296.

(28)

30. L. B. Ozer, T. Lu and W. Wolf, Design of a real-time gesture recognition system: high performance through algorithms and software, IEEE Sign. Process. Mag.22 (2005) 57–64.

31. L. B. Ozer and W. Wolf, Real-time posture and activity recognition, Proc. IEEE

Workshop, Motion and Video Computing (2002), pp. 133–138.

32. G. Peters, Theories of three-dimensional object perception — a survey, Recent

Research Developments in Pattern Recognition, Transworld Research Network (2000).

33. Y. Rubner, C. Tomasi and L. J. Guibas, The earth mover’s distance as a metric for image retrieval, Int. J. Comput. Vis.40(2) (2000) 99–121.

34. H. Schneiderm and T. Kanade, Object detection using the statistics of parts, Int. J.

Comput. Vis.56(3) (2004) 151–177.

35. T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber and T. Poggio, Robust object recogni-tion with cortex-like mechanisms, IEEE Trans. Patt. Anal. Mach. Intell.29(3) (2007) 411–426.

36. Y. Shan, H. S. Sawhney, B. Matei and R. Kumar, Shapeme histogram projection and matching for partial object recognition, IEEE Trans. Patt. Anal. Mach. Intell.28(4) (2006) 568–577.

37. I. Shimshoni and J. Ponce, Finite-resolution aspect graphs of polyhedral objects,

IEEE Trans. Patt. Anal. Mach. Intell.19(4) (1997) 315–327.

38. A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, B. Schiele and L. Van Gool, Towards multi-view object class detection, IEEE Conf. Computer Vision and Pattern

Recog-nition (CVPR), New York, USA (June, 2006).

39. I. Ulrich and I. Nourbakhsh, Appearance based place recognition for topological local-ization, IEEE Conf. Robotics and Automation (November 2000), pp. 1023–1029. 40. I. Weiss and M. Ray, Model-based recognition of 3D objects from single images, IEEE

Trans. Patt. Anal. Mach. Intell.23(2) (2001) 116–128.

41. N. Werghi and Y. Xiao, Recognition of human body posture from a cloud of 3-D data points using wavelet transform coeﬃcients, Proc. IEEE Int. Conf. Automatic Face

and Gesture Recognition (2002), pp. 70–75.

42. C. Wren, A. Azarbayejani, T. Darrell and A. Pentland, Pﬁnder: real-time tracking of the human body, IEEE Trans. Patt. Anal. Mach. Intell.19(7) (2007) 780–785. 43. C. Xu and J. L. Prince, Snakes, shapes, and gradient vector ﬂow, IEEE Trans. Imag.

Process.7(3) (1998) 359–369.

44. P. Yan, S. M. Khan and M. Shah, 3D model based object class detection in an arbitrary view, IEEE Int. Conf. Computer Vision (ICCV), Rio de Janeiro, Brazil (October 2007).

(29)

Jwu-Sheng Hu

recei-ved the B.S. degree from the Department of Mechanical Engineer-ing, National Taiwan University, Taiwan, in 1984, and the M.S. and Ph.D. degrees from the Department of Mechan-ical Engineering, Uni-versity of California at Berkeley, in 1988 and 1990, respectively. He is currently a Pro-fessor in the Department of Electrical and Control Engineering, National Chiao-Tung University, Taiwan, R.O.C.

His current research interests include microphone array signal processing, active noise control, intelligent mobile robots, embedded systems and applications.

Tzung-Min Su

recei-ved the B.S. degree in electrical and con-trol engineering from National Chiao Tung University, Taiwan, R.O.C in 2000. He is currently a Ph.D. can-didate in the Depart-ment of Electrical and Control Engineering at National Chiao Tung University, Taiwan, R.O.C. He was awarded the championship at the national competition held by the Ministry of Education Advisor Oﬃce in 2001.

His research interests include background subtraction, 3D object recognition, and home-care surveillance.