Shadow and Highlight Removal - Background Subtraction with Shadow Removal

Chapter 2 Background Subtraction

2.4 Background Subtraction with Shadow Removal

2.4.1 Shadow and Highlight Removal

Besides foreground and background, shadows and highlights are two important phenomenons that should be considered in most cases. Shadows and highlights result from changes in illumination. Compared with the original pixel value, shadow has similar chromaticity but lower brightness, and highlight has similar chromaticity but higher brightness. The regions influenced by illumination changes are classified as the foreground if shadow and highlight removal is not performed after background subtraction.

Hoprasert et al. [60] proposed a method of detecting highlight and shadow by gathering statistics from N color background images. Brightness and chromaticity distortion are used with four threshold values to classify pixels into four classes. The method that used the mean value as the reference image in [60] is not suitable for dynamic background. Furthermore, the threshold values are estimated based on the histogram of brightness distortion and chromaticity distortion with a given detection rate, and are applied to all pixels regardless of the pixel values. Therefore, it is possible to classify the darker pixel value as shadow. Furthermore, it cannot record the history of background information.

This work proposes a 3D cone model that is similar to the pillar model proposed by Hoprasert [60], and combines the LTCBM and STCBM to solve the above problems. A cone model is proposed with the efficiency in deciding the parameters of 3D cone model according to the proposed LTCBM and STCBM. In the RGB space, a Gaussian distribution of the LTCBM becomes an ellipsoid whose center is the mean of the Gaussian component, and the length of each principle axis equals 2.5 standard deviations of the Gaussian component. A new pixel I(R,G,B) is considered to belong to background if it is located inside the ellipsoid. The chromaticities of the pixels located outside the ellipsoid but inside the cone (formed by the ellipsoid and the origin) resemble the chromaticity of the background. The brightness difference is then applied to classify the pixel as either highlight or shadow. Figure 2-4 illustrates the 3D cone model in the RGB color space.

Figure 2-4 The proposed 3D cone model in the RGB color space.

The threshold values τ_low and τ_high are applied to avoid classifying the darker pixel value as shadow or the brighter value as highlight, and can be selected based on the standard deviation of the corresponding Gaussian distribution in the CBM.

Because the standard deviations of the R, G and B color axes are different, the angles between the curved surface and the ellipsoid center are also different. It is difficult to classify the pixel using the angles in the 3D space. The 3D cone is projected onto the 2D space to classify a pixel using the slope and the point of tangency. Figure 2-5 illustrates the projection of the 3D cone model onto the RG 2D space.

Figure 2-5 2D projection of the 3D cone model from RGB space onto the RG space.

Let a and b denote the lengths of major and minor axis of the ellipse, where

a=2.5*σR and b=2.5*σ_G. The center of the ellipse is (μ_R,μ_G), and the elliptical equation is described as Eq. (2-14).

) 1

A matching result set is given by F_b =

{

f_bi,i=1,2,3

}

, where f is the matching _bi result of a specific 2D space. A pixel vectorI =

[

I_R,I_G,I_B

]

is then projected onto the 2D spaces of R-G, G-B, and B-R. The pixel matching result is set to 1 when the slope of the projected pixel vector is between m₁ and m . Meanwhile, if the background ₂ mean vector is E =

[

μ_R,μ_G,μ_B

]

, the brightness distortion α_b can be calculated via

The image pixel is classified as highlight, shadow or foreground using the matching result set F , the brightness distortion _b α_b and Eq. (2-17).

When a pixel is a large standard deviation away from a Gaussian distribution, the Gaussian distribution probability of the pixel approximately equals to zero. It also means the pixel does not belong to the Gaussian distribution. By using the simple concept, τ_high and τ_low can be chosen using N standard deviation of the _G corresponding Gaussian distribution in the CBM and are described as Eq. (2-18).

2.4.2 Background Subtraction

A hierarchical approach combining color-based background subtraction and gradient-based background subtraction has been proposed by Javed et.al [57]. This work proposes a similar method for extracting the foreground pixels. Given a new image frame I , the color-based backgound model is set to the LTCBM and STCBM, and gradient-based model is F^k

(

Δ ,_m Δ_d

)

. C(I) is defined as the result of color-based background subtraction using the CBM. G(I) is defined as the result of gradient-based background subtraction. C(I) and G(I) can be extracted by testing every pixel of frame I using the LTCBM and F^k

(

Δ ,_m Δ_d

)

. Moreover, C(I) and

) (I

G are both defined as a binary image, where 1 represents the foreground pixel and 0 represents the background pixel. The foreground pixels labeled in C(I) are further classified as shadow, highlight and foreground by using the proposed 3D cone model. C^'(I) can then be obtained from C(I) after transferring the the foreground pixels which have been labeled as shadow and highlight in C(I) into the

background pixel. The difference between Javed et.al [57] and the proposed method is that a pixel classifying procedure using the CSIM is applied before using the connected component algorithm to group all the foreground pixels in C(I ). The robustness of background subtraction is enhanced due to the better accuracy in |∂R_a |. Moreover, the foreground pixels can be extracted using Eq. (2-19).

B a

R j

i P

j i G j i

a I ≥

∂

∑

^∈^∂ ∇

)) , ( ) , (

) (

( (2-19)

where I∇ denotes the edges of image I and ∂ represents the number of R_a boundary pixels of region R . _a

Chapter 3 Incremental Similarity-Based

Aspect-Graph 3D Object Recognition

3.1 Introduction

The common challenge in 3D object recognition, human posture recognition, and scene recognition is the variation in orientations. The simplest method for solving this problem is to characterize an object with a densely sampled collection of independent views. The object can be described in detail by constructing an object model with numerous 2D views; however, this approach significantly increases computing time due to the expansive search space. Thus, several approaches have been developed to extract a minimal set of object views. Appearance-based methods focus on changes in intensity of each view. However, changes to object lighting, rotation, deformation and occlusion affect object recognition results when using the appearance-based method.

Aspect-graph representations focus on shape changes to an object’s projection [61-62].

Koenderink et al. [63] developed the underlying theory that describes 3D objects

using aspect-graph representation. Moreover, the traditional aspect-graph method [64]

assumes that an object belongs to a limited class of shapes, and that characteristic views can be extracted using prior knowledge of the object. Aspect-graph vertices represent the characteristic views extracted from points on a transparent viewing sphere with an object in the object center. These characteristic views are extracted as prototypes of an object from a densely sampled collection of object views.

Cyr and Kimia [1] presented a similarity-based aspect-graph method to extract the characteristic views using shape similarity between views. The viewing sphere is sampled at regular (5 degree) intervals and two similarity metrics, which one based on curving matching and the other based on shocking matching, are applied to combine views into aspects. Let there be N objects { , ,...,O O₁ ₂ O_n,...,O_N₋₁,O_N}, which comprise an object database. Each object is composed of M views sampling the viewing sphere giving rise to a set of views { ,...,V₁¹ V_mⁿ,...,V_M^N} where V_mⁿ denotes the m^th view of object O_n. The aspect p of object nis defined as A_pⁿ, which is a collection of views ranging from V_{m k}ⁿ− − to V_{m k}ⁿ+ +and represented by the characteristic

view V_mⁿ. Moreover, the dis-similarity of two views is represented as d V V( _mⁿ, )_jⁱ , which is the distance between the m view of object ^th n and the i view of object ^th j. The goal is to minimize the set of views required to represent each object O_n. Two criteria are imposed to maintain successful object recognition while forming aspects representation by characteristic views. The first criterion (local monotonicity) supposes that the dis-similarity of two views increases as their relative viewing angle between them increases. The second criterion describes that the distance of each view

Vi in an aspect A_mⁿ and the characteristic view of that aspect V_mⁿ is smaller than

the distance between any non-aspect view V_jⁿ and the characteristic view V_mⁿ.

The training views of an object in [1], which are sampled at 5-degree increments and sorted by order, are collected in advanced. When additional views of an object are collected to improve object representation in the work of [1], the total views of an object must be resorted in order of view angles. The first criterion, local monotonicity, is not suitable when an object is symmetrical in the feature space. It is inconvenient to update the aspect-graph representation while collecting more new 2D views. To improve the flexibility of an update mechanism, this work presents an incremental database construction method for building and updating the aspect-graph with object views sampled at random intervals. Object representation becomes increasingly detailed using additional captured and characteristic views without re-calculating similarity measures by re-sorting total views. Moreover, the first criterion in the work of [1], local monotonicity, is not utilized, thereby improving flexibility of extracting aspects of symmetrical objects. Although the proposed approach cannot confirm the view angle of a test view using a specific object view, it improves the flexibility for building an aspect-graph representation, and reduces computing time when updating object aspects. Additionally, the accuracy of the object representation increases with minimal growth of search space while collecting additional new object views.

The remainder of this chapter is organized as follows. Section 3.2 presents the system architecture and the corresponding dataflow. Section 3.3 describes the procedure for extracting features and the similarity measures for building database and object matching. Section 3.4 describes the ISAG for extracting the aspects and characteristic views of objects. Furthermore, the object matching procedure is described with a weighting combination between different similarity measures.

Conclusions are discussed in Section 3.5.

3.2 System Architecture

The proposed framework (Fig. 3-1) contains two parts, which are called the database building procedure and the matching procedure. Suppose an object database contains T₀ objects, and T₁ 2D views of each object are randomly sampled from a viewing sphere. In the database building procedure, the ISAG (Fig. 3-2) is applied to extract the aspects of each object using T₁ 2D views. The main 3D database contains a set of assistant 3D object databases (AOD). Furthermore, an AOD comprises the aspects of each object, where the aspects are represented by their characteristic views. Figure 3-3 illustrates the inner structure of an AOD. In Fig. 3-3, a set of aspects is employed to represent the database of a 3D object in the aspect level.

The prototypes for these aspects, called the characteristic views, are utilized to represent an object for object matching. The passage from one characteristic view to another is defined with only the similarity measure. The proposed similarity-based aspect-graph focuses on an efficient learning method with associated features and similarity functions. While object features are sufficient to discriminate the similarity between each two 2D training views, the aspects and characteristic views can be extracted using associated similarity measures. Even if the objects are complex, the characteristic views can be extracted in the feature space.

In the matching procedure, a similarity measure is applied between a 2D view sampled from an unknown object and all the characteristic views of the 3D object database. After the weighted combination of all similarity measures, the first three characteristic views that have the highest similarity with the testing 2D view are regarded as the recognition results (the top three matches).

The 1st Assistant

1. Database Building Procedure (A Main 3D Object Database (MOD) ) A 2D view sampled

The Nrh Feature Extraction The Nnd Similarity Measure

……

Figure 3-1 The system architecture of the proposed framework. A MOD comprises of total AODs. An incremental learning method based

on similarity-based aspect-graph

Yes An Assistant

3-D Object Database (AOD)

Figure 3-2 The database building procedure, where T₀ is the number of objects in the database and T₁ is the number of sampled views required to build the aspect-graph representation of an object.

View 1 View 70 View 71 View 72

Aspect Level View 28 View 31 View 32 View 33 View 37 View 43 View 44 View 65

Figure 3-3 The inner structure of an AOD.

3.3 Object Representation

In this work, shape and color features are utilized to measure similarity between two object views. To extract shape information, a robust background subtraction framework from previous works [65-66] is utilized to extract foreground regions while considering shadows and highlights. Foreground detection provides flexibility when constructing the object database, even in an out-of-control environment. Canny edge detection [67] is then applied to extract shape edge, and the Gradient Vector Flow Snake (GVF) [68] is applied to extract the contour information. Assume that the contour information is included in a set Z, which is composed of N pointsz , where _i

z is a complex form given by Eq. (3-1). Two kinds of shape features, which are i

called the Fourier descriptor (FD) [69] and the point-to-point length (PPL), are extracted from Z .

{ ( )} {z i x_i jy_i}, 0 i N

= = + ≤ <

Z (3-1)

3.3.1 Shape Features

The points inside the set Z are re-sampled using Eq. (3-2) to eliminate variations in shift and scale.

{ ( )} { [(z i L x_c _i x_c) j y( _i y_c)]/ }L

= = − + −

Z (3-2)

where 0 i≤ < N; L denotes contour length of Z , L_c is expected contour length, and (x_c,y_c) is the location of the contour center of Z . Then, the Fourier transform is applied to Z to compute FD using Eq. (3-3).

( ) _n^N0 ( ) exp( 2 / ), 0 k<N

FD k =

∑

₌⁻ z n −j πkn N ≤ (3-3)

The low-frequency parts of FD are extracted with the consideration of decreasing the variations of high-frequency noises, and are defined as MAG Notably, MAG is composed of 2T magnitude values of frequency information selected among ₂ 2N frequencies. The method for extracting MAG is given by Eq. (3-4).

{| ( ) | , | ( ) | , 1 2}

MAG= FD k FD N k− ≤ ≤k T (3-4) Intuitively speaking, MAG only characterizes the shape and not the orientation of human posture. Therefore, MAG cannot discriminate between similar shapes oriented differently. To solve this problem, phase information for FD must be used for memorizing an object. The work in [70] proposes that memorizing the phase value at low frequency is sufficient. Suppose the phase information is θ_z, then θ_z can be calculated using FD(1)and FD N( − , as described in Eqs. (3-5) and (3-6). 1)

1 1 1

(1) | (1)|.exp( )

FD = FD jθ =R + jI (3-5)

1 1 1

( 1) | ( 1)|.exp( _N ) _N _N

FD N− = FD N− jθ ₋ =R ₋ + jI ₋ (3-6) Furthermore, θ can be calculated using Eq. (3-7). _z

1 1 1 1 1 1

( ) / 2 (arctan( / ) arctan( / )) / 2

z N I R IN RN

θ = θ θ+ ₋ = + ₋ ₋ (3-7)

where R₁and R_N₋₁ denote the real parts of FD(1) and FD N( − , 1) I₁ and I_N₋₁

denote the imaginary parts of FD(1) and FD N( − , and 1) θ₁ and θ_N₋₁ are the phases of FD(1) and FD N( − . 1)

Moreover, the lengths between each pair of points in Z are defined as PPL.

PPLis suitable for describing shape details. To calculate PPL is time consuming due

to that each point is considered as a start point. Equations (3-8) and (3-9) describe the

Numerous features, such as edge, corner, texture, color and shape, have been utilized to extract useful information from an image. Among these features, color involves the intuitive information to represent the conceptual idea of an image.

Therefore, pixel color and pixel position are utilized in this work to extract the conceptual idea of an image. The color space used in this work is RGB color space, a format common to most video devices. To enhance the regional information of an image, the position (x, y) feature is combined with RGB color information as the feature vector. That is, each pixel contains a 5D feature vector (R, G, B, x, y), which is shown in Fig. 3-4.

Figure 3-4 5D feature vector construction.

This work applies Gaussian mixture model (GMM) to model region information in a scene image as a blob model, which is defined as BM, using 5D feature vectors (R, G, B, x, y). We assume that the density function of color and position features have Gaussian distributions. First, each pixel x is defined as a 5-dimensional vector at time t. Moreover, N Gaussian distributions are used to construct the GMM, which is described in Eq. (3-10).

λ represents the parameters of GMM,

Next, parameters λ of GMM are calculated to enable the GMM to match the feature vector distribution with least errors. The most common method for calculating parameters λ is ML estimation. The objective of ML estimation is to identify model parameters by maximizing the likelihood function of GMM obtained from training feature vectors X . The ML parameters are derived iteratively using the EM algorithm. Supposing there ares feature vectors x x₁, ,...,₂ x (In this work, s is _s defined as image size, 320×240=76,800), then the ML estimation of λ can be calculated using Eq. (3-11).

Furthermore, unsupervised data clustering is used before the EM algorithm iterations to accelerate convergence. This study uses the K-means algorithm [59] for clustering. The number of clusters is defined, and then the initial center of each cluster is obtained randomly. The appropriate center and variance of each cluster can be

estimated iteratively using the K-means algorithm and applied as the initial mean and variance of each Gaussian component of the GMM.

3.3.3 Similarity Functions

To determine the similarity between two objects when building databases and recognizing objects, a similarity measurement D U V is applied to extract features.

(

)

We assume that the features extracted from two contours are U =

{

u0, , , ," u_i " u_I₋1

}

and V =

{

v0, , , ," v_i " v_I₋1

}

, respectively, where I denotes the feature size. Two similarity measures are applied using 1-norm distance (Eq. (3-12)) and K-L distance [71](Eq. (3-13)), where c denotes the number of points on an extracted contour and

sdenotes image size. In this work, c is defined as 256 and s is defined as 76800.

( )

¹ represents the m characteristic view of the ^th n object. Moreover, ^th Ammin denotes the aspects that have the minimum distance from V_newⁿ and min

Cm represents the minimal distance, where m^min is the index of Ammin. C_mⁿmin−₁ and C_mⁿmin+₁ denote the

neighboring views of min

(Eq.(3-17)) denotes the similarity measure using θ_z and 1-norm distance.

( )

3.4 Flexible 3D Object Recognition Framework

A flexible framework using the ISAG is described in this section. In the framework, a MOD is composed of one or more AODs. Each AOD is built using one main feature or using one main feature with one assistant feature. Moreover, each feature has its similarity function, such as Eqs. (3-14)-(3-17).

3.4.1 Generation of Aspects and Characteristic Views

The ISAG is a four-step procedure and is illustrated as Fig. 3-5. Step A-1 to A-4 is applied to extract aspects and characteristic views. Those aspects comprise an object database and the characteristic views are used for object matching with a new view

Vnew. Step A-1:

Initialize the number of aspects be zero. 2D views of the n object are randomly ^th sampled from a viewing sphere and each 2D view is regarded as V . _newⁿ

Step A-2:

When the number of existing aspects of the n object equals zero, ^th V_newⁿ is regarded as a characteristic view of a new aspect.

Step A-3:

When the number of existing aspects of the n object equals one or two, ^th

(A-3.1) When Eqs. (3-18) and (3-19) are both satisfied, V is combined into the _newⁿ mmin aspect, and the characteristic view of the m^min aspect remains the same. where T and₃ T are both predefined threshold values. ₅

(A-3.2) Otherwise, if Eq. (3-18) is satisfied and Eq. (3-19) is not, V_newⁿ is combined into the m^min aspect, and is regarded as a new characteristic view of the m^min aspect.

(A-3.3) Otherwise, if Eqs. (3-18) and (3-19) are both unsatisfied, a new aspect of the n object is established, and ^th V_newⁿ is regarded as the new characteristic view of the new aspect.

Step A-4:

When the number of existing aspects of the n object is ≥ 3, ^th

(A-4.1) If either Eq. (3-20) or Eq. (3-21) is true, a new aspect is constructed and

V is considered the characteristic view of the new aspect. When a new new

aspect is established, the aspect order can be determined to let similar aspects be close to each other using Eq. (3-22). If Eq. (3-22) is true, the similarity distance between V_newⁿ and C_mⁿmin+₁ exceeds that between V_newⁿ

and C_mⁿmin−₁, then the new aspect is inserted between aspect m^min and aspect m^min−1; otherwise, the new aspect is inserted between aspects

mmin and m^min+1. (A-4.2) Otherwise, if Eqs. (3-20) and (3-21) are both unsatisfied and Eq. (3-19) is true, V is combined into the _newⁿ m^min aspect and the characteristic view

在文檔中以二維影像與漸進式相似度外觀圖解法為基礎之穩健三維物體辨識 (頁 37-0)