Region-Level Motion-Based Background Modeling and Subtraction Using MRFs

(1)

Region-Level Motion-Based Background

Modeling and Subtraction Using MRFs

Shih-Shinh Huang, Student Member, IEEE, Li-Chen Fu, Fellow, IEEE, and Pei-Yung Hsiao, Member, IEEE

Abstract—This paper presents a new approach to automatic seg-mentation of foreground objects from an image sequence by inte-grating techniques of background subtraction and motion-based foreground segmentation. First, a region-based motion segmenta-tion algorithm is proposed to obtain a set of mosegmenta-tion-coherence re-gions and the correspondence among rere-gions at different time in-stants. Next, we formulate the classification problem as a graph labeling over a region adjacency graph based on Markov random fields (MRFs) statistical framework. A background model repre-senting the background scene is built and then is used to model a likelihood energy. Besides the background model, a temporal co-herence is also maintained by modeling it as the prior energy. On the other hand, color distributions of two neighboring regions are taken into consideration to impose spatial coherence. Then, the a priori energy of MRFs takes both spatial and temporal coherence into account to maintain the continuity of our segmentation. Fi-nally, a labeling is obtained by maximizing the a posteriori energy of the MRFs. Under such formulation, we integrate two different kinds of techniques in an elegant way to make the foreground de-tection more accurate. Experimental results for several video se-quences are provided to demonstrate the effectiveness of the pro-posed approach.

Index Terms—Background subtraction, Markov random fields (MRFs), motion-based segmentation.

I. INTRODUCTION

P

ROLIFERATION of cheap camera sensors and increased processing power have made acquisition and processing of the video information become feasible. Many analysis tasks such as object detection and tracking can be performed effi-ciently on standard PCs. In many applications, success of de-tecting foreground regions from a static background scene is an important step before high-level processing, such as object identification and event understanding. However, in real-world situations, there exist several kinds of environment variations that will make the foreground detection more difficult. In order to cope with that, the approach here should be able to immune to these variaitions, i.e., being invariant to them or adapting to them. The aforementioned variations that may cause the inter-ested pixel intensity to change and, hence, lead to misdetection includes the following.

Manuscript received March 27, 2005; revised December 10, 2006. This work was supported by the National Science Council under the project NSC93-2752-E-002-007-PAE. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Anil Kokaram.

S.-S. Huang and L.-C. Fu are with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C.

P.-Y. Hsiao is with the Department of Electrical Engineering, National Uni-versity of Kaohsiung, Kaohsiung, Taiwan, R.O.C.

Digital Object Identifier 10.1109/TIP.2007.894246

• Illumination Variation

— Gradual illumination variation is the small change of light intensity due to the location change or fluctuation of light source.

— Sudden illumination variation means the light inten-sity changes in an abrupt manner. For example, traffic light in an outdoor environment or lights switched on and off are cases of the sudden illumination variation. — Shadow is the area where direct light is entirely blocked

(umbra) or partially blocked (penumbra). • Motion Variation

— Global motion means that small camera displacement will result in a global displacement of the captured scene. The causes of global motion include camera movement or a poorly fixed camera.

— Local motion refers to the intrinsic motion in the scene due to movements of foreground or of nonstatic back-ground objects, such as tree branches or cloud.

The aim of the proposed method in this paper is to handle all those variations except for global motion. In other words, our focus is on the stationary camera, that is, we assume that our input sequence has been properly compensated and the back-ground scene is stationary.

A. Related Works

Generally speaking, techniques for foreground detection can be grouped into two categories, background subtraction and mo-tion-based foreground segmentation. For background subtrac-tion, the foreground represents regions that have different ap-pearance from those in the reference image, which is normally referred to as background. For motion-based segmentation, re-gions that are subjected to a coherent and significant motion are considered meaningful foreground. We will give a brief review of these two kinds of techniques.

1) Background Subtraction: In order to adapt to changes, the

background is usually represented by the background model and updated over time. This kind of technique is based on an as-sumption that the background scene is available and the camera is only subject to minimal vibration without loss of generality. Many approaches for background subtraction have been pro-posed over the past decades, but usually differ in the ways of modeling the background. A simple method is to represent the gray level or color intensity of each pixel in the image as an in-dependent and unimodal distribution [1]–[4].

If the intensity of each pixel is due to the light reflected from a particular surface under a particular lighting, a unimodal dis-tribution will be sufficient to model the pixel value. However, 1057-7149/$25.00 © 2007 IEEE

(2)

in the real world, the appearance of a pixel in most video se-quences is in multimodal distribution. The usage of a mixture of Gaussian distributions is common in modeling multimodel distribution. For example, Friedman [5] modeled the pixel in-tensity as a weighted mixture of three Gaussian distributions respectively corresponding to road, vehicle, and shadow. An in-cremental version of the expectation maximization (EM) algo-rithm was then used to learn and update the parameters of the Gaussian mixture. Stauffer [6] modeled each pixel as a mix-ture of Gaussian distributions where depends on memory. In-coming pixels are compared against the corresponding Gaussian mixture model. If a match is found, the parameters of the model are adjusted. An improved method [7] was proposed to learn faster and more accurately by introducing online EM algorithm. However, not all distributions are in Gaussian form [8]. In [9], a nonparametric background model based on nonparametric density estimation was proposed to handle the situations where the background scene is nonstatic but contains minimal mo-tion. Another approach [10] that represents the color of each pixel by a group of clusters was proposed to adapt to noise and background variation. The currently proposed approaches are used to represent the background scene by a set of independent models without taking any semantic information into consider-ation. This makes false detection likely when changes or noise occur. It is here where some sophisticated modeling or updating strategies are applied.

2) Motion-Based Foreground Segmentation: The technique

of motion-based foreground segmentation is based on the idea that appearance of foreground objects are always accompanied by motion. In general, such technique consists of two steps, i.e., motion segmentation and region classification. The aim of mo-tion segmentamo-tion is to divide an image into a set of regions with motion coherence, whereas that of region classification is to assign a label, foreground or background, to each segmented region.

Various approaches in the literature for motion segmentation have been proposed. For providing a meaningfully semantic de-scription of video, Wang and Adelson [11] employed a -means clustering algorithm in the affine parameter space to find a small number of motion classes. Finally, each flow vector is assigned to one of the resulting motion classes. Borshukov [12] later im-proved Wang and Adelson’s algorithm through a merging and multistage approach to perform motion segmentation in a more robust way.

The aforementioned approaches incur inaccurate segmenta-tion due to inexact mosegmenta-tion estimasegmenta-tion near the object boundary. In order to overcome this problem, color information is intro-duced to obtain more accurate segmentation. In [13]–[16], an initial segmentation proceeds with color segmentation. Then, re-gions are merged on the basis of temporal or spatial similarity. Underlying approaches of this kind are based on an assumption that motion boundaries are generally subsets of colored ones. Without prior knowledge on the foreground, a straightforward way is to consider the foreground as a segmented region with large motion velocity. In addition to motion, Tsaig [13] also adopted spatial and temporal continuity to perform region clas-sification by maximizing a posteriori probability under MRFs framework.

Fig. 1. Block diagram of the proposed algorithm.

B. System Overview

In this section, we present the main features of our approach for segmenting foreground regions from a sequence of images. It extends the work in [17] and [18], and Fig. 1 gives a block diagram of our proposed algorithm.

The main idea is to regard the background model as a portion of knowledge for classification, and motion-based segmentation is to generate a set of regions for classification in the semantic level. At first, a region-based motion segmentation algorithm based on information of both motion and color is applied to seg-ment captured images into a set of regions. All pixels belonging to the same region have coherent motion. In order to save time, the segmentation result at a preceding time instant is used to fa-cilitate the segmentation process and to build the correspondent mappings for regions at the next time instant.

After segmentation, the statistical framework, MRFs, is in-troduced to formulate the foreground detection problem as a la-beling problem. By comparing the segmented region with the one built in the background model, a likelihood energy can be evaluated for classification. For the sake of maintaining spa-tial and temporal coherence, the similarity at boundaries of all neighborhood regions and the relation among all possibly corre-sponding regions at different time instants are taken into account to model the a priori energy.

The optimization over the MRFs model is then performed, or specifically a posteriori probability is maximized to obtain a classification result. Finally, regions which have the same clas-sification label and similar colors are merged to derive a more meaningful segmentation. Finally, the background model and the resulting region map are updated accordingly.

C. Organization

The remainder of this paper is organized as follows. In Sec-tion II, we introduce the region-based moSec-tion segmentaSec-tion al-gorithm to obtain a set of motion-coherence regions. Section III addresses problems of background modeling and updating. The classification process based on the MRFs statistical framework is described in Section IV. In Section V, we demonstrate the effectiveness of the developed approach by providing some ap-pealing experimental results. Finally, we conclude the paper in Section VI with some relevant discussion.

(3)

Fig. 2. Block diagram of the region-based motion segmentation algorithm.

II. REGION-BASEDMOTIONSEGMENTATION

The aim of the motion segmentation in this paper is to divide the entire image into a set of regions of which each associated with object(s) or part(s) that has (have) coherent motion. In general, the aforementioned approaches to mo-tion segmentamo-tion involve color segmentamo-tion followed by the region fusion algorithm based on motion information. The approaches based on region-merging strategy usually result in over-segmentation, which makes region classification more difficult and computationally expensive. In this paper, we pro-pose a region-based motion segmentation algorithm by using motion information followed by color information to over-come this shortcoming.

As shown in Fig. 2, the proposed algorithm mainly con-sists of three steps, namely, region projection, motion marker extraction, and boundary determination. First of all, Horn and Schunck’s method [19] is used to estimate dense optical flow for describing motion vector, , of every pixel between two consecutive image frames

and . Segmented regions of the previous image frame, , will then be projected to the current image frame, . Regions with coherent motion are extracted as initial motion markers. Pixels not ascribed to any region are labeled uncertain ones. Finally, a watershed algorithm [20] based on motion and color is utilized to join uncertain pixels to the nearest similar marker.

A. Region Projection

The purpose of region projection, namely, projecting regions in the previous frame to the current one, is to facilitate the seg-mentation. Because of inaccuracy in estimating motion of re-gion’s boundary using Horn and Schunck’s method, a para-metric motion model is adopted to represent the motion of a region. For saving computation time, the affine motion model is chosen for adoption in this paper. Let the affine motion model represent the motion of a region , then it is a six-pa-rameter model denoted as the parametric motion vector, i.e.,

, of any relevant pixel . Particularly, can be expressed as

(1) where is referred to as the parametric mo-tion vector of the pixel with which

being the six parameters of .

The six parameters of can be estimated by the least square method [21] as shown in (2)

(2)

where denotes the region over which the parametric motion of every pixel is described by .

Given the affine motion model, any pixel in the previous frame, , should be projected to the location

, where . The

mo-tion estimate of each pixel already has subpixel accuracy. How-ever, due to occlusion and uncovering effect, the displaced frame difference is less likely to be useful especially at the boundaries of objects. To reduce this effect, the minimum displaced frame difference over the nearest four nearest pixels is taken as the error measure which is called projection error and is defined as

(3) where denotes the abbreviation of the set of four con-nected pixels surrounding the pixel .

is the Euclidean distance of RGB color vectors be-tween two pixels and at different time instant. If is less than a given threshold , then the region label of is assigned to the same one of . Otherwise, the pixel which has large projection error is labeled as uncertain ones to indicate that the projecting from the previous frame to the current one is failed.

B. Motion Marker Extraction

The output of this step is a set of motion-coherent regions, that is, all pixels within a region comply with a motion model. Here, each such region is referred to as a motion maker. By starting to grow from these markers, we can eventually obtain a segmentation. In the next section, we will describe how to classify every uncertain pixel to only one of motion markers by region growing scheme.

Motion markers here are derived in two ways. First, the regions projected from the previous time frame are one kind of motion markers because each of them arises from an affine motion model. In addition to those, the regions resulting from the newly introduced object(s) may be another kind of motion

(4)

markers, for example, the trunk of a person who appears in the background scene. To handle this situation, a method similar to [15] is used to extract this kind of motion marker. That is, a -means clustering algorithm [22] is applied to perform color quantization in RGB color space followed by connected-component finding algorithm [23] so as to extract a set of homogeneous color regions from uncertain pixels. Currently, the number of quantized colors used in this paper is 12.

Next, an affine motion model is evaluated to describe the motion of each region, , according to (2). We then exclude the pixel from if the motion error, , of the pixel associated with is larger than a predefined threshold , where the motion error is defined as

(4) After exclusion, the regions that have the region size above a threshold are considered as the motion markers. The set of these

motion markers is denoted as ,

where is the number of motion markers. Each motion marker, stands for a segmented region.

C. Boundary Determination

After motion marker extraction, the number of the regions to be segmented is known. However, a large number of pixels are not yet assigned to any region. These uncertain pixels are mainly around the contours of the regions. Through the use of the watershed algorithm [20], uncertain pixels will be merged to one of the markers. The watershed algorithm is basically a region growing algorithm that merges the uncertain pixel to the nearest similar marker. The remaining problem is how to define the similarity measure between a uncertain pixel and the motion marker.

In general, the weighted sum of the intensity and motion com-pensation difference is a popular measure for estimating the dis-tance between a pixel and a region in the literature of spatio-temporal segmentation [24], [14], [15]. Here, this measure is also adopted for the watershed algorithm. Suppose that is the affine motion model for the motion marker, , and is a uncertain pixel neighboring to . The distance between

and is defined as

(5) where is a weighting factor. and are inten-sity difference and displaced frame difference, respectively. Be-cause the motion marker is just a motion-coherent region rather than an intensity-coherent one, the intensity distance used in [15] may not be a suitable one. In this paper,

is defined as

(6)

Fig. 3. Frame (a) 38 and (b) 39, respectively; (c) segmentation result stored in the Region Map. The regions given with different colors except the black one in (d) are motion markers projected from the region in (c) according to the motion vector represented by the affine motion model. Then, the area inside the blue rectangle in (d) is zoomed in and shown in (e) for better exhibition. Blue regions in (e) are motion markers obtained by using color information. After boundary determination process, a final segmentation result of region-based motion segmentation is shown in (f).

As for displaced frame difference, it is the intensity difference of two corresponding pixels at different frames. Thus, the dis-placed frame difference, can be defined as

(7) Fig. 3 shows the result of our proposed region-based motion segmentation algorithm applied to the image sequence for hall monitoring. As shown in Fig. 3(f), the segmentation result of frame 39 exhibits that the background behind the walking person is segmented as a single region by our approach. This makes the further region classification more efficient and effective.

III. BACKGROUNDMODELING

In the last decade, many sophisticated methods have been pro-posed to model the background scene. To deal with multiple appearances of the background, Stauffer and Grimson [6] mod-eled each background pixel by a mixture of Gaussian dis-tributions, and the details of its robustness were explained in their paper. In this paper, we use the same way to model and up-date the background scene. A brief description of Stauffer and Grimson’s work is first given and then we introduce the Bhat-tacharyya distance as the difference measure between the region

(5)

from the region-based motion segmentation and the one repre-sented by the background model.

A. Adaptive Gaussian Mixture Models

The probability of observing of a specific pixel at time instant can be expressed as

(8) where is the weight of the th Gaussian distribution at time and are the mean vector and covariance matrix of the th Gaussian distribution at time , and is the normal Gaussian distribution expressed by

(9) Note that the observation variable, , we use in this paper is an RGB color vector.

To adapt to illumination changes, the weight of the th Gaussian distribution is updated by the following equation:

(10) where is the learning rate and is the matching factor. In matching process, the Gaussian distributions are first sorted by the value . equals 1 if th Gaussian model is the first one which matches the observed color value, and is set to 0 for the remaining models. A matching is defined as the observed color which falls within the range of 2.5 time standard deviation of the Gaussian model. The parameters of the distribution which matches the observation are updated by the following equation:

(11) As discussed in [7], the likelihood term in [6] is ignored to make the adaptation of the means and covariance matrices faster. If none of the distributions matches the new obser-vation, the least probable distribution is replaced by a newly created distribution with the current value as its mean, an ini-tially high variance, and a low weight parameter. Iniini-tially, the background model is constructed by taking the first image as the reference one through the update process. In other words, a Gaussian distribution of each pixel is initialized by setting its color mean to the color of the corresponding pixel at the first frame, its color variance to an initially high value, and its weight to 1.0.

B. Bhattacharyya Distance

After describing the way to model the background scene and update it, we want to introduce how to measure the similarity be-tween the currently observed image and the constructed back-ground model. Inspired by [25], the idea behind is to generate

an image representing the background scene according to the constructed background model, and, thus, the measure to eval-uate the similarity between the current observation and the back-ground model can be defined as a function of and in region level.

For each pixel observation in the image , the cor-responding pixel observation in the image is de-fined as the mean vector of the Gaussian distribution at the pixel which has the minimum Mahalanobis distance [22] from . Let be the region at the image obtained from the region-based motion segmentation process, and be the same region as but at the image .

We assume that the colors of the regions, and , are both Gaussian distributions. Suppose that and are the mean vector and covariance matrix of , respectively, and sim-ilarly for and are of . The distance measure between and can be related to the probability of classification error in statistical hypothesis testing, which naturally leads to the Bhattacharyya distance [22], [26]. The Bhattacharyya dis-tance, , is formally defined as follows:

(12) The first term of (12) gives the class separability due to the dif-ference between class means, whereas the second term is the class separability due to the difference between class covariance matrices.

However, the region similarity defined in this way will lead to misclassification of the background region where direct light is blocked by the foreground object. The region of this kind is referred to as shadow. According to [8], the intensity of the pixel in shadow will be scaled down by a factor with , where is a constant.

If the actual color vector of a pixel is , it will become after being covered by shadow. In an ideal case, . Due to light fluctuation and noise effect, the ideal situation hardly takes place. Thus, the scaling factor is defined to be , which minimizes

. By differentiating with respect to , we can obtain according to the following equations:

(13) In order to create invariance to the shadow effect, the pixel in the current image should be preprocessed first, but this is impractical due to expensive computation. Instead of doing this, we just use the mean vectors of and to evaluate and

scale down the distribution of to .

Fig. 4 shows the results of our proposed shadow elimination algorithm.

(6)

Fig. 4. Shadow effect elimination is (a) the original image and (b) is the seg-mentation result of our proposed region-based motion segseg-mentation algorithm. The detected foreground without taking shadow effect into consideration, that is, = 1 is shown in (c); (d) result after applying shadow effect elimination.

IV. MRFS-BASEDCLASSIFICATION

In this section, we describe how to incorporate the back-ground model to classify every region in the segmentation map SM into either a foreground object or a background one by MRFs. The regions obtained from the aforementioned motion segmentation algorithm are semantic ones where the pixel movements are consistent with a motion model due to the object’s motion. Therefore, all pixels within each region should be classified to the same label, and, thus, we perform the label assignment in region level by MRFs. Before that, a graph called region adjacency graph (RAG) is used to represent the set of segmented regions. Let be an RAG, where is the set of nodes in graph and each node corresponds to a region , and is the set of edges with

if and being neighboring regions.

A. MRFs Framework

Then, we want to find an assignment of all sites (nodes) to either background or foreground . An assignment is called a configuration and is denoted as , and the set of all possible configurations is denoted as .

Suppose that we obtain an observation . The optimal con-figuration which we are interested is the one that will be a maximum a posteriori probability (MAP) under the observa-tion . That is

(14) From Bayes’ rule, we have

(15) According to Hammersley–Clifford theorem [27], the a

poste-riori probability in (14) which follows a Gibbs distribution can

be expressed as

(16)

where is a normalizing constant, called partition function. To maximize the a posteriori probability leads to minimize the

posterior energy function , where

the terms, and are the likelihood and the a priori energy over all sites, respectively.

For practical reasons, only singleton and pairwise cliques are considered. Therefore, the posterior energy function

can be decomposed into

(17)

where is the observation of the site , and, and are the singleton and pairwise clique energies, respectively. In the subsequent sections, we will give the details about the defi-nitions of the likelihood and the prior energies.

B. Region Classification

In this section, we describe how to define and so as to incorporate the background model as well as temporal and spatial coherence under MRFs framework.

1) Likelihood Energy : The term

represents the likelihood energy of the site to be classified as the label . As a result, two energy functions,

and , should be defined to evaluate the back-ground and foreback-ground likelihood energies, respectively. The idea behind our design of the likelihood energy is described as follows. If the color distribution of the currently observed image at site is similar to the one of the background scene rep-resented by the background model, the site will have high probability to be assigned to a label ; otherwise, it will be more likely classified as foreground.

Let and be the two regions of the site but at the image and , respectively. The similarity measure between the color distributions of the regions, and , is through Bhattacharyya distance . Thus, two functions, and , as depicted in Fig. 5(a) are

defined to evaluate and ,

respec-tively. and in Fig. 5(a) are two

con-stants. The purpose of introducing the threshold in defining the energy functions is to avoid the outlier effect [28].

2) Prior Energy : The prior energy is composed of

sin-gleton, , and pairwise, energies. We relate to the temporal coherence, that is, the region obtained at cur-rent time instant tends to be classified as the same label as the corresponding region at previous time instant. Suppose that stands for the region of the site . The can be defined as

if if

(18) where is the corresponding region of at frame which can be obtained by using affine motion model (see

(7)

Fig. 5. Energy functions: (a) Two functions,f ( 1 ) and f ( 1 ), to evaluate the likelihood function, U(o j ! ). (b) Two functions, f (1; 1) andf (1; 1), that are for evaluating spatial function U (1; 1).

Fig. 6. Temporal projection of the regionR (t) at time t to the one R (t 0 1) at time t 0 1. The shaded area is the region that has been classified as the foreground.r is the ratio of the pixels within R (t01) that have been assigned a background labelB.

Fig. 6). is the ratio of the pixels in that have been classified as background at time instant , and .

As for the term, , we relate it to spatial coherence. Spa-tial coherence means that two neighboring regions with similar color should be assigned to the same label. Assume that two sites and are adjacent. Then, and will be assigned to the same label if they have similar color distributions. According to this, two functions, and , as depicted in Fig. 5(b) are defined for the cases, and , respectively. Similarly for and

and are two predefined constants in Fig. 5(b). By such formulation, the foreground detection problem is mapped to optimization problem. The optimization is carried out by using iterative conditional mode (ICM) algorithm to find the most proper label assignment of every region. After classifi-cation, the regions neighboring to each other and with the same classified label and motion will then be merged and used to up-date the background model and region map.

V. EXPERIMENT

In this section, one standard MPEG-4 test sequence as well as three image sequences captured from intelligent home (e-home) demon room belonging to the Intelligent Robotics Laboratory at National Taiwan University are considered to validate our proposed method. The MPEG-4 test sequence we used here is the Hall Monitoring image sequence in CIF format at 10 fps, whereas the three e-home sequences with the image size 352 240 are all acquired via AXIS 2310 PTZ Network Camera, which is mounted on the ceiling for monitoring and

TABLE I PARAMETERSTABLE

perceiving human activity. Additionally, we compare our algorithm with the one proposed in [29], [30] which is used to extract foreground objects for further human identification. The threshold value of Wang’s algorithm is set to 0.8. Parameters used for foreground detection in our system are listed in Table I. In summary, and are thresholds for projection error and motion error, respectively. is a weighting factor of in (5). in (10) is factor for updating the weights of the Gaussain distributions. is a constant for

shadow effect elimination. ,

and are four constants for region classification as illustrated in Fig. 5.

A. Experimental Result

The Hall Monitoring benchmark is a commonly used MPEG-4 test sequence, especially for evaluating the effec-tiveness of background subtraction techniques. To effectively subtract foreground objects from the Hall Monitoring image sequence, the problem of noise and shadow effect caused by indoor illumination should be well handled. Fig. 7(a) illustrates the original frames 15, 25, 50, and 75 of the Hall Monitoring image sequence. Images in Fig. 7(b) and (c) shows the detection results of Wang’s approach and ours, respectively. Frame 15 here is to exhibit that our algorithm can automatically detect newly introduced objects. The red circles in Fig. 7(b) indicate that noise due to light fluctuation is still considered as the fore-ground even when sophisticated measure in Wang’s approach is adopted to perform background subtraction. However, by performing foreground detection in a more semantic region level, we can eliminate the noise effect as shown in Fig. 7(c).

In addition to hall monitoring sequence, three video se-quences from our e-home demo room are used to demonstrate that our proposed approach can handle illumination variation and local motion. The first case is the image sequence exhibiting the gradual illumination variation and local motion. When a

(8)

Fig. 7. Hall monitoring sequence: (a) 15, 25, 50, and 75 of the hall monitoring sequence. (b), (c) Detection result of Wang’s approach and our proposed approach, respectively.

person enters, the background will gradually brighten. This is due to radiance from the fluorescent lamp that is reflected back into the background scene. After leaving the scene, he will wave the curtains to make it flutter. Some possible false positives due to Wang’s algorithm under the condition with gradual illumination variation and local motion are shown in Fig. 8(b). As shown in Fig. 8(c), the detection results of our approach are more robust in such situations.

In the second case, the situation with the sudden illumi-nation variation is used to test the effectiveness of our ap-proach. The light in the living room of our demo e-home will be switched off automatically to save energy when the last oc-cupant leaves that room. The original images of this case are shown in Fig. 9(a) and the two bottom-most pictures are for two scenes where the light in the living room is all turned off. The two bottom-most pictures of the detection results in Fig. 9(b) show that Wang’s approach may lead to some misclassification when there is a sudden illumination change situation. This is expected because the intensity of an image is the only cue used in Wang’s approach for pixel classification. Although our work cannot deal perfectly with this situation as shown in Fig. 9(c), at least most pixels of the background scene are correctly clas-sified. The reason that our approach is able to deal so well with sudden illumination variation is because both spatial and temporal coherence are imposed. When illumination variation occurs, our motion segmentation algorithm will result in sev-eral newly segmented regions. If these regions are considered independently and only the background model is used for re-gion classification, all of them will be classified as foreground

Fig. 8. Gradual illumination variation in e-home demo room. (a) Original images. (b) Detection result of Wang’s approach. (c) Detection results of our approach.

ones. Yet, classifying them as the background will yield lower posterior energy due to integrate the prior energies that impose spatial and temporal coherence.

(9)

Fig. 9. Sudden illumination variation in e-home demo room. (a) Original im-ages. (b) Detection results of Wang’s approach. (c) Detection results of our approach.

Fig. 10. Shadow effect elimination under the condition with gradual illumina-tion variaillumina-tion.

The final case features two persons entering the scene in order and crossing each other at the top of the image. This produces severe shadow effect in our e-home demo room. Our proposed method will eliminate most of the shadow effect by applying the aforementioned scaling factor before evaluating Bhat-tacharyya distance. Empirically, is set to 0.7 in this paper. An example of shadow effect elimination is shown in Fig. 10. Instead of eliminating the shadow effect during the post processing (after classification step) with some sophisticated algorithm, our method naturally excludes shadows from the background. Fig. 11 shows some detection results in our approach. Notwithstanding, in some pathological cases, the areas that are severely covered with shadows will still be misclassifed as the foreground. This is because the scale-down ratio of intensity is less than 0.7. To over-come this problem, it is necessary to understand the contextual information or structure of the background scene more.

B. System Performance

Our system is implemented on a personal computer with 1.8-GHz Pentium IV processor. Table II shows a run-time anal-ysis of our system. The listed numbers are the processing times

of each operation for the Hall Monitoring sequence and three cases in our e-home room. The average time for processing each frame is 370 ms, which corresponds to 2.7 fps.

On the other hand, the processing time of the similar MRFs-based approach proposed in [13], which was performed on Pentium III 500 Hz, is 2000 ms per frame with CIF image format. According to the simulation result using software Sandra of SiSoftware company (http://www.sisoftware.co.uk/) and the technique report on their website, 1.8-GHz Pentium 4 is roughly 2.4 times faster than 500-MHz Pentium III. As a consequence, for the approach in [13] the time to process one frame on the personal computer with 1.8-GHz Pentium 4 will be ms ms. It signifies that our proposed ap-proach should be more than twice faster than the one proposed in [13].

The major reason why our approach could be more efficient is because the background scene is taken as a single region rather than being over-segmented. This makes the number of regions extracted from our proposed algorithm small and, hence, signif-icantly reduces the complexity to perform MRFs-classification operation. In Fig. 12, we show that the number of segmented regions for the first 100 frames of four image sequences are 90 at most even when some are with the complex background scene, In this manner, the computing time of the MRFs-Classifi-cation is reduced to about 120 ms as shown in the fourth row of Table II. Although our proposed algorithm is still not ready for real-time applications yet, it has achieved significant improve-ment in performance. In our ongoing research, the tracking al-gorithm should be introduced to track the detected foreground objects in order to speed up the performance.

VI. CONCLUSION

In this paper, we performed the foreground detection at the region level which means that contextual information is taken into consideration. To achieve this, an segmentation approach consisting of region projection, motion marker extraction, and boundary determination is proposed to automatically obtain a set of motion-coherent regions. The main advantage of this method is that it avoids the over-segmentation problem to make the further region classification more efficient. The Bhattacharyya distance is used to measure the distance between two region distributions, and shadow effect is eliminated by applying a scale factor to the region distribution. A statistical framework, MRFs, fuses the cues from background model and the prior knowledge including temporal and spatial coherence to detect the foreground objects in a more accurate and ele-gant way. Experimental results demonstrate that our proposed method can successfully extract the foreground objects even under situations with illumination variation, shadow, and local motion.

Nevertheless, our proposed method is heavily dependent on the motion estimation algorithm. The segmentation may be-come unsatisfactory when the object has a large displacement. Therefore, our on-going research is to develop a tracking algo-rithm which can be used track the detected object. Moreover, the trajectory of the object should also be taken into account for re-gion classification. Last but not least, the result after high-level

(10)

Fig. 11. Figures here show that our approach can successfully eliminate shadow.

TABLE II RUN-TIMEANALYSIS

Fig. 12. Number of segmented regions after applying our proposed motion al-gorithm for four test video sequences. The average number of regions of these four sequences are 32.24, 39.49, 18.32, and 32.35, respectively. Furthermore, the average adjacency of each region of these four sequences are 5.7, 5.6, 3.8, and 5.2, respectively.

recognition will be fed back to update the background and re-gion map to handle the situations with structure variation and severe shadow effect.

REFERENCES

[1] S. Gupte, O. Masoud, R. F. K. Martin, and N. P. Papanikolopoulos, “Detection and classification of vehicles,” IEEE Trans. Intell. Trans-port. Syst., vol. 3, no. 1, pp. 37–47, Mar. 2002.

[2] S. Y. Chien, S. Y. Ma, and L. G. Chen, “Efficient moving object seg-mentation algorithm using background registration technique,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 7, pp. 577–586, Jul. 2002.

[3] I. Haritaoglu, D. Harwood, and L. S. Davis, “W4: Real-time surveil-lance of people and their activities,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 809–830, Aug. 2000.

[4] E. Stringa and C. S. Regazzoni, “Real-time video-shot detection for scene surveillance applications,” IEEE Trans. Image Process., vol. 9, no. 1, pp. 69–79, Jan. 2000.

[5] N. Friedman and S. Russell, “Image segmentation in video sequence: A probabilistic approach,” presented at the Int. Conf. Uncertainty in Artificial Intelligence, Aug. 1997.

[6] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time tracking,” presented at the IEEE Computer Society Conf. Computer Vision and Pattern Recognition, Jun. 1999.

[7] P. KaewTraKulPong and R. Bowden, “An improved adaptive back-ground mixture model for real-time tracking with shadow detection,” presented at the 2nd Eur. Workshop Advanced Video-Based Surveil-lance Systems, 2001.

[8] A. Elgammal, D. Harwood, and L. S. Davis, “Non-parametric model for background subtraction,” presented at the IEEE Int. Conf. Com-puter Vision Frame-Rate Workshop, 1999.

[9] A. Elgammal, R. Duraiswami, D. Harwood, and L. S. Davis, “Back-ground and fore“Back-ground modeling using nonparametric kernel density estimation for visual surveillance,” Proc. IEEE, vol. 90, no. 7, pp. 1151–1162, Jul. 2002.

[10] D. Bulter and S. Sridharan, “Real-time adaptive background segmen-tation,” presented at the IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2003.

[11] J. Y. A. Wang and E. H. Adelson, “Spatio-temporal segmentation of video data,” Proc. SPIE, 1994.

[12] G. D. Borshukov and G. Bozdagi, “Motion segmentation by multistage affine classification,” IEEE Trans. Image Process., vol. 6, no. 11, pp. 1591–1594, Nov. 1997.

[13] Y. Tsaig and A. Averbuch, “Automatic segmentation of moving objects in video sequences: A region labeling approach,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 7, pp. 597–612, Jul. 2002.

(11)

[14] Y. Altunbasak, P. E. Eren, and A. M. Tekalp, “Region-based parametric motion segmentation using color information,” Graph. Models Image Process., vol. 60, no. 1, pp. 013–023, Jan. 1998.

[15] J. G. Choi, S. W. Lee, and S. D. Kim, “Spatio-temporal video seg-mentation using a joint similarity measure,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 2, pp. 279–286, Apr. 1997.

[16] F. Dufax, F. Moscheni, and A. Lippman, “Spatio-temporal segmenta-tion based on mosegmenta-tion and static segmentasegmenta-tion,” in Proc. IEEE Conf. Image Processing, 1995, pp. 306–309.

[17] S. S. Huang, L. C. Fu, and P. Y. Hsiao, “A region-based background modeling and subtraction using partial directed Hausdorff distance,” presented at the IEEE Int. Conf. Robotics and Automation, 2004. [18] S. S. Huang, L. C. Fu, and P. Y. Hsiao, “A region-level motion-based

background modeling and subtraction using MRFs,” presented at the IEEE Int. Conf. Robotics and Automation, 2005.

[19] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” AI Memo 572, Massachusetts Inst. Technol., Cambridge, 1980.

[20] P. Salembier and M. Pardas, “Hierarchical morphological segmenta-tion for image sequence coding,” IEEE Trans. Image Process., vol. 3, no. 9, pp. 639–651, Sep. 1994.

[21] J. Y. A. Wang and E. H. Adelson, “Representation moving images with layers,” IEEE Trans. Image Process., vol. 3, no. 5, pp. 625–638, Sep. 1994.

[22] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. New York: Wiley, 2000.

[23] A. Rosenfeld and J. L. Pfaltz, “Sequential operations in digital picture processing,” J. Assoc. Comput. Mach., vol. 13, pp. 471–494, 1966. [24] J. Lim and J. B. Ra, “A semantic video object tracking algorithm using

three-step boundary refinement,” in Proc. IEEE Int. Conf. Image Pro-cessing, 1999, pp. 159–163.

[25] A. Monnet, A. Mittal, N. paragios, and V. Ramesh, “Background mod-eling and subtraction of dynamic scenes,” in Proc. IEEE Int. Conf. Computer Vision, 2003, pp. 1305–1312.

[26] B. Mak and E. Barnard, “Phone clustering using the Bhattacharyya distance,” in Proc. 4th Int. Conf. Spoken Language Processing, Oct. 1996, vol. 4, pp. 2005–2008.

[27] S. Geman and D. Geman, “Stochastic relaxation gibbs distributions and the bayesian restoration of images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-6, no. 11, pp. 721–741, Nov. 1984.

[28] M. J. Black, “The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields,” Comput. Vis. Image Understand., vol. 63, no. 1, pp. 75–104, Jan. 1996.

[29] L. Wang, T. Tan, H. Ning, and W. Hu, “Silhouette analysis-based gait recognition for human identification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 12, pp. 1505–1518, Dec. 2003.

[30] Y. Kuno, T. Watanabe, Y. Shimosakoda, and S. Nakagawa, “Auto-mated detection of human for visual surveillance system,” in Proc. Inf. Conf. Pattern Recognition, 1996, pp. 865–869.

Shih-Shinh Huang (S’03) was born on November

8, 1974, in Taiwan, R.O.C. He received the B.S. de-gree from the National Taiwan Normal University in 1996 and the M.S. and Ph.D. degrees from the Na-tional Taiwan University, Taipei, Taiwan, R.O.C., in 1998 and 2007, respectively.

His research interests include computer vision, pattern recognition, and intelligent transportation systems.

Li-Chen Fu (M’84–SM’94–F’04) received the B.S.

degree from the National Taiwan University, Taipei, Taiwan, R.O.C., in 1981, and the M.S. and Ph.D. de-grees from the University of California, Berkeley, in 1985 and 1987, respectively.

Since 1987, he has been on the faculty of and currently is a professor in both the Department of Electrical Engineering and the Department of Com-puter Science and Information Engineering of the National Taiwan University. His areas of research interest include robotics, FMS scheduling, shop floor control, home automation, visual detection and tracking, E-commerce, and control theory and applications.

Pei-Yung Hsiao (M’90) received the B.S. degree in

chemical engineering from Tung Hai University in 1980 and the M.S. and Ph.D. degrees in electrical engineering from the National Taiwan University, Taiwan, R.O.C., in 1987 and 1990, respectively.

He currently is an Associate Professor with the Department of Electrical Engineering, National University of Kaohsiung, Kaohsiung, Taiwan. His main research interests and industrial experiences are focused on VLSI/CAD, VLSI Design, DIP/SOC, image processing, fingerprint recognition and tech-nology, FPGA rapid prototyping, embedded system, neural network, and expert system.

Dr. Hsiao was granted a scholarship in the 1985 Electronics Engineering Award Examination conducted by the Ministry of Foreign Affairs from the Taiwan, R.O.C., Government for studying microelectronics in Belgium, and he was awarded the 1990 Acer Long Term Ph.D. Dissertation Award from Acer Group.