2.1 Background Subtraction
A background model in a surveillance system represents background objects.
The method that compares the current processed image with the background representation to determine foreground regions is called background subtraction. If the background is unchanged but affected by Gaussian noise, the colors of the background pixels can be modeled as a Gaussian distribution with mean vector (µ) and covariance matrix (Σ) [1-8]. Background subtraction is then performed by calculating the probability of each pixel in the current image belonging to the Gaussian model.
Since background appearances may be affected by external forces, modeling a pixel with a Gaussian distribution may misclassify some background pixels as foreground ones. In many cases, the background may change repeatedly. A background pixel with repeated changes can be divided into several background constituents and modeled as an MoG distribution [9-13]. For each background constituent in a pixel, the means ( ), covariances (Σ ), and weights ( ) of the i-th constituent ( ) have to be estimated. If there are K background constituents, the parameters of the background model can be represented as , Σ , |1 In order to decide whether a sample point X belongs to the background B, the conditional probability | is calculated as follows:
| | ; , Σ . (2.1)
9
where η represents a Gaussian probability density function,
; , Σ 1
2 / |Σ | / . (2.2)
The motion of some background objects may not be repeated. After the motion, the objects remain in the same position for a period. To model the background changes, researchers have proposed methods for online updating of the parameters of background models [1-17]. The mean vector and covariance matrix in time t are represented as and Σ , respectively. The updating rules are formulated as follows:
1 , (2.3)
Σ 1 Σ , (2.4)
where ρ is used to control the updating rate. To integrate the updating method into an MoG model, Stauffer and Grimson [9] proposed a method to update the mean vector and covariance matrix of a background constituent to match those of in Eqs.(2.3) and (2.4). The weight , of the i-th background constituent is updated as follows:
, 1 , , , (2.5)
where , is an indicator function, whose value is one if the i-th background constituent matches and zero otherwise, and α is a constant used to control the updating rate of the weights. In Stauffer and Grimson's method [9], the updating rate ρ for the parameters of the i-th constituent (Gaussian distribution) is calculated according to α and ; , Σ .
To make background models more robust, researchers tried to modify updating rules or adopt different features [10-13]. In adaptive background models, background objects are assumed to appear more frequently than foreground ones. However, the
10
assumption is not always satisfied. If the appearances of a background pixel appear less frequently than those of foreground objects, the background pixel is probably misclassified as a foreground object. Taking the following case as an example, assume a room is monitored by a fixed camera and the background objects in the room include a door and a wall as shown in Fig. 2.1(a). People may enter the room, and close or open the door. If a person wears a suit of clothes of single color and walks
person
(a)
(b)
(t + 1)-th frame (t + 2)-th frame (t + 30)-th frame
Possible colors Weights after 30 frames and α=0.01
Fig. 2.1 An example of a slowly moving person. (a) Sketches of two possible background scenes. Left: door closed; Right: door opened. (b) Consecutive frames of a person moving from left to right. (c) Possible colors and their weights of the rectangle region shown in the left image.
11
slowly across the room as shown in Fig. 2.1(b), the major color of the clothes may be captured repeatedly in a certain position among several consecutive images. Assume that the person moves from left to right in 30 frames. If the updating rate α in Eq.(2.5) is set as 0.01, the weight of the repeatedly captured color of the clothes in the images will increase from 0 to 0.26, as shown in Fig. 2.1(c). This large weight may cause the clothes to be labeled as the background, when the MoG model in Eq.(6.2) is used.
Using a small updating rate can overcome this problem; however, the background model will be updated very slowly and may fail to learn background changes.
In another situation, the color of the person's clothes is assumed to be the same as that of the door, as illustrated in Fig. 2.2(a). If the person enters the room and passes through the door, the region of clothes may be labeled as background due to the similarity of colors. However, after the door is opened, the current background color is not similar to the clothes color as shown in Fig. 2.2(b). The clothes may still be labeled as the background, since they are very similar to possible background colors been estimated.
In these two situations, we observe that modeling each pixel independently
(a) (b)
Fig. 2.2 Sketches of a person whose clothes colors are similar to the door color in front of different background scenes: (a) scene when the door is closed and (b) scene when the door is opened.
12
cannot sufficiently represent the similarity among different object appearances caused by either object motion or illumination changes. Most researchers regarded the variations caused by object motion as foreground changes and attempted to eliminate the effect caused by illumination. In consecutive images, when the illumination changes, the pixels of an object are usually changed simultaneously. In order to model background objects, the pixels at different positions should be considered together. To represent the relation among the pixels, Durucan and Ebrahimi [14] proposed to model the colors of a region as vectors. They segmented the foreground regions by calculating the linear dependence between the vectors of the current image and those of the background model. However, it is expensive to represent the dependence between vectors in terms of storage and speed. To reduce the cost, the vector of a region should be reduced to a lower dimensional feature. Li et al. [15,16] used two-dimensional gradient vectors as the features of local spatial relations among neighboring pixels. In their proposed method, the appearance variations caused by illumination changes can be distinguished from object motion. However, the gradient features cannot be used to extract the foreground region that has a uniform color. To model the relation among pixels, we need to use the relations among near pixels to reduce time and storage consumption, and then extend the relations into a more global form.
The methods based on the Markov random field (MRF) are well known for extending the neighboring relations among pixels into a more global form. Image segmentation methods based on MRF [17,23] assume that most pixels belonging to the same object have the same label and these pixels form a group in an image. The MRF combines colors among a clique of pixels in a neighboring system and uses an energy function to measure the color consistency. Then, the maximum a posterior
13
estimation method is used to minimize the energy for all the cliques to find the optimal labels. In the MRF-based methods, the final segmentation results are strongly dependent on the energy functions of the labels in different cliques. If a high energy is assigned to the clique with unique labels, the extracted foreground regions will become more complete than those of the pixel-wise background models. An additional noise removal process is not required. However, when several pixels are mis-labeled, these errors will propagate into neighboring pixels. The error propagation will cause more pixels to be mis-labeled. In this research, we directly estimate the relations among pixels instead of the labels, and therefore the errors will not easily propagate.
2.2 Human Tracking
In the last few decades, tracking objects or humans in video sequences has received much attention. Much research about the topic has been proposed and been reviewed in several survey papers [24-28]. Moeslund et al. [24] divided a general human tracking algorithm into two main phases: figure-ground segmentation and temporal correspondences. The former finds the target human in an image, and the latter associates the detected humans in consecutive frames to create temporal trajectories. In the following, related work about these two phases will first be addressed. The methods for segmenting human bodies and correcting tracking failures will then be described.
The methods of figure-ground segmentation can be classified into five categories according to the used features. These categories include background subtraction [6,9], motion-based segmentation [29], depth-based segmentation [30], appearance-based segmentation [18,19,21,31,32], and shape-based segmentation [2,33]. Background
14
subtraction and motion-based segmentation methods find the differences between images to extract the target. The two approaches assume that only one target object moves in a specific region and the appearances of background objects in consecutive images do not change. However, the assumptions usually cannot be met in general environments. To achieve better segmentation results for tracking, further checks are needed. The depth-based segmentation approach uses the positions of the target in three-dimensional space or in the ground plane to segment the target. However, to locate such kind of positions, specific hardware (such as multiple cameras) or additional calculations (such as inverse perspective transform) are needed. The appearance-based segmentation approach became popular recently, since the approach is usually simple and fast. The approaches of shape-based and appearance-based segmentation are similar except that the former does not use the color content inside the object. Since the appearances of a tracking target may change with time, several researchers proposed methods to model and update the appearance model of the target person dynamically in consecutive images [19,32]. Since the target is a moving object, some researchers tried to segment the target by a background subtraction method [6,9,34]. Shan et al. [34] modeled the colors of the target object as the appearance feature and then use the color cue to calculate a color probability distribution map from the current image. The color probability distribution map was combined with the background subtracted image using a logical AND operation to detect the target position. Other researchers used classifiers such as SVM [21] and Adaboost [31] to model the appearance of target objects.
In the tracking phase, temporal correspondence aims to predict and update the states of the target person from the measurement and predicted state, where the measurement is detected by figure-ground segmentation and the predicted state is
15
calculated using the system dynamic model. To find the temporal correspondences, Polat et al. [35] used MHT (Multiple Hypothesis Tracker) to construct hypotheses representing all the predictions and measurements. The most likely hypothesis is chosen as the target. To combine the predictions and measurements, Kalman filtering is another well-known method and has already been applied in many studies [3,9,36].
The Kalman-filter-based approaches are commonly used for tracking a target whose system dynamic model can be represented as a linear function and the noise as a Gaussian. In non-linear systems, extended Kalman filters that approximate the non-linear dynamic model by Taylor series have been applied [37]. Recently, particle filters have been proposed to construct a robust tracking framework that are neither limited to linear dynamic model nor Gaussian distributed noise [19,38,39]. The method represents the state of a target object by a set of samples (particles) with weights. The weight of a sample is calculated by the figure-ground segmentation and the samples are generated by the importance sampling method so that the samples can represent the probability distributions of the target object's appearances. We adopt the particle filter in our system, since they can be applied in an appearance-based tracking system very effectively.
A human is not a rigid object and his appearance changes irregularly.
Segmentation of human body parts in an image has already been proposed in several papers [40-43]. Forsyth and Fleck [40] introduced the notion of 'body plans' to represent a human or an animal as a structured assembly of body parts learnt from images. Shashua et al. [41] divided a human body into nine regions, for each of which a classifier was learnt based on features of orientation histograms. Mikolajczyk et al.
[42] divided a human body into seven parts. For each part, a detector was learnt by following the Viola-Jones approach applied to scale invariant orientation-based
16
features. Ioffe and Forsyth [43] decomposed the human body into nine distinctive segments. The method finds a person by constructing assemblies of body segments.
The segments were consistent with the constraints on the appearance of a person that result from kinematic properties. These body-parts-based human segmentation methods usually focused on detecting humans in a static image. Recently, body-parts-based human tracking in consecutive images has been proposed [21,31,44-47]. Parts of these studies focused on precise decomposition of body parts for motion type or pose analysis. However, in general environments, it is difficult to decompose precisely body parts due to self occlusions and complex background scenes. The studies in [21,47] proposed a detection-based tracking model to solve the occlusion problem. They detected body parts by a pretrained model, and then tried to associate the detected body parts to a target person by smoothing his trajectory.
However, when multiple humans appeared in a frame, the detection model could not differentiate the body parts of the different persons. The spatial positions and velocities were the only cues, which can be used to find the temporal correspondences of different persons.
When a person is tracked in consecutive frames, the figure-ground segmentation may fail, since the person may be occluded or other objects may have similar appearances with the target person. The first problem can be classified into occluded by other persons and occluded by background objects. To cope with the problem of inter-person occlusion, several researchers proposed to detect occlusion events and then used the system dynamics to estimate the position of the occluded person [48,49].
Using similar methods to predict the occlusion of background objects, one needs to create background object models. However, it is difficult to model all background objects in a complex environment. Several researchers tried to cope with the
17
occlusion problems by tracking different body parts of a person simultaneously [21,47,48]. When a body part is occluded, the position of the person can still be tracked based on the other parts. Mohan et al. [21] extracted a human body by detecting four parts: the head, legs, left arm, and right arm, by four distinct quadratic support vector machines. After geometric constraints among these parts are confirmed, another support vector machine is used to classify the combination of the four parts as either a human or a non-human. Wu and Nevatia [47] used four detectors to detect head-shoulder, torso, legs, and full-body. The detectors were learnt by a boosting approach using edgelet features. They used a strong classifier to classify the body parts in images. When we track multiple humans, the classifier cannot be used to distinguish different persons, and their trajectories will easily be confused if no other approaches are adopted. Lerdsudwichai et al. [48] proposed a method to model the face and clothes in the initialization phase. To identify different persons, they used clothes colors. However, the appearance of a person may change when the person presents different poses or is affected by different illuminations. The appearances captured in the initialization phase cannot be applied to track the person in other frames. In our research, we use an adaptive appearance model to track the body parts, even when multiple persons are tracked.
Apart from the occlusion events, a tracker may lose the tracking target when other objects have similar appearances. In general, a robust appearance model can be used to reduce the tracking failures, or the system dynamic model of the target person can be used to predict his position. However, the robust appearance model may be too complex to maintain efficiently. We will use the system dynamic model of the target person to track him, when a tracking failure is detected. To detect the tracking failure, Dockstader and Imennov [36] proposed a method that uses a structural model to
18
represent a person and a hidden Markov model (HMM) to describe the temporal characteristics of the tracking failure. In the tracking phase, the HMM was used to predict the tracking failures. Since a person may change his velocity, the predicted position of the person using the system dynamic model is too rough. A fine tune for the target person is needed.
19