CHAPTER 1 INTRODUCTION
1.5 Thesis Outline
The thesis is organized as follows. Chapter 2 introduces the theory of eigenspace and canonical space transformation. In this chapter, we discuss the process of how to transform a large dimensional image to eigenspace and canonical space. Chapter 3 shows how we use timing information to build a fuzzy rule database for activity recognition. We collect three consecutive images as a feature vector. Then by training from the known data, we can extract transitional rules of templates for activity recognition. The fuzzy rules play an important role in our activity recognition system.
In Chapter 4, we will introduce the technological processes of our recognition system.
In Chapter 5, the experiment results of our recognition system are shown. At last, we conclude this thesis with a discussion in Chapter 6.
Fig. 1.1 The block diagram of human activity recognition system
Chapter 2 Fundamentals of Eigenspace and Canonical Space Transformation
In video and image processing, the dimensions of image data are often extremely large. Because there are great deals of redundancies in the images, it is common to transform image from one space to another space to reduce redundancy. Many methods like Fourier Transformation, wavelet, Principal Component Analysis (PCA) and eigenspace transformation (EST) has actually been demonstrated to be a potent scheme to this end. However, PCA based on the global covariance matrix of the full set of image data is not sensitive to class structure in the data. In order to increase the discriminatory power of various activity features, Etemad and Chellappa [23] use Linear Discriminant Analysis (LDA), also called Canonical Analysis (CA), which can be used to optimize the class separability of different activity classes and improve the classification performance. The features are obtained by maximizing between-class and minimizing within-class variations. Unfortunately, this approach has high computation cost when applying to large images. It was only tested with small images.
Here we call this approach canonical space transformation (CTS). Combining EST based on PCA with CST based on CA, our approach reduces the data dimensionality and optimizes the class separability of different action sequences simultaneously.
Images in high-dimensional image space are converted to low-dimensional eigenspace using PCA. The obtained vector thus is further projected to a smaller canonical space using CST. Recognition is accomplished in the canonical space.
Assume that there are c classes to be learned. Each class represents a specific posture of various forms existing in the training image data. x′ is the j-th image in i,j
class i, and Ni is the number of images in the i-th class. The total number of images in
training set isNT =N1+N2+L+Nc. This training set can be written as
[
x1′,1 ,L ,x1′,N1 ,L ,x′2,1 ,L ,x′c,Nc]
(1)where each x′ is an image with n pixels. i,j
At first, the intensity of each sample image is normalized by
.
Then we can get the mean pixel value for training image as
1 .
2.1 Eigenspace Transformation (EST)
Basically EST is widely used to reduce the dimensionality of an input space by mapping the data from a correlated high-dimensional space to an uncorrelated low-dimensional space while maintaining the minimum mean-square error for information loss. EST uses the eigenvalues and eigenvectors generated by the data covariance matrix to rotate the original data coordinates along the direction of maximum variance.
If the rank of the matrix XXT is K, then K nonzero eigenvalues of XXT,
λK dimensionality of XX is the typical image size, it is too large to be computed easily. T Based on singular value decomposition theory, we can get the eigenvalues and eigenvectors by computing the matrix R~
instead, that is
These K eigenvectors are used as an orthogonal basis to span a new vector space.
Each image can be projected to a point in this K-dimensional space. Based on the theory of PCA, each image can be approximated by taking only the k ≤K largest eigenvalues λ1 ≥ λ2 ≥L≥ λk and their associated eigenvectors e1 ,e2 ,L ,ek . This partial set of k eigenvectors spans an eigenspace in which y are the points i,j
that are the projections of the original images xi,j by the equation
[
k]
i j eigenspace transformation matrix. After this transformation, each original imagej
x can be approximated by the linear combination of these k eigenvectors and i, y i,j
is a one-dimensional vector with k elements which are their associated coefficients.
2.2 Canonical Space Transformation (CST)
Based on canonical analysis in [22], we suppose that
{
φ1,φ2,L,φc}
represents the classes of transformed vectors by eigenspace transformation and y is the j-th i,jvector in class i. The mean vector of entire set can be written as
∑∑
Let St denote the total scatter matrix, Sw denote the within-class matrix and Sb
denote the between-class matrix, then
( )( )
where Sw represents the mean of within-class vectors distance and Sb represents the mean of between-class distance vectors distance. The objective is to minimize Sw and maximize Sb simultaneously. That is known as the generalized Fisher linear discriminant function and is given by
( )
T .The ratio of variances in the new space is maximized by the selection of feature W if
. generated eigenvector corresponding to the i-th largest eigenvalues λi. According to the theory presented in [22], we can solve (12) as follows
*
After solving (11), we will obtain c–1 nonzero eigenvalues and their corresponding eigenvectors
[
v1,v2,L,vc]
that create another orthogonal basis and span a (c–1)-dimensional canonical space. By using these bases, each point in eigenspace can be projected to another point in canonical space by[
c]
i j ji T ,
1 2
1
, v ,v , ,v y
z = L − (14)
where z represents the new point and the orthogonal basis i,j
[
v1,v2,L,vc−1]
T is called the canonical space transformation matrix.By merging equation (8) and (14), each image can be projected into a point in the new (c-1)-dimensional space by
i,j j
,
i Hx
z = . (15)
Chapter 3 Human Activity Recognition System
The first step of human activity recognition system is object extraction. We have to construct a background model for object extraction. There are many well-known background models. The most common one is that applies frame difference with a threshold. W4 is such a typical example with some modifications [7]. It records the maximum and minimum grayscale and the maximum inter-frame difference of each pixel in a background video. Then each image frame subtracts the maximum and minimum grayscale of each pixel. If the pixel’s absolute value of the subtraction operation is larger than the maximum inter-frame difference, the pixel is classified to a foreground one. W4 admits some rules make the background model be adaptive to varying environment. In our approach, we describe the background scene as a statistical model. We obtain a background model from pure background video by calculating the maximum, minimum gray level and frame ratio of each pixel in the images.
3.1 Background Modeling by Frame Ratio
Although extraction of foreground based on frame difference approach is the most famous method in image processing, the drawback involves the robustness of illumination changes. If we film an environment at a standstill, background modeling based on frame difference may still invoke errors due to the illumination changes. As a result, noises will be detected and the quality of object extraction will be affected.
We propose a method utilizing frame ratio, instead of frame difference, which is proved robust to the illumination changes. Smooth changes of illumination are not
obvious, but longer duration of illumination changes are still affect object extraction.
We assume the scene captured by a camera can be described as
( )
x y S( ) ( )
x y r x yIi , = i , i , (16)
where Ii is the intensity of the scene, Si is the spatial distribution of source illumination, ri is the distribution of scene reflectance,
( )
x,y is the location of a pixel in the image and i is the image sequence index. If the camera is fixed stationary and moving objects are not permitted to show up in the scene, the reflectance of the background may remain the same at any time. That is,( ) ( )
x,y r x,y.ri = (17)
Although the reflectance is unchanged, the influence of illumination is still going on.
If we shoot the background only and keep the camera stationary, the reflectance of the scene will be the same and there are only illumination changes. The problems caused by reflectance still remains if frame difference approach is adopted. Nevertheless, the influence of reflectance is eliminated in the frame ratio approach. The frame ratio between two consecutive frames is written as
( ) ( ) ( ) ( )
where I is the intensity of captured images, S is the spatial distribution of source illumination.
(18)
We propose to utilize the frame ratio to build the background model. Each pixel of background scene is characterized by three statistics: minimum intensity value
( )
x yn , , maximum intensity value m
( )
x ,y and maximum inter-frame ratio d( )
x ,yof a background video. Because these three values are statistical, we need a background video, without any moving objects, for background model training. Let I be an image frame sequence and contains N consecutive images. Ii
( )
x,y be the3.2 Extraction of Foreground Object
Foreground objects can be segmented from every frame of the video stream.
Each pixel of the video frame is classified to either a background or a foreground pixel by the difference between the background model and a captured image frame.
We utilize the maximum intensity m
( )
x ,y , minimum intensity n( )
x ,y and maximum inter-frame ratio d( )
x ,y of the training background model to segment a foreground by( )
gray level of a pixel in a binary image and k is a threshold. k is determined by experiments according to difference environments. The value of k affects the mount of information retained in binary image B.According to binary image B, we extract the region of foreground object to minimize the image size. Foreground region extraction can be accomplished by simply introducing a threshold on the histograms in X and Y direction. Fig. 3.1 shows an example of foreground region extraction. We utilize the binary image and project it to X and Y directions. The interested section has higher counts in the histogram. We obtain the boundary coordinates x1, x2 of X axis and y1, y2 of Y axis from the projection histogram. We can use these boundary coordinates as the corner of a rectangle to extract foreground region. Fig. 3.2 is the extracted foreground region.
X axis Counts
Counts
Y axis
x1 x2
y1
y2
Fig. 3.1 Histogram of binary image projection in X and Y direction.
Fig. 3.2 The binary image of extracted foreground region.
3.3 Activity template selection
There are few differences between two postural image frames if they are captured in a short interval. Besides, a human body is a rigid body, thus has its natural frequency; namely, it has restriction on action speed when doing some specific actions.
Therefore, we select some key frames from a sequence to represent an activity.
Cameras usually capture image frames in a high frequency. In our approach, we select one image frame, as called the essential template image, with a fixed interval instead of each image. An example is shown in Fig. 3.3. After determining the templates, each activity is represented by several essential templates.
Fig. 3.3 One image frame is selected as template with an interval.
These essential templates are transformed to a new space by eigenspace transformation (EST) and canonical space transformation (CST). As described in Chapter 2, we only utilize k largest eigenvalues and their associated eigenvectors to approximate the image. The approximation can decrease data dimension, but it would also lose slight information of image with few differences. However, two similar
image frames will converge to two near points after eigenspace and canonical space transformation. The images of similar postures done by difference people also barely converge to one point. Consequently, we select only essential templates rather than use all sequences for human activity recognition.
3.4 Construction of Fuzzy Rules from Video Streams
Transitional relationships of postures in temporal sequence are important information for human activity classification. If we only utilize one image frame to classify the action, classification result may be failed easily because human’s actions may have similar postures in two different activity sequences. For example, the action of “jumping” and “crouching” both have the same postures called common states as shown in Fig. 3.4.
Fig. 3.4 Common states of two different activities
Human activities have lots of ambiguity, so we propose a method which can not only combines temporal sequence information for recognition but also is tolerant to variation of actions done by different people. Fuzzy rule base classification has the ability to absorb data difference by learning and has been successfully used in many applications for fusing results of two classifiers. In our system we view each transformation vectors of temporal images as a different feature. The fuzzy rule-base approach also has been proposed in gesture recognition in [20].
In our approach, EST and CST methods are used to extract features. As described in Chapter 2, each image frame is transformed to a (c–1)-dimensional vector by EST and CST methods. Assume that there are n training models and c clusters in the system. Therefore, we have Nt templates, where Nt is equal to n multiplied by c. Let
j
g be a vector of template image of thr j-th training model and the i-th category and i,
j
t be the transformed vector of i, g . i,j t is computed by i,j
n j
c
j i
i j
i, =H×g , , =1 ,2 ,L , ; =1 ,2 ,L ,
t (21)
where H denote the transformation matrix combing EST and CST and n is the total number of posture images in the i-th cluster. t is a (c–1)-dimensional vector and i,j
each dimension is supposed to be independent. Hence, t is rewrite as i,j
[
, 2 , , ,1]
T.1 , ,
,j = ij ij ic−j
i t t L t
t (22)
The transformation of each training model’s templates is treated as a mean vector.
That is,
j i j
i, t,
µ = (23)
where i is the number of template categories. The standard deviation vector of the m-th dimension is computed by
( )
We make use of the membership functions to represent the features’ possibility to each cluster. Many types of membership functions, e.g., bell-shaped, triangular, and trapezoid ones, are frequently used in a fuzzy system. We choose the Gaussian type membership function to represent the features because the Gaussian type membership function can reflect the similarity via the first order and second order statistics of clusters and is differentiable.
Firstly, when the k-th training image frame xk is inputted, the feature vector ak is extracted by
If we assume the dimensions of the feature vectors are independent, a local measure of similarity between the training vector and each template vectors can be computed.
Let Σ denote the covariance matrix of all essential template vectors and Ci denote the
i-th class of essential templates. The membership function is given by
where m is the number of dimension and j is the training model number. ri,k denotes the grade of membership function in category i of the k-th image frame. Besides, we can obtain which category each image belongs to by
k
The membership function describes the probability of which one it is like most.
But it just contains the information of a temporal image. In order to include temporal information, we collect three images to form a basis for classification. If we use too many images to form a basis, the data may contain too many images of other activity.
If we use too few images, it may not have enough timing information to represent an activity.
As developed by Wang and Mendel [19], fuzzy rules can be generated by learning from examples. Three contiguous images are combined as a group (I1, I2, I3) in our approach. We view the transformation of the three images as three features, and form a feature vector [a1, a2, a3]. An image sequence with feature vector [a1, a2, a3] is associated with its output of corresponding activity. Such image sequence constitutes an input-output pair to be learned in the fuzzy rule base. In this setting, the generated rules are a series of associations of the form
(27)
“IF antecedent conditions hold, THEN consequent conditions hold.”
The number of antecedent conditions equals the number of features. Note that antecedent conditions are connected by “AND.” For illustrative purpose, assume now we have c linguistic labels, each linguistic label represents a category of essential templates. Each image in the video stream can be represented by these c linguistic labels. Therefore, the feature vector [a1, a2, a3] can be described by c3 combinations of linguistic labels. Each one of the combinations represents the possible transition states of the three images. For example, an image sequence, its transformations of image 1, image 2, image 3 and belonging categories being concatenated as vector format, is given by
[
1 1]
1 3 1 2
1,a ,a ;D
a (29)
+ +
D1Image 1 Image 2 Image 3
where a11, a12 and a13denote the transformation vectors of Image 1, Image 2 and Image 3 by EST and CST respectively, and D1 is the corresponding belonging object category of the activity. Image 3 is the latest image captured by the camera of these three images. The class of each image is obtained by Eqs.(27) and (28). Suppose that Image 1, Image2 and Image 3 belong to category 1, category 2 and category 3 respectively. Therefore, we assign the image sequences, whose feature vector is [a11,
12
a , a13], to the linguistic labels Posture 1, Posture 2 and Posture 3 respectively.
Finally, a rule is produced from the feature-target vector. Hence this image sequence supports the rule of
Rule 1. IF the activity’s I1 is P1 AND its I2 is P2 AND its I3 is P3, THEN the
activity is D1. (30)
where Ii is Image i and Pj is Posture j. Similarly, for more examples, if we are given two feature-target vectors of input activities
[
2 2]
2 3 2 2
1,a ,a ;D
a (31)
[
3 3]
3 3 3 2
1,a ,a ;D
a (32)
where a1i, ai2, and ai3denote transformation vectors of Image 1, Image 2, and Image 3 of the activity, respectively, and Di is the corresponding belonging object category of the activity.
]
; , ,
[a12 a22 a23 D2 can imply the rule of
Rule 2. IF the activity’s I1 is P18 AND its I2 is P20 AND its I3 is P21, THEN
the activity is D2. (33)
Similarly, ][a31,a32,a33;D3 can imply the rule of
Rule 3. IF the activity’s I1 is P2 AND its I2 is P20 AND its I3 is P21, THEN the
activity is D3. (34)
where Ii is Image i and Pj is Posture j.
In Eq. (33), posture sequence does not appear from Posture 19 to Postures 20 in the order as our observation. However, our system is able to learn the hidden transitional modes of activities from training data. This is an advantage of our system and it will also improve the correct rate in classification. In Eq. (34), although Posture 2 is a posture of the activity D1 but D3, the system still learns this sequence as the activity D3. We regard that Posture 2 as a common state of the two activities D3 and D1. Therefore the fuzzy rules induce tolerant to some ambiguous postures of different activities and classify the image sequence to an activity more correctly.
Due to a large number of training activities, some conflicting rules may be generated. The conflicting rules have the same antecedent conditions but lead to different consequent conditions. For a set of antecedent conditions, we can have only one rule to reflect it. Therefore, we have to choose one from the two or more conflicting rules from each qualified cluster. To this end, we choose the rule that is supported by a maximum number of examples. Furthermore, to prune redundant or inefficient fuzzy rules, if the supporting actions of a rule are less than a threshold, the rule is excluded from defining an IF-THEN rule.
3.5 Classification Algorithm
When a video stream is inputted for recognition, we extract image frames from the video first. Utilizing background model of Section 4.1 to extract foreground subject from the scene. The foreground object is a binary image. Suppose that we have a binary imagex′ , where k is the index of the image sequence. The binary image k
of object x′ needs to be normalized. The binary image k x′ also needs to subtract k
the mean vector in Eq. (3) in order to become the standardized vector x . That is, k
x. after the binary image of object is normalized. That is
k.
k Hx
s = (36)
If we assume each dimension of sk is independent, sk can be rewritten as
[
1 , 2, , ,−1]
T.= k k icj
k s s L s
s (37)
Let Σ denote the covariance matrix of all essential template vectors and Ci
denote the i-th class of essential templates. The membership function is given by
( )
where m is the number of dimension and j is the training model number. ri,k denotes the grade of membership function in category i of the k-th image frame. σ is the
where m is the number of dimension and j is the training model number. ri,k denotes the grade of membership function in category i of the k-th image frame. σ is the