Human action recognition based on graph-embedded spatio-temporal subspace

(1)

Human action recognition based on graph-embedded

spatio-temporal subspace

Chien-Chung Tseng

a

, Ju-Chin Chen

b

, Ching-Hsien Fang

a

, Jenn-Jier James Lien

a,n

a

Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 70101, Taiwan, ROC

b_{Department of Computer Science and Information Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 80778, Taiwan, ROC}

a r t i c l e

i n f o

Article history: Received 8 January 2011 Received in revised form 17 February 2012 Accepted 2 April 2012 Available online 14 April 2012 Keywords:

Human action recognition

Adaptive locality preserving projection Large margin nearest neighbor

a b s t r a c t

Human action recognition is an important issue in the pattern recognition field, with applications ranging from remote surveillance to the indexing of commercial video content. However, human actions are characterized by non-linear dynamics and are therefore not easily learned and recognized. Accordingly, this study proposes a silhouette-based human action recognition system in which a three-step procedure is used to construct an efficient discriminant spatio-temporal subspace for k-NN classification purposes. In the first step, an Adaptive Locality Preserving Projection (ALPP) method is proposed to obtain a low-dimensional spatial subspace in which the linearity in the local data structure is preserved. To resolve the problem of overlaps in the spatial subspace resulting from the ambiguity of the human body shape among different action classes, temporal data are extracted using a Non-base Central-Difference Action Vector (NCDAV) method. Finally, the Large Margin Nearest Neighbor (LMNN) metric learning method is applied to construct an efficient spatio-temporal subspace for classification purposes. The experimental results show that the proposed system accurately recognizes a variety of human actions in real time and outperforms most existing methods. In addition, a robustness test with noisy data indicates that our system is remarkably robust toward noise in the input images.

1. Introduction

Human action recognition has attracted signiﬁcant interest in the computer vision community in recent decades and has spurred the development of a wide variety of applications, including video surveillance human–computer interaction and the analysis of sporting events. However, automatic human action recognition is highly challenging due to the non-stationary back-ground of most video content, the ambiguity of the human body shape among different actions, and the existence of intra-class variations in the appearance, physical characteristics, and motion style of different human subjects.

One of the most important issues in realizing human action recognition systems is how to extract discriminant features from a video sequence. A suitable feature extraction and selection method can help system improve the recognition performance[1]. Moreover, the importance of feature extraction is not only for human action recognition, but for face recognition, age and gender classiﬁcation[1–3]. In recent years, the feature of sparse representation selected via the optimization process attracts a lot of attention because it can provide promising recognition

performance for faces under occlusions or other variants[2]. In addition, joint boosting[1] and the difference of Gaussian ﬁlter followed by Radon transform [3] are powerful approaches for feature selection. According to[4,5], most existing feature-based human action recognition systems are based on optical ﬂow[6,7], space–time gradients[8,9], feature tracking models [10–15], or sparse spatio-temporal interest points[16–21]. However, space– time gradient methods and feature tracking models are highly sensitive to the quality of the input video and variations in the articulation of the human body or lighting conditions, respec-tively. Furthermore, the performance of recognition systems based on sparse interest points is inevitably limited due to the loss of global structure information[22]. Accordingly, the feasi-bility of performing human action recognition based on the human silhouette has attracted increasing interest in recent years

[4,5,23–28]. Compared to the feature extraction methods pro-posed in[6–21], silhouette-based methods enable the construc-tion of a sequence of space–time patterns that encode not only the spatial information of the body shape but also the temporal information of the global and local body parts[5].

In a video sequence containing human actions, the human silhouette in each frame can be represented by a vector in high-dimensional space and expected intrinsically to lie in a low-dimensional space embedded within this high-low-dimensional space[4]. Manifold learning methods, such as Isometric Feature Contents lists available atSciVerse ScienceDirect

journal homepage:www.elsevier.com/locate/pr

Pattern Recognition

n

Corresponding author. Tel.: þ886 6 2757575; fax: þ886 6 2747076. E-mail address: [email protected] (J.-J. James Lien).

(2)

Mapping (Isomap)[29], Locally Linear Embedding (LLE)[30], or Laplacian Eigenmaps [31], provide the means to identify the intrinsic geometrical structure of a database and thus facilitate the analysis of human action motions in compact low-dimen-sional space. For example, Elgammal and Lee[32]utilized the LLE method to infer 3D body poses from human silhouettes. Similarly, Wang and Suter[5]used a linear approximation to the LE method referred to as the Locality Preserving Projection (LPP) method[33]

to establish low-dimensional feature representation for human silhouettes. LPP can extract the low-dimensional features of human silhouettes as a manifold by preserving both the intrinsic geometry and the local structure of the data via an adjacency undirected graph that incorporates the neighborhood information of the database[33]. However, LPP lacks clear rules for preserving linearity, and the rules for building the adjacency graph are not strict enough. As a result, the subspace obtained by LPP might be incompact; i.e., a data point in the subspace may be neighbored with other data points that are unrelated or similar to it.

Several supervised manifold learning methods based on class label information have been proposed in recent years, including Marginal Fisher Analysis (MFA) [9], supervised-LPP [5], and Locality Sensitive Discriminant Analysis (LSDA) [34]. The class label information makes possible the discovery of the local spatial discriminant structure and therefore enables the separation of images with different action classes. However, in addition to spatial information (e.g., silhouette shape), temporal information, such as the dynamic variation of the silhouette shape over a sequence of video frames, is also helpful in accomplishing reliable human action recognition systems. Accordingly, various research-ers have incorporated temporal information into the action recognition process. For example, Jia and Yeung[4] proposed a local spatio-temporal subspace learning method (LSTDE) in which temporal subspaces associated with the data points in consecu-tive frames were constructed in such a way as to maximize both the discriminant structure in accordance with the class labels and the principal angles among the temporal subspaces of the different classes. Meanwhile, Wang and Suter [5] modeled the temporal evolution of an action motion as a sequence of projec-tion points with associated temporal orders and used a hidden Markov model (HMM) to capture the structural and dynamic nature of the corresponding motion.

Consequently, the methods for human action recognition are mainly divided into two approaches: silhouette base and feature base. The first one utilizes human silhouettes as features which are commonly extracted by background subtraction[4,5,25]. The second one adopts motion or interesting information from human movement such as optical flow or interesting points [6,12,21]. Generally, the feature-based method does not use background subtraction during preprocessing. The objective in this study is to design a silhouette-based human action recognition system that can be integrated into an ordinary visual surveillance system with real-time moving object detection, classification and activity analysis capabilities. This system applies a three-step procedure to learn a discriminant spatio-temporal subspace for classification purposes. In the first step, a dimensionality reduction method designated as the Adaptive Locality Preserving Projection (ALPP) method is used to construct a compact spatial subspace. ALPP applies a modified graph construction process and a linearity measurement mechanism and will preserve the linearity in the local structure information while simultaneously reducing the dimensionality of the original database. Although ALPP can preserve the linearity and local structure information well, the ambiguity of the human body shape among different action classes will still result in an overlap of the silhouette information within the spatial subspace. Accordingly, in the second step of the proposed system, a Non-base Central-Difference Action Vector

(NCDAV) method is used to extract the temporal data from the reduced spatial subspace to characterize the motion information in a temporal vector. It should be noted that NCDAV encodes the difference information between each consecutive frame with the base data. However, the base data is discarded in the temporal vector; otherwise, the temporal vector containing base data will result in overlapped distributions between different action types in the subspace. Finally, the Large Margin Nearest Neighbor (LMNN) metric learning method[22]is applied in the third step to construct a discriminant spatio-temporal subspace where the temporal vectors belonging to the same action class are clustered together while those associated with different action classes are separated by a margin. Having established the spatio-temporal subspace, human action recognition is achieved by utilizing a k-NN classiﬁer to determine the action class of each input frame, and a majority voting mechanism is then applied to identify the action class of the entire input sequence.

In summary, there are three contributions in this study. The ﬁrst one is that the proposed feature extraction method, named as Adaptive Locality Preserving Projection (ALPP), modiﬁes the building of adjacency graph to solve the problems of LPP. The second one is the temporal information extraction, named as Non-base Central-Difference Action Vector (NCDAV). The designed temporal vector can reduce the corrupted effect by the base data and thus reduce the system degradation from the ambiguous and noise. The third one is the proposed framework itself, which is to extract discriminant features for action recogni-tion. According to the properties of the three approaches, the proposed system is able to not only recognize human actions in real-time, but also considerably tolerate noise condition.

2. Learning process in a spatio-temporal subspace

Fig. 1 shows the overall framework of the proposed human

action recognition system. As shown, the system comprises a learning process (see Fig. 1(a)) and a recognition process (see

Fig. 1(b)). Assume that there are M training sequences

X¼[X1,X2,...,XM] and that each sequence comprises niframes. The

learning process commences by extracting the human silhouettes from the training sequence frames using a background subtrac-tion method. To reduce intra-class variasubtrac-tions in the subject size, the silhouettes are centralized and normalized to a consistent size of w h pixels. Therefore, each silhouette xican be represented by

a D-dimensional vector (D ¼w h), and the training set has the form X ¼[x1,x2,...,xN]ARD N, where N ¼PMi ¼ 1ni.

Having constructed the training set, a three-step procedure is applied to analyze the spatial and temporal information of the silhouettes in the training sequences and to construct a spatio-temporal subspace for classiﬁcation purposes. Section 2.1

describes the ALPP algorithm, which is used to reduce the dimensionality of the original spatial subspace while preserving the linearity in the local structure information. Section 2.2

describes the use of the NCDAV method for extracting temporal information from the training sequences and constructing the corresponding temporal vector. Finally,Section 2.3describes the LMNN method, which is used to construct a discriminant spatio-temporal subspace for classiﬁcation purposes.

2.1. Adaptive Locality Preserving Projection (ALPP) for dimensionality reduction

Human action sequences, as represented by a contiguous series of human silhouettes, can be viewed as a set of data points on non-linear manifolds in high-dimensional space. To reduce the computational cost of the learning process, it is desirable to

(3)

eliminate the redundant information within the original sequences in order to obtain a low-dimensional spatial subspace. However, in constructing this subspace, the local spatial structure and relations among the data points must be preserved. That is, the data points (images) that are close (similar) in the original high-dimensional space would be also close in the low-dimen-sional subspace. In the present study, this computation is achieved by using a new Adaptive Locality Preserving Projection (ALPP) algorithm. Importantly, ALPP retains the well-known advantages of the Locality Preserving Projection (LPP) method

[33]. In other words, less computational complexity is needed than in methods such as the LE or LLE methods[30,31], which utilize a non-linear spectral embedding technique. In addition, a linear transformation enables LPP to provide a low embedding for new data points without computing the entire matrix from scratch. Moreover, ALPP solves the problem of LPP, which will be discussed in the following paragraph because of its use of a modiﬁed graph construction process and linearity measurement. In LPP, the k-NN method is used to deﬁne the neighborhood information in high-dimensional space before constructing the corresponding graph. When applying the k-NN method, four possible relationships exist between any pair of points (see

Fig. 2(a), in which the blue edge indicates that the red point is a neighbor of the blue point, while the red edge indicates that the blue point is a neighbor of the red point). For example, consider

the case shown inFig. 3(a), in which a 5-NN clustering scheme is used to construct the graph. Because k is speciﬁed as 5, the clustering scheme attempts to group each data point with ﬁve neighbors. Consider the red data point shown on the left of

Fig. 3(a). This data point has only 4 neighbors, and thus, the

clustering scheme is forced to add a remote data point (i.e., the red data point shown on the right ofFig. 3(a) is added to the group of neighbors, as indicated by the red edge between the two points). As shown inFig. 3(a), the graph comprises two distinct groups of images; i.e., they are unrelated or dissimilar to one

Fig. 1. Framework of the proposed human action recognition system: (a) the learning process which can generate the spatio-temporal subspace for classiﬁcation. (b) The recognition process which use k-NN method in LMNN spatio-temporal subspace to recognize the human action.

Traditional rules for graph construction.

New bi-relationrules for graph construction.

Fig. 2. Four situations of graph construction in (a) LPP and (b) ALPP. (For interpretation of the references to color in this ﬁgure, the reader is referred to the web version of this article.)

(4)

another. As a result, the edge constructed between them may cause an inappropriate congregation of the two different groups following the dimensionality reduction process (see the lower graph inFig. 3(a)).

The modiﬁed process of building the adjacency undirected graph in ALPP comprises two major components, namely, graph construction and linearity measurement. The goal of the graph construction process is to enforce stricter connection rules in order to discard the neighbors with low connection relations. Meanwhile, the goal of the linearity measurement mechanism is to evaluate the linearity between every pair of points in order to connect data points that are not neighbors but have linearity strong enough to have a corresponding edge in the adjacency undirected graph. In this way, similar data points are grouped together in the low-dimensional subspace, thereby improving its discriminatory power and reducing its dimensionality.

In the process of graph construction, the discriminatory power of the graph is improved by imposing new bi-relation connection rules between each pair of nodes, as shown inFig. 2(b). As in the LPP method, ALPP also recognizes four possible relationships between each pair of nodes when applying the k-NN clustering method. However, in contrast to LPP, ALPP constructs an edge in the adjacency undirected graph only when both nodes in the pair are neighbors of one another. As a result, ALPP avoids the problem inherent in LPP of retaining redundant data points simply to satisfy the requirement for a given number of neighbors and to make the distribution of the same group more discriminant after dimensionality reduction (see the lower graph inFig. 3(b)).

Following the graph construction process, the linearity mea-surement mechanism [35] is applied to calculate the linearity between each pair of points with a connected path. Thus, the data with high linearity are grouped together and thereby the low-dimensional spatial subspace can be more compact. The linearity measurement mechanism contains two terms: the Euclidean distance and the geodesic distance. Consider the graph shown

inFig. 4(a), which includes four points and has a path between

points A and B.Fig. 4(b) illustrates the Euclidean distance and the geodesic distance between points A and B. As shown, the Euclidean distance is the absolute distance between the two points, while the geodesic distance is the length of the shortest path between the two points. In accordance with [35], the

linearity lijbetween two data points, xiand xj, can be evaluated as

lij¼

Geodesic distance ðxi,xjÞ

Eulidean distance ðxi,xjÞ

: ð1Þ

Because the geodesic distance is always equal to or greater than the Euclidean distance, lijis always equal to or greater than

1. Clearly, the closer the geodesic distance and Euclidean distance are to one another, the greater the linearity is between the two data points. Thus, in the limiting case of lij¼1, the two data points

are totally linear. In ALPP, lijis computed for every possible pair of

data points, and any edges with a linearity of less than a linearity measurement threshold lt(lt¼1.1 which is deﬁned by the

experi-ments) are added to the adjacency graph that was constructed in the previous step. For example, assume that the original ALPP graph contains four data points, A–D, and that each edge in the graph indicates that the corresponding data points are both neighbors of one another (seeFig. 5). Furthermore, assume that the linearity IACbetween points A and C is found to have a value of

less than lt. Consequently, an edge is constructed between them

(see the right graph inFig. 5). In contrast, data points A and D have only a weak linearity (i.e., IAD4lt), and thus, no edge is added

(5-NN) (5-NN)

Dimensionality Reduction

Fig. 3. (a) The problem of LPP where an improper edge to connect the unrelated points may cause the congregation of two different groups and the classiﬁcation error after the dimensionality reduction. (b) The example of ALPP which use more strict connection rule can avoid the mistake by keeping unimportant data nearby and make the distribution of the same group more discriminant after the dimensionality reduction. (For interpretation of the references to color in this ﬁgure, the reader is referred to the web version of this article.)

A B A B Euclidean distance Geodesic distance Fig. 4. (a) An example graph constructed by ALPP. (b) The ‘‘Geodesic distance’’ and ‘‘Euclidean distance’’ of points A and B.

D lAC= 1.05 A B C lAD= 1.3 A B C D

Fig. 5. An example of linearity measurement indicates that the data point C which is not the neighbor of A but its linearity is strong enough will add an edge between them. However, the data point D which is not the neighbor of A and the linearity is not strong enough will not add an edge between them.

(5)

between them.Fig. 6shows the adjacency undirected graphs of two action sequences, bend (orange edge) and walk (green edge), using the traditional rules applied in the LPP method, new bi-relation rules, and new bi-relation rules and linear measurement, respec-tively. The red arrows inFig. 6(a) indicate that a data point in the graph may have wrong connections with other data points that are far from it. In contrast, the proposed method can reduce the wrong connections to preserve the local structure of the data set (see

Fig. 6(c)).

After the graph construction and linearity measurement pro-cesses, a weight matrix W is obtained whose elements Wij are

either 1 or 0, depending on the connection between xiand xj. It is

to be noted that W describes both the linearity and the relation-ship between every pair of points. For example, Wij¼1 implies a

strong linearity and a relationship between points xi and xj,

whereas Wij¼0 indicates that the two data points are unrelated

to one another. In other words, the value of W is assigned as

follows: Wij¼

1 if xiand xjare connected,

0 else: (

ð2Þ According to the locality-preserving criterion, which ensures that those points that have sufﬁcient linearity (i.e., lijolt) in the

original high-dimensional space are grouped together in the low-dimensional subspace[33,35], the weight matrix W that records the local structure or linearity between each pair of data points is added to the objective function. There exists a transformation matrix A to minimize the following objective function:

arg minX

ij

ðyiyjÞ2Wij¼arg min A

X

ij

ðATxiATxjÞ2Wij, ð3Þ

where xi is the original data point and yi is the

correspond-ing data point in the low-dimensional subspace. Because each

The number of connected paths = 848 The number of connected paths = 716 The number of connected paths = 1227

(k = 5, lij = 1.3)

(k = 5) (k = 5)

(k = 5, lij = 1.3)

The number of connected paths = 958 The number of connected paths = 714 The number of connected paths = 1392

Fig. 6. The adjacency undirected graphs of the two action sequences of bend (orange edge) and walk (green edge): (a) using the traditional rules, (b) using only new bi-relation rules, and (c) using both new bi-bi-relation rules and linear measurement. The red arrows indicate the difference between adjacency undirected graphs using traditional and proposed method. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

(6)

transformation vector in A can work independently, Eq. (3) can be rewritten as follows: arg min a X ij ðaT_x iaTxjÞ2Wij, ð4Þ

where a is the transformation vector. After a process of algebraic manipulation[15], Eq. (4) can be reformulated as follows: arg min

a a

T_XLXT_a, _ð5Þ

where L is the Laplacian matrix; D is the diagonal matrix, in which Dii¼PjWij; and L¼D W. The value of Dii in D indicates the

number of neighbors of data point xi. In other words, the more

neighbors xihas, the larger the value of Dii and the greater the

importance of xi. As described in[33], the following constraint is

imposed on the objective function given in Eq. (5):

aT_XDXT_{a ¼ 1:} _ð6Þ

This constraint not only causes the data point with the largest value of Dii to be located close to the origin of the

low-dimen-sional subspace but also restrains the distribution of all of the remaining data points. Combining Eqs. (5) and (6), the optimiza-tion problem becomes

arg min

a a T_XLXT_a

s:t: aT_XDXT_{a ¼ 1:} _ð7Þ

Then, Eq. (7) can be solved via the Lagrangian formulation as follows: Lagrangian ¼ aT_XLXT a

l

aT_XDXT a ) @ @aða T_XLXT a

l

aT_XDXT aÞ ¼ 0 )XLXTa ¼

l

XDXTa, ð8Þ where the two matrices XLXTand XDXTare both symmetric and positive semi-deﬁnite. Eq. (8) is a generalized eigen-decomposi-tion problem [4,5,30–33,35], and the transformation matrix A¼[a1,a2,...,ad]ARD dis given by the eigenvectors corresponding

to the d smallest eigenvalues. Thus, the data in the low-dimensional subspace can be obtained as Y¼ AT_{X, where}

Y¼[y1,y2,...,yN]ARd N. Fig. 7(a) and (b) show the distribution of

Y in the LPP and ALPP subspace. As the diagrams indicate, the distribution of Y in the ALPP subspace is more compact than its distribution in the LPP subspace. In addition, as shown inFig. 7, the continuity of action in the ALPP subspace is smoother than that in the LPP subspace (i.e., the images that are close (similar) in the original high-dimensional space are also close in the low-dimensional subspace).

2.2. Temporal vector creation

After obtaining the spatial subspace by ALPP, all of the training silhouettes are projected in this subspace, and thus, the

: bend : jack : jump : pjump : run : side : skip : walk : wave1 : wave2 (k = 5, lij= 1.1) (k = 5) Overlap

Fig. 7. 2D distribution of 10 action sequences in (a) LPP subspace, and in (b) ALPP subspace, (c) is the 2D distribution of wave1 action sequence in LPP and ALPP, respectively. In addition, the smooth change of silhouette images and the compactness of distribution indicate that ALPP subspace can preserve better local structure and linearity of data points. (For interpretation of the references to color in this ﬁgure, the reader is referred to the web version of this article.)

(7)

References

[1] R. Xiao, W. Li, Y. Tian, X. Tang, Joint boosting feature selection for robust face recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2006, pp. 1415–1422.

[2] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Yi Ma, Robust face recognition via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2009) 210–227.

[3] H. Zhou, P. Miller, J. Zhang, Age classiﬁcation using radon transform and entropy based scaling SVM, in: Proceeding of the British Machine Vision Conference, 2011, pp. 28.1–28.12.

[4] L.K. Jia, D.Y. Yeung, Human action recognition using local spatio-temporal discriminant embedding, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.

[5] L. Wang, D. Suter, Visual learning and recognition of sequential data manifolds with applications to human movement analysis, Computer Vision and Image Understanding 110 (2) (2008) 152–172.

[6] R. Chaudhry, Avinash Ravichandran G. Hager, R. Vidal, Histograms of oriented optical ﬂow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1932–1939.

[7] A. Efros, A. Berg, G. Mori, J. Malik, Recognizing action at a distance, in: IEEE International Conference on Computer Vision, vol. 2, 2003, pp. 726–733. [8] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVM

approach, IEEE International Conference on Pattern Recognition, vol. 3, 2004, pp. 32–36.

[9] S. Yan, D. Xu, B. Zhang, H.J. Zhang, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (1) (2007) 40–51.

[10] A. Bissacco, A. Chiuso, Y. Ma, S. Soatto, Recognition of human gaits, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2001, pp. 52–57. [11] C. Bregler, Learning and recognizing human dynamics in video sequences, in: IEEE Conference on Computer Vision and Pattern Recognition, 1997, pp. 568–574. [12] J.C. Niebles, F.F. Li, A hierarchical model of shape and appearance for human action classiﬁcation, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 99, 2007, pp. 1–8.

[13] L. Wang, H.Z. Ning, T.N. Tan, W.M. Hu, Fusion of static and dynamic body biometrics for gait recognition, in: IEEE International Conference on Compu-ter Vision, 2003, pp. 1449–1454.

[14] Y. Wang, G. Mori, Max-margin hidden conditional random ﬁelds for human action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 872–879.

[15] Y. Yacoob, M.J. Black, Parameterized modeling and recognition of activities, Computer Vision and Image Understanding 73 (2) (1999) 232–247. [16] M. Bregonzio, S. Gong, Tao Xiang, Recognising action as clouds of space–time

interest points, in: IEEE Conference on Computer Vision and Pattern Recogni-tion, 2009, pp. 1948–1955.

[17] P. Dollar, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in: IEEE International Conference on Computer Vision, 2005, pp. 65–72.

[18] Y. Ke, R. Sukthankar, M. Hebert, Efﬁcient visual event detection using volumetric features, in: IEEE International Conference on Computer Vision, 2005, pp. 166–173.

[19] I. Laptev, On space–time interest points, International Journal of Computer Vision 64 (2–3) (2005) 107–123.

[20] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.

[21] W.T. Lee, H.T. Chen, Histogram-based interest point detectors, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1590– 1596.

[22] K.Q. Weinberger, L.K. Saul, Distance metric learning for large margin nearest neighbor classiﬁcation, Journal of Machine Learning Research 10 (2009) 209–244.

[23] A. Bobick, J. Davis, The recognition of human movement using temporal templates, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (3) (2001) 257–267.

[24] L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri, Action as space–time shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (12) (2007) 2247–2253.

[25] R. Poppe, M. Poel, Discriminative human action recognition using pairwise CSP classiﬁers, in: IEEE International Conference on Automatic Face and Gesture Recognition, 2008, pp. 1–6.

[26] L. Wang, D. Suter, Recognizing human activities from silhouettes: motion subspace and factorial discriminative graphical model, in: IEEE International Conference on Pattern Recognition, 2007, pp. 1–8.

[27] L. Wang, D. Suter, Learning and matching of dynamic shape manifolds for human action recognition, IEEE Transactions on Image Processing 16 (6) (2007) 1646–1661.

[28] D. Weinland, Edmond Boyer, Action recognition using exemplar-based embedding, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–7.

[29] J.B. Tenenbaum, V.D. Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [30] S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear

embedding, Science 290 (5500) (2000) 2322–2326.

[31] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, Advances in Neural Information Processing Systems 14 (2002) 585–591.

[32] A. Elgammal, C.S. Lee, Inferring 3D body pose from silhouettes using activity manifold learning, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2004, pp. 681–688.

[33] X. He, P. Niyogi, Locality preserving projections, Advances in Neural Informa-tion Processing Systems 16 (2003) 152–160.

[34] D. Cai, X. He, K. Zhou, J. Han, H. Bao, Locality sensitive discriminant analysis, in: International Joint Conferences on Artiﬁcail Intelligence, 2007, pp. 708–713. [35] R. Wang, X. Chen, Manifold discriminant analysis, in: IEEE Conference on

Computer Vision and Pattern Recognition, 2009, pp. 429–436. [36] /http://www.cam-orl.co.uk/facedatabase.htmlS.

[37] /http://yann.lecun.com/exdb/mnist/index.htmlS.

[38] M. Turk, A. Pentland, Face recognition using eigenfaces, in: IEEE International Conference on Pattern Recognition, 1991, pp. 586–591.

[39] P.N. Belhumeur, J.P. Hepanha, D.J. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class speciﬁc linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (7) (1997) 711–720. [40] X. Wu, Y. Jia, W. Liang, Incremental discriminant-analysis of canonical

correlations for action recognition, Pattern Recognition 43 (12) (2010) 4190–4197.

[41] J. Zhang, S. Gong, Action categorization with modiﬁed hidden conditional random ﬁeld, Pattern Recognition 43 (1) (2010) 197–203.

[42] R. Filipovych, E. Ribeiro, Learning human motion models from unsegmented videos, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–7.

[43] S. Ali, A. Basharat, M. Shah, Chaotic invariants for human action recognition, in: IEEE International Conference on Computer Vision, 2007, pp. 1–8.

Chien-Chung Tseng received his B.S. degree in computer science and information engineering from the National Cheng Kung University, Tainan, Taiwan, in 2004. He now is a Ph.D. candidate in computer science and information engineering at the National Cheng Kung University, Tainan, Taiwan. In addition to his current research into license plate recognition and human action recognition, his interests lie in automatic caricature generation, event classiﬁcation, computer vision, and pattern recognition.

Ju-Chin Chen received her B.S., M.S. and Ph.D. degrees in computer science and information engineering from the National Cheng Kung University, Tainan, Taiwan, in 2002, 2004 and 2010, respectively. She is now an assistant professor in the Department of Computer Science and Information Engineering at the National Kaohsiung University of Applied Science, Taiwan. Her research interests lie in the ﬁelds of machine learning, computer vision and pattern recognition.

Ching-Hsien Fang received his B.S. degree in computer science and information engineering form the National Tsing Hua University, Hsinchu, Taiwan, in 2008. Then, he received his M.S. degrees in computer science and information engineering from the National Cheng Kung University, Tainan, Taiwan, in 2010. He is now a engineer in Cyberlink which advances and innovates video and audio technology for the people’s enjoyment. In addition to his current research into automatic face detection, tracking and recognition, his interests lie in the ﬁelds of digital image processing, computer vision and pattern recognition.

Jenn-Jier James Lien (M’00) received his M.S. and Ph.D. degrees in electrical engineering from Washington University, St. Louis, MO, and the University of Pittsburgh, Pittsburgh, PA, in 1993 and 1998, respectively. From 1995 to 1998, he was a research assistant at the Vision Autonomous Systems Center in the Robotics Institute at Carnegie Mellon University, Pittsburgh, PA. From 1999 to 2002, he was a senior research scientist at L1-Identity (formerly Visionics) and a project lead for the DARPA surveillance project. He is now an associate professor in the Department of Computer Science and Information Engineering at the National Cheng Kung University, Taiwan.