A
Language Modeling Approach
to
Atomic
Human Action Recognition
Yu-Ming Lianga, Sheng-Wen
Shihb, Arthur Chun-ChiehShihe,
*Hong-Yuan
Mark LiaoaC,
andCheng-Chung
LinaaDepartment
of Computer Science,bDepartment
of Computer Science and cInstitute of InformationScience,National Chiao TungUniversity, InformationEngineering,National Chi AcademiaSinica, Taipei,Taiwan Hsinchu, Taiwan NanUniversity, Nantou, Taiwan *liao@iis.sinica.edu.tw
Abstract-Visual analysis of human behavior has generated analysis of human behavior. A number of approaches have
considerable interest in the field ofcomputer vision because it been proposed thus far. For example, Ogale et al. [5] used has a wide spectrum of potential applications. Atomic human context-free grammars to model humanactions,while Parket action recognition is an important part of a human behavior al. employed hierarchical finite state automata to recognize
analysis system.In this paper,wepropose alanguage modeling human behavior [6]. In [9], hidden Markov models (HMM)
framework for this task. The framework is comprised of two were applied to human action recognition. This particular modules: a posture labeling module, and an atomic action language modeling technique is useful for both human action learning and recognition module. A posture template selection recognition and human action sequencesynthesis.Galataetal. algorithm is developed based on a modified shape context utilized variable-length Markov models (VLMM) to
matching technique. The posture templates form a codebook that characterize human actions [2], and showed that VLMMs
is used to convertinput posture sequences into training symbol trained with motion-capture dataor silhouette images can be sequences orrecognition symbol sequences. Finally, a variable- used to synthesize human action animations. Currently, the length Markov model technique is applied to learn and recognize HMM is the most popular stochastic algorithm for language the input symbol sequences of atomic actions. Experiments on modeling because of its versatility and mathematical real data demonstrate theefficacy of the proposed system. simplicity. However, since the states of a HMM are not observable, encoding high-order temporal dependencies with Keywords-human behavior analysis; language modeling; this model is a challengingtask. There is no systematic way to posture template selection;variable-lenth Markov mode determine the topology of a HMM or even the number of its
states.Moreover, thetrainingprocess only guarantees a local I. INTRODUCTION optimal solution; thus, thetraining result is very sensitive to
In recent years, visual analysis of human behavior has the initial values of the
parameters.
On the otherhand, sincegeneratedconsiderable interest in the field of computer vision the states ofa VLMM are observable, its parameters can be because it has awide spectrum ofpotential applications, such estimatedeasily givensufficienttrainingdata.Consequently,a as smart surveillance, human computer interfaces, and VLMM can capture both long-term and short-term content-based retrieval. Atomic human action recognition is dependencies efficiently because the amount of memory
animportant partofahuman behavior analysissystem. Since requiredforpredictionisoptimized duringthetrainingprocess. the humanbodyis anarticulatedobjectwith manydegrees of However, thus far, the VLMMtechniquehas not beenapplied freedom, inferring abodyposture froma single2-D image is to human behavior recognition directly because of two
usually anill-posed problem. Providing asequenceofimages limitations: 1) it cannot handle the dynamic time warping
might help solve the ambiguity of behavior recognition. problem,and
2)
it lacksamodel forobserving
noise.However, to integrate the information extracted from the In this research, we propose a hybrid framework of
images, it is essential to find a model that can effectively VLMM and HMM that retains the models advantages, while formulate the spatial-temporal characteristics of human avoiding their drawbacks. The framework is comprised of actions. Note that if a continuous human posture can be three modules: a posture labeling module, a VLMM atomic
quantized into a sequence of discrete postures, each one can action learning module, and a recognition module. First, a be regarded as a letter ofa specific language. Consequently, posture template selection algorithm is developed based on a
an atomic action composed of a short sequence ofdiscrete modified shape context technique. The selected posture postures, which indicates a unitary and complete human templates constitute a codebook, which is used to convert movement, can be regarded as a verb of that language. input posture sequences into discrete symbol sequences for Sentences and paragraphs that describe human behavior can subsequent processing.Then, the VLMM technique is applied then be constructed, and the semantic
description
ofahuman to learn the symbol sequences that correspond to atomic actioncanbe determinedbyalanguage modeling approach. actions. This avoidstheproblem of learning theparameters of Language modeling [4], a powerful tool for dealing with a HMM. Finally, the learned VLMMs are transformed into temporal ordering problems, can also be applied to the HMMs for atomic action recognition. Thus, an input posturesequence canbe classified with the fault tolerance property of A. Posturelabeling
aHMJM.
To convert a human action into a sequence of discretesymbols, acodebook of posture templates must be createdas
II. VARIABLE LENGTH
MARKOV
MODEL analphabet
todescribe each posture.Although
the codebookA variable length Markov model technique [2, 8] is should beas completeas possible, it isimportantto minimize frequentlyapplied tolanguage modeling problems becauseof redundancy. Therefore, a posture is only included in the its powerful ability to encode temporal dependencies. As codebook
if it
cannot be approximated by existing codewords,showninFig. 1, aVLMM can be regarded as a probabilistic each of
which
represents a human posture. In this work, afinite state automaton
(PFSA).
Thetopology
and the humanposture is
represented by
a silhouetteimage,
and atraining
shape
matching
process is used to assess the differenceparameters
byofa LmMzingt camntb
learne from
to between twoshapes.
First,
a low-levelimage processing
sequences by
optimizlng
the amountoF
memoryrequired
to technique is applied to extract thesilhouette of a human body predict the next symbol. Usually, aPFSA is constructed from from each input image. Then, the codebook of postureaprediction suffixtree (PST), as shown in Fig. 2. The details templates computed from the training images is used to
ofVLMM trainingaregiven in [8]. convert the extracted silhouettesinto symbolsequences. Shape
matching and posture template selection are the most
x(;AB0)=20
25)2rBB')=0.25
e5,03)
important procedures in the posture labeling process. TheseAB'
'BB'
arediscussedinthefollowing subsections.B(O-5) B(.75) A)
A(0.75) A(025) A' 1)Shape matching with amodifiedshapecontexttechnique:
A>
/\B We modified the shape context technique proposed byn('A')=0.5 .5)
PeABD#5
Belongie
th~~~~~~~~~e
onigienel sai. donexapral
etal. [1] to deal with the shapehe
matchaeingersetdb
matching
problem. In Fig. 1. An example of a VLMM Fig. 2. The PST for constructing the the original shape contextapproach, a shape iS represented by PFSAshown inFig. 1 a discrete set of sampled points, P{p1,
P2,...,
p,n
. ForeachAfter a VLMMhas been trained, it is used to predict the point piEP, a coarse histogram h, is computed to define the
next input symbol according to a variable number of local
shape
context ofpi. Toensurethat the localdescriptor is previously input symbols. In general, aVLMM decomposes sensitivetonearby points,
the localhistogram
iscomputed in theprobability
ofastring
ofsymbols,
001o2
...oT,
into the alog-polar
space. Anexample
ofshape
contextcomputation
product of conditional probabilities as follows: andmatching isshown in Fig. 3.
T
P(O A)
P(oj
lOj
...o10,A)
(1)
j=i
where
oj
is thej-th symbolinthe string anddj
is the amount ofmemory
required
topredict
thesymbol
oj.
Thegoal
of VLMM (b)recognition
is to find the VLMM that bestinterprets
theobservedstring ofsymbolsinterms of thehighest probability.
ie)
(g)Therefore, the recognition result can be determined as model Fig. 3. Shapecontext computation and matching: (a) and (b) showthe
i asfollows: sampled points of twocorresponding shapes; and (c)-(e) are the local shape contexts
to different reference points. A diagram of the log-polar i =arg max
P(O
Ai). (2) spaceisshownin(f),while(g)shows thecorrespondencebetweenpointsi Z computedusingabipartite graph matching method.
This method works well for natural language processing. Assume that
pi
and qj are points of the first and second However, since natural language processing and human shapes, respectively. The shape context approach defines thebehavioranalysis areinherently different, twoproblems must cost of matching the two points as follows:
be solved before the VLMM technique can be applied to
atomic action recognition. First, as noted in Section 1, the 1 K
[hi
(k) -hj
(k)]VLMM technique cannot handle the dynamic time warping
C(pi
,qj)
- h(k)h (problem; hence
VLMMs
cannot recognizeatomic
actions2k=i
hi(k)
+hj
(k)
when they are performed at different speeds. Second, the where hi(k) and
hj(k)
denote the K-bin normalized histograms VLMM technique does not include a model for observing ofpiand qj, respectively. Shape matching is accomplished by noise, so the system is less tolerant of image preprocessing minimizing the following total matching cost:errors. We describeoursolutionstothesetwoproblems inthe
H(T)
-Cp1,
qff(i)),
(4)
nextsection. i
where zTis apermutationof 1, 2, ..., n. Due to the constraint
III. THEPROPOSED METHODFORATOMICACTION ofone-to-one matching, shape matching can be considered as RECOGNITION anassignment problem that can be solved by a bipartite graph
The
proposed
method comprises two phases: 1) posturematching
method.labeling, which converts a continuous human action into a Although the shape context matching algorithm usually
discrete symbol sequence; and 2) application of the VLMM provides satisfactory results, the computational cost of technique to learn and recognize the constructed symbol applying it to a large database of posture templates is so high
sequences. The two phases aredescribed
below,
that is not feasible. To reduce the computation time, we onlycompute the local shape contexts atcertain critical reference outlined in Section 2. These VLMMs are actually different points, which should be easily and efficiently computable, order Markov chains.Forsimplicity,wetransform all thehigh
robust against noise, andcritical to defining the shape of the order Markov chains into first-order Markov chains by
silhouette. Note that the last requirement is very important augmenting thestate space. For example, theprobabilityofa
because ithelps preserve the informative local shape context.
di-th
order Markov chain withstatespace Sisgiven by Inthis work, the critical reference points are selected as thevertices of the convex hull of a human silhouette. Shape P(X1 iJ
X16-d
=r-d,Xi-d,+l
=r'-d+i, -X1 r-1), (6)matching based on this modified shape context technique is where
Xi
is a state in S. To transform thedi-th
order Markov accomplished by minimizing the total cost of the matching chain into a first-order Markov chain, a new state space is modified in(4)asfollows: constructed such that bothYJ
1 =(Xi-d
.Xi-)
=(r-d
..r-r-))=
C)(5)
and 4 (Xd+1l,,Xi)
(rd11l
l ,r) areincluded
in thejEAA newstate space. Asa
result,
thehigh
order Markov chain canwhere A is the set ofconvex hull vertices. An example of be formulated as the following first-order Markov chain [3]
convexhull-shapecontextsmatchingis shown inFig.4.There P(X - r - r X - r *-- X -r are three important reasons why convex hull-shape contexts J i
X-d,-
i-d,'i-d,+1
-i-d1+1'
Xi-l-
i1)
(CSC) can deal with the posture shape matching problem P(
=(r-d+1
...i)yi-I=
(ri-di
...r-())7
effectively. First, since the number ofconvex hull vertices is Hereafter, we assume that every VLMM has been transformed
significantly smaller than the number of whole shape points, into afirst-order Markov model. the computation cost can be reduced substantially.
Second,
convex hull vertices usually include the tips of human body As mentioned in Section 2, two problems must be solved parts;hence theycanpreservemore salient informationabout before the VLMM technique can be applied to the action the human shape, as shown in Fig. 4(a). Third, even if some recognitiontask, namely, the dynamic time warping problem
bod
pumartshare,
missed
by hu .dete.
cTion
meth ,he
and the lack ofa model forobserving
noise. Note that thebody
p a mbspeed
of theactionaffects the number ofrepeatedsymbols
inremaining convex hull vertices can still be applied to shape the constructed symbol sequence: a slower action produces matching due to the robustness ofcomputingthe convexhull morerepeat symbols. To eliminate this speed-dependent
factor,
vertices,asshown inFig.4. theinput symbol sequence is preprocessed to merge repeated
__-_A _ - symbols. VLMMs corresponding to different atomic actions
0;;W>vO
SlS
A , are trained with preprocessed symbol sequences similartothe methodproposed
by
Galataet al.[2].
However, thisapproach
O0@ iet
t08
isonly
valid when the observed noise isnegligible,
which is(a)
(b) (c) animpractical
assumption.
Therecognition
rate of theFig. 4. Convex hull-shape contextsmatching: (a) and (b) show the convex constructed VLMMs is low because image preprocessing
hull vertices of two shapes; (c) shows the correspondence between the errorsmay identify repeated postures as different symbols. To
convex hull vertices determinedusing shape matching. incorporate a noise observation model, the VLMMs must be
2)Posture template
selection: Posturetemplate
selection is modifiedtorecognize input
sequences withrepeated symbols.
used to construct a codebook of posture templates from Let a,, denote the state transitionprobability
from state i to training silhouette sequences. Here, wepropose an automatic statej.
Initially,
aiod
0 becauserepeated symbols
are posturetemplate selection algorithm (see Algorithm 1),basedmerged
into onesymbol. Then,
theprobability
ofself-onthe CSC discussed in Section 3.1.1.Inthemethod,thecost
pN(vv
ofmatchingtwoshapes, see(5), is denotedby
c,
(bi,
aj1).
WetransitionIis
updatedas aewP(vi =N(v)
,whereN(v1)only need to empirically determine one threshold parameter is the number of occurrences of symbol
vi,
and the otherrc in our
posture
template
selection method. Thisparameter
transition probability is updated asa,new
aldadnew. Fordetermines whether a new
training
sample should be . .example, if the input training symbol sequence is
incorporatedinto the codebook. "AAABBAAACCAAABB," the preprocessed training symbol
Algorithm 1: Posture Template Selection sequence becomes "ABACAB." The VLMM constructed with
Codebook ofkey postures:A=a a a the original input training sequence is shown in Fig. 5(a);
Trainingsequence:T
=]
{ala2...
aM while theoriginal
VLMM and modified VLMM constructedforammgseache:(
T =t
..tNJ with the preprocessed training sequence are shown in Figuresfor each t E T do {
if
(A
0 or minC(t, a)>
-cc)
5(b)
and5(c),
respectively.
aE4
I4u t} IC:
AAA --A(O)A(0.67-- --A(067)M e- M + I }1) B(i BA ) BA
C(0.33)B(O.33)CA
A *- At +1 C(1) (
B. Humanactionsequence
learning
andrecognition 8 / \Usingthe posturetemplates
codebook,
aninput
sequenceofBA1
'C' B 'C'postures
{b1,b2,...,bn}
can be converted into asymbol
"I('(0)
-M<,'C(O.5)
sequence {aq(1)''q....a(n)'}e{,2,eeqi) r ,miM} CJ9bia.. i. ...Fig. 5. (a) the VLMM constructed with the original input training sequence.
Thus, atomic action VLMMs can be trained by the method (b) theoriginal VLMM constructed with the preprocessed training sequence.
(c) themodified VLMM, which includes the possibility of self-transition.
Next, anoise observation model is introduced to convert a The number of states for each HMM was assigned as the VLMM into a HMM. Note that the output of a VLMM number of states of the corresponding learned VLMM. Table determines its state transition and vice versa because the states 1 compares our method's recognition rate with that of the of a VLMM are observable. However, due to the image HMM method computed with the test data from the nine preprocessing noise, the symbol sequence corresponding to an subjects. Our method clearly outperforms the HMM method. atomic action includes some randomness. If, according to the
VLMM, the output symbol is
qt
at time t, then its posture t t I t t 9template
at
can be retrieved from the codebook. The Fig. 7. Nine test subjectsextracted silhouette image ot willnot deviatetoo much from Table1.Comparisonof our method's recognition rate with that of the
its corresponding posture template at if the segmentation HMM computed with the test data from theninesubjects result does not contain any major errors. Therefore, the CSC Actions 1 2 3 4 6 7 1 9 1
distance
C,
(ot,at)
between the image and the template will Our method8189
100 100 8444 100100
97 8 100 1 100MININI 88)S 6 100 7.6 100 995)56 100 100
be close to zero. In this work, we assume that the CSC
distance has a Gaussian distribution, i.e., V. CONCLUSION
1 -
,'(ot
2t) Wehaveproposed
aframework forunderstanding
humanP(o, q,A) = e 2+ .Note that the VLMM hasnow atomic actions using a language modeling approach. The framework
comprises
twomodules:aposture
labeling
module,
been
convertedio
mistdeta
rom theand
aVLMM atomic actionlearning
andrecognition
module. the state, then the VLMM becomes a standard HMM. The We have developed a simple andefficient
posture template probability ofthe observedstringof symbols, HMM.pobiofhoeso
T'
selectionalgorithm
based on a modifiedshape
context12.T matching method. Acodebook of posture templates is created
for a given model
A
can be evaluated by the HMM to convert the input posture sequences into discrete symbols forward/backward procedure with proper scaling [7]. Finally, so that the language modeling approach can be applied. The the category i that maximizes the following equation is VLMM technique is then used to learn and recognize human deemedtobe therecognition
result: action sequences. Ourexperiment
results demonstrate thei*=argmaxlog[P(O
A).
(8)efficacy
of theproposed
system.
ACKNOWLEDGMENT
IV. EXPERIMENTS The authors would like to thank the Department of
Weconductedaseries of experiments to evaluate the Industrial Technology, Ministry of Economic Affairs, Taiwan effectiveness of the proposed method. Thetraining data used for supporting this research under Contract No. 96-EC-17-A-in the experiments was a real video sequence comprised of 02-S1-032, and the National Science Council, Taiwan under
approximately 900 frames with ten categories of action Contract NSC 95-2221-E-260-028-MY3. sequences. Using the posture template selection algorithm, a REFERENCES
codebook of 75 posture templates (see Fig. 6), was "S
constructed from thetraining data. The datawasthen usedto [1] 5. Belongie, J. Malik, and J. Puzicha, "Shape matching and object buildten
VLMMs,
each of which was associated with one of recognition using shapecontexts,"
IEEE Transactions on Patternthe atomic actions. Analysisand Machine
Intelligence,
Vol.24,No.24,pp.509-522,2002.[2] A.Galata,N.Johnson,and D.Hogg, "Learningvariable-lengthMarkov
Z Z Z
EZ
Z Z Z Z L E Y Ymodels of
behavior,"
Computer Vision andImage Understanding,"
Vol.
414141
E~~~~~~~~~~~~~~l
~8 1, No.3,pp. 398-413,2001.t
I
t t V 11'
X I
IX[3]
Peter Guttorp, Stochastic Modeling of Scientific Data,London:
XLI
Xi]LE
rLI Li]
| /Iili
l
2Chapmanand Hall/CRC, 1995.m<
x ff
EC
Xw1
E E E
[4] F. Jelinek, Statistical Methodsfor Speech Recognition, Cambridge,E
< < < v < < < <i[<E~
E~
5-]Mass.:
MIT Press, 1998.Fig. 6. Posture templates extracted from the training data [5] A. S. Ogale, A. Karapurkar, and Y. Aloimonos, "View-invariant
modelingandrecognitionof humanactionsusinggrammars,"Workshop
A test video was usedto assess the effectiveness of the onDynamicalVisionatICCV,Beijing,China, 2005.
proposed method. The test data was obtained from the same [6] J. Park, S. Park, and J. K. Aggarwal, "Model-based human motion
subject. Each atomic action was repeated
totalct.ofch
four times, yielding tracking and behavior recognition using hierarchical finite state40 omicactionsequees.Thepr
sed
fimetd ahievedi
automata,"Proceedings ofInternationalConference oncomputationalatotal of 40 achon sequences. The
proposed
methodachieved Science and Its Applications, Assisi, Italy, pp. 311- 320, 2004.a10000recognitionratefor all thetestsequences. [7] L. R. Rabiner, "A tutorial on hidden Markov models and selected
Inthe secondexperiment,testvideos of ninesubjects (see applicationsinspeech recognition," Proceedings of the IEEE, Vol. 77,
Fig. 7)wereusedtoevaluate theperformanceof theproposed No. 2, 1989.
method. Each person repeated each action five times, so we [8] D. Ron, Y. Singer, and N.Tishby,"The powerof amnesia," Advancesin
had five sequences for each action and each subject, whih
.Neural
Information ProcessingSystems, Morgan Kauffmann,NewYork,had five
sequencesfor each action and each subect, whch
pp. 176- 183,1994.yielded a total of 450 action sequences. For comparison, we [9] J. Yamato, J. Ohya, and K. Ishii, "Recognizing human action in time-also tested the performance of the HMM method in this sequential images using hidden Markov model," Proceedings of IEEE experiment. The HMMs we used were fully connected models. Conferenceon Computer Vision and Pattern Recognition, pp. 379- 385,
1992.