A language modeling approach to atomic human action recognition

(1)

A

Language Modeling Approach

to

Atomic

Human Action Recognition

Yu-Ming Lianga, Sheng-Wen

Shihb, Arthur Chun-Chieh

Shihe,

*Hong-Yuan

Mark Liaoa

C,

and

Cheng-Chung

Lina

aDepartment

of Computer Science,

bDepartment

of Computer Science and cInstitute of InformationScience,

National Chiao TungUniversity, InformationEngineering,National Chi AcademiaSinica, Taipei,Taiwan Hsinchu, Taiwan NanUniversity, Nantou, Taiwan *liao@iis.sinica.edu.tw

Abstract-Visual analysis of human behavior has generated analysis of human behavior. A number of approaches have

considerable interest in the field ofcomputer vision because it been proposed thus far. For example, Ogale et al. [5] used has a wide spectrum of potential applications. Atomic human context-free grammars to model humanactions,while Parket action recognition is an important part of a human behavior al. employed hierarchical finite state automata to recognize

analysis system.In this paper,wepropose alanguage modeling human behavior [6]. In [9], hidden Markov models (HMM)

framework for this task. The framework is comprised of two were applied to human action recognition. This particular modules: a posture labeling module, and an atomic action language modeling technique is useful for both human action learning and recognition module. A posture template selection recognition and human action sequencesynthesis.Galataetal. algorithm is developed based on a modified shape context utilized variable-length Markov models (VLMM) to

matching technique. The posture templates form a codebook that characterize human actions [2], and showed that VLMMs

is used to convertinput posture sequences into training symbol trained with motion-capture dataor silhouette images can be sequences orrecognition symbol sequences. Finally, a variable- used to synthesize human action animations. Currently, the length Markov model technique is applied to learn and recognize HMM is the most popular stochastic algorithm for language the input symbol sequences of atomic actions. Experiments on modeling because of its versatility and mathematical real data demonstrate theefficacy of the proposed system. simplicity. However, since the states of a HMM are not observable, encoding high-order temporal dependencies with Keywords-human behavior analysis; language modeling; this model is a challengingtask. There is no systematic way to posture template selection;variable-lenth Markov mode determine the topology of a HMM or even the number of its

states.Moreover, thetrainingprocess only guarantees a local I. INTRODUCTION optimal solution; thus, thetraining result is very sensitive to

In recent years, visual analysis of human behavior has the initial values of the

parameters.

On the otherhand, since

generatedconsiderable interest in the field of computer vision the states ofa VLMM are observable, its parameters can be because it has awide spectrum ofpotential applications, such estimatedeasily givensufficienttrainingdata.Consequently,a as smart surveillance, human computer interfaces, and VLMM can capture both long-term and short-term content-based retrieval. Atomic human action recognition is dependencies efficiently because the amount of memory

animportant partofahuman behavior analysissystem. Since requiredforpredictionisoptimized duringthetrainingprocess. the humanbodyis anarticulatedobjectwith manydegrees of However, thus far, the VLMMtechniquehas not beenapplied freedom, inferring abodyposture froma single2-D image is to human behavior recognition directly because of two

usually anill-posed problem. Providing asequenceofimages limitations: 1) it cannot handle the dynamic time warping

might help solve the ambiguity of behavior recognition. problem,and

2)

it lacksamodel for

observing

noise.

However, to integrate the information extracted from the In this research, we propose a hybrid framework of

images, it is essential to find a model that can effectively VLMM and HMM that retains the models advantages, while formulate the spatial-temporal characteristics of human avoiding their drawbacks. The framework is comprised of actions. Note that if a continuous human posture can be three modules: a posture labeling module, a VLMM atomic

quantized into a sequence of discrete postures, each one can action learning module, and a recognition module. First, a be regarded as a letter ofa specific language. Consequently, posture template selection algorithm is developed based on a

an atomic action composed of a short sequence ofdiscrete modified shape context technique. The selected posture postures, which indicates a unitary and complete human templates constitute a codebook, which is used to convert movement, can be regarded as a verb of that language. input posture sequences into discrete symbol sequences for Sentences and paragraphs that describe human behavior can subsequent processing.Then, the VLMM technique is applied then be constructed, and the semantic

description

ofahuman to learn the symbol sequences that correspond to atomic actioncanbe determinedbyalanguage modeling approach. actions. This avoidstheproblem of learning theparameters of Language modeling [4], a powerful tool for dealing with a HMM. Finally, the learned VLMMs are transformed into temporal ordering problems, can also be applied to the HMMs for atomic action recognition. Thus, an input posture

(2)

sequence canbe classified with the fault tolerance property of A. Posturelabeling

aHMJM.

To convert a human action into a sequence of discrete

symbols, acodebook of posture templates must be createdas

II. VARIABLE LENGTH

MARKOV

MODEL an

alphabet

todescribe each posture.

Although

the codebook

A variable length Markov model technique [2, 8] is should beas completeas possible, it isimportantto minimize frequentlyapplied tolanguage modeling problems becauseof redundancy. Therefore, a posture is only included in the its powerful ability to encode temporal dependencies. As codebook

if it

cannot be approximated by existing codewords,

showninFig. 1, aVLMM can be regarded as a probabilistic each of

which

represents a human posture. In this work, a

finite state automaton

(PFSA).

The

topology

and the human

posture is

represented by

a silhouette

image,

and a

training

shape

matching

process is used to assess the difference

parameters

byofa LmMzingt camntb

learne from

to between two

shapes.

First,

a low-level

image processing

sequences by

optimizlng

the amount

oF

memory

required

to technique is applied to extract thesilhouette of a human body predict the next symbol. Usually, aPFSA is constructed from from each input image. Then, the codebook of posture

aprediction suffixtree (PST), as shown in Fig. 2. The details templates computed from the training images is used to

ofVLMM trainingaregiven in [8]. convert the extracted silhouettesinto symbolsequences. Shape

matching and posture template selection are the most

x(;AB0)=20

25)

2rBB')=0.25

e5,03)

important procedures in the posture labeling process. These

AB'

'BB'

arediscussedinthefollowing subsections.

B(O-5) B(.75) A)

A(0.75) A(025) A' 1)Shape matching with amodifiedshapecontexttechnique:

A>

/\B We modified the shape context technique proposed by

n('A')=0.5 _.5)

PeABD#5

Belongie

th~~~~~~~~~e

onigienel sai. donexapral

etal. [1] to deal with the shape

he

matchaeingersetdb

matching

problem. In Fig. 1. An example of a VLMM Fig. 2. The PST for constructing the the original shape contextapproach, a shape iS represented by PFSAshown inFig. 1 a discrete set of sampled points, P

{p1,

P2,...,

p,n

. Foreach

After a VLMMhas been trained, it is used to predict the point piEP, a coarse histogram h, is computed to define the

next input symbol according to a variable number of local

shape

context ofpi. Toensurethat the localdescriptor is previously input symbols. In general, aVLMM decomposes sensitiveto

nearby points,

the local

histogram

iscomputed in the

probability

ofa

string

of

symbols,

0

01o2

...

oT,

into the a

log-polar

space. An

example

of

shape

context

computation

product of conditional probabilities as follows: andmatching isshown in Fig. 3.

T

P(O A)

P(oj

lOj

...o10,A)

(1)

j=i

where

oj

is thej-th symbolinthe string and

dj

is the amount of

memory

required

to

predict

the

symbol

oj.

The

goal

of VLMM (b)

recognition

is to find the VLMM that best

interprets

the

observedstring ofsymbolsinterms of thehighest probability.

ie)

(g)

Therefore, the recognition result can be determined as model Fig. 3. Shapecontext computation and matching: (a) and (b) showthe

i asfollows: sampled points of two_{corresponding} shapes; and (c)-(e) are the local shape contexts

to different reference points. A diagram of the log-polar i =arg max

P(O

Ai). (2) spaceisshownin(f),while(g)shows thecorrespondencebetweenpoints

i Z computedusingabipartite graph matching method.

This method works well for natural language processing. Assume that

pi

and qj are points of the first and second However, since natural language processing and human shapes, respectively. The shape context approach defines the

behavioranalysis areinherently different, twoproblems must cost of matching the two points as follows:

be solved before the VLMM technique can be applied to

atomic action recognition. First, as noted in Section 1, the 1 K

[hi

(k) -

hj

(k)]

VLMM technique cannot handle the dynamic time warping

C(pi

,

qj)

- h(k)h (

problem; hence

VLMMs

cannot recognize

atomic

actions

2k=i

hi

(k)

+

hj

(k)

when they are performed at different speeds. Second, the where hi(k) and

hj(k)

denote the K-bin normalized histograms VLMM technique does not include a model for observing ofpiand qj, respectively. Shape matching is accomplished by noise, so the system is less tolerant of image preprocessing minimizing the following total matching cost:

errors. We describeoursolutionstothesetwoproblems inthe

H(T)

-

Cp1,

qff(i)),

(4)

nextsection. i

where zTis apermutationof 1, 2, ..., n. Due to the constraint

III. THEPROPOSED METHODFORATOMICACTION ofone-to-one matching, shape matching can be considered as RECOGNITION anassignment problem that can be solved by a bipartite graph

The

proposed

method comprises two phases: 1) posture

matching

method.

labeling, which converts a continuous human action into a Although the shape context matching algorithm usually

discrete symbol sequence; and 2) application of the VLMM provides satisfactory results, the computational cost of technique to learn and recognize the constructed symbol applying it to a large database of posture templates is so high

sequences. The two phases aredescribed

below,

that is not feasible. To reduce the computation time, we only

(3)

compute the local shape contexts atcertain critical reference outlined in Section 2. These VLMMs are actually different points, which should be easily and efficiently computable, order Markov chains.Forsimplicity,wetransform all thehigh

robust against noise, andcritical to defining the shape of the order Markov chains into first-order Markov chains by

silhouette. Note that the last requirement is very important augmenting thestate space. For example, theprobabilityofa

because ithelps preserve the informative local shape context.

di-th

order Markov chain withstatespace Sisgiven by Inthis work, the critical reference points are selected as the

vertices of the convex hull of a human silhouette. Shape P(X1 iJ

X16-d

=r-d

,Xi-d,+l

=r'-d+i, -X1 r-1), (6)

matching based on this modified shape context technique is where

Xi

is a state in S. To transform the

di-th

order Markov accomplished by minimizing the total cost of the matching chain into a first-order Markov chain, a new state space is modified in(4)asfollows: constructed such that both

YJ

1 =

(Xi-d

.

Xi-)

=

(r-d

..r-r-)

)=

C)(5)

and 4 (Xd+1l,

,Xi)

(rd11l

l ,r) are

included

in the

jEAA newstate space. Asa

result,

the

high

order Markov chain can

where A is the set ofconvex hull vertices. An example of be formulated as the following first-order Markov chain [3]

convexhull-shapecontextsmatchingis shown inFig.4.There P(X - r - r X - r *-- X -r are three important reasons why convex hull-shape contexts J i

X-d,-

i-d,'

i-d,+1

-

i-d1+1'

Xi-l-

i1)

(CSC) can deal with the posture shape matching problem P(

=(r-d+1

...i)

yi-I=

(ri-di

...

r-())7

effectively. First, since the number ofconvex hull vertices is Hereafter, we assume that every VLMM has been transformed

significantly smaller than the number of whole shape points, into afirst-order Markov model. the computation cost can be reduced substantially.

Second,

convex hull vertices usually include the tips of human body As mentioned in Section 2, two problems must be solved parts;hence theycanpreservemore salient informationabout before the VLMM technique can be applied to the action the human shape, as shown in Fig. 4(a). Third, even if some recognitiontask, namely, the dynamic time warping problem

bod

pumartshare,

mi

ssed

by hu .

dete.

cTion

meth ,

he

and the lack ofa model for

observing

noise. Note that the

body

p a m

bspeed

of theactionaffects the number ofrepeated

symbols

in

remaining convex hull vertices can still be applied to shape the constructed symbol sequence: a slower action produces matching due to the robustness ofcomputingthe convexhull morerepeat symbols. To eliminate this speed-dependent

factor,

vertices,asshown inFig.4. theinput symbol sequence is preprocessed to merge repeated

__-_A _ - symbols. VLMMs corresponding to different atomic actions

0;;W>vO

SlS

A , are trained with preprocessed symbol sequences similartothe method

proposed

by

Galataet al.

[2].

However, this

approach

O0@ iet

t08

is

only

valid when the observed noise is

negligible,

which is

(a)

(b) (c) an

impractical

assumption.

The

recognition

rate of the

Fig. 4. Convex hull-shape contextsmatching: (a) and (b) show the convex constructed VLMMs is low because image preprocessing

hull vertices of two shapes; (c) shows the correspondence between the errorsmay identify repeated postures as different symbols. To

convex hull vertices determinedusing shape matching. incorporate a noise observation model, the VLMMs must be

2)Posture template

selection: Posture

template

selection is modifiedto

recognize input

sequences with

repeated symbols.

used to construct a codebook of posture templates from Let a,, denote the state transition

probability

from state i to training silhouette sequences. Here, wepropose an automatic state

j.

Initially,

aiod

0 because

repeated symbols

are posturetemplate selection algorithm (see Algorithm 1),based

merged

into one

symbol. Then,

the

probability

of

self-onthe CSC discussed in Section 3.1.1.Inthemethod,thecost

pN(vv

ofmatchingtwoshapes, see(5), is denotedby

c,

(bi,

aj1).

We

transitionIis

updatedas aewP(vi =

N(v)

,whereN(v1)

only need to empirically determine one threshold parameter is the number of occurrences of symbol

vi,

and the other

rc in our

posture

template

selection method. This

parameter

transition probability is updated as

a,new

aldadnew. For

determines whether a new

training

sample should be . .

example, if the input training symbol sequence is

incorporatedinto the codebook. "AAABBAAACCAAABB," the preprocessed training symbol

Algorithm 1: Posture Template Selection sequence becomes "ABACAB." The VLMM constructed with

Codebook ofkey postures:A=a a a the original input training sequence is shown in Fig. 5(a);

Trainingsequence:T

=]

{ala2...

aM while the

original

VLMM and modified VLMM constructed

forammgseache:(

T =

t

..tNJ with the preprocessed training sequence are shown in Figures

for each t E T do {

if

(A

0 or minC

(t, a)>

-cc)

5(b)

and

5(c),

respectively.

aE4

I4u t} IC:

AAA --A(O)A(0.67-- --A(067)

M e- M + I }1) B(i BA ) BA

C(0.33)B(O.33)CA

A *- At +1 C(1) (

B. Humanactionsequence

learning

andrecognition 8 / \

Usingthe posturetemplates

codebook,

an

input

sequenceof

BA1

'C' B 'C'

postures

{b1,b2,...,bn}

can be converted into a

symbol

"

I('(0)

-M

<,'C(O.5)

sequence {aq(1)''q....a(n)'}e{,2,eeqi) r ,miM} CJ9bia.. i. ...Fig. 5. (a) the VLMM constructed with the original input training sequence.

Thus, atomic action VLMMs can be trained by the method (b) theoriginal VLMM constructed with the preprocessed training sequence.

(c) themodified VLMM, which includes the possibility of self-transition.

(4)

Next, anoise observation model is introduced to convert a The number of states for each HMM was assigned as the VLMM into a HMM. Note that the output of a VLMM number of states of the corresponding learned VLMM. Table determines its state transition and vice versa because the states 1 compares our method's recognition rate with that of the of a VLMM are observable. However, due to the image HMM method computed with the test data from the nine preprocessing noise, the symbol sequence corresponding to an subjects. Our method clearly outperforms the HMM method. atomic action includes some randomness. If, according to the

VLMM, the output symbol is

qt

at time t, then its posture t t I t t 9

template

at

can be retrieved from the codebook. The Fig. 7. Nine test subjects

extracted silhouette image ot willnot deviatetoo much from Table1.Comparisonof our method's recognition rate with that of the

its corresponding posture template at if the segmentation HMM computed with the test data from theninesubjects result does not contain any major errors. Therefore, the CSC Actions 1 2 3 4 6 7 1 9 1

distance

C,

(ot,at)

between the image and the template will Our method

8189

100 100 8444 100

100

97 8 100 1 100

MININI 88)S 6 100 7.6 100 995)56 100 100

be close to zero. In this work, we assume that the CSC

distance has a Gaussian distribution, i.e., V. CONCLUSION

1 -

,'(ot

2t) Wehave

proposed

aframework for

understanding

human

P(o, q,A) = e 2+ .Note that the VLMM hasnow atomic actions using a language modeling approach. The framework

comprises

twomodules:a

posture

labeling

module,

been

convertedio

m

istdeta

rom the

and

aVLMM atomic action

learning

and

recognition

module. the state, then the VLMM becomes a standard HMM. The We have developed a simple and

efficient

posture template probability ofthe observedstringof symbols, HMM._pobiofhoes

o

T'

selection

algorithm

based on a modified

shape

context

12.T matching method. Acodebook of posture templates is created

for a given model

A

can be evaluated by the HMM to convert the input posture sequences into discrete symbols forward/backward procedure with proper scaling [7]. Finally, so that the language modeling approach can be applied. The the category i that maximizes the following equation is VLMM technique is then used to learn and recognize human deemedtobe the

recognition

result: action sequences. Our

experiment

results demonstrate the

i*=argmaxlog[P(O

A).

(8)

efficacy

of the

proposed

system.

ACKNOWLEDGMENT

IV. EXPERIMENTS The authors would like to thank the Department of

Weconductedaseries of experiments to evaluate the Industrial Technology, Ministry of Economic Affairs, Taiwan effectiveness of the proposed method. Thetraining data used for supporting this research under Contract No. 96-EC-17-A-in the experiments was a real video sequence comprised of 02-S1-032, and the National Science Council, Taiwan under

approximately 900 frames with ten categories of action Contract NSC 95-2221-E-260-028-MY3. sequences. Using the posture template selection algorithm, a REFERENCES

codebook of 75 posture templates (see Fig. 6), was "S

constructed from thetraining data. The datawasthen usedto [1] 5. Belongie, J. Malik, and J. Puzicha, "Shape matching and object buildten

VLMMs,

each of which was associated with one of recognition using shape

contexts,"

IEEE Transactions on Pattern

the atomic actions. Analysisand Machine

Intelligence,

Vol.24,No.24,pp.509-522,2002.

[2] A.Galata,N.Johnson,and D.Hogg, "Learningvariable-lengthMarkov

Z Z Z

EZ

Z Z Z Z L E Y Y

models of

behavior,"

Computer Vision andImage Understanding,"

Vol.

414141

E~~~~~~~~~~~~~~l

~8 1, No.3,pp. 398-413,2001.

t

I

t t V 1

1'

X I

I

X[3]

Peter Guttorp, Stochastic Modeling of Scientific Data,

London:

XLI

Xi]LE

r

LI Li]

| /

Iili

l

2Chapmanand Hall/CRC, 1995.

m<

x ff

EC

Xw1

E E E

[4] F. Jelinek, Statistical Methodsfor Speech Recognition, Cambridge,

E

< < < v < < < <i[<

E~

5-]Mass.:

MIT Press, 1998.

Fig. 6. Posture templates extracted from the training data [5] A. S. Ogale, A. Karapurkar, and Y. Aloimonos, "View-invariant

modelingandrecognitionof humanactionsusinggrammars,"Workshop

A test video was usedto assess the effectiveness of the onDynamicalVisionatICCV,Beijing,China, 2005.

proposed method. The test data was obtained from the same [6] J. Park, S. Park, and J. K. Aggarwal, "Model-based human motion

subject. Each atomic action was repeated

_totalct.ofch

four times, yielding tracking and behavior recognition using hierarchical finite state

40 omicactionsequees.Thepr

sed

f

imetd ahievedi

automata,"Proceedings ofInternationalConference oncomputational

atotal of 40 achon sequences. The

proposed

methodachieved Science and Its Applications, Assisi, Italy, pp. 311- 320, 2004.

a10000recognitionratefor all thetestsequences. [7] L. R. Rabiner, "A tutorial on hidden Markov models and selected

Inthe secondexperiment,testvideos of ninesubjects (see applicationsinspeech recognition," Proceedings of the IEEE, Vol. 77,

Fig. 7)wereusedtoevaluate theperformanceof theproposed No. 2, 1989.

method. Each person repeated each action five times, so we [8] D. Ron, Y. Singer, and N.Tishby,"The powerof amnesia," Advancesin

had five sequences for each action and each subject, whih

.Neural

Information ProcessingSystems, Morgan Kauffmann,NewYork,

had five

sequences

for each action and each subect, whch

pp. 176- 183,1994.

yielded a total of 450 action sequences. For comparison, we [9] J. Yamato, J. Ohya, and K. Ishii, "Recognizing human action in time-also tested the performance of the HMM method in this sequential images using hidden Markov model," Proceedings of IEEE experiment. The HMMs we used were fully connected models. Conferenceon Computer Vision and Pattern Recognition, pp. 379- 385,

1992.