X-axis Gestural Motion Trajectories

(1)

Ming-Hsuan Yang and Narendra Ahuja

Departmentof Computer Science and Beckman Institute

University ofIllinois atUrbana-Champaign,Urbana, IL 61801

fmhyang,[email protected]

http://vision.ai.uiuc.edu

Abstract

We present an algorithm for extracting and clas-

sifying two-dimensional motion inan imagesequence

based on motion trajectories. First, a multiscale seg-

mentation is performed to generate homogeneous re-

gions in each frame. Regions between consecutive

frames are then matched to obtain 2-view correspon-

dences. AÆne transformations are computed from

each pair of corresponding regions to dene pixel

matches. Pixelsmatchesoverconsecutiveimagespairs

areconcatenatedtoobtain pixel-levelmotion trajecto-

ries across the image sequence. Motion patterns are

learned from the extracted trajectories using a time-

delay neural network. Weapply the proposed method

to recognize 40 hand gesturesof American Sign Lan-

guage. Experimentalresultsshowthatmotionpatterns

in hand gesturescan be extractedandrecognized with

high recognition rateusing motion trajectories.

1 Introduction

Inthispaper,wepresentanalgorithm forextract-

ing two-dimensionalmotion elds ofobjectsacrossa

video sequence and classifying each as one of a set

of a priori known classes. The algorithm is used to

recognize dynamic visual processes basedon spatial,

photometric and temporal characteristics. An appli-

cation of the algorithm is in sign language recogni-

tion where an utterance is interpreted based on, for

example,handlocation,shape,andmotion. Theper-

formance of the algorithm is evaluated on the task

of recognizing40complexhandgesturesofAmerican

Sign Language(ASL).

The algorithm consists of two major steps. First,

each image is partitioned into regions using amulti-

scalesegmentationmethod. Regionsbetweenconsec-

utive frames are then matched to obtain 2-view cor-

respondences. AÆne transformations are computed

fromeachpairofcorrespondingregionstodenepixel

matches. Pixelmatchesoverconsecutiveimage pairs

areconcatenatedtoobtainpixel-levelmotiontrajecto-

riesacrossthevideosequence. Pixelsarealsogrouped

based on their 2-view motion similarity to obtain a

motion based segmentation of the video sequence.

Onlysomeofthemovingregionscorrespondtovisual

phenomenaofinterest. Boththeintrinsicpropertiesof

theobjectsrepresentedbyimageregionsandtheirdy-

namicsrepresented by themotion trajectories deter-

minewhether theycompriseaneventofinterest. For

example,itis suÆcientto recognizemostgesturesin

ASLintermsofshapeandlocationchangesofpalmre-

gions.Therefore,palmandheadregionsareextracted

outineachframeandthepalmlocationsarespecied

withreferenceto theusually stillheadregions.

Torecognizemotionpatternsfromtrajectories,we

useatime-delayneuralnetwork(TDNN)[11]. TDNN

is a multilayer feedforward network that uses time-

delays between all layers to represent temporal rela-

tionships betweeneventsin time. An inputvectoris

organizedasatemporalsequence,whereonlythepor-

tionoftheinputsequencewithinatimewindowisfed

tothenetworkatonetime. Thetimewindowisshifted

andanotherportionof theinputsequenceisgivento

thenetworkuntilthewholesequencehasbeenscanned

through. The TDNN is trainedusing standarderror

backpropagation learning algorithm. The output of

thenetworkiscomputedbyaddingallofthese scores

over time, followed by applying a nonlinearfunction

suchassigmoidfunctiontothesum. TDNNswithtwo

hidden layers using sliding input windows over time

leadto arelativelysmallnumberoftrainableparam-

eters. Weadopt TDNNto recognizemotionpatterns

becausegesturesarespatio-temporalsequencesoffea-

ture vectors dened along motion trajectories. Our

experimental results show that motion patterns can

be learnedbyatime-delayneuralnetwork with high

recognitionrate.

2 Related Work

Since Johansson's seminal work [7] that suggests

human movements can be recognized solely by mo-

(2)

beeninvestigatedto recognizehumanmotionbysev-

eral researchers. In[8]Siskindand Morrisconjecture

thathumaneventperceptiondoesnotpresupposeob-

ject recognition. In other words, they think visual

event recognition is performed by a visual pathway

which is separated from object recognition. To ver-

ify the conjecture, they analyze motion proles of

objects that participate in dierent simple spatial-

motion events. Their trackerusesamixture of color

basedandmotionbasedtechniques. Colorbasedtech-

niques areusedtotrackobjectsdenedbysetofcol-

ored pixelswhosesaturationandvalueareabovecer-

tain thresholds in each frame. These pixels are then

clusteredintoregionsusingahistogrambasedonhue.

Movingpixelsareextractedfromframedierencesand

divided into clusters based onproximity. Next, each

region (generated by color ormotion) in each frame

is abstractedbyanellipse. Finally,feature vectorfor

each frame is generated by computing the absolute

and relative ellipse positions, orientations, velocities

and accelerations. To classifyvisualevents, theyuse

a set of Hidden Markov Models (HMMs) which are

used as generative models and trained on movies of

eachvisual eventrepresentedbyaset offeature vec-

tors. After training,anewobservationisclassiedas

beinggeneratedbythemodelthatassignsthehighest

likelihood. Experimentsonasetof6simplegestures,

\pick up," \put down," \push," \pull," \drop," and

\throw," demonstratethat gesturescan be classied

basedonmotionproles.

BobickandWilson[3]adoptastatebasedapproach

torepresentandrecognizegestures. First,manysam-

ples of a gesture are used to compute its principal

curve [5] which is parameterized by arc length. A

by-product of calculating the curve is the mapping

of each sample pointof agesture exampleto an arc

lengthalongthecurve. Next,theyuselinesegmentsof

uniform length to approximate thediscretized curve.

Each line segmentis represented by avectorand all

the line segmentsaregrouped intoanumberof clus-

ters. Astateisdenedtoindicatetheclustertowhich

a linesegmentbelongs. A gestureis then dened by

anorderedsequenceofstates. Therecognitionproce-

dure is to evaluate whether inputtrajectory success-

fully passes through the states in the prescribed or-

der. Contrastedto theirwork whereeach exampleof

agestureis asingletrajectoryin space, eachgesture

in ourwork is representedby aset of motiontrajec-

tories correspondingto themotionsofdierentparts

of, say, the palm, instead of a single representative

point. Thus,eachexampleofagesturein ourworkis

imental resultsshowthat an ensemble oftrajectories

yieldsbettergeneralization

Recently,IsardandBlakehaveproposedtheCON-

DENSATIONalgorithm[6]asaprobabilisticmethod

totrackcurvesin visualscenes. This methodisafu-

sionofthestatisticalfactoredsamplingalgorithmwith

astochasticmodeltosearchamultivariateparameter

space that is changing overtime. Objects are mod-

eledasasetofparameterizedcurvesandthestochas-

ticmodelisestimatedbasedonthetrainingsequence.

Experimentsontheproposedalgorithmhavebeencar-

riedtotrackobjectsbasedontheirhand drawntem-

plates. BlackandJepson [2]extendthisalgorithm to

recognizegesturesandfacialexpressionsinwhich hu-

manmotionsaremodeledastemporal trajectoriesof

someestimatedparameters(whichdescribethestates

ofagesture)overtime. Themajordierencebetween

our approach and these methods is that wepropose

amethod to extract motiontrajectories from anim-

agesequencewithouthanddrawntemplates[6]ordis-

tinct trackable icons [2]. Motion patterns are then

learned from the extracted motion trajectories. No

priorknowledgeisassumedorrequiredfortheextrac-

tionof motion trajectories, although domain specic

knowledgecanbeappliedforeÆciencyreasons.

3 Motion Segmentation

Tocapture thedynamic characteristicsofobjects,

we segment an image frame into regions with uni-

form motion. Our motion segmentation algorithm

processesanimagesequencetwosuccessiveframesat

atime. Forapairofframes,(I

t

;I

t+1

), thealgorithm

identiesregionsineachframe comprisingthemulti-

scale intraframe structure. Regions at all scales are

then matched across frames. AÆne transforms are

computed for each matched region pair. The aÆne

transformparametersforregionatallscalesarethen

usedtoderiveasinglemotioneldwhichisthenseg-

mentedto identify thedierentlymovingregions be-

tweenthetwoframes. Thefollowingsectionsdescribe

themajorstepsinthemotionsegmentationalgorithm.

3.1 Multiscale Image Segmentation

Multiscalesegmentationisperformedusingatrans-

form descriedin [1]which extracts ahierarchy ofre-

gions in each image. Thegeneral form of the trans-

form,which maps animage to afamilyof attraction

forceelds,isdened by

F(x;y;

g

(x;y);

s

(x;y))= R R

R d

g (I;

g (x;y))

d

s (~r;

s (x;y))

~ r

jj~r jj dwdv

where R = domain(I(u;v))nf(x;y)g and ~r = (v

x)

~

i+(w y)

~

j. The parameter

g

denotes a homo-

(3)

gion to which a pixel belongs and

s

is spatial scale

that controls theneighborhood from which the force

on the pixel is computed. The homogeneity of two

pixels isgivenbytheEuclideandistancebetweenthe

associatedm-dimensionalvectorsofpixelvalues(e.g.,

m=3foracolorimage):

I =jI(x;y) I(v;w)j

The spatial scale parameter,

s

, controls the spatial

distancefunction,d

s

(),andthehomogeneityscalepa-

rameter,

g

, controls thehomogeneitydistance func-

tion,d

g

(). Onepossibleformforthesefunctions sat-

isfying criteriadiscussedin[1]isunnormalizedGaus-

sian:

d

g (I;

g )

q

2

g N

I (0;

2

g )

d

s (~r;

s )

p

2

s N

jj~r jj (0;

2

s

); jj~rjj2

s

0; jj~rjj>2

s

Theforceeldencodestheregionstructureinaman-

ner which allows easyextraction. Region boundaries

correspondtodivergingforcevectorsin Fandregion

skeletonscorrespondtoconvergingforcevectorsinF.

An increasein

g

causeslesshomogeneousstructures

tobeencodedandanincreasein

s

causeslargestruc-

turestobeencoded.

3.2 Region Matching

Thematchingofmotionregionsacrossframesisfor-

mulatedasagraphmatchingproblematfourdierent

scaleswherescalereferstothelevelofdetailcaptured

by theimage segmentationprocess. Threepartitions

of eachimagearecreatedbyslicingthroughthemul-

tiscalepyramidat threepreselectedvaluesof

g . Re-

gionpartitionsfromadjacentframesarematchedfrom

coarseto nescales,withcoarserscalematchesguid-

ing the ner scale matching. Each partition is rep-

resented as a region adjacency graph, within which

eachregionis representedasanodeand regionadja-

cencies arerepresentedasedges. Regionmatching at

each scale consists of nding the set of graph trans-

formation operations (edge deletion, edge and node

matching,andnodemerging)ofleastcostthatcreate

anisomorphismbetweenthecurrentgraphpair. The

cost of matching apairof regionstakesintoaccount

theirsimilaritywithregardtoarea,averageintensity,

expectedpositionasestimatedfromeachregion'smo-

tioninpreviousframes,andthespatialrelationshipof

eachregionwithitsneighboringregions.

Oncetheimagepartitionsatthethreedierentho-

mogeneity scales have been matched, matchings are

thenobtainedfortheregionsintherstframeofthe

tation module using the previous frame pair. The

match in the second frame for each of these motion

regionsisgivenas theunionoftheset of nestscale

regionsthatcomprisethemotionregion. Thisgivesa

fourth matched pair of image partitions, and is con-

sideredto bethecoarsestscalesetofmatchesthat is

utilizedin aÆneestimation. Thedetails of thealgo-

rithmcanbefoundin[9].

3.3 AÆne Transformation Estimation

For each pair of matched regions, the best aÆne

transformationbetweenthemisestimatediteratively.

LetR t

i

bethe ith region in framet and its matched

regionbeR t+1

i

. Alsoletthecoordinatesofthepixels

within R t

i be(x

t

ij

;y t

ij

),with j =1:::jR t

i

j where jR t

i j

isthecardinalityofR t

i

,andthepixelnearestthecen-

troidofR t

i be(x

t

i

;y t

i

). Each(x t

ij

;y t

ij

)ismappedbyan

aÆnetransformationto thepoint(^x t

ij

;y^ t

ij

) according

to

x t

ij

y t

ij

!R

A

k

x t

ij

x t

i

y t

ij

y t

i

+

~

T

k +

x

i t+1

y

i t+1

=

^ x t

ij

^ y t

ij

k

where the subscript k denotes the iteration number,

and R [] denotes a vector operator that rounds each

vector component to the nearest integer. The aÆne

transformationcomprisesa22deformationmatrix,

A

k

, and a translation vector,

~

T

k

. By dening the

indicatorfunction,

t

i

(x;y)=

1;(x;y)2R t

i

0;else

theamountofmismatch ismeasuredas

(M t

i )=

P

x;y jI

t

(x;y) I

t+1 (^x;y)j^

t

i

(x;y)+ t+1

i

(^x ;y)^ t

i

(x;y) t+1

i

(^x ;y)^

The aÆne transformation parameters that minimize

M t

i

areestimatediterativelyusingalocaldescentcri-

terion.

3.4 MotionField Integration

ThecomputedaÆneparametersgiveamotioneld

at each of the four scales. These motion elds are

thencombinedintoasinglemotioneldbytakingthe

coarsestmotioneld andthenperformingthefollow-

ing computation recursively at four scales. At each

matchedregion,theimagepredictionerrorgenerated

bythecurrentmotioneldandthemotioneldatthe

nextnerscalearecompared. Atanyregionwherethe

predictionerrorusingthenerscalemotionimproves

byasignicantamount,thecurrentmotionisreplaced

bythenerscalemotion. Theresultisasetof\best

matched"regionsatthecoarsestacceptablescales.

(4)

Theresultingmotioneld

~

M

t;t+1

issegmentedinto

areasofuniformmotion. Weuseaheuristicthatcon-

siders each pair ofbest matched regions, R t

i and R

t

j ,

which share a common border, and merges them if

the followingrelationis satisedforall(x t

ik

;y t

ik ) and

(x t

jl

;y t

jl

)thatarespatiallyadjacentto oneanother:

jj

~

M

t;t+1 (x

t

ik

;y t

ik )

~

M

t;t+1 (x

t

jl

;y t

jl )jj

max(jj

~

M

t;t+1 (x

t

ik

;y t

ik )jj;jj

~

M

t;t+1 (x

t

jl

;y t

jl )jj)

<m

g

where m

g

is aconstantless than1 that determines

the degree of motion similarity necessaryfor the re-

gions tomerge.

Thesegmentedmotionregionsareeachrepresented

in MS

t;t+1

bya dierentvalue. Because each of the

best matched regions have matches, the matches in

frame t+1of theregionsin MS

t;t+1

are known and

comprise the coarsest scale regions that are used in

theaÆneestimationmoduleforthenextframepair.

It should be noted that the motion segmentation

doesnotnecessarilycorrespondtothemovingobjects

in thescenebecausethemotionsegmentationisdone

overa single motioneld. Nonrigid objects, such as

humans, aresegmented into multiple,piecewise rigid

regions. Inaddition,fastobjectsmovingatratesless

than one pixel perframe cannot be identied. Han-

dlingboththesesituationsrequiresexaminingthemo-

tioneld overmultipleframes.

Figure 1shows frames from an image sequence of

a complex ASLsign called \cheerleader"and Figure

2showstheresultsofmotionsegmentation. Dierent

motionregionsaredisplayedwithdierentgraylevels.

Noticethatthereareseveralmotionregionswithinthe

head and palm regions becausethese piecewise rigid

regionshaveuniformmotion.

4 Color and Geometric Analysis

Motion segmentation generates regions that have

uniform motion. However,onlysomeofthesemotion

regions carry important information for motion pat-

tern recognition. To recognizehand gesturesconsid-

eredhere,itissuÆcienttoextractthemotionregions

of head and palm regions. Towards this end, weuse

colorandgeometricinformationofpalmandheadre-

gions.

Human skincolorhasbeenusedand provedtobe

an eective feature in many applications. We use a

Gaussian mixture to model the distribution of skin

colorpixelsfromaMichigandatabaseof2,447images

which consists of human faces from dierent ethnic

groups. We use CIE LUV color space and discard

the luminescencevalue ofeach pixelto minimize the

Gaussian mixture are estimated using an EM algo-

rithm. Amotionregionisclassiedtohaveskincolor

ifmost of the pixels haveprobabilities of being skin

coloraboveathreshold. Coupledwithmotionsegmen-

tation,motion regionsof skin colorcanbe eÆciently

extractedfromimagesequences.

Sincetheshapeofhumanheadandpalmcanbeap-

proximatedbyellipses,andthehumanhandisathin

rectangularregion,motionregionsthathaveskincolor

aremergeduntiltheshapeofthemergedregionisap-

proximatelyellipticorrectangular. Theparametersof

arectangularshapecanbeobtainedfromthebound-

ing box of each region easily. The orientation of an

ellipseiscalculatedfromtheaxesoftheleastmoment

ofinertia. The extents of themajor and minor axes

ofthe ellipseare approximatedby theextentsof the

regionalongtheaxisdirections,andthusgeneratethe

parametersfor theellipse. The largestelliptic region

extractedfrom an image isidentied ashumanhead

andthenexttwosmallerellipticregionsarepalmre-

gions.Figure1showstheimagesequenceofacomplex

ASLsigncalled\cheerleader"andFigure3showsthe

resultsof colorandgeometricanalysis onthemotion

regions.

5 Motion Trajectories

Although motion segmentation generates aÆne

transformationsthatcapturemotiondetailsbymatch-

ingregionsatnescales,itissuÆcienttousecoarser

motiontrajectoriesof identiedpalm regionsforges-

turerecognitionconsidered inthispaper.

AÆnetransformationofpalm regionineachframe

pair is computed based on equations in Section 3.3.

TheaÆnetransformationsofsuccessivepairsarethen

concatenatedto construct the motion trajectories of

thepalmregion. Figure4showssuchtrajectoriesfora

numberofframesintheimagesequence\cheerleader."

Since all pixel trajectories are shown together, they

formathickblob. Figure5showsa10to 1subsam-

plingofthemotiontrajectories.

6 Motion Pattern Classication

We employ TDNN to classify gestural motion

patterns of palm regions since TDNNs have been

demonstratedtobeverysuccessfulinlearningspatio-

temporalpatterns. TDNNis adynamicclassication

approachinthatthenetworkseesonlyasmallwindow

ofthemotionpatternandthiswindowslidesoverthe

input datawhile the network makesaseries of local

decisions. Theselocaldecisionshavetobeintegrated

intoaglobaldecisionatalatertime. Intheirseminal

work,Waibeletal. [11]demonstratedexcellentresults

(5)

(i)frame35 (j)frame37 (k)frame40 (l)frame44 (m)frame46 (n)frame49 (o)frame52 (p)frame55

Figure1: ImagesequenceofASLsign\cheerleader"

(a)frame14 (b)frame16 (c)frame19 (d)frame22 (e)frame25 (f)frame29 (g)frame31 (h)frame34

Figure2: Motionsegmentationoftheimagesequence\cheerleader"(pixelsofthesamemotionregionaredisplayed

with samegraylevelanddierentregionsaredisplayedwithdierentgraylevels)

(a)frame14 (b)frame16 (c)frame19 (d)frame22 (e)frame25 (f)frame29 (g)frame31 (h)frame34

Figure3: Extractedheadandpalm regionsfromimagesequence\cheerleader"

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(a)#14-#16 0 20 40 60 80 100 120

0 20 40 60 80 100120 140160

Y-axis

palm1 palm2

(b)#16-#19 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(c)#19-#22 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(d)#22-#25 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(e)#25-#29 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(f)#29-#31 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(g)#31-#34

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(h)#35-#37 0 20 40 60 80 100 120

0 20 40 60 80 100120 140160

Y-axis

palm1 palm2

(i)#37-#40 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(j)#40-#44 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(k)#44-#46 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(l)#46-#49 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(m)#49-#52 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

palm1 palm2

(n)#52-#55

Figure4: ExtractedgesturalmotiontrajectoriesfromsegmentsofASLsign\cheerleader"(since allpixeltrajec-

toriesareshown,theyform athickblob)

(6)

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140 160

Y-axis

X-axis Gestural Motion Trajectories

palm1

(a) Motion trajectories of a

sample set ofpalmpointsfor

theASLsign\cheerleader"

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140 160

Y-axis

X-axis Gestural Motion Trajectories

#14

#16

#22 #19

#25

#29

#31

#34,#35

#37

#40

#44

#46

#49

#52

#55

palm1

(b) Motion trajectory of one

palmpoint for the ASL sign

\cheerleader"

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140 160

Y-axis

X-axis Gestural Motion Trajectories

palm2

(c) Motion trajectories of a

sample set ofpalmpointsfor

theASLsign\cheerleader"

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140 160

Y-axis

X-axis Gestural Motion Trajectories

#14

#16

#19

#22

#25

#29

#31

#34,#35

#37

#40

#44

#46

#49

#52

#55 palm2

(d) Motion trajectory of one

palmpoint for the ASL sign

\cheerleader"

Figure5: Extractedgesturalmotiontrajectories(sub-

sampled byafactorof10)of ASLsign\cheerleader"

for phoneme classication using TDNN and showed

that it achieveslowererrorratesthanthose achieved

byasimpleHMMrecognizer.

ThedesignofTDNNisattractivebecauseitscom-

pact structure economizes on weights and makes it

possibleforthenetworktodevelopgeneralfeaturede-

tectors. Most importantly, its temporal integration

at theoutputlayermakesthenetworkshiftinvariant

(i.e. insensitive to the exact positioning of the ges-

ture). Figure6showsourTDNN architectureforthe

experiments,where positivevaluesareshownasgray

squaresandnegativevaluesasblacksquares. Thein-

putstoourTDNNarevectorsof(x;y;v;)formotion

trajectoriesextractedfrom agestureimage sequence,

where x,y arepositionswithrespect tothecenter of

thehead,andv,aremagnitudesandangleofveloc-

ity respectively; the outputs are the gesture classes;

andthelearningmechanismiserrorbackpropagation.

7 Experiments

Weuseavideodatabaseof40ASLsignsforexperi-

ments. EachvideoconsistsofanASLsignwhichlasts

about 3 to 5 seconds at 30 frames per second with

image size of 160120in Quicktimeformat. Figure

1showsonecomplexASLgesture from thesequence

\cheerleader." Notethatthehandmovementconsists

of rotationand repetitions. Each image sequence of

the40gesturesintheexperimenthas80to120frames.

integration

x y v

lengh=20 window

window lengh=5 window lengh=10

46 slots 37 slots

18 slots

50 slots θ

Gesture

Input Layer Hidden Layer 1 Hidden Layer 2 1 2

Output Layer 40

Figure6: ArchitectureofTDNN

Discardingtheframesinwhichpalmsdonotappearin

theimages(i.e. framesinstartingandendingphase),

each image sequence has about 50 frames. Motion

regions with skin color are identied by their chro-

maticcharacteristics. Theseregions arethen merged

into palm and headregions shown in Figure 3based

on geometric analysis discussed in Section 4. AÆne

parameters of matched palm regions are computed,

which give pixel motion trajectories for each image

pair. Byconcatenating the trajectories for consecu-

tive image pairs, continuous motion trajectories are

generated. Figures4showstheextractedmotiontra-

jectoriesfromanumberofframesandFigure5shows

thetrajectoriesfromthewholeimagesequence. Note

thatthemotiontrajectoriesofpalmregionmatchthe

movementintherealscenewell.

Training of TDNN is performed on the corpus of

80%of the extracted dense (38 on the average) tra-

jectoriesfrom each gesture, usingan errorbackprop-

agation algorithm. The rest 20% of the trajectories

arethen used fortesting. Based on theexperiments

with 40 ASL gestures, the average recognition rate

on the training trajectories is 98:14% and the aver-

agerecognitionrateontheunseen testtrajectoriesis

93:42%. Sincedensemotiontrajectoriesareextracted

from each image sequence, the recognition rate for

each gesture canbe improved by a \voting" scheme

(i.e. the majority rules) on the classication result

of each individual trajectory. The resulting average

recognition rate on the training and testing sets for

gesture recognition are 99:02% and 96:21%, respec-

tively.

8 Discussion and Conclusion

Wehavedescribedanalgorithmtoextractandrec-

ognize motion patterns using trajectories. For con-

creteness, the experiments have been carried out to

recognizehandgesturesinASL.Motionsegmentation

(7)

tion. Moving regions with salient features are then

extractedusingcolorandgeometricinformation. The

aÆne transformations associated with these regions

are then concatenatedto generatecontinuous trajec-

tories. Thesemotiontrajectoriesencodethedynamic

characteristicsofhandgesturesandareclassiedbya

time-delayneural network. Ourexperiments demon-

stratethathandgesturescanberecognized,withhigh

accuracy,usingmotiontrajectories.

Thecontributions ofthisworkcanbesummarized

as follows. First,ageneralmethod that extractsmo-

tion trajectories is developed. This is in contrast to

muchworkongesturerecognitionthatusescolorhis-

togram tracker[8] [4] [2], magnetic sensors [3], hand

drawn template [6], and stereo [10] to obtain a rep-

resentation of the gesture. Second, weuse aTDNN

torecognizegesturesbasedontheextractedtrajecto-

ries. Using anensemble of trajectories helps achieve

highrecognitionrates. Itwouldbeinterestingtocom-

paretheserecognitionrateswiththoseobtainedusing

other recognitionmethods such asHMM,CONDEN-

SATIONalgorithm[6][2]andprincipal curve[3].

Acknowledgements

The support of Advanced TelecommunicationRe-

searchInternationalis gratefullyacknowledged.

References

[1] N.Ahuja. Atransformformultiscaleimagesegmen-

tationbyintegratededgeandregiondetection.IEEE

Trans. Pattern Anal. Mach. Intell., 18(12):1211{

1235,1996.

[2] M. J. Black and A. D. Jepson. A probabilis-

tic framework for matching temporal trajectories:

Condensation-based recognition of gesture and ex-

pressions. InProceedingsof EuropeanConference on

Computer Vision,pages909{924,1998.

[3] A. F. Bobick and A. D. Wilson. A state-based ap-

proachto therepresentationand recognitionof ges-

ture. IEEE Trans. Pattern. Anal. Mach. Intell.,

19(12):1325{1337, 1997.

[4] J. L.Crowley and F.Berard. Multi-modal tracking

offacesforvideocommunications. InProceedings of

IEEE Conference on Computer Vision and Pattern

Recognition,pages640{645,1997.

[5] T.HastieandW.Stuetzle. Principalcurves. Journal

ofAmericanStatisticalAssociation,84(406):502{516,

1989.

[6] M. Isard and A. Blake. Condensation -conditional

densitypropagationforvisualtracking.International

JournalofComputer Vision,29(1):5{28, 1998.

[7] G.Johansson. Visualperceptionofbiologicalmotion

and a model for its analysis. Perception and Psy-

chophysics,73(2):201{211,1973.

approachtovisualeventclassication.InProceedings

oftheFourthEuropeanConference onComputerVi-

sion,pages347{360,1996.

[9] M. Tabb and N. Ahuja. 2-d motion estimation by

matchingamultiscalesetofregionprimitives. IEEE

Trans. Pattern Anal. and Mach. Intell., 1997. sub-

mitted.

[10] C.VoglerandD.Metaxas. Aslrecognitionbasedon

acouplingbetweenhmmsand3dmotionanalysis.In

ProceedingsoftheSixth InternationalConference on

Computer Vision,pages363{369,1998.

[11] A.Waibel,T.Hanazawa,G.Hinton,K.Shikano,and

K.Lang. Phonemerecognitionusingtime-delayneu-

ralnetworks. IEEETrans.onAcoustics,Speech,and

SignalProcessing,37(3):328{339,1989.