Ming-Hsuan Yang and Narendra Ahuja
Departmentof Computer Science and Beckman Institute
University ofIllinois atUrbana-Champaign,Urbana, IL 61801
fmhyang,[email protected]
http://vision.ai.uiuc.edu
Abstract
We present an algorithm for extracting and clas-
sifying two-dimensional motion inan imagesequence
based on motion trajectories. First, a multiscale seg-
mentation is performed to generate homogeneous re-
gions in each frame. Regions between consecutive
frames are then matched to obtain 2-view correspon-
dences. AÆne transformations are computed from
each pair of corresponding regions to dene pixel
matches. Pixelsmatchesoverconsecutiveimagespairs
areconcatenatedtoobtain pixel-levelmotion trajecto-
ries across the image sequence. Motion patterns are
learned from the extracted trajectories using a time-
delay neural network. Weapply the proposed method
to recognize 40 hand gesturesof American Sign Lan-
guage. Experimentalresultsshowthatmotionpatterns
in hand gesturescan be extractedandrecognized with
high recognition rateusing motion trajectories.
1 Introduction
Inthispaper,wepresentanalgorithm forextract-
ing two-dimensionalmotion elds ofobjectsacrossa
video sequence and classifying each as one of a set
of a priori known classes. The algorithm is used to
recognize dynamic visual processes basedon spatial,
photometric and temporal characteristics. An appli-
cation of the algorithm is in sign language recogni-
tion where an utterance is interpreted based on, for
example,handlocation,shape,andmotion. Theper-
formance of the algorithm is evaluated on the task
of recognizing40complexhandgesturesofAmerican
Sign Language(ASL).
The algorithm consists of two major steps. First,
each image is partitioned into regions using amulti-
scalesegmentationmethod. Regionsbetweenconsec-
utive frames are then matched to obtain 2-view cor-
respondences. AÆne transformations are computed
fromeachpairofcorrespondingregionstodenepixel
matches. Pixelmatchesoverconsecutiveimage pairs
areconcatenatedtoobtainpixel-levelmotiontrajecto-
riesacrossthevideosequence. Pixelsarealsogrouped
based on their 2-view motion similarity to obtain a
motion based segmentation of the video sequence.
Onlysomeofthemovingregionscorrespondtovisual
phenomenaofinterest. Boththeintrinsicpropertiesof
theobjectsrepresentedbyimageregionsandtheirdy-
namicsrepresented by themotion trajectories deter-
minewhether theycompriseaneventofinterest. For
example,itis suÆcientto recognizemostgesturesin
ASLintermsofshapeandlocationchangesofpalmre-
gions.Therefore,palmandheadregionsareextracted
outineachframeandthepalmlocationsarespecied
withreferenceto theusually stillheadregions.
Torecognizemotionpatternsfromtrajectories,we
useatime-delayneuralnetwork(TDNN)[11]. TDNN
is a multilayer feedforward network that uses time-
delays between all layers to represent temporal rela-
tionships betweeneventsin time. An inputvectoris
organizedasatemporalsequence,whereonlythepor-
tionoftheinputsequencewithinatimewindowisfed
tothenetworkatonetime. Thetimewindowisshifted
andanotherportionof theinputsequenceisgivento
thenetworkuntilthewholesequencehasbeenscanned
through. The TDNN is trainedusing standarderror
backpropagation learning algorithm. The output of
thenetworkiscomputedbyaddingallofthese scores
over time, followed by applying a nonlinearfunction
suchassigmoidfunctiontothesum. TDNNswithtwo
hidden layers using sliding input windows over time
leadto arelativelysmallnumberoftrainableparam-
eters. Weadopt TDNNto recognizemotionpatterns
becausegesturesarespatio-temporalsequencesoffea-
ture vectors dened along motion trajectories. Our
experimental results show that motion patterns can
be learnedbyatime-delayneuralnetwork with high
recognitionrate.
2 Related Work
Since Johansson's seminal work [7] that suggests
human movements can be recognized solely by mo-
beeninvestigatedto recognizehumanmotionbysev-
eral researchers. In[8]Siskindand Morrisconjecture
thathumaneventperceptiondoesnotpresupposeob-
ject recognition. In other words, they think visual
event recognition is performed by a visual pathway
which is separated from object recognition. To ver-
ify the conjecture, they analyze motion proles of
objects that participate in dierent simple spatial-
motion events. Their trackerusesamixture of color
basedandmotionbasedtechniques. Colorbasedtech-
niques areusedtotrackobjectsdenedbysetofcol-
ored pixelswhosesaturationandvalueareabovecer-
tain thresholds in each frame. These pixels are then
clusteredintoregionsusingahistogrambasedonhue.
Movingpixelsareextractedfromframedierencesand
divided into clusters based onproximity. Next, each
region (generated by color ormotion) in each frame
is abstractedbyanellipse. Finally,feature vectorfor
each frame is generated by computing the absolute
and relative ellipse positions, orientations, velocities
and accelerations. To classifyvisualevents, theyuse
a set of Hidden Markov Models (HMMs) which are
used as generative models and trained on movies of
eachvisual eventrepresentedbyaset offeature vec-
tors. After training,anewobservationisclassiedas
beinggeneratedbythemodelthatassignsthehighest
likelihood. Experimentsonasetof6simplegestures,
\pick up," \put down," \push," \pull," \drop," and
\throw," demonstratethat gesturescan be classied
basedonmotionproles.
BobickandWilson[3]adoptastatebasedapproach
torepresentandrecognizegestures. First,manysam-
ples of a gesture are used to compute its principal
curve [5] which is parameterized by arc length. A
by-product of calculating the curve is the mapping
of each sample pointof agesture exampleto an arc
lengthalongthecurve. Next,theyuselinesegmentsof
uniform length to approximate thediscretized curve.
Each line segmentis represented by avectorand all
the line segmentsaregrouped intoanumberof clus-
ters. Astateisdenedtoindicatetheclustertowhich
a linesegmentbelongs. A gestureis then dened by
anorderedsequenceofstates. Therecognitionproce-
dure is to evaluate whether inputtrajectory success-
fully passes through the states in the prescribed or-
der. Contrastedto theirwork whereeach exampleof
agestureis asingletrajectoryin space, eachgesture
in ourwork is representedby aset of motiontrajec-
tories correspondingto themotionsofdierentparts
of, say, the palm, instead of a single representative
point. Thus,eachexampleofagesturein ourworkis
imental resultsshowthat an ensemble oftrajectories
yieldsbettergeneralization
Recently,IsardandBlakehaveproposedtheCON-
DENSATIONalgorithm[6]asaprobabilisticmethod
totrackcurvesin visualscenes. This methodisafu-
sionofthestatisticalfactoredsamplingalgorithmwith
astochasticmodeltosearchamultivariateparameter
space that is changing overtime. Objects are mod-
eledasasetofparameterizedcurvesandthestochas-
ticmodelisestimatedbasedonthetrainingsequence.
Experimentsontheproposedalgorithmhavebeencar-
riedtotrackobjectsbasedontheirhand drawntem-
plates. BlackandJepson [2]extendthisalgorithm to
recognizegesturesandfacialexpressionsinwhich hu-
manmotionsaremodeledastemporal trajectoriesof
someestimatedparameters(whichdescribethestates
ofagesture)overtime. Themajordierencebetween
our approach and these methods is that wepropose
amethod to extract motiontrajectories from anim-
agesequencewithouthanddrawntemplates[6]ordis-
tinct trackable icons [2]. Motion patterns are then
learned from the extracted motion trajectories. No
priorknowledgeisassumedorrequiredfortheextrac-
tionof motion trajectories, although domain specic
knowledgecanbeappliedforeÆciencyreasons.
3 Motion Segmentation
Tocapture thedynamic characteristicsofobjects,
we segment an image frame into regions with uni-
form motion. Our motion segmentation algorithm
processesanimagesequencetwosuccessiveframesat
atime. Forapairofframes,(I
t
;I
t+1
), thealgorithm
identiesregionsineachframe comprisingthemulti-
scale intraframe structure. Regions at all scales are
then matched across frames. AÆne transforms are
computed for each matched region pair. The aÆne
transformparametersforregionatallscalesarethen
usedtoderiveasinglemotioneldwhichisthenseg-
mentedto identify thedierentlymovingregions be-
tweenthetwoframes. Thefollowingsectionsdescribe
themajorstepsinthemotionsegmentationalgorithm.
3.1 Multiscale Image Segmentation
Multiscalesegmentationisperformedusingatrans-
form descriedin [1]which extracts ahierarchy ofre-
gions in each image. Thegeneral form of the trans-
form,which maps animage to afamilyof attraction
forceelds,isdened by
F(x;y;
g
(x;y);
s
(x;y))= R R
R d
g (I;
g (x;y))
d
s (~r;
s (x;y))
~ r
jj~r jj dwdv
where R = domain(I(u;v))nf(x;y)g and ~r = (v
x)
~
i+(w y)
~
j. The parameter
g
denotes a homo-
gion to which a pixel belongs and
s
is spatial scale
that controls theneighborhood from which the force
on the pixel is computed. The homogeneity of two
pixels isgivenbytheEuclideandistancebetweenthe
associatedm-dimensionalvectorsofpixelvalues(e.g.,
m=3foracolorimage):
I =jI(x;y) I(v;w)j
The spatial scale parameter,
s
, controls the spatial
distancefunction,d
s
(),andthehomogeneityscalepa-
rameter,
g
, controls thehomogeneitydistance func-
tion,d
g
(). Onepossibleformforthesefunctions sat-
isfying criteriadiscussedin[1]isunnormalizedGaus-
sian:
d
g (I;
g )
q
2
2
g N
I (0;
2
g )
d
s (~r;
s )
p
2
2
s N
jj~r jj (0;
2
s
); jj~rjj2
s
0; jj~rjj>2
s
Theforceeldencodestheregionstructureinaman-
ner which allows easyextraction. Region boundaries
correspondtodivergingforcevectorsin Fandregion
skeletonscorrespondtoconvergingforcevectorsinF.
An increasein
g
causeslesshomogeneousstructures
tobeencodedandanincreasein
s
causeslargestruc-
turestobeencoded.
3.2 Region Matching
Thematchingofmotionregionsacrossframesisfor-
mulatedasagraphmatchingproblematfourdierent
scaleswherescalereferstothelevelofdetailcaptured
by theimage segmentationprocess. Threepartitions
of eachimagearecreatedbyslicingthroughthemul-
tiscalepyramidat threepreselectedvaluesof
g . Re-
gionpartitionsfromadjacentframesarematchedfrom
coarseto nescales,withcoarserscalematchesguid-
ing the ner scale matching. Each partition is rep-
resented as a region adjacency graph, within which
eachregionis representedasanodeand regionadja-
cencies arerepresentedasedges. Regionmatching at
each scale consists of nding the set of graph trans-
formation operations (edge deletion, edge and node
matching,andnodemerging)ofleastcostthatcreate
anisomorphismbetweenthecurrentgraphpair. The
cost of matching apairof regionstakesintoaccount
theirsimilaritywithregardtoarea,averageintensity,
expectedpositionasestimatedfromeachregion'smo-
tioninpreviousframes,andthespatialrelationshipof
eachregionwithitsneighboringregions.
Oncetheimagepartitionsatthethreedierentho-
mogeneity scales have been matched, matchings are
thenobtainedfortheregionsintherstframeofthe
tation module using the previous frame pair. The
match in the second frame for each of these motion
regionsisgivenas theunionoftheset of nestscale
regionsthatcomprisethemotionregion. Thisgivesa
fourth matched pair of image partitions, and is con-
sideredto bethecoarsestscalesetofmatchesthat is
utilizedin aÆneestimation. Thedetails of thealgo-
rithmcanbefoundin[9].
3.3 AÆne Transformation Estimation
For each pair of matched regions, the best aÆne
transformationbetweenthemisestimatediteratively.
LetR t
i
bethe ith region in framet and its matched
regionbeR t+1
i
. Alsoletthecoordinatesofthepixels
within R t
i be(x
t
ij
;y t
ij
),with j =1:::jR t
i
j where jR t
i j
isthecardinalityofR t
i
,andthepixelnearestthecen-
troidofR t
i be(x
t
i
;y t
i
). Each(x t
ij
;y t
ij
)ismappedbyan
aÆnetransformationto thepoint(^x t
ij
;y^ t
ij
) according
to
x t
ij
y t
ij
!R
A
k
x t
ij
x t
i
y t
ij
y t
i
+
~
T
k +
x
i t+1
y
i t+1
=
^ x t
ij
^ y t
ij
k
where the subscript k denotes the iteration number,
and R [] denotes a vector operator that rounds each
vector component to the nearest integer. The aÆne
transformationcomprisesa22deformationmatrix,
A
k
, and a translation vector,
~
T
k
. By dening the
indicatorfunction,
t
i
(x;y)=
1;(x;y)2R t
i
0;else
theamountofmismatch ismeasuredas
(M t
i )=
P
x;y jI
t
(x;y) I
t+1 (^x;y)j^
t
i
(x;y)+ t+1
i
(^x ;y)^ t
i
(x;y) t+1
i
(^x ;y)^
The aÆne transformation parameters that minimize
M t
i
areestimatediterativelyusingalocaldescentcri-
terion.
3.4 MotionField Integration
ThecomputedaÆneparametersgiveamotioneld
at each of the four scales. These motion elds are
thencombinedintoasinglemotioneldbytakingthe
coarsestmotioneld andthenperformingthefollow-
ing computation recursively at four scales. At each
matchedregion,theimagepredictionerrorgenerated
bythecurrentmotioneldandthemotioneldatthe
nextnerscalearecompared. Atanyregionwherethe
predictionerrorusingthenerscalemotionimproves
byasignicantamount,thecurrentmotionisreplaced
bythenerscalemotion. Theresultisasetof\best
matched"regionsatthecoarsestacceptablescales.
Theresultingmotioneld
~
M
t;t+1
issegmentedinto
areasofuniformmotion. Weuseaheuristicthatcon-
siders each pair ofbest matched regions, R t
i and R
t
j ,
which share a common border, and merges them if
the followingrelationis satisedforall(x t
ik
;y t
ik ) and
(x t
jl
;y t
jl
)thatarespatiallyadjacentto oneanother:
jj
~
M
t;t+1 (x
t
ik
;y t
ik )
~
M
t;t+1 (x
t
jl
;y t
jl )jj
max(jj
~
M
t;t+1 (x
t
ik
;y t
ik )jj;jj
~
M
t;t+1 (x
t
jl
;y t
jl )jj)
<m
g
where m
g
is aconstantless than1 that determines
the degree of motion similarity necessaryfor the re-
gions tomerge.
Thesegmentedmotionregionsareeachrepresented
in MS
t;t+1
bya dierentvalue. Because each of the
best matched regions have matches, the matches in
frame t+1of theregionsin MS
t;t+1
are known and
comprise the coarsest scale regions that are used in
theaÆneestimationmoduleforthenextframepair.
It should be noted that the motion segmentation
doesnotnecessarilycorrespondtothemovingobjects
in thescenebecausethemotionsegmentationisdone
overa single motioneld. Nonrigid objects, such as
humans, aresegmented into multiple,piecewise rigid
regions. Inaddition,fastobjectsmovingatratesless
than one pixel perframe cannot be identied. Han-
dlingboththesesituationsrequiresexaminingthemo-
tioneld overmultipleframes.
Figure 1shows frames from an image sequence of
a complex ASLsign called \cheerleader"and Figure
2showstheresultsofmotionsegmentation. Dierent
motionregionsaredisplayedwithdierentgraylevels.
Noticethatthereareseveralmotionregionswithinthe
head and palm regions becausethese piecewise rigid
regionshaveuniformmotion.
4 Color and Geometric Analysis
Motion segmentation generates regions that have
uniform motion. However,onlysomeofthesemotion
regions carry important information for motion pat-
tern recognition. To recognizehand gesturesconsid-
eredhere,itissuÆcienttoextractthemotionregions
of head and palm regions. Towards this end, weuse
colorandgeometricinformationofpalmandheadre-
gions.
Human skincolorhasbeenusedand provedtobe
an eective feature in many applications. We use a
Gaussian mixture to model the distribution of skin
colorpixelsfromaMichigandatabaseof2,447images
which consists of human faces from dierent ethnic
groups. We use CIE LUV color space and discard
the luminescencevalue ofeach pixelto minimize the
Gaussian mixture are estimated using an EM algo-
rithm. Amotionregionisclassiedtohaveskincolor
ifmost of the pixels haveprobabilities of being skin
coloraboveathreshold. Coupledwithmotionsegmen-
tation,motion regionsof skin colorcanbe eÆciently
extractedfromimagesequences.
Sincetheshapeofhumanheadandpalmcanbeap-
proximatedbyellipses,andthehumanhandisathin
rectangularregion,motionregionsthathaveskincolor
aremergeduntiltheshapeofthemergedregionisap-
proximatelyellipticorrectangular. Theparametersof
arectangularshapecanbeobtainedfromthebound-
ing box of each region easily. The orientation of an
ellipseiscalculatedfromtheaxesoftheleastmoment
ofinertia. The extents of themajor and minor axes
ofthe ellipseare approximatedby theextentsof the
regionalongtheaxisdirections,andthusgeneratethe
parametersfor theellipse. The largestelliptic region
extractedfrom an image isidentied ashumanhead
andthenexttwosmallerellipticregionsarepalmre-
gions.Figure1showstheimagesequenceofacomplex
ASLsigncalled\cheerleader"andFigure3showsthe
resultsof colorandgeometricanalysis onthemotion
regions.
5 Motion Trajectories
Although motion segmentation generates aÆne
transformationsthatcapturemotiondetailsbymatch-
ingregionsatnescales,itissuÆcienttousecoarser
motiontrajectoriesof identiedpalm regionsforges-
turerecognitionconsidered inthispaper.
AÆnetransformationofpalm regionineachframe
pair is computed based on equations in Section 3.3.
TheaÆnetransformationsofsuccessivepairsarethen
concatenatedto construct the motion trajectories of
thepalmregion. Figure4showssuchtrajectoriesfora
numberofframesintheimagesequence\cheerleader."
Since all pixel trajectories are shown together, they
formathickblob. Figure5showsa10to 1subsam-
plingofthemotiontrajectories.
6 Motion Pattern Classication
We employ TDNN to classify gestural motion
patterns of palm regions since TDNNs have been
demonstratedtobeverysuccessfulinlearningspatio-
temporalpatterns. TDNNis adynamicclassication
approachinthatthenetworkseesonlyasmallwindow
ofthemotionpatternandthiswindowslidesoverthe
input datawhile the network makesaseries of local
decisions. Theselocaldecisionshavetobeintegrated
intoaglobaldecisionatalatertime. Intheirseminal
work,Waibeletal. [11]demonstratedexcellentresults
(i)frame35 (j)frame37 (k)frame40 (l)frame44 (m)frame46 (n)frame49 (o)frame52 (p)frame55
Figure1: ImagesequenceofASLsign\cheerleader"
(a)frame14 (b)frame16 (c)frame19 (d)frame22 (e)frame25 (f)frame29 (g)frame31 (h)frame34
(i)frame35 (j)frame37 (k)frame40 (l)frame44 (m)frame46 (n)frame49 (o)frame52 (p)frame55
Figure2: Motionsegmentationoftheimagesequence\cheerleader"(pixelsofthesamemotionregionaredisplayed
with samegraylevelanddierentregionsaredisplayedwithdierentgraylevels)
(a)frame14 (b)frame16 (c)frame19 (d)frame22 (e)frame25 (f)frame29 (g)frame31 (h)frame34
(i)frame35 (j)frame37 (k)frame40 (l)frame44 (m)frame46 (n)frame49 (o)frame52 (p)frame55
Figure3: Extractedheadandpalm regionsfromimagesequence\cheerleader"
0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(a)#14-#16 0 20 40 60 80 100 120
0 20 40 60 80 100120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(b)#16-#19 0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(c)#19-#22 0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(d)#22-#25 0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(e)#25-#29 0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(f)#29-#31 0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(g)#31-#34
0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(h)#35-#37 0 20 40 60 80 100 120
0 20 40 60 80 100120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(i)#37-#40 0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(j)#40-#44 0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(k)#44-#46 0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(l)#46-#49 0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(m)#49-#52 0 20 40 60 80 100 120
0 20 40 60 80 100 120 140160
Y-axis
X-axis Gestural Motion Trajectories
palm1 palm2
(n)#52-#55
Figure4: ExtractedgesturalmotiontrajectoriesfromsegmentsofASLsign\cheerleader"(since allpixeltrajec-
toriesareshown,theyform athickblob)
0 20 40 60 80 100 120
0 20 40 60 80 100 120 140 160
Y-axis
X-axis Gestural Motion Trajectories
palm1
(a) Motion trajectories of a
sample set ofpalmpointsfor
theASLsign\cheerleader"
0 20 40 60 80 100 120
0 20 40 60 80 100 120 140 160
Y-axis
X-axis Gestural Motion Trajectories
#14
#16
#22 #19
#25
#29
#31
#34,#35
#37
#40
#44
#46
#49
#52
#55
palm1
(b) Motion trajectory of one
palmpoint for the ASL sign
\cheerleader"
0 20 40 60 80 100 120
0 20 40 60 80 100 120 140 160
Y-axis
X-axis Gestural Motion Trajectories
palm2
(c) Motion trajectories of a
sample set ofpalmpointsfor
theASLsign\cheerleader"
0 20 40 60 80 100 120
0 20 40 60 80 100 120 140 160
Y-axis
X-axis Gestural Motion Trajectories
#14
#16
#19
#22
#25
#29
#31
#34,#35
#37
#40
#44
#46
#49
#52
#55 palm2
(d) Motion trajectory of one
palmpoint for the ASL sign
\cheerleader"
Figure5: Extractedgesturalmotiontrajectories(sub-
sampled byafactorof10)of ASLsign\cheerleader"
for phoneme classication using TDNN and showed
that it achieveslowererrorratesthanthose achieved
byasimpleHMMrecognizer.
ThedesignofTDNNisattractivebecauseitscom-
pact structure economizes on weights and makes it
possibleforthenetworktodevelopgeneralfeaturede-
tectors. Most importantly, its temporal integration
at theoutputlayermakesthenetworkshiftinvariant
(i.e. insensitive to the exact positioning of the ges-
ture). Figure6showsourTDNN architectureforthe
experiments,where positivevaluesareshownasgray
squaresandnegativevaluesasblacksquares. Thein-
putstoourTDNNarevectorsof(x;y;v;)formotion
trajectoriesextractedfrom agestureimage sequence,
where x,y arepositionswithrespect tothecenter of
thehead,andv,aremagnitudesandangleofveloc-
ity respectively; the outputs are the gesture classes;
andthelearningmechanismiserrorbackpropagation.
7 Experiments
Weuseavideodatabaseof40ASLsignsforexperi-
ments. EachvideoconsistsofanASLsignwhichlasts
about 3 to 5 seconds at 30 frames per second with
image size of 160120in Quicktimeformat. Figure
1showsonecomplexASLgesture from thesequence
\cheerleader." Notethatthehandmovementconsists
of rotationand repetitions. Each image sequence of
the40gesturesintheexperimenthas80to120frames.
integration
x y v
lengh=20 window
window lengh=5 window lengh=10
46 slots 37 slots
18 slots
50 slots θ
Gesture
Input Layer Hidden Layer 1 Hidden Layer 2 1 2
Output Layer 40
Figure6: ArchitectureofTDNN
Discardingtheframesinwhichpalmsdonotappearin
theimages(i.e. framesinstartingandendingphase),
each image sequence has about 50 frames. Motion
regions with skin color are identied by their chro-
maticcharacteristics. Theseregions arethen merged
into palm and headregions shown in Figure 3based
on geometric analysis discussed in Section 4. AÆne
parameters of matched palm regions are computed,
which give pixel motion trajectories for each image
pair. Byconcatenating the trajectories for consecu-
tive image pairs, continuous motion trajectories are
generated. Figures4showstheextractedmotiontra-
jectoriesfromanumberofframesandFigure5shows
thetrajectoriesfromthewholeimagesequence. Note
thatthemotiontrajectoriesofpalmregionmatchthe
movementintherealscenewell.
Training of TDNN is performed on the corpus of
80%of the extracted dense (38 on the average) tra-
jectoriesfrom each gesture, usingan errorbackprop-
agation algorithm. The rest 20% of the trajectories
arethen used fortesting. Based on theexperiments
with 40 ASL gestures, the average recognition rate
on the training trajectories is 98:14% and the aver-
agerecognitionrateontheunseen testtrajectoriesis
93:42%. Sincedensemotiontrajectoriesareextracted
from each image sequence, the recognition rate for
each gesture canbe improved by a \voting" scheme
(i.e. the majority rules) on the classication result
of each individual trajectory. The resulting average
recognition rate on the training and testing sets for
gesture recognition are 99:02% and 96:21%, respec-
tively.
8 Discussion and Conclusion
Wehavedescribedanalgorithmtoextractandrec-
ognize motion patterns using trajectories. For con-
creteness, the experiments have been carried out to
recognizehandgesturesinASL.Motionsegmentation
tion. Moving regions with salient features are then
extractedusingcolorandgeometricinformation. The
aÆne transformations associated with these regions
are then concatenatedto generatecontinuous trajec-
tories. Thesemotiontrajectoriesencodethedynamic
characteristicsofhandgesturesandareclassiedbya
time-delayneural network. Ourexperiments demon-
stratethathandgesturescanberecognized,withhigh
accuracy,usingmotiontrajectories.
Thecontributions ofthisworkcanbesummarized
as follows. First,ageneralmethod that extractsmo-
tion trajectories is developed. This is in contrast to
muchworkongesturerecognitionthatusescolorhis-
togram tracker[8] [4] [2], magnetic sensors [3], hand
drawn template [6], and stereo [10] to obtain a rep-
resentation of the gesture. Second, weuse aTDNN
torecognizegesturesbasedontheextractedtrajecto-
ries. Using anensemble of trajectories helps achieve
highrecognitionrates. Itwouldbeinterestingtocom-
paretheserecognitionrateswiththoseobtainedusing
other recognitionmethods such asHMM,CONDEN-
SATIONalgorithm[6][2]andprincipal curve[3].
Acknowledgements
The support of Advanced TelecommunicationRe-
searchInternationalis gratefullyacknowledged.
References
[1] N.Ahuja. Atransformformultiscaleimagesegmen-
tationbyintegratededgeandregiondetection.IEEE
Trans. Pattern Anal. Mach. Intell., 18(12):1211{
1235,1996.
[2] M. J. Black and A. D. Jepson. A probabilis-
tic framework for matching temporal trajectories:
Condensation-based recognition of gesture and ex-
pressions. InProceedingsof EuropeanConference on
Computer Vision,pages909{924,1998.
[3] A. F. Bobick and A. D. Wilson. A state-based ap-
proachto therepresentationand recognitionof ges-
ture. IEEE Trans. Pattern. Anal. Mach. Intell.,
19(12):1325{1337, 1997.
[4] J. L.Crowley and F.Berard. Multi-modal tracking
offacesforvideocommunications. InProceedings of
IEEE Conference on Computer Vision and Pattern
Recognition,pages640{645,1997.
[5] T.HastieandW.Stuetzle. Principalcurves. Journal
ofAmericanStatisticalAssociation,84(406):502{516,
1989.
[6] M. Isard and A. Blake. Condensation -conditional
densitypropagationforvisualtracking.International
JournalofComputer Vision,29(1):5{28, 1998.
[7] G.Johansson. Visualperceptionofbiologicalmotion
and a model for its analysis. Perception and Psy-
chophysics,73(2):201{211,1973.
approachtovisualeventclassication.InProceedings
oftheFourthEuropeanConference onComputerVi-
sion,pages347{360,1996.
[9] M. Tabb and N. Ahuja. 2-d motion estimation by
matchingamultiscalesetofregionprimitives. IEEE
Trans. Pattern Anal. and Mach. Intell., 1997. sub-
mitted.
[10] C.VoglerandD.Metaxas. Aslrecognitionbasedon
acouplingbetweenhmmsand3dmotionanalysis.In
ProceedingsoftheSixth InternationalConference on
Computer Vision,pages363{369,1998.
[11] A.Waibel,T.Hanazawa,G.Hinton,K.Shikano,and
K.Lang. Phonemerecognitionusingtime-delayneu-
ralnetworks. IEEETrans.onAcoustics,Speech,and
SignalProcessing,37(3):328{339,1989.