• 沒有找到結果。

X-axis Gestural Motion Trajectories

N/A
N/A
Protected

Academic year: 2022

Share "X-axis Gestural Motion Trajectories"

Copied!
7
0
0

加載中.... (立即查看全文)

全文

(1)

Ming-Hsuan Yang and Narendra Ahuja

Departmentof Computer Science and Beckman Institute

University ofIllinois atUrbana-Champaign,Urbana, IL 61801

fmhyang,[email protected]

http://vision.ai.uiuc.edu

Abstract

We present an algorithm for extracting and clas-

sifying two-dimensional motion inan imagesequence

based on motion trajectories. First, a multiscale seg-

mentation is performed to generate homogeneous re-

gions in each frame. Regions between consecutive

frames are then matched to obtain 2-view correspon-

dences. AÆne transformations are computed from

each pair of corresponding regions to de ne pixel

matches. Pixelsmatchesoverconsecutiveimagespairs

areconcatenatedtoobtain pixel-levelmotion trajecto-

ries across the image sequence. Motion patterns are

learned from the extracted trajectories using a time-

delay neural network. Weapply the proposed method

to recognize 40 hand gesturesof American Sign Lan-

guage. Experimentalresultsshowthatmotionpatterns

in hand gesturescan be extractedandrecognized with

high recognition rateusing motion trajectories.

1 Introduction

Inthispaper,wepresentanalgorithm forextract-

ing two-dimensionalmotion elds ofobjectsacrossa

video sequence and classifying each as one of a set

of a priori known classes. The algorithm is used to

recognize dynamic visual processes basedon spatial,

photometric and temporal characteristics. An appli-

cation of the algorithm is in sign language recogni-

tion where an utterance is interpreted based on, for

example,handlocation,shape,andmotion. Theper-

formance of the algorithm is evaluated on the task

of recognizing40complexhandgesturesofAmerican

Sign Language(ASL).

The algorithm consists of two major steps. First,

each image is partitioned into regions using amulti-

scalesegmentationmethod. Regionsbetweenconsec-

utive frames are then matched to obtain 2-view cor-

respondences. AÆne transformations are computed

fromeachpairofcorrespondingregionstode nepixel

matches. Pixelmatchesoverconsecutiveimage pairs

areconcatenatedtoobtainpixel-levelmotiontrajecto-

riesacrossthevideosequence. Pixelsarealsogrouped

based on their 2-view motion similarity to obtain a

motion based segmentation of the video sequence.

Onlysomeofthemovingregionscorrespondtovisual

phenomenaofinterest. Boththeintrinsicpropertiesof

theobjectsrepresentedbyimageregionsandtheirdy-

namicsrepresented by themotion trajectories deter-

minewhether theycompriseaneventofinterest. For

example,itis suÆcientto recognizemostgesturesin

ASLintermsofshapeandlocationchangesofpalmre-

gions.Therefore,palmandheadregionsareextracted

outineachframeandthepalmlocationsarespeci ed

withreferenceto theusually stillheadregions.

Torecognizemotionpatternsfromtrajectories,we

useatime-delayneuralnetwork(TDNN)[11]. TDNN

is a multilayer feedforward network that uses time-

delays between all layers to represent temporal rela-

tionships betweeneventsin time. An inputvectoris

organizedasatemporalsequence,whereonlythepor-

tionoftheinputsequencewithinatimewindowisfed

tothenetworkatonetime. Thetimewindowisshifted

andanotherportionof theinputsequenceisgivento

thenetworkuntilthewholesequencehasbeenscanned

through. The TDNN is trainedusing standarderror

backpropagation learning algorithm. The output of

thenetworkiscomputedbyaddingallofthese scores

over time, followed by applying a nonlinearfunction

suchassigmoidfunctiontothesum. TDNNswithtwo

hidden layers using sliding input windows over time

leadto arelativelysmallnumberoftrainableparam-

eters. Weadopt TDNNto recognizemotionpatterns

becausegesturesarespatio-temporalsequencesoffea-

ture vectors de ned along motion trajectories. Our

experimental results show that motion patterns can

be learnedbyatime-delayneuralnetwork with high

recognitionrate.

2 Related Work

Since Johansson's seminal work [7] that suggests

human movements can be recognized solely by mo-

(2)

beeninvestigatedto recognizehumanmotionbysev-

eral researchers. In[8]Siskindand Morrisconjecture

thathumaneventperceptiondoesnotpresupposeob-

ject recognition. In other words, they think visual

event recognition is performed by a visual pathway

which is separated from object recognition. To ver-

ify the conjecture, they analyze motion pro les of

objects that participate in di erent simple spatial-

motion events. Their trackerusesamixture of color

basedandmotionbasedtechniques. Colorbasedtech-

niques areusedtotrackobjectsde nedbysetofcol-

ored pixelswhosesaturationandvalueareabovecer-

tain thresholds in each frame. These pixels are then

clusteredintoregionsusingahistogrambasedonhue.

Movingpixelsareextractedfromframedi erencesand

divided into clusters based onproximity. Next, each

region (generated by color ormotion) in each frame

is abstractedbyanellipse. Finally,feature vectorfor

each frame is generated by computing the absolute

and relative ellipse positions, orientations, velocities

and accelerations. To classifyvisualevents, theyuse

a set of Hidden Markov Models (HMMs) which are

used as generative models and trained on movies of

eachvisual eventrepresentedbyaset offeature vec-

tors. After training,anewobservationisclassi edas

beinggeneratedbythemodelthatassignsthehighest

likelihood. Experimentsonasetof6simplegestures,

\pick up," \put down," \push," \pull," \drop," and

\throw," demonstratethat gesturescan be classi ed

basedonmotionpro les.

BobickandWilson[3]adoptastatebasedapproach

torepresentandrecognizegestures. First,manysam-

ples of a gesture are used to compute its principal

curve [5] which is parameterized by arc length. A

by-product of calculating the curve is the mapping

of each sample pointof agesture exampleto an arc

lengthalongthecurve. Next,theyuselinesegmentsof

uniform length to approximate thediscretized curve.

Each line segmentis represented by avectorand all

the line segmentsaregrouped intoanumberof clus-

ters. Astateisde nedtoindicatetheclustertowhich

a linesegmentbelongs. A gestureis then de ned by

anorderedsequenceofstates. Therecognitionproce-

dure is to evaluate whether inputtrajectory success-

fully passes through the states in the prescribed or-

der. Contrastedto theirwork whereeach exampleof

agestureis asingletrajectoryin space, eachgesture

in ourwork is representedby aset of motiontrajec-

tories correspondingto themotionsofdi erentparts

of, say, the palm, instead of a single representative

point. Thus,eachexampleofagesturein ourworkis

imental resultsshowthat an ensemble oftrajectories

yieldsbettergeneralization

Recently,IsardandBlakehaveproposedtheCON-

DENSATIONalgorithm[6]asaprobabilisticmethod

totrackcurvesin visualscenes. This methodisafu-

sionofthestatisticalfactoredsamplingalgorithmwith

astochasticmodeltosearchamultivariateparameter

space that is changing overtime. Objects are mod-

eledasasetofparameterizedcurvesandthestochas-

ticmodelisestimatedbasedonthetrainingsequence.

Experimentsontheproposedalgorithmhavebeencar-

riedtotrackobjectsbasedontheirhand drawntem-

plates. BlackandJepson [2]extendthisalgorithm to

recognizegesturesandfacialexpressionsinwhich hu-

manmotionsaremodeledastemporal trajectoriesof

someestimatedparameters(whichdescribethestates

ofagesture)overtime. Themajordi erencebetween

our approach and these methods is that wepropose

amethod to extract motiontrajectories from anim-

agesequencewithouthanddrawntemplates[6]ordis-

tinct trackable icons [2]. Motion patterns are then

learned from the extracted motion trajectories. No

priorknowledgeisassumedorrequiredfortheextrac-

tionof motion trajectories, although domain speci c

knowledgecanbeappliedforeÆciencyreasons.

3 Motion Segmentation

Tocapture thedynamic characteristicsofobjects,

we segment an image frame into regions with uni-

form motion. Our motion segmentation algorithm

processesanimagesequencetwosuccessiveframesat

atime. Forapairofframes,(I

t

;I

t+1

), thealgorithm

identi esregionsineachframe comprisingthemulti-

scale intraframe structure. Regions at all scales are

then matched across frames. AÆne transforms are

computed for each matched region pair. The aÆne

transformparametersforregionatallscalesarethen

usedtoderiveasinglemotion eldwhichisthenseg-

mentedto identify thedi erentlymovingregions be-

tweenthetwoframes. Thefollowingsectionsdescribe

themajorstepsinthemotionsegmentationalgorithm.

3.1 Multiscale Image Segmentation

Multiscalesegmentationisperformedusingatrans-

form descriedin [1]which extracts ahierarchy ofre-

gions in each image. Thegeneral form of the trans-

form,which maps animage to afamilyof attraction

force elds,isde ned by

F(x;y;

g

(x;y);

s

(x;y))= R R

R d

g (I;

g (x;y))

d

s (~r;

s (x;y))

~ r

jj~r jj dwdv

where R = domain(I(u;v))nf(x;y)g and ~r = (v

x)

~

i+(w y)

~

j. The parameter 

g

denotes a homo-

(3)

gion to which a pixel belongs and 

s

is spatial scale

that controls theneighborhood from which the force

on the pixel is computed. The homogeneity of two

pixels isgivenbytheEuclideandistancebetweenthe

associatedm-dimensionalvectorsofpixelvalues(e.g.,

m=3foracolorimage):

I =jI(x;y) I(v;w)j

The spatial scale parameter, 

s

, controls the spatial

distancefunction,d

s

(),andthehomogeneityscalepa-

rameter, 

g

, controls thehomogeneitydistance func-

tion,d

g

(). Onepossibleformforthesefunctions sat-

isfying criteriadiscussedin[1]isunnormalizedGaus-

sian:

d

g (I;

g ) 

q

2

2

g N

I (0;

2

g )

d

s (~r;

s ) 

 p

2

2

s N

jj~r jj (0;

2

s

); jj~rjj2

s

0; jj~rjj>2

s

Theforce eldencodestheregionstructureinaman-

ner which allows easyextraction. Region boundaries

correspondtodivergingforcevectorsin Fandregion

skeletonscorrespondtoconvergingforcevectorsinF.

An increasein

g

causeslesshomogeneousstructures

tobeencodedandanincreasein

s

causeslargestruc-

turestobeencoded.

3.2 Region Matching

Thematchingofmotionregionsacrossframesisfor-

mulatedasagraphmatchingproblematfourdi erent

scaleswherescalereferstothelevelofdetailcaptured

by theimage segmentationprocess. Threepartitions

of eachimagearecreatedbyslicingthroughthemul-

tiscalepyramidat threepreselectedvaluesof 

g . Re-

gionpartitionsfromadjacentframesarematchedfrom

coarseto nescales,withcoarserscalematchesguid-

ing the ner scale matching. Each partition is rep-

resented as a region adjacency graph, within which

eachregionis representedasanodeand regionadja-

cencies arerepresentedasedges. Regionmatching at

each scale consists of nding the set of graph trans-

formation operations (edge deletion, edge and node

matching,andnodemerging)ofleastcostthatcreate

anisomorphismbetweenthecurrentgraphpair. The

cost of matching apairof regionstakesintoaccount

theirsimilaritywithregardtoarea,averageintensity,

expectedpositionasestimatedfromeachregion'smo-

tioninpreviousframes,andthespatialrelationshipof

eachregionwithitsneighboringregions.

Oncetheimagepartitionsatthethreedi erentho-

mogeneity scales have been matched, matchings are

thenobtainedfortheregionsinthe rstframeofthe

tation module using the previous frame pair. The

match in the second frame for each of these motion

regionsisgivenas theunionoftheset of nestscale

regionsthatcomprisethemotionregion. Thisgivesa

fourth matched pair of image partitions, and is con-

sideredto bethecoarsestscalesetofmatchesthat is

utilizedin aÆneestimation. Thedetails of thealgo-

rithmcanbefoundin[9].

3.3 AÆne Transformation Estimation

For each pair of matched regions, the best aÆne

transformationbetweenthemisestimatediteratively.

LetR t

i

bethe ith region in framet and its matched

regionbeR t+1

i

. Alsoletthecoordinatesofthepixels

within R t

i be(x

t

ij

;y t

ij

),with j =1:::jR t

i

j where jR t

i j

isthecardinalityofR t

i

,andthepixelnearestthecen-

troidofR t

i be(x

t

i

;y t

i

). Each(x t

ij

;y t

ij

)ismappedbyan

aÆnetransformationto thepoint(^x t

ij

;y^ t

ij

) according

to



x t

ij

y t

ij



!R



A

k



x t

ij

 x t

i

y t

ij

 y t

i



+

~

T

k +



 x

i t+1

 y

i t+1

 

=



^ x t

ij

^ y t

ij



k

where the subscript k denotes the iteration number,

and R [] denotes a vector operator that rounds each

vector component to the nearest integer. The aÆne

transformationcomprisesa22deformationmatrix,

A

k

, and a translation vector,

~

T

k

. By de ning the

indicatorfunction,

 t

i

(x;y)=



1;(x;y)2R t

i

0;else

theamountofmismatch ismeasuredas

(M t

i )=

P

x;y jI

t

(x;y) I

t+1 (^x;y)j^



 t

i

(x;y)+ t+1

i

(^x ;y)^  t

i

(x;y) t+1

i

(^x ;y)^



The aÆne transformation parameters that minimize

M t

i

areestimatediterativelyusingalocaldescentcri-

terion.

3.4 MotionField Integration

ThecomputedaÆneparametersgiveamotion eld

at each of the four scales. These motion elds are

thencombinedintoasinglemotion eldbytakingthe

coarsestmotion eld andthenperformingthefollow-

ing computation recursively at four scales. At each

matchedregion,theimagepredictionerrorgenerated

bythecurrentmotion eldandthemotion eldatthe

next nerscalearecompared. Atanyregionwherethe

predictionerrorusingthe nerscalemotionimproves

byasigni cantamount,thecurrentmotionisreplaced

bythe nerscalemotion. Theresultisasetof\best

matched"regionsatthecoarsestacceptablescales.

(4)

Theresultingmotion eld

~

M

t;t+1

issegmentedinto

areasofuniformmotion. Weuseaheuristicthatcon-

siders each pair ofbest matched regions, R t

i and R

t

j ,

which share a common border, and merges them if

the followingrelationis satis edforall(x t

ik

;y t

ik ) and

(x t

jl

;y t

jl

)thatarespatiallyadjacentto oneanother:

jj

~

M

t;t+1 (x

t

ik

;y t

ik )

~

M

t;t+1 (x

t

jl

;y t

jl )jj

max(jj

~

M

t;t+1 (x

t

ik

;y t

ik )jj;jj

~

M

t;t+1 (x

t

jl

;y t

jl )jj)

<m



g

where m



g

is aconstantless than1 that determines

the degree of motion similarity necessaryfor the re-

gions tomerge.

Thesegmentedmotionregionsareeachrepresented

in MS

t;t+1

bya di erentvalue. Because each of the

best matched regions have matches, the matches in

frame t+1of theregionsin MS

t;t+1

are known and

comprise the coarsest scale regions that are used in

theaÆneestimationmoduleforthenextframepair.

It should be noted that the motion segmentation

doesnotnecessarilycorrespondtothemovingobjects

in thescenebecausethemotionsegmentationisdone

overa single motion eld. Nonrigid objects, such as

humans, aresegmented into multiple,piecewise rigid

regions. Inaddition,fastobjectsmovingatratesless

than one pixel perframe cannot be identi ed. Han-

dlingboththesesituationsrequiresexaminingthemo-

tion eld overmultipleframes.

Figure 1shows frames from an image sequence of

a complex ASLsign called \cheerleader"and Figure

2showstheresultsofmotionsegmentation. Di erent

motionregionsaredisplayedwithdi erentgraylevels.

Noticethatthereareseveralmotionregionswithinthe

head and palm regions becausethese piecewise rigid

regionshaveuniformmotion.

4 Color and Geometric Analysis

Motion segmentation generates regions that have

uniform motion. However,onlysomeofthesemotion

regions carry important information for motion pat-

tern recognition. To recognizehand gesturesconsid-

eredhere,itissuÆcienttoextractthemotionregions

of head and palm regions. Towards this end, weuse

colorandgeometricinformationofpalmandheadre-

gions.

Human skincolorhasbeenusedand provedtobe

an e ective feature in many applications. We use a

Gaussian mixture to model the distribution of skin

colorpixelsfromaMichigandatabaseof2,447images

which consists of human faces from di erent ethnic

groups. We use CIE LUV color space and discard

the luminescencevalue ofeach pixelto minimize the

Gaussian mixture are estimated using an EM algo-

rithm. Amotionregionisclassi edtohaveskincolor

ifmost of the pixels haveprobabilities of being skin

coloraboveathreshold. Coupledwithmotionsegmen-

tation,motion regionsof skin colorcanbe eÆciently

extractedfromimagesequences.

Sincetheshapeofhumanheadandpalmcanbeap-

proximatedbyellipses,andthehumanhandisathin

rectangularregion,motionregionsthathaveskincolor

aremergeduntiltheshapeofthemergedregionisap-

proximatelyellipticorrectangular. Theparametersof

arectangularshapecanbeobtainedfromthebound-

ing box of each region easily. The orientation of an

ellipseiscalculatedfromtheaxesoftheleastmoment

ofinertia. The extents of themajor and minor axes

ofthe ellipseare approximatedby theextentsof the

regionalongtheaxisdirections,andthusgeneratethe

parametersfor theellipse. The largestelliptic region

extractedfrom an image isidenti ed ashumanhead

andthenexttwosmallerellipticregionsarepalmre-

gions.Figure1showstheimagesequenceofacomplex

ASLsigncalled\cheerleader"andFigure3showsthe

resultsof colorandgeometricanalysis onthemotion

regions.

5 Motion Trajectories

Although motion segmentation generates aÆne

transformationsthatcapturemotiondetailsbymatch-

ingregionsat nescales,itissuÆcienttousecoarser

motiontrajectoriesof identi edpalm regionsforges-

turerecognitionconsidered inthispaper.

AÆnetransformationofpalm regionineachframe

pair is computed based on equations in Section 3.3.

TheaÆnetransformationsofsuccessivepairsarethen

concatenatedto construct the motion trajectories of

thepalmregion. Figure4showssuchtrajectoriesfora

numberofframesintheimagesequence\cheerleader."

Since all pixel trajectories are shown together, they

formathickblob. Figure5showsa10to 1subsam-

plingofthemotiontrajectories.

6 Motion Pattern Classi cation

We employ TDNN to classify gestural motion

patterns of palm regions since TDNNs have been

demonstratedtobeverysuccessfulinlearningspatio-

temporalpatterns. TDNNis adynamicclassi cation

approachinthatthenetworkseesonlyasmallwindow

ofthemotionpatternandthiswindowslidesoverthe

input datawhile the network makesaseries of local

decisions. Theselocaldecisionshavetobeintegrated

intoaglobaldecisionatalatertime. Intheirseminal

work,Waibeletal. [11]demonstratedexcellentresults

(5)

(i)frame35 (j)frame37 (k)frame40 (l)frame44 (m)frame46 (n)frame49 (o)frame52 (p)frame55

Figure1: ImagesequenceofASLsign\cheerleader"

(a)frame14 (b)frame16 (c)frame19 (d)frame22 (e)frame25 (f)frame29 (g)frame31 (h)frame34

(i)frame35 (j)frame37 (k)frame40 (l)frame44 (m)frame46 (n)frame49 (o)frame52 (p)frame55

Figure2: Motionsegmentationoftheimagesequence\cheerleader"(pixelsofthesamemotionregionaredisplayed

with samegraylevelanddi erentregionsaredisplayedwithdi erentgraylevels)

(a)frame14 (b)frame16 (c)frame19 (d)frame22 (e)frame25 (f)frame29 (g)frame31 (h)frame34

(i)frame35 (j)frame37 (k)frame40 (l)frame44 (m)frame46 (n)frame49 (o)frame52 (p)frame55

Figure3: Extractedheadandpalm regionsfromimagesequence\cheerleader"

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(a)#14-#16 0 20 40 60 80 100 120

0 20 40 60 80 100120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(b)#16-#19 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(c)#19-#22 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(d)#22-#25 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(e)#25-#29 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(f)#29-#31 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(g)#31-#34

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(h)#35-#37 0 20 40 60 80 100 120

0 20 40 60 80 100120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(i)#37-#40 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(j)#40-#44 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(k)#44-#46 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(l)#46-#49 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(m)#49-#52 0 20 40 60 80 100 120

0 20 40 60 80 100 120 140160

Y-axis

X-axis Gestural Motion Trajectories

palm1 palm2

(n)#52-#55

Figure4: ExtractedgesturalmotiontrajectoriesfromsegmentsofASLsign\cheerleader"(since allpixeltrajec-

toriesareshown,theyform athickblob)

(6)

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140 160

Y-axis

X-axis Gestural Motion Trajectories

palm1

(a) Motion trajectories of a

sample set ofpalmpointsfor

theASLsign\cheerleader"

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140 160

Y-axis

X-axis Gestural Motion Trajectories

#14

#16

#22 #19

#25

#29

#31

#34,#35

#37

#40

#44

#46

#49

#52

#55

palm1

(b) Motion trajectory of one

palmpoint for the ASL sign

\cheerleader"

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140 160

Y-axis

X-axis Gestural Motion Trajectories

palm2

(c) Motion trajectories of a

sample set ofpalmpointsfor

theASLsign\cheerleader"

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140 160

Y-axis

X-axis Gestural Motion Trajectories

#14

#16

#19

#22

#25

#29

#31

#34,#35

#37

#40

#44

#46

#49

#52

#55 palm2

(d) Motion trajectory of one

palmpoint for the ASL sign

\cheerleader"

Figure5: Extractedgesturalmotiontrajectories(sub-

sampled byafactorof10)of ASLsign\cheerleader"

for phoneme classi cation using TDNN and showed

that it achieveslowererrorratesthanthose achieved

byasimpleHMMrecognizer.

ThedesignofTDNNisattractivebecauseitscom-

pact structure economizes on weights and makes it

possibleforthenetworktodevelopgeneralfeaturede-

tectors. Most importantly, its temporal integration

at theoutputlayermakesthenetworkshiftinvariant

(i.e. insensitive to the exact positioning of the ges-

ture). Figure6showsourTDNN architectureforthe

experiments,where positivevaluesareshownasgray

squaresandnegativevaluesasblacksquares. Thein-

putstoourTDNNarevectorsof(x;y;v;)formotion

trajectoriesextractedfrom agestureimage sequence,

where x,y arepositionswithrespect tothecenter of

thehead,andv,aremagnitudesandangleofveloc-

ity respectively; the outputs are the gesture classes;

andthelearningmechanismiserrorbackpropagation.

7 Experiments

Weuseavideodatabaseof40ASLsignsforexperi-

ments. EachvideoconsistsofanASLsignwhichlasts

about 3 to 5 seconds at 30 frames per second with

image size of 160120in Quicktimeformat. Figure

1showsonecomplexASLgesture from thesequence

\cheerleader." Notethatthehandmovementconsists

of rotationand repetitions. Each image sequence of

the40gesturesintheexperimenthas80to120frames.

integration

x y v

lengh=20 window

window lengh=5 window lengh=10

46 slots 37 slots

18 slots

50 slots θ

Gesture

Input Layer Hidden Layer 1 Hidden Layer 2 1 2

Output Layer 40

Figure6: ArchitectureofTDNN

Discardingtheframesinwhichpalmsdonotappearin

theimages(i.e. framesinstartingandendingphase),

each image sequence has about 50 frames. Motion

regions with skin color are identi ed by their chro-

maticcharacteristics. Theseregions arethen merged

into palm and headregions shown in Figure 3based

on geometric analysis discussed in Section 4. AÆne

parameters of matched palm regions are computed,

which give pixel motion trajectories for each image

pair. Byconcatenating the trajectories for consecu-

tive image pairs, continuous motion trajectories are

generated. Figures4showstheextractedmotiontra-

jectoriesfromanumberofframesandFigure5shows

thetrajectoriesfromthewholeimagesequence. Note

thatthemotiontrajectoriesofpalmregionmatchthe

movementintherealscenewell.

Training of TDNN is performed on the corpus of

80%of the extracted dense (38 on the average) tra-

jectoriesfrom each gesture, usingan errorbackprop-

agation algorithm. The rest 20% of the trajectories

arethen used fortesting. Based on theexperiments

with 40 ASL gestures, the average recognition rate

on the training trajectories is 98:14% and the aver-

agerecognitionrateontheunseen testtrajectoriesis

93:42%. Sincedensemotiontrajectoriesareextracted

from each image sequence, the recognition rate for

each gesture canbe improved by a \voting" scheme

(i.e. the majority rules) on the classi cation result

of each individual trajectory. The resulting average

recognition rate on the training and testing sets for

gesture recognition are 99:02% and 96:21%, respec-

tively.

8 Discussion and Conclusion

Wehavedescribedanalgorithmtoextractandrec-

ognize motion patterns using trajectories. For con-

creteness, the experiments have been carried out to

recognizehandgesturesinASL.Motionsegmentation

(7)

tion. Moving regions with salient features are then

extractedusingcolorandgeometricinformation. The

aÆne transformations associated with these regions

are then concatenatedto generatecontinuous trajec-

tories. Thesemotiontrajectoriesencodethedynamic

characteristicsofhandgesturesandareclassi edbya

time-delayneural network. Ourexperiments demon-

stratethathandgesturescanberecognized,withhigh

accuracy,usingmotiontrajectories.

Thecontributions ofthisworkcanbesummarized

as follows. First,ageneralmethod that extractsmo-

tion trajectories is developed. This is in contrast to

muchworkongesturerecognitionthatusescolorhis-

togram tracker[8] [4] [2], magnetic sensors [3], hand

drawn template [6], and stereo [10] to obtain a rep-

resentation of the gesture. Second, weuse aTDNN

torecognizegesturesbasedontheextractedtrajecto-

ries. Using anensemble of trajectories helps achieve

highrecognitionrates. Itwouldbeinterestingtocom-

paretheserecognitionrateswiththoseobtainedusing

other recognitionmethods such asHMM,CONDEN-

SATIONalgorithm[6][2]andprincipal curve[3].

Acknowledgements

The support of Advanced TelecommunicationRe-

searchInternationalis gratefullyacknowledged.

References

[1] N.Ahuja. Atransformformultiscaleimagesegmen-

tationbyintegratededgeandregiondetection.IEEE

Trans. Pattern Anal. Mach. Intell., 18(12):1211{

1235,1996.

[2] M. J. Black and A. D. Jepson. A probabilis-

tic framework for matching temporal trajectories:

Condensation-based recognition of gesture and ex-

pressions. InProceedingsof EuropeanConference on

Computer Vision,pages909{924,1998.

[3] A. F. Bobick and A. D. Wilson. A state-based ap-

proachto therepresentationand recognitionof ges-

ture. IEEE Trans. Pattern. Anal. Mach. Intell.,

19(12):1325{1337, 1997.

[4] J. L.Crowley and F.Berard. Multi-modal tracking

offacesforvideocommunications. InProceedings of

IEEE Conference on Computer Vision and Pattern

Recognition,pages640{645,1997.

[5] T.HastieandW.Stuetzle. Principalcurves. Journal

ofAmericanStatisticalAssociation,84(406):502{516,

1989.

[6] M. Isard and A. Blake. Condensation -conditional

densitypropagationforvisualtracking.International

JournalofComputer Vision,29(1):5{28, 1998.

[7] G.Johansson. Visualperceptionofbiologicalmotion

and a model for its analysis. Perception and Psy-

chophysics,73(2):201{211,1973.

approachtovisualeventclassi cation.InProceedings

oftheFourthEuropeanConference onComputerVi-

sion,pages347{360,1996.

[9] M. Tabb and N. Ahuja. 2-d motion estimation by

matchingamultiscalesetofregionprimitives. IEEE

Trans. Pattern Anal. and Mach. Intell., 1997. sub-

mitted.

[10] C.VoglerandD.Metaxas. Aslrecognitionbasedon

acouplingbetweenhmmsand3dmotionanalysis.In

ProceedingsoftheSixth InternationalConference on

Computer Vision,pages363{369,1998.

[11] A.Waibel,T.Hanazawa,G.Hinton,K.Shikano,and

K.Lang. Phonemerecognitionusingtime-delayneu-

ralnetworks. IEEETrans.onAcoustics,Speech,and

SignalProcessing,37(3):328{339,1989.

參考文獻

相關文件

In contrast to other methods that use color to segment and track regions of interest (for hand gesture recognition, human motion analysis, etc.), our motion segmentation algorithm

A model is developed for estimating the displacement field in spatio-temporal image sequences that allows for affine shape deforma- tions of corresponding spatial regions and for

(c) Sketch several trajectories of solution curves in the phase plane i.e... (b) saddle

An alternative approach (Gurvitz 1997, Makhlin et. 2001) referred here as the ME approach of “partially” reduced density matrix, is to take trace over environmental

We demonstrate the connection between the quantum trajectory approach and master equation approach of “partially” reduced density matrix by studying an initial charge qubit

We show next that the master equation for the reduced or “partially” reduced density matrix simply results when an average or “partial” average is taken on the conditional,

The Motion Picture Association film rating system is used in some countries to rate a motion picture's suitability for certain audiences based on its content.. Graphing

Comparing mouth area images of two different people might be deceptive because of different facial features such as the lips thickness, skin texture or teeth structure..