Portable Vision-Based HCI

(1)

Abstract

This paper aims to create a lightweight version of vision-based HCI (Human Computer Interaction) for emerging portable/handheld devices. We manage to provide a steerable interface so that a user can manipulate his/her handheld device on a projected screen on a plan surface (figure 1) without the intervention of a “mouse” or a “keyboard”. By means of the idea, we hope to realize a part of the pervasive world [1] by an intuitive way of user input interface. Unlike previous works [4,5,8], which use hand gesture recognition for the same purpose, we use an intuitive method to detect hand motion by its MHI (motion history images)[2]. In this paper, we will introduce an efficient method for the hand motion recognition so that it can fit to a handheld device. With the system, what users need to do then is just install this software program on their handheld device and then they have a vision-based steerable interface.

1. Introduction

As Mark Weiser mentioned “Ubiquitous computing” for more than 10 year ago [1], people are striving to hide computers into various forms. The booming of handheld devices sometimes seems as a trend of ubiquitous computing. The ways people interact with computers are of interests in this paper.

Vision-based HCI or hand mouse system is popular in virtual reality and computer vision domains [8]. In this work, we try to have this intuitive interface run on the handheld devices so that not only can people dump the heavy-weight desktop with handheld devices but also drop the touch pad and small keyboard and enjoy the contents themselves. We assume the popularity of equipping a low-resolution camera on those handheld devices. By adopting their embedded cameras, the system can detect a user's hand motion in real-time, and results in the autonomous manipulation of corresponding programs on the device.

To run the vision-based HCI system on the handheld devices, computing power of frame processing is a critical concern. We have to find an efficient method that can

balance the efficiency and accuracy. Though Adaboost method proposed by Viola and Jones [6] seem to be a choice for the project, we do not have a good result with a low-resolution camera. Instead of using gesture recognition methods, we then use Motion History Images [2]. By grabbing and processing the image on pixels directly, we gain an efficiency way of computing as a result.

1.1. Contribution

The novelty of this paper is to provide an efficient method to run motion detection works on a resource-limited device by using motion history images.

2. Related Works

Since we target the system to run on a handheld device, computing efficiency then be a critical consideration in this work. “A portable system for anywhere interaction” from IBM runs the anywhere display system on a laptop and provides a good demonstration for the idea [4,5]. They adopt Gesture Recognition in tracking fingertips and the work can calibrate the projected surface autonomously. Our work is mostly inspired by MHI, introduced by David et al., where motions are extracted from the differences of successive frames [2,3]. By the subtraction, the static parts of images will first be eliminated and then we can focus on those dynamic parts. This method can reduce the computing power significantly in our work since we assume the projected screen is static almost all the time. Our work is highly dependent on the MHI method with motion information retrieval. Viola and Jones provide a novel approach to detect an object efficiently using Adaboost [6].

This method was considered in the primary stage of this work, however, like all other things, there is a tradeoff between accuracy and efficiency. We dropped it after a period of trials.

- A Real-time Hand Mouse System on Handheld Devices

Chu-Feng Lien (p93007@csie.ntu.edu.tw)

Department of Computer Science and Information Engineering

National Taiwan University

(2)

2.1. Hand recognition using Adaboost

In the preliminary stage of implementing the system, we use “Viola Jones” method to train a front image of a right palm. The original idea is to interpret the hand motion by palm recognition. Therefore, we collect 1397 positive images and 3000 negative images (backgrounds) and start the training process.

First we fit the images to a fixed 30x50 size, define the falsealarm rate to be 50%, and start the Adaboost training process. It takes 2 days to acquire 11-stage classifiers. The result is not good and falsealrm rate seems too high (see Figure 2 as one of the failure examples). It may be caused by the low resolution of camera, or the improper choose of samples. However, we gave up the method and looked for

another way to interpret hand motions.

2.2. Motion History Images

After unsuccessful trials of Adaboost method to recognize a palm, we adopt MHI method to see whether we can avoid the problem of inaccurate detection of objects.

The MHI is constructed by successively layering selected images regions over time (capturing the motion recency of pixels) using a simple update rule:

( ) ( ( ) )

( )





∂

−

<

≠

= Ψ

∂

∂ τ

τ

y x MHI elseif

y x I y if

x

MHI 0 ,

0

, , . ....[2]

where each pixel (x,y) in the MHI is tagged with a current timestamp

τ

if the function

Ψ

signals object presence (or motion) in the current video image I(x,y); the remaining timestamps in the MHI are removed if they are older than the decay value (

τ − ∂

). The function is updated in every new frames captured from camera and user applications can thus analyze the sequences differences and retrieve the information.

3. System Implementation

3.1. System Flow

System flow chart is shows on Figure 3. It starts from retrieving images from camera drivers. Upon a valid image, it firstly finds the location of possible projected screen (we assume the biggest rectangle in the frame to be the target) by Canny Edge detection [7]. After finding the target screen, it then calculates the position mapping between camera resolution and device resolution. The projected screen detection is executed every 10 seconds in case the camera or the target screen is moved accidentally.

After the basic environment setup process, the system reads images to a ring buffer and shows the differences between frames based on MHI. The frond edge of each motion is determined as the mouse position on the target device. We record the latest 50 frond edges and draw a trajectory to filter out the false positives. Event detection is defined by the energy of movements. If the image density rate is dropped by 70% in a consecutive frames. We define

Figure 2: Hand recognition using Adaboost method

Figure 1: System Configuration.

(3)

the action as a mouse “click” event and execute the corresponding commands on the target device.

<Figuer 3: System flow chart>

3.2. Find the Projected Screen

On the beginning of the system, we have to define the projected screen so everything in consequence will make senses. During system initialization, first we apply Canny Edge detection to the grabbed frames, find every possible rectangle, and then choose the biggest one on the frame.

In case of a dynamic environment that we might encounter and consider the computing efficiency while adding edge detection to each frame, the re-computation of projected screen for every 10 second is designed to cope with the dynamics.

<Figure 4: Screen Detection>

3.3. Motion Silhouette

In the system, we design an image ring buffer, where the buffer size is 3, to store the history images, calculate the differences, and represent the result as a Silhouette.

Silhouette is generated between successive images. It is subtracted from the images and represented by grey level so we can reduce lots of computation effort by just processing these pixels of silhouette.

<Figure 5: Silhouette image>

3.4. Frond edge detection

Upon availability of silhouette images, we have to decide where the user attention is. In a real environment, if a user wants to touch or open (click) an icon on the projected surface, his/her hand may appear from the right or left side of the projected screen (we assume upper and bottom side is impossible in the scenario.) To detect the frond edge in the image, we first have to decide the direction of hand motion (A). In our system, we divide the image frame into 2 parts, right and left. By simply calculating the image density (B) of each part, we can know the direction of a user’s hand motion.





<=

= >

) ( ) (

) ( ) ) (

,

( left if denx deny

y den x den if right y

x

Direct ^{… (A)}

∑

= z

Z

den( ) ^{… (B)}

where x represents the right half of the image, and y represents the left half of the image. “den(Z)” is the summation of image density of the dedicated parts.

After deciding the direction of the motion, we need to find the front edge (user attentions) of each movement.

Frond edge is determined by scanning all the image space and find the leftist valid point if the direction is “right”; find the rightist valid point if “left”. This method takes just few computations and the accuracy is good under a stable condition, where the background is static and lighting condition is stable.

3.5. Noise Filtering

In order to filter out false positive detection, we record the last 50 valid frond edges and predict its path using a simple heuristic model. We calculate the Euclidean distance between current and latest frond edge. If it is far from last point and the trajectory, we take this as a false positive detection and thus disregard the observation. Grey level red colors are used to represent the trajectory of motions. The most recent front edge is marked brightest red, and the most Image

Capture from

Find the screen (edge

MHI

Find frond Motion Interpretatio

Noise filter

Mouse/keyboar d

start

(4)

previous front edge is marked with a light red.

<Figure 6: Motion trajectory>

There are also signal bouncing problems that we need to deal with in the detection. For example: If we touch an icon on the projected screen, the system detects the motion and the corresponding program will execute the program. If we keep tracking the motion of frames, there will be dramatically changes of motions during the execution of each program. Hence a 5 seconds interval of signal bouncing is introduced after a valid mouse/keyboard event is detected. We can adjust the sensitivity by changing the signal bouncing time.

3.6. Coordinate mapping

Another task is to map the detected position (front edge) on the camera to a corresponding coordinate on the target device. First of all, after the project screen is detected during system initialization, the system redefines the origin of coordinate in image pixel’s point of view. It will decide the position of new origin and the width and height of the new screen and generate a transition function for coordinate mapping. Then the system will have to project each relative detected point to a real coordinate in the target device. For example: As demonstrated by Figure 7, supposed our camera resolution is 800x600, computer screen resolution is 128x800, and the newly detected screen is 600x400 with its origin on (100,100), we have to apply the following formulation to and get the calibrated relative coordinate RP(x,y):

where

( )

1280 600 *

100

' 



 



 −

= x

x ^;

( )

800 400 *

100

' 



 



 −

= y

y ^;

1280: display width of computer 800: display height of computer 600: camera resolution of original width 400: camera resolution of original height 100: the x-axis of new origin

100: the y-axis of new origin

<Figure 7: screen mapping>

3.7. Event detection and definition

By observing the behavior of data manipulation, we find a pattern that when a user wants to click an icon on the surface, he/she tends to put the hand on top of the object and stay for a short while. If the motion is continuous, the image density will increase or decrease gradually. Therefore, we take this pattern as a trigger signal of a click event. When a user’s hand is paused for a while, the image density will be dropped dramatically because the image differences calculation will result in a very mall value.

We then define an event if

( ) ( ) ( )



( )











≥









<









=

% ' 30

% ' 30 )

' , (

X den

X if den FALSE

X den

X if den TRUE X

X Event

where Event() is the determination function, den() is the image density function, X is the current frame, and X’ is previous frame.

<Figure 8: front edge detection>

The first prototype is designed for presentation support system. We therefore define events that can help a speaker to go through all the slides that he/she prepares. We define some events as follows:

“Mouse double-click”: when the motion direction is from the right side and a valid activation ( Event(X,X’)==TRUE ) is detected. The system will send a double-click mouse event to Windows operating system.

“Page up (PgUp)”: when the motion direction is from the left side and a valid activation is detected. The system will

detected screen

Computer

1280

800600

800

Origin Origin

New Origin

Camera

latest move earliest move

Front edge

(5)

send a keyboard event “PgUp to Windows operating system”

“close current windows”: when a valid activation is detected outside the projected screen for consecutive 3 times within 10 seconds. The system will close the current working window.

With the 3 events, a speaker can open his/her presentation materials with “double-click” command.

During the presentation, he/she can control the slide page up and down by “double-click” and “PgUp” command. Last, after he/she finishes the presentation, he/she can close current window, and open another supplementary files. The system can facilitate a regular presentation work without the intervention of physically manipulating the device.

3.8. Performance Evaluation

The experiment is finished on a laptop with main components CPU: Pentium M Processor 730 (1.6GHz);

512MB DDRII RAM. Results are show as follows. The values indicate the usage of CPU on a typical base.

Adaboost Motion History Images

5FPS (800x600) 90% 30%

5FPS (640x480) 80% 20%

5FPS (174x144) 40% 10%

3FPS (800x600) 35% 10%

3FPS (640x480) 30% 7%

3FPS (174x144) 15% 3%

<Figure 9: performance evaluation>

The results show a very good performance of this method compared with Adaboost. It takes more less computation and suitable for a handheld device, which usually equipped with limited computation capability.

3.9. System Limitation

From the implementation and experiments, though the system accuracy is good in a well-controlled environment, we find several drawbacks and limitations while user behaviors and environments are unpredictable.

-- High error rate on fast moving motion (this issue can be improved by increasing the framing rate) will result in high false positive detection.

-- If the speaker walks around the projected area continuously, the system can adapt to this behavior and performs well, but if the speaker suddenly stops in the middle of the screen, system will result in a falsealarm.

-- If the environmental lights are changing or a shadow is projected within the scope of the camera, the system will be misleading to a difference result.

-- If the edge color of projected screen is similar to its neighbor objects. The screen will not be well detected.

Or if the projected screen is highly distorted, accuracy of position will be very low.

4. Summary and Future Works

With the help of Intel’s OpenCV libraries [9], we have lots of available well-implemented functions and thus our work can be prototyped quickly.

By the experiment results, the system should be suitable to a portable device with its computation efficiency feature.

Nonetheless, since the prototype is implemented on a laptop, we need to port it to a handheld device and see its real performance.

Noise problem affect motion recognition significantly in an open space. As mentioned in previous chapter, a dynamic background and lighting condition may worsen the detection result. In the future work, we will try other supplementary methods to filter out those noises, by means like clustering or increase the depth of motion detect history to filter out those fake movements. Object recognition method is also considered to be a great helpful to the accuracy, but just the reason why we chooses motion history, that will cause another computation problem.

Last, our current system support just few actions on the target device, ex: signle/double click, close windows. To fulfill the realization of “steerable” interface, more complicated actions should be defined to control the device.

We hope to add more sophisticated features detection so not only to click but also to drag objects on the projected interface.

References

[1] Mark Weiser. The Computer for the Twenty-First Century.”Scientific American. September 1991. pp. 94-104.

[2] James W. Davis. "Hierarchical Motion History Images for Recognizing Human Motion," event, p. 39, IEEE Workshop on Detection and Recognition of Events in Video (EVENT'01), 2001

[3] Tim Weingaertner, Stefan Hassfeld, Ruediger Dillmann.

"Human Motion Analysis: A Review," nam, p. 0090, 1997 IEEE Workshop on Motion of Non-Rigid and Articulated Objects (NAM '97), 1997

[4] N. Sukaviriya, R. Kjeldsen, C. Pinhanez, L. Tang, A. Levas, G. Pingali, and M. Podlaseck, “A Portable System for Anywhere Interactions” In Extended abstracts on Human factors in Computing Systems (CHI), 2004.

[5] Kjeldsen, R., Levas, A., and Pinhanez, C. (2003) Dynamically Reconfigurable Vision-Based Interfaces, in ICVS 2003, International Conference on Computer Vision Systems.

[6] Paul Viola and Michael J. Jones. “Robust real-time object detection.” In Proc. of IEEE Workshop on Statistical and Computational Theories of Vision, 2001.

[7] J. F. Canny, A computational approach to edge detection, IEEE Trans. Pattern Analysis. Mach. lntell. 8, 679-698 (1986).

(6)

[8] L. Bretzner, I. Laptev and T. Lindeberg, ``Hand gesture recognition using multi-scale colour features, hierarchical models and particle filtering'', in Proc. 5th IEEE International Conference on Automatic Face and Gesture Recognition Washington D.C, May 2002.

[9] Intel Open Source Computer Vision Library (OpenCV) : http://www.intel.com/technology/computing/opencv/index.

htm