Thesis Organization - 基於手勢之次世代加護病房人機互動技術

Chapter 1. Introduction

1.3 Thesis Organization

This thesis is organized into five chapters as follows. In Chapter 1, we introduce the background knowledge related to our research on hand gesture recognition. Based on the nature of the device, we classify these methods as wearable sensors and interfaces based on computer vision. In Chapter 2, we describe in detail the input device, including hardware and software, of our system as well as the process in the proposed framework. In Chapter 3, we define the hand gestures adopted in this thesis and then we explain in detail the hand gesture recognition method which includes system initialization, data acquisition, feature extraction, and hand gesture recognition. In Chapter 4, we detail the structure of our hand gesture control system and show some experimental results, with comparison to other system. In Chapter 5, conclusions are drawn.

Chapter 2. System Description

The propped system includes three parts (see Figure 2.1): hardware (Kinect), framework (Open Natural Interaction), middleware (NITE) and application, a program that analyzes data obtained from OpenNI with NITE, and perform corresponding user operation (see Chapter 3). OpenNI which provides a way to access sensor data of Kinect.

In Section 2.1, the input device is described. In Section 2.2, the framework and middleware are described. In addition, the structure of hand gesture control system is proposed in Section 2.3. Finally the software stack for our proposed system is given in Section 2.4.

Figure 2.1 The system overview.

Input device Framework / middleware Hand gesture control system

Application

Hardware System

2.1 Input Device

Microsoft announces a new game device - Kinect which enables user to play game without controller. Kinect includes a pair of near infrared radiation projector and sensor, a CMOS camera, an array microphone and a motor (see Figure 2.2). Through these sensors, Kinect can retrieve more accurate action than using remote controller. We will discuss in more detail in the next section.

Figure 2.2 The Functions of Kinect.

The near infrared radiation (IR) projector, near-IR sensor and PrimeSense’s chip are the key for computing the depth of scene. The near-IR projector emits invisible light which pass through a filter and scattered into a special pattern. The near-IR pattern is projected onto the object in front of near-IR projector (see Figure 2.3(a)) and the near-IR sensor receives the reflected pattern and analyzes the depth value (see Figure 2.3(b)). Because the depth value is estimated by near-IR pattern, the edges of the object will cause random errors in depth value.

The random errors in depth measurement by Kinect will increase with increasing distance to the sensor [21]. When Kinect senses object within range of 50~250 cm, the depth errors can be lower than 1 cm.

(a) (b)

Figure 2.3 (a) the near-infrared radiation projector. (b) the near-infrared radiation sensor [21].

The CMOS camera in Kinect has a fixed lens and can capture VGA resolution with 30 frame per second (fps) or HD 720 resolution with 14 fps video clips. The CMOS camera has the ability to identify color and can be used for facial recognition. In Figure 2.4(a), we show the color image capture from Kinect‘s CMOS camera. In Figure 2.4(b), we show the corresponding depth image capture by Kinect.

(a) (b)

Figure 2.4 (a) the color image captured by CMOS sensor. (b) the depth image captured by Kinect.

In Kinect, a Multi-array MIC has four microphones placed on the same surface that can receive voice signal. The sound arrive them in different time so we can calculate the voice source, cancel noise or delete echoes by using signal filter and calculating arrival time of sound. This system enable Kinect to receive more pure voice signal.

2.2 Framework and Middleware

OpenNI is a framework developed by PrimeSense for Kinect and all the devices with PrimeSense technology to access sensor data. A three-layered view of the OpenNI is shown in Figure 2.5. OpenNI provides a basic communion interface between application and hardware and reserves ability for adding functions provided by middleware component such as NITE, which provides natural-interaction UI controls APIs for OpenNI framework.

Figure 2.5 OpenNI structure.

In OpenNI, there are two types of production nodes; one is sensor related and the other one is NITE related. Sensor related production node included Depth Generator, Image Generator, IR Generator and Audio Generator. The Depth Generator is a node that generates a depth-map, which is accurate to a millimeter and represents the distance from 3D sensor camera to an object, Image Generator is a node that generates colored image-maps, IR Generator is a node that generates IR image-maps, and Audio Generator is a node that generates an audio stream from the multi-array MIC.

Application

NITE related production node included User Generator (see Figure 2.7), Scene Generator and Hand Point Generator. For Scene Generator, the Scene Analyzer’s main output is a labeled depth map, in which each pixel holds an user ID or it is part of the background. The Hand Point Generator supports hand detection and tracking (see Figure 2.6).

Although OpenNI with NITE has some pre-defined hand gestures, these gestures are not enough for our requirements for surfing the internet in an ICU. Thus, we will define our own hand gestures in the Chapter 3.

Figure 2.6 Hand detection and tracking using Hand Point Generator [20].

Figure 2.7 User body detection and tracking using User Generator [20].

2.3 Hand Gesture Control System

Model–view–controller (MVC) was invented by Krausner [22] which is a software design pattern for designing interactive computer user interfaces. MVC divides the application into three components: model, controller and view, as shown in Figure 2.8, with each component having its main target. The View component is to handle all events about user interface, the Controller component is a middleman that communications with the View and the Model, and the Model component store all data that is used for controller or view.

MVC is of great helps for long-term maintenance of program change and we will design our system with the MVC design pattern.

Figure 2.8 MVC design pattern.

View Controller Model

2.4 Software Stack

Figure 2.9 shows the Hand gesture control system software stack in which NITE and OpenNI introduced in previous section are used for updating depth map and obtaining hand point. Moreover, the Qt framework provides a cross-platform GUI environment for the creation of user interface and time event. Qt framework for Windows is based on Window API which we also use for accessing system-level IO.

Figure 2.9 System software stack of hand gesture control.

framework Windows API

Kinect driver NITE Application

Open NI

Chapter 3. Hand Gesture-Based HCI

The proposed hand gesture recognition method is described in this chapter. In Section 3.1, the defined hand gestures and a system flowchart for their recognition will be presented.

Then, we explain each part of the flowchart from Sessions 3.2 to 3.5.

3.1 Hand Gestures Definition

In order to operate a web browser on PC via hand gestures, we define some basic hand gestures for different operating commands. The defined commands which include moving cursor, clicking, switching page, zoom in, zoom out, scroll up and scroll down, which are commonly used for browsing websites or documents. Some of these hand gestures are designed to be operated by two hands since it is more intuitive from user experience. The detailed description of these hand gestures will be given in Section 3.4.

Figure 3.1 shows the proposed hand gesture control system which has five main parts.

The first part initializes hardware, framework, middleware, and our hand gesture control system. The second part acquires depth data from Kinect. The third part extracts hand features, including position, state, moving direction and moving magnitude, for hand gesture recognition. The fourth and fifth parts are hand gesture recognition and HCI operation, respectively.

Figure 3.1 Flowchart of the proposed HCI system.

Moving region determination

HCI operation

Two-hand mode

One hand Two hand

Feature extraction

One-hand mode

Hand gesture recognition System initialization

Acquisition of depth data and hand points from Kinect

Hand detection

Yes

Hand mode decision

3.2 System Initialization and Data Acquisition

Firstly, OpenNI is driven to acquire depth data continuously from Kinect. Next, NITE is used for hand point detection and tracking. After that, the depth data and the hand points are used to initialize a Moving Region for subsequent hand gesture recognition.

3.2.1 Acquisition of Depth Data

The function "WaitAndUpdateAll" waits for all nodes to have new data available, and then updates them about 33 milliseconds. Our hand gesture system refreshes the depth data by using the "UpdateData" request function every 10 millisecond that forces depth generator to expose the most updated data to application without waiting other generator for smoother movement in operation time. Our hand gesture system updates depth data continuously by OpenNI. Figure 3.2 shows the obtained depth data.

Figure 3.2 An example of obtained depth data.

3.2.2 Hand Point Detection

The NITE will extract the hand information, which includes a frame number, an ID and a hand point (hand's center of mass). Figure 3.3 shows a detected hand point via NITE.

Figure 3.3 The hand point detection via NITE.

3.3 Feature Extraction

In order to recognize the previously defined hand gestures, we use three kinds of features, including hand position, hand state, and hand moving direction and magnitude. While hand position is obtained from NITE directly, the change of hand state is detected the difference of hand images. Figures 3.4 (a) and (b) show the hand image at frame t and frame t₊₅, respectively, and Figure 3.4 (c) shows the difference between Figures 3.4 (a) and (b), after applying the morphology filter. For deciding the hand state, we detect the minimum distance between the hand and the top of the image. If the distance is changing from short to long (see Figures 3.5 (a)-(b)), the hand state must be in the fisting state, otherwise, the hand state must be in the opening state.

Figure 3.4 (a) the hand image of frame t; (b) the hand image of frame t+5; (c) the difference between (a) and (b)

(a) (b) (c)

(a) (b)

Figure 3.5 (a) a short distance between the hand and the top of the image ; (b) the longer distance between the hand and the top of the image .

Hand direction and magnitude are calculated by subtracting the previous hand point coordinates from the current hand point coordinates and stored the information as a motion vector. Figure 3.6 shows the x and y components of the vector for a swipe gesture.

Figure 3.6the x and y components of the vector for a swipe gesture.

Magnitude (pixel)

Frame

(a) (b)

3.4 Hand Gestures Recognition

Table 3.1 shows the hand gestures and action considered in our hand gesture recognition system. When a hand is detected in our system, it will be recognized as the dominant hand and system will enter one-hand mode. There are three gestures for the one-hand mode, including cursor moving, mouse clicking, and next/pervious page. The Move gesture is recognized when the hand point is moving in "Moving Region" (see Appendix A). The

"Click" gesture is a open-close-open sequence by dominant hand. The Swipe gesture is switching page by swiping hand. If another hand is detected, the system will enter two-hand mode which includes zooming and scrolling hand gesture. Zooming and scrolling action are recognized by the fisting of the dominant hand. The Zooming gesture is moving two hands closer or apart. The Scrolling gesture is moving non-dominant hand up or down.

Table 3.1 Hand gestures used in the HCI operation

Hand

Chapter 4. System Implementation and Experimental Results

The system implementation and experimental results are described in this chapter. In Section 4.1, the system implementation and the function of each module are presented. Then, we test all kinds of hand gestures that we selected and compare the results with some similar methods in Section 4.2.

4.1 System Implementation

We implement our system with the Model-View-Controller (MVC) design pattern as shown in Figure 4.1, with three parts to increase the flexibility of hand gesture-based human-computer interaction system and decrease dependency of each part. The View part contains QtMotorController, Windows GUI and QtImgReader modules. QtMotorController is a UI for a user to adjust Kinect angle. Windows GUI is a part of Windows OS which provides an application programming interface (API) for system-level IO control. For efficient operation, QtImgReader is adopted to access the depth data directly which emits update event to DataGenerator and shows depth data on the screen.

The Controller part is organized by MotorController and GestureController.

MotorController controls Kinect motor and changes the horizontal angle of Kinect from -37 to +37 degrees. GestureController analyzes user's hand gesture and dispatches different events, like mouse or keyboard event, to a computer.

The Model part is organized by MotorData and DataGenerator. MotorData is the Kinect motor data which stores in Kinect hardware that records the value of Kinect angle.

DataGenerator acquires the depth data and image data from Kinect and stores in an array map.

Figure 4.1 The Model-View-Controller for our system.

4.2 Experimental Results

The hand gesture-based human-computer interaction system is implemented by using Microsoft Visual C++ 2010 and uses Kinect as input sensor. We test our system under both PC and notebook to show that it can execute on a platform with fewer system resources than PC. The detailed specifications are shown in Table 4.1.

Table 4.1 Hardware and software specifications

Item Describe

Model PC Notebook Thinkpad X201i

CPU Intel I5-2500K @ 3.9GHz Intel i3 M330 @ 2.13GHz

RAM 16GB DDR3-1333 4GB DDR3-1066

OS Windows 7 Enterprise 64bit Library

All the experiments discussed in this section are tested indoors. The user stands in front of the Kinect sensor and keeps a distance of 1.8 meters. We test all hand gestures that we defined before and calculate the average process during the operation. In our experiments, we simulate a sequence of operations for surfing the web and analyze the performance of recognition algorithm. Then we compare these results with some similar methods.

The system initialization requires user waving hands. Figure 4.2 shows the hand detection process. Figures 4.2 (a) and (b) shows two images of a waving hand. After waving hand, the hand point that is detected by NITE (see the blue dot in Figure 4.2(c)). To define the Moving Region, the dominant hand (the first detected hand) needs to stay in the air, as show in Figure 4.3 (a), for about one second. Figure 4.3 (b) shows that the Moving Region is defined and the cursor (the yellow circle) is show in the browser.

The one-hand gestures include Click, Move, and Swipe gestures. Figure 4.4 shows the clicking test. Figures 4.4 (a)-(c) show open-fist-open sequence and the red circle in Figure 4.4 (c) indicates the position of clicking action. Figure 4.5 shows the moving test where the cursor moves from right to left as the hand moves from right (see Figure 4.5 (a)) to left (see Figure 4.5 (b)). The pervious page and next page commands correspond to swiping left (see Figures 4.6 (a)-(b)) and swiping right (see Figures 4.6 (b)-(c)), respectively, by a hand.

The two-hand gestures include Zooming and Scrolling gestures. Figures 4.7 and Figure 4.8 show the zooming test. The dominant hand fisting indicates the start of a two-hand action while two hands move apart (see Figure 4.7) and close together (see Figure 4.8) to zoom in and zoom out the page, respectively. The Scrolling gesture also needs the user to fist the dominant hand (shown in Figures 4.9 (a)-(b)) before scrolling pages. The scroll up and down is moving the non-dominant hand up (see Figures 4.10 (a)-(b)) and down (see Figures 4.9 (b)-(c)).

Figure 4.2 (a) and (b) waving the hand. (c) The hand being detected (the blue dot).

(a)

(b)

(c)

Figure 4.3 (a) the hand stay in the air. (b) the defined Moving Region (right) and the cursor position (left).

(a)

(b)

Figure 4.4 The clicking gesture.

(a)

(b)

(c)

Figure 4.5 The moving test.

(a)

(b)

Figure 4.6 The Swipe gesture for pervious page and next page.

(a)

(b)

(c)

Figure 4.7 The zoom in test (moving two hand apart).

(a)

(b)

Figure 4.8 The zoom out test (moving two hand close).

(a)

(b)

Figure 4.9 The scroll down the page.

(a)

(b)

(c)

Figure 4.10 The scroll up the page.

(a)

(b)

Table 4.2 shows the performance of HCI for PC and notebook platforms. The processing time of each frame is the average time during the test of HCI operations for each platform, which correspond a frame rate larger than 45 frames per second even for the slower platform.

Table 4.2 Comparison of frame rate and process time

Item Performance

Machine PC Notebook X201i

Frame rate 49.0103 45.0015

Process time (sec) 0.02 0.022

For operating a web browser in ICU, we compare our system with methods presented in [23]-[25], which are all designed for manipulating medical data. Table 4.3 shows a comparison of different methods. In ICU, paramedics will prefer a system which can be ready quickly for immediate use for browsing patient's health state and health data right around hospital bed. Our hand gesture-based human-computer interaction system requires only about 2 second to initialize system and doesn't require a special field of view (see Figure 4.11 for Kipshagen's method), which is more convenient to browse patient's data. Therefore, our method is more suitable for ICU than other methods.

Table 4.3 Comparison of different methods

Our method Gallo Kipshagen Bellmore

Main Technology

Hand point from Kinect

Skeleton from

Kinect Stereo-cameras Skeleton from Kinect

Figure 4.11 Schematic drawing of contact-free software control application [25].

Chapter 5. Conclusions

In this study, we proposed a real-time hand gesture-based HCI using the Kinect and developed some basic hand gestures that users can use to browse web pages, photos and other information. In order to perform the hand gesture recognition, we extract the four features from depth data, including hand position, hand state and hand moving direction and distance, which are effective in describing the different hand gestures. The hand gesture recognition approach uses two modes according to the number of hand used. The one-hand mode contains three kinds of gestures: Move, Click and Swipe gesture. The two-hand mode contains two kinds of gestures: Zooming and Scrolling. Beside, a Moving Region is automatically determined which can allows the hand to move within a small range to control the cursor to more easily in the monitor screen and decreases the computation of fist detection.

We implemented the hand gesture recognition system based on MVC model to increase the flexibility of hand gesture recognition system and simulate the browsing actions for ordinary web pages. Our system compares favorable with systems in terms of average process time.

Appendix A

The Moving Region

Gesture-based user interface often requires a user to move hands to operate the system, making the user tired easily. To resolve this problem, the system determines the Moving Region (as shown in Figure A.1 (a)) which is a small region and defined to be mapped to the whole screen (shown in Figure A.2 (b)). This region can also reduce the computation time of fist detection.

Figure A.1 (a) the Moving Region. (b) the corresponding monitor screen.

The Moving Region detection operation includes two steps: scene analysis and Moving Region adjustment. In scene analysis, the depth image will be segmented into three layers according to the depth value of hand point:

value of hand point －190 mm. T_high = depth value of hand point ＋190 mm.) Figure A.2 shows an example of the above segmentation, Figure A.2 (a) shows the depth image and Figures A.2 (b)-(d) show the segmented three layers. In order to define a Moving Region, which may not be occluded by other objects, the union of Figures A.2 (b) and (c) is used.

Figure A.2 An example of the layer segmentation. (a) the depth image, (b) the foreground layer, (c) the hand layer, (d) the background layer.

The Moving Region is decided by growing a initial window from the hand point until the window contain additional non-hand object. This process ensures the hand is clear visible in the Moving Region. Figure A.1 (a) shows the Moving Region represented by a yellow rectangle while Figure A.1 (b) shows the corresponding region of the monitor screen (which is the full screen in this case).

(a)

(b) (c) (d)

References

[1] S. Mitra, “Gesture recognition a survey,” IEEE Transactions on Systems, Man, and Cybernetics (SMC) – Part C , vol. 37, no. 3, pp. 311-324, 2007.

[2] R. Dongwan and P. Junseok, “Design of an armband type contact-free Space Input Device for Human-Machine Interface,” IEEE International Conference on communications, pp. 841-842

[3] P. Kumar, S. S. Rautaray, and A. Agrawal, “Hand data glove: a new generation real-time mouse for human-computer interaction,” IEEE International Conference on Recent Advances in Information Technology, pp. 750-755, 2012.

[4] A. Ibarguren, I. Maurtua, and B. Sierra, “Layered architecture for real time sign recognition: hand gesture and movement,” Engineering Applications of Artificial Intelligence, vol. 23(7), pp. 1216-1228, 2010.

[5] Fifth Dimension Technologies, 2012. (http://www.5dt.com/products/ pdataglove5u.html) [6] Y. Fang, K. Wang, J. Cheng, and H. Lu, “A real-time hand gesture recognition method,”

IEEE International Conference Multimedia and Expo, pp.995-998, 2007.

在文檔中基於手勢之次世代加護病房人機互動技術 (頁 18-0)