Using the Leap Motion
Controller to
.
Translate Sign
Language to Speech
Name: Tam Chi Yan Leung Ka Chun Cheung Yat Laam To Wun Yin
ID: '
·,
School: Engineering Engineering Engineering Engineering
Department: Computer Computer Computer Science Computer Science Science Science
Year of 4 4 4 4 study: Email: -Phone number: 1 I
Abstract
This project aims to develop a sign language translator to improve the communication quality between deaf people and the general public. It returns speech and text when the user performs sign language in front of Leap Motion Controller (LMC). LMC is used to capture hand gesture images and convert them into the positional and direction information. These data will be used to compare with the data inside the database to determine the most similar sign using Dynamic Time Warping (DTW) algorithm. DTW is an algorithm measuring similarity between sequences or time series which may vary in time [1]. It finds an optimal alignment between two time series. Once the gesture has been recognized, its corresponding speech will be played and meaning will be displayed in text.
Our product is able to recognize the gesture with more than 90% accuracy within 50 gestures, where each gesture contains 10 recorded samples, in less than 2 seconds after the action performed. This product shows the possibility and effectiveness of recognizing sign language using LMC. It might probably eliminate the language barrier between deaf and us in future.
Table of Contents
Abstract 2 Table of Contents 3 1. Detail Description 4 1.1 Data Structure 4 1.1.1 Data Structure of Leap Motion 4 1.1.2 Data Structure of Our System 4 1.2 Gesture Matching Algorithm 6 1.3 GUI Design 11 1.3.1 Overall Design 11 1.3.2 Graphic Visualizer Design 11 1.3.3 Multilingual GUI Output Design 12 2. Discussion 13 3. References 151. Detail Description
The goal of this project is to develop a HumanComputer Interaction (HCI) application to improve the communication quality between deaf people and the general public in Hong Kong. This application uses the cameras on LMC to capture the hand gestures and then look for the corresponding sign language, so that the gestures recognized can be translated into text and speech in realtime.
The system receives the data from the LMC, which captures the hand movement. Then the data will be compared with the stored gestures in the database. If the comparison reaches similarity, the meaning in terms of text will be displayed on the screen and the corresponding speech will be played.
1.1 Data Structure
1.1.1 Data Structure of Leap Motion
The data sent from LMC is a series of instances of class Frame. Each Frame object provides the information of recognized hands in a frame, including their directions and coordinates. Only some of these data of a Frame instance are being stored in order to reduce the size of the database.
1.1.2 Data Structure of Our System
To handle the data from controller, we introduce class Coordinate and enumeration HandType for managing threedimensional coordinates and representing the corresponding side of the recognized hands respectively. They are the major components of classes FingerData and PalmData which organize the information related to the captured fingers and palms respectively.
FingerData, PalmData and HandType form a customized frame, class OneFrame, to replace the bulky class Frame from Leap Motion. An array of OneFrame objects is
as a simplified record of an input gesture from the controller. It is the fundamental part of class Sample.
A set of objects of class Sample is taken as the basis for our program to identify a particular gesture. Each gesture has a unique name. It also contains the information of the number of fingers, the number of palms and the type of hands to facilitate faster comparison among signs. These elements define class Sign which is the representation of a sign in our system.
1.2 Gesture Matching Algorithm
Whenever a gesture is captured by LMC, it will be sent to our system to compare with other gestures stored in the database using the Dynamic Time Warping (DTW) algorithm. DTW is an algorithm measuring similarity between sequences or time series which may vary in time. It finds an optimal alignment between two time series. One of the time series may be “warped” nonlinearly by stretching or shrinking its time axis. This optimal alignment can be used to determine the similarity between these two series. The recognition algorithm mainly considers the similarity between the given data (including normalized coordinates of fingertips, normalized coordinates of palms) and those data in the database gesture by gesture. There are already numerous studies and journals about DTW. A journal written by Ralph Niels can show the basic principle of DTW [1].
The distance calculation for the alignment between two sequences in DTW is the major concern in this project. An algorithm has been introduced to calculate the differences between the gesture captured by LMC and those stored in the database. As errors may be caused by gestures beginning at different coordinates, the normalized coordinate of fingers and palms should be calculated in each frame. It helps reduce inconsistency before we calculate the distance between two frames. The following approach has been suggested.
Now, given a frame, name as “frame n” shown in Figure 2. For each finger, the normalized finger coordinate is the relative to the coordinates between fingers and palm. It can preserve the movement of the fingers while omitting the error mentioned above.
Figure 2. Calculation of normalized finger coordinate
The normalized palm coordinate is the relative to the coordinates between the palm in “frame n” and the first frame in the Sample (i.e. frame 0). It can preserve the movement of palm.
The normalized coordinates of fingertips and palms is used to calculate the distance between two frames. The following equation has been suggested to calculate the distance between frames. For example, given a frame from Sample A (i.e. Frame A) and a frame from Sample B (i.e. Frame B), the distance difference is shown in Figure 4.
Figure 4. Equation of calculating distance between two frames
The normalized coordinate of fingers and palms preserve the movement of fingers and palms. The difference of distance can be calculated using this equation to compare the properties between two frames. When the distance between all frames are calculated, DTW can be implemented to find the optimal alignment and the average distance between two gestures.
Due to the limitation of LMC, our product can capture those handsigns which only involve finger movements. Sign languages which consist of limbs and joints are not considered.
DTW algorithm has been used for matching gestures. It is easy to implement as there are numerous source code which implement DTW. Nevertheless, some
gestures. A gesture sample can be described as a series of frames, matching two gestures is equal to compare two series of frames. Therefore, the distances between the frames from these two gesture are the major concern during implementation. The following equation mentioned above has been implemented in DTW to calculate the distance between two frames. In order to understand the process, we show the equation again below. Figure 5. Equation of calculating distance between two frames When the system tries to recognize a gesture sample (i.e. Sample A), it compares with the gestures inside the database by DTW to find the most similar one. The gesture with minimum distance (i.e. Sample B) with Sample A can be said to be “matched”. However, the user might perform a gesture which does not exist in the database, the system would return the most similar one. The inappropriate recognition might lead to incorrect translation which confuses both user and listener.
Theoretically, the above equation can evaluate the distance between two frames with the same number of hands. If the user performs a gesture with two hands, a series of frames with two hands will be generated. Nevertheless, LMC would occasionally fail to capture some data, there might be a few frames which record one hand only. This equation would fail to be implemented due to the difference in hand number. The following approach has been done in order to tolerate this condition.
Figure 6. Frames with different hand number
Given the example above (Figure 6), if the two frames consist different hand number (i.e. 2 hand in frame A and 1 hand in frame B), the frame with less hand number (frame B) should be considered first. In this case, we first check whether the hand in frame B is left or right. Assume it is the left hand in this case, therefore we ignore the right hand in frame A. Instead, we compare the left hand in both frame A and B only. As half of the data in frame A has been given up, therefore the distance calculated by this approach should be adjusted. Adding half of the value of boundary to the distance has been suggested. As this condition occasionally occurs, only a few of frames would be affected. The distance calculated by this approach would not dominate the average distance calculated in DTW.
1.3 GUI Design
1.3.1 Overall Design
Figure 7. Graphic User Interface (GUI) prototype
The GUI design is developed by the JavaFX and JavaFX Scene Builder. The GUI contains multitab which provides different functions of the product.
“Record” tab allows user to set up a new gesture to store in the database.
“Recognize” tab is for user to perform his/her gesture to output the preset
In this program, we used JavaFX as the graphic library of the visualizer. The visualizer is a class with subscene which can be added into other group.
Figure 8. LMC Official Visualizer Figure 9 Visualizer in this project
Inspired by the Leap Motion official visualizer (Figure 8), we decided to build up the hand to have skeleton only for our inbuilt visualizer (Figure 9). Apart from the official visualizer, our skeleton hand simplified the unnecessary details but still keeps the recognizable appearance. In addition, a replay function is necessary for visualizer to replay the stored gesture to the user for showing the gesture in an understandable way for the user. Hence, the visualizer requires to update the screen with the changing input of the LMC and the retrieved gesture stored in database. 1.3.3 Multilingual GUI Output Design User can turn on the recognition mode through GUI. Our system support Cantonese and English translation. In recognition mode, the system will continuously compare gestures stored in the database with the performed gesture. If the gesture is recognized, the corresponding text output will be displayed into the text area of the GUI. Also, the word stored in the database will be sent to Google texttospeech
2. Discussion
Using LMC to translate sign language to speech is the main purpose of this project. A database with 50 signs and total 500 more samples has been recorded. According to the test result, it indicates that our product has more than 90 percent of accuracy for recognizing a gesture. This shows the possibility for translating sign language accurately using LMC. Nevertheless, our product is still restricted by limitations of LMC.
The limitations are confined to the effectiveness of LMC, restricting the choice of gestures performed by users. As LMC can only capture hand movement from the infrared cameras in one direction only, there would be a vision block if the user’s fingers overlap. Although LMC tries to predict the positions of the fingers whenever the vision is blocked, it is not accurate enough. The inaccurate raw data obtained from it might cause recognition errors.
Not only is effectiveness of capturing gestures a concern, but the effectiveness of detection is also a problem. The field of view which LMC can capture data is bounded by the distance restriction of infrared camera in LMC. It is about 150 degrees and approximately 3 to 60 centimeters above the device. Some signs might fail to perform due to the narrow detection range of LMC. This would distort the representation of those gestures involving movements around chest and head, inducing unpredictable recognition flaws.
In order to solve these problems, we have to redefine new gestures for those signs which are unsuitable for detection by LMC. First, gestures that has vision block of fingers must be avoided. Second, gestures must be performed close to LMC device.
users to record the standard signs performed by themselves and selfdefined gestures. This customization feature turns our product into a personal product. It would be efficient for users themselves but not others because of the size of hands and variation in selfdefined gestures. The size of hands and the way of performing selfdefined gestures of an author must fit the data of his or her own database best. So, users can train our product in order to further improve accuracy. Our first intention was to implement the sign language translator on mobile platform. It would be a portable translator which users can use it conveniently rather than sit in front of a computer. This would definitely lower the language barrier between deaf and us. Nonetheless, the computation power of smartphones is far not enough for LMC. It requires a powerful CPU such as Intel Core i3/ i5 / i7 or AMD PhenomTM II [3] which are only available in laptop or desktop computer currently. It is difficult to implement our sign language translator into smart phone right now but probably in future. We take iPhone 6s Plus [4] as an example, it consists of a CPU with 1.85 GHz dualcore 64bit ARM structure. The hardware specification of iPhone 6S is close to the requirement of LMC but still not enough. The computation power of smartphones are anticipated to be increasing. It probably meets the requirement of LMC several years or even several months later.
This product has great potential in sign language recognition for further development. Our group has demonstrated the capability and efficiency of DTW algorithm applied to gesture recognition. DTW is a relative simple and easily understood algorithm for undergraduate students compared with many sophisticated algorithms and models. If there is any further development carried by developers with advanced skills and knowledge, more superb features could be added such as natural language processing and artificial intelligent. Translating sign language into a full sentence with correct grammar could be a possible result of their works. Deaf might speak with us using LMC without any language barrier in future.