Virtual Arm Environment - 基於深度學習及遷移式學習之機器人操作平板電腦虛擬鍵盤的視覺與動作協調系統

First of all, our robot is not for industrial use, and the overheat situation appears frequently. Therefore, we expect the agent to learn what they are doing as soon as possible.

As mentioned before, it is safer and faster for training agents in virtual environments to estimate the RL algorithm. Thus, we develop a simple virtual arm environment to evaluate our assumption of the vision feature. The virtual arm environment is shown in Figure 3.6,

Action Generator

where we colored the stylus pen as yellow and the target area with a blue box, and constrained the joint angle to be in the valid range that NAO moves its arms.

Attention DDPG

The algorithm deep deterministic policy gradient (DDPG) [30] we applied is an ideal approach to tackle our problem. Transit high dimensional arm joint to the limb end of the physical agent is a continuous-control task that DDPG is capable of handling.

Additionally, the DDPG algorithm is model free. Therefore, our work aims to fine-tune the DDPG with some problems we face.

In our assumption, from the relative positions, and the current joint angle form the state 𝑠_𝑡, we can get an action from the deterministic policy 𝜇 by 𝑎_𝑡 = 𝜇(𝑠_𝑡). With the model-free characteristic, the algorithm needs no adjustment to evaluate the assumption.

We only have to change the action and state dimension for the input shape of the deep neural network with 𝜇 and 𝑄^𝜇 . The vision state that the agent can observe is the

pixel-Figure 3.6 Virtual arm environment

wise distance between the key button center to the end of the stylus pen. The x-axis and y-axis distances would be normalized to the interval [−1,1], as the relative positions of each other.

Secondly, in a continuous control task, directly applying the angles generated by policy 𝜋 is usually unstable for the training phase. A common way to conquer this problem is to apply the 𝑡𝑎𝑛ℎ function at the last layer of the deep model and scale down the angle extent to a small range. We thus add an attention layer before the output layer and leverage the scale of the attention map as the scale down ratio of each joint. The concept is to let the more critical joint get more momentum to accomplish the action generated by the policy. We apply the self-attention [34] mechanism to enhance the training performance. The implement detail will describe in the next chapter.

Experiment Design and Implementation

Experiment Platform

NAO¹⁷

NAO is a robot based on the Gentoo Linux system, whose applications can be developed using Python, C++, Java, and JavaScript. In addition to the support of many programming languages, Aldebaran Robotics also provides a relatively simple GUI development tool, Choregraphe¹⁸. Figure 4.1 shows the appearance of NAO. The reason we choose NAO is that NAO looks like a child, which makes it suitable for applications like accompanying the elders for long-term care. Its human-like body also makes it possible to help incapable human for some simple jobs on 3C devices

Figure 4.1 NAO V5 [2]

17 https://www.softbankrobotics.com/emea/en/nao

18 http://doc.aldebaran.com/1-14/software/choregraphe/

Physical Setup

It is essential to create a safe and proper experimental setting for robots. Especially that the NAO robot is not as robust as robots for industry application. Some constraints have to be taken into considerations. First of all, in all experiments, we would set NAO to be in a sitting position to avoid overloading the motor at knees. However, the motors of knees still may overheat if NAO stays sitting for a long period. Therefore, we fix the fitness of each leg alternately to keep the body balance. Moreover, if the arm motor temperature is too high, NAO will be set to the rest mode to rest until the motor cools down. Additionally, a tablet computer that displays the virtual keyboard is placed in front of NAO. In this way, the NAO robot can see the state of the keyboard and its own hand movements, and touch the target position according to the visual information. The physical setup of the robot is shown in Figure 4.2.

Another problem is that how can the NAO robot handle the stylus pen in hand. In the beginning, we made NAO extend its index finger and tried to use it to press a button on a regular physical keyboard, as shown in Figure 4.3.

Figure 4.2 Physical setup of the Nao robot

Figure 4.3 Press keyboard button by finger

Figure 4.4 Filled with foam tape between fingers and pen

Unfortunately, due to the fragility of NAO’s finger structure. This is not the best and safest choice. At the same time, how to hold the pen should be also taken into account.

We found it difficult for NAO to hold the pen properly, since the robot’s hand is made of just three fingers, and cannot hold the pen well. In order to avoid pen from sliding away, we filled with foam tape to bind NAO’s fingers and the pen together; it only tightened the hand holding the pen. The resultant hand and pen are shown in Figure 4.4.

On the other hand, if we use the physical keyboard, the distance information that we applied to conquer sparse reward will be less accurate. Therefore, we replaced the physical keyboard with a tablet to get the position where the stylus pen touches a tablet keyboard.

Remote Computation System Setup

A scheme of our remote computation setting is shown in Figure 4.5. The system consists of one remote computer (a desktop), one NAO robot, and one tablet.

Consider NAO’s embedded computer: one core 1.60 GHz CPU (Atom Z530) and only 1 GB RAM. Obviously, the computation resources of NAO are very restricted. So, a remote computer is introduced as the centralized computation unit of the whole system.

It is designed to infer the distance from the pen’s tail NAO hold and the target key button, and evaluate how we should change the joint angles of NAO’s hand to bring them to the

Physical agent

Cyber agent

Figure 4.5 Remote Computation System

same places. We also used the GPU on the remote computer to handle the relative heavy loaded training phase and to execute the inference work that consumes considerable RAM space and requires parallel computation.

The NAO robot is the physical agent of our task. It provides action and perception capability to interact with the real world. We process the images from NAO’s camera via the local wireless network and send a command to NAO’s hand to execute the action we want it to do.

The tablet will record the positions pressed by NAO in the training phase. The distance between the point pressed and the target key position will be used for evaluation of the reward.

We use a Microsoft Surface pro4¹⁹ to display the virtual keyboard. In order to avoid extra transmission latency and unnecessary effort of remote computation, we also implemented a virtual keyboard internally on the remote computer and use remote desktop software to display the keyboard on the tablet with the same layout of the Surface Pro4.

In order to avoid the extra transmission latency and unnecessary effort of remote computation, we implemented a virtual keyboard on the remote computer and use remote desktop software to display the keyboard on the tablet with the same layout on Surface Pro4.

Virtual Keyboard on the Tablet

For training NAO agent to type a keyboard, we need some information that feedbacks from the environment. We used Python’s package pyglet to design a virtual keyboard similar to the keyboard outlay of the Surface Pro4 Tablet. When the tablet

screen is touched, the virtual keyboard program can probe the relative distance calculated by the xy-coordinates on the screen. Moreover, since our system’s computations are executed remote from the agent, large latency and extra coding jobs of communications between the internet should be avoided. We keep the virtual keyboard and reinforcement learning agent separately and use the remote desktop software (i.e., TeamViewer²⁰, AnyDesk²¹, VNCviewer²²) to simulate the real keyboard environment. Figure 4.6 shows the virtual keyboard and the original tablet keyboard appearance.

NAO gym package

Gym²³ is a toolkit for developing or improving reinforcement learning algorithms announced by OpenAI. It supports teaching agents to react in a virtual environment built for them. Especially if the hardware devices are fragile or expensive, or the algorithm is in the early development phase. We leverage this toolkit to build a virtual environment, which integrates NAO Python SDK and a virtual keyboard that we developed previously.

Once we have constructed the environment, new algorithms or vision models can be used to improve the system, and we can directly replace any component of a cyber-agent in the environment.

20 https://www.teamviewer.com/tw/download/windows/

21 https://anydesk.com/zhs

22 https://www.realvnc.com/en/connect/download/viewer/

23 https://gym.openai.com/

Figure 4.6 Virtual Keyboard layout

Moreover, if some other people want to conduct similar researches or applications as we do, they can directly apply our NAO gym implementation to train their NAO robot.

The package mainly included the following APIs:

𝑟𝑒𝑠𝑒𝑡() function: Initialize the sitting pose and safely move the arm grasping a pen and move to the top of a tablet panel. It can also set the arm to an initial pose to make the pen visible to the agent. This function returns the state that concatenates arm angles and distances between the pen and the target position.

𝑠𝑡𝑒𝑝(𝑎𝑐𝑡𝑖𝑜𝑛) function: Convert an input action vector to the angle setting command and send it to the physical agent. Return the state of arm angles and the distance between the pen and the target position after executing the action.

𝑜𝑛𝑇𝑜𝑢𝑐ℎ() function: Monitor tactile sensors, or the deviation of the signal the joint sensor got and the command set to the robot body. We can leverage this event-detection capability to judge if the stylus touches the tablet screen.

𝑔𝑒𝑡𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒() function: Monitor the temperature of each joint motor, if they get too hot, then set NAO to rest pose until the motor cools down.

Vision

Transfer Learning

Our framework is composed of a vision part and a motion part. Therefore, the vision model can be designed and trained independently. We applied the YOLOv3 [20] model and retrained the model with transfer learning [35]. First, we fixed the front layer of the model and trained with our data for five epochs. After that, we unfreeze all layers of the YOLO model and continue training as shown in Figure 4.7.

Data Augmentation

For training on our dataset to fit the distribution of the panel outlay, we use data augmentation to increase data samples to avoid overfitting. There are two approaches to augment our dataset. First, we use a computer vision annotation tool called CVAT[36]. It is a free, online interactive video and image annotation tool for computer vision. We use CVAT to label the keyboard layout with locations and bounding boxes of each letter button with two pictures of the tablet, as shown in Figure 4.8. Then we apply the random OpenCV-powered augmentation to the input images according to the specifications in Table 2.

Figure 4.7 Freeze front layers in prior epoch for transfer learning

The Bounding boxes will be tracked automatically and updated with the images.

Moreover, we leverage cropped letter button images to enhance the training process through randomly applying the affine transform to them and synthesizing them with the COCO dataset [33] images. The random affine parameters of the synthesis character are listed in Table 3. The cropped keyboard button are shown in Figure 4.9.

Finally, in order to detect the stylus pen correctly, we also have to produce some labeled data of it. We tried to use the CVAT tools again, by recording a short video that captures images of different view angles of the pen. However, the annotation job is tedious, and results of the captured image tend to have some afterimage caused by the angle changes.

Figure 4.8 Label data with CVAT

Table 2 Augmentation parameters of keyboard Augmentation Description

Translation +/- 10% (vertical and horizontal)

Rotation +/- 5 degrees

Shear +/- 2 degrees

Scale +400%/-10%

HSV Saturation +/- 50%

HSV Intensity +/- 50%

So, we use an image processing program to eliminate the background of stylus pen image and save them as PNG files with the transparent channel. Then we conduct similar processing like the synthesis of the letter buttons with the COCO dataset.

However, due to the fixed focal distance of NAO camera, if we synthesize the picture with the pen images straightforwardly, the stylus pen could be difficult to detect when the pen is too close to the camera. Therefore, we blur the image when doing the synthesis processing according to the scale of the stylus pen that we put in the background. Some images of stylus pens are shown in Figure 4.10. Figure 4.11 shows the augmented training image and the labels, and the inference result of the model.

Figure 4.9 Cropped character button

Table 3 Parameter of random affine of character button on COCO images

Augmentation Description

Rotation +/- 5 degrees

Shear +/- 2 degrees

Scale +400%/-10%

Figure 4.11 The left images are augmented and synthesized by OpenCV. The image at the right side is inference result of the vision model.

Figure 4.10 Stylus pen image without a background.

Motion

Constraints on Joint Angles

As the issues mentioned before, the motors of the robot get overheat frequently.

Moreover, the collision of the robot body is the major problem that we try to avoid. To achieve this, the joint angle that NAO should not be set is found by exploring all angles it can reach. It is reasonable that, in a dull task, the trajectory in the action space usually occupies a small-range sub-space. Thus, we can restrict the angles of the arm to be within validated ranges, where NAO can still touch the keyboard. Figure 4.12 shows the angle ranges of each joint. Table 4 shows the restricted ranges.

Figure 4.12 The angle ranges of each joint

Reward Engineering

The reward is the critical factor directly affecting the training process and the policy the agent learned. In our implementation, the relative distance is the primary factor that we concern. There are two relative distances that we can use: one is the distance between the pen and the target letter the agent sees. The other is between the position it touched and the target letter location on the screen. Moreover, we want the agent to avoid being trapped in the sparse reward condition. Thus, we tend to let both of the pen and the letter appear in the agent’s sight. The agent cannot perceive the depth information due to that the two cameras are arranged in a vertical line. However, Sergey Levine et al. [8] [9]

proved that even the agent gets only a two-dimensional image, the agent can still learn a policy to complete the task. But, for offering more pose information, we give the ratio of bounding box areas of the pen and the target key button, respectively. Finally, we hope that the agent can type the character button, not just press on it. Thus, when the agent holds on the touched position too long or drags the pen too far away, we will give it a negative reward. Table 5 shows the reward setting.

Table 4 Restricted range

Joint name Range (radians) Restricted

RShoulderPitch -2.0857 to 2.0857 0.3958 to 0.9097 RShoulderRoll -1.3265 to 0.3142 -1.3265 to 0.3142

RElbowYaw -2.0857 to 2.0857 -0.2016 to 2.0857

RElbowRoll 0.0349 to 1.5446 0.3509 to 1.5446

RWristYaw -1.8238 to 1.8238 0.2536 to 1.8238

Table 5 The reward setting

State Type Score rate

Normalized related distance by the vision −𝑥

Normalized related distance by touched position 1 − 𝑥

Pen appearance −𝑥

Target appearance −0.5𝑥

Pixel related distance small the 40 by the vision −𝑥 Cosine similarity of set angle and get angle −1 + 𝑥

Pressed the target letter +20

Pressed a button +1

Time of the pen keep touched on the screen −𝑥

Normalized related distance dragged from the touched position −𝑥

State Observation

The state we observe from the environment is the input to the policy. In general, the more information, the better. However, we cannot provide all the information to the agent, like the relative position between the pen touched and the target on the screen. In the testing phase, that information is not available to the agent; it is irrational to let the agent provide the state to the policy. Besides, before it touches the screen, the agent cannot get any reward from the environment. This kind of sparse reward might not be helpful. Thus, the observations that we provide to the agent are:

1. Relative position (𝑥, 𝑦) = (𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛_𝑝𝑒𝑛− 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛_{𝑡𝑎𝑟𝑔𝑒𝑡})/𝑣𝑖𝑠𝑖𝑜𝑛_𝑤𝑖𝑑𝑡ℎ.

2. Pen appearance in sight. Set to be 1 if it appears, otherwise 0 3. Target appearance in sight. Set to be 1 if it appears, otherwise 0

4. Area ratio count 𝑎𝑟𝑒𝑎_{𝑡𝑎𝑟𝑔𝑒𝑡}⁄𝑎𝑟𝑒𝑎_𝑝𝑒𝑛, if both areas appear, otherwise 0

5. The joint angle probed by joint sensors. A five-dimensional tuple of angle values.

6. The joint angles setting by command. A five dimensional tuple of angle values.

7. The absolute position of target in sight (𝑥, 𝑦) = 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛_{𝑡𝑎𝑟𝑔𝑒𝑡}/𝑣𝑖𝑠𝑖𝑜𝑛_𝑤𝑖𝑑𝑡ℎ.

We combine the above information as a 17 dimensional vector for the observed states and send it to the policy 𝜋 to make a decision. The agent executes the action to obtain new states and is rewarded from the environment, and then continues the training process.

Attention Mechanism

Our attention mechanism is a slight modification of that proposed in [34]. Given a task-related query vector 𝑞, via calculating the attention distribution projected on 𝑘𝑒𝑦, apply it on the 𝑣𝑎𝑙𝑢𝑒 , and then we will get the attention value. We define 𝑋 = (𝑥₁, … , 𝑥_𝑛) as the input data. Through calculating the attention distribution by 𝑘𝑒𝑦 = 𝑣𝑎𝑙𝑢𝑒 = 𝑋, we can get 𝛼_𝑖 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑠(𝑘𝑒𝑦_𝑖, 𝑞)) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑠(𝑋_𝑖, 𝑞)), where 𝛼_𝑖 is the attention distribution, and 𝑠(𝑋_𝑖, 𝑞) is an attention estimator. In general, 𝑠(𝑋_𝑖, 𝑞) = 𝑋_𝑖^T𝑞. Attention distribution 𝛼_𝑖 means when we query in 𝑞, the attention level of the 𝑖th element. Using a soft message selection mechanism to encoding 𝑎𝑡𝑡(𝑞, 𝑋) = ∑^𝑁_𝑖=1𝑎_𝑖𝑋_𝑖.

The significant difference between the self-attention and the traditional attention mechanism is that the conventional attention 𝑞 usually comes from the external environment, but for self-attention, the 𝑞 comes from itself. Therefore, we add the following attention layer to the output layer. The Value comes from:

The last output layer:

𝑋_𝑖 = (𝑥₁, … , 𝑥_𝑛) The attention estimator:

𝑠(𝑘𝑒𝑦_𝑖, 𝑞) = 𝑋_𝑖^T𝑋_𝑖

By softmax to figure out the attention distribution:

𝛼_𝑖 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑠(𝑘𝑒𝑦_𝑖, 𝑞)) Finally, we apply the attention distribution to the output data:

𝑎𝑡𝑡(𝑞, 𝑋) = ∑ 𝛼_𝑖𝑋_𝑖

𝑁

𝑖=1

Experiment Results and Discussion

We conduct the experiments according to the proposed features.

在文檔中基於深度學習及遷移式學習之機器人操作平板電腦虛擬鍵盤的視覺與動作協調系統 (頁 33-52)