• 沒有找到結果。

For the computer vision part, results by and performance of the Attention-OCR introduced briefly in Chapter 2 as a straightforward approach will be compared with those of the YOLO system below.

Figure 5.1 shows a detection result of Attention-OCR. We can find out that there are many letters not detectable, especially when the shape of the key is severely affected by perspective deformation. Thus, we eliminate the perspective deformation of the keyboard area first, and use the Canny edge detection [37] to extract the edge of each button, and detect tetragons. Finally, Attention-OCR is applied to recognize all the cropped candidates. The result is shown in Figure 5.2.

Figure 5.1 Detection result by Attention-OCR

However, the detection pipeline of this approach suffers from two big issues. First, similar to R-CNN [24], the candidate-region selection is a bottleneck of the whole process, and the FPS reached is only 0.1538. Second, the resolution of the camera is not enough to extract the edges of the target key button. Moreover, the process is vulnerable to changes in the environment and needs more adjustments. Thus, in many cases, we cannot crop the button correctly. Figure 5.3 shows edges detected through this pipeline. Such a sequential workflow, not parallelized, is the major reason that the traditional computer vision is slow with the end-to-end CNN.

Figure 5.3 The edge detection result enhanced by bilateral filter [38]

Figure 5.2 The pipeline of Attention-OCR detection

The YOLO system we applied solves the efficiency problem. The YOLO system can exploit data augmentation to generalize the model in the environments. Also, it possesses more fault tolerance capability in low-resolution cases. Moreover, through parallelizing the computation of the convolution filters on GPUs to make the detection process more efficient. It can achieve 5.5556 FPS. Since the images we receive from NAO through the internet is under the frame rate of five, with the lowest resolution 320 × 240 in which each letter can still be detected from the associated image. Therefore, the performance is acceptable for our work. Figure 5.4 shows the detection result of the YOLO model.

As what we assume in our system design, any computer vision algorithm or detection model can replace the vision model we use. The more accurate the component is, the more robust is this system.

Attention-Based Actor

For verifying the method proposed in Chapter 3, We test the algorithm on virtual environments with and without applied the attention component in the deep policy actor.

Second, we will discuss how the observed state information affects the selection of the learning results. Additionally, in order to verify that the calibration-free claim in Sergey Levine [9] still works for our system, we randomly change the arm position at each

Figure 5.4 Detection result of the YOLO model

episode.

Moreover, for testing the improvement of attention mechanism we will train three model including original DDPG deep model, we use green line to denote the reward while training; the blue line is applied attention mechanism at the last layer of deep policy model, the red line is applied attention mechanism and use the attention distribution as speed of each joint.

For the verifying listed assumption, we adjusted the virtual arm environment for the aim respectively. The v0 is the original version that only gives us the related positions and the arm angles as the state of the environment. The v1 is for testing the absolute position of target affect trained policy performance; we let the returned state with the position information to the agent that not provide in v0. The v2 is to verify the calibration-free capability. Therefore, we set the arm position randomly at 200 pixels around the position center of v0. Figure 5.5 shows the outlay of those virtual environments. The reward calculate in the same way, the reward is the −1 ×

Figure 5.5 Virtual environment v0 outlay

𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑚𝑎𝑥(𝑤𝑖𝑑𝑡ℎ, ℎ𝑒𝑖𝑔ℎ𝑡)⁄ , if the blue box is touched, give 1 point. The max step of each episode is 200; therefore, the max score is 200, if the target button is touched then give all the score. The target buttons are A to Z, 0 to 9, and Enter.

For the DDPG algorithm, at the beginning of the training phase, we let the agent execute ten thousand random actions as the warmup data placed in the replay buffer to smooth the off-policy sampling. However, the batch normalization to improve performance in their algorithm always damage the learned policy. The reason might be samples from the warmed replay buffer is still not helpful enough, while our goal is to make the training phase as short as possible. Thus we do not adopt that mechanism in our work.

Performance in Virtual Arm v0

Figure 5.6 shows the record of the original reward of each epoch. We can see the

Figure 5.6 The original reward data of each episode, green line is the original DDPG, the blue line is only applied attention mechanism, red line is applied attention and use the attention distribution as joint speed.

Mode1 Attention

Without Attention

Mode2 Attention

agent with attention mechanism whether applied the attention distribution, the learned policy will perform more stable at the convergence phase.

In order to compare the curves more clearly, we will remove the values outside the two standard deviations for every 150 data shows in Figure 5.7. The green curve is the result of the original DDPG algorithm. It eventually will convergence and learn the policy, but the performance in the training phase frequently unstable, even in the convergence too.

The blue curve “mode1 Attention” shows the result that the attention mechanism is only applied in the deep policy model. The performance seems to be not so different from the original DDPG, but the training phase is more stable.

Finally, the red curve “mode2 Attention” is the result of the case that the attention distribution 𝛼𝑖 that we mentioned in 4.3.4 is applied along with the moving speed of each joint. This mechanism significantly improves the performance of the training process in our task.

Figure 5.7 The performance curves of DDPG. The green curve is original DDPD, the blue curve has applied the mechanism described in 4.3.4, The red curve has applied the attention distribution as each joint speed.

Performance with Absolute Position

To evaluate if the absolute position feature in the image will help to learn a better policy, we try to concatenate this feature to the state vector as the policy input. Figure 5.8 shows the result of the added absolute position to the observed state.

From Figure 5.8, we can see the absolute position information stabilized the performance in all the case. Therefore, it is reasonable to add this as a feature to the observed state.

Performance with Randomized Arm

Finally, we want to verify that under the state information we give, it can still have the calibration-free capability. Therefore, we randomly set the arm position at ±100 pixels related to the v0 environment position in every episode.

Figure 5.8 The result of the added absolute position to the observed state

Figure 5.10 Randomly set arm position at the initialization phase of each episode

Figure 5.10 The performance comparison of v0 and v2 virtual arm environment

Figure 5.10 Randomly set arm position at the initialization phase of each episode shows the randomly-set arm positions at the initialization phase of each episode. As the performance curves are shown in Fig 5.8, the randomized positions can accelerate the training process.

Moreover, in the reward of every episode shows in Figure 5.9, the convergence phase of the policy is also more stable than in the v0 environment. The light red line is the performance of the fixed arm position environment; while the red line is the v2 randomized arm position performance.

The success rate of each virtual arm environment listed in Table 6.

Table 6 Success rate of each virtual arm environment

v0 v1 v2

Original DDPG 30.22% 52.45% 31.43%

Mode 1 Attention 33.02% 56.15% 54.95%

Mode 2 Attention 59.00% 53.35% 69.66%

相關文件