機器棋士的製作與挑戰

(1)

國立高雄大學應用數學系

碩士論文

機器棋士的製作與挑戰

Computer Player for Gomoku and Pick-and-Place Task Using

Robotic Arm

研究生：蘇靖壹撰

指導教授：張志鴻

(2)

(3)

Computer Player for Gomoku and Pick-and-Place Task

Using Robotic Arm

Advisor: Professor Chang, Chih-Hung

Department of Applied Mathematics

National University of Kaohsiung

Student: Su, Jing-Yi

Department of Applied Mathematics

National University of Kaohsiung

ABSTRACT

Deep learning and intelligent manufacturing are widely applied nowadays. In this paper,

we implement two applications of these two fields. The application of deep learning is to

train a computer player for Gomoku. Gomoku is a chess game with a simple rule that

whoever gets their stones 5-in-a-row wins. We first train the Gomoku player using supervised

learning. Secondly, we also enhance our Gomoku player with reinforcement learning. The

application of intelligent manufacturing is to accomplish a pick-and-place task using a robotic

arm. Instead of hard coding for specific object and target positions, we want the robotic arm

to detect the positions and accomplish the task automatically. So far, some results have been

achieved.

(4)

Contents

1. Introduction 2

2. Preliminary of Gomoku Player 2

2.1. Convolutional Neural Network 2

2.1.1. Convolution Layer 3

2.1.2. Activation Layer 4

2.1.3. Pooling Layer 5

2.2. Residual Block 6

2.3. Batch Normalization 6

2.4. Asynchronous Policy and Value Monte Carlo Tree Search

Algorithm 7

2.4.1. Algorithm 8

3. Training a Gomoku Player 9

3.1. Data Preprocessing 9

3.2. Model Construction 10

3.3. Training Result 11

3.4. The Problem of the Gomoku Player 12

4. Reinforcing the Gomoku player 13

4.1. Retrained Model 13

4.2. Reinforced Model 13

4.3. Result 14

5. Control of Robotic Arm 14

5.1. Homogeneous Transformation Matrix 14

5.2. Denavit-Hartenberg Parameters 15

5.3. Pieper’s Solution 19

6. Hardware Introduction and Wiring Diagram 22

6.1. Hardware Introduction 22

6.2. Wiring Diagram 25

7. Object Detection and Pick and Place 27

7.1. Functions 27

7.2. Process 29

8. Conclusion and Discussion 30

(5)

1. Introduction

AlphaGo[8] is a computer player for the game of Go. AlphaGo plays Go based on two neural networks, policy network and value network. These two neural networks are trained with human-played game records using some artificial features. In May 2017, AlphaGo beats Ke Jie, the world champion of Go, in a three-game match. It was a huge step in the history of computer player. However, a stronger computer player, AlphaGo Zero[9], comes out in a few months later. The most surprising thing is that AlphaGo Zero is constructed without any human knowledge and has a simpler model. Unlike AlphaGo, the network used in AlphaGo Zero only takes the positions of stones as features. The network is reinforced itself with the self-played game records. After 40 days training, AlphaGo Zero has the ability to beat AlphaGo with score 89:11.

In this paper, we want to train a computer player for Gomoku and applied it to a robotic arm. The robotic arm is a 3D printed robotic arm we make. Our final goal is to make the robotic arm play Gomoku against human automatically. So far, there are some difficulties we still need to conquer. In the following, we will show our current works on the computer Gomoku player and the robotic arm.

For computer Gomoku player part, we want to train a computer player for Gomoku. First, we train the Gomoku player with supervised learning based on the model used in AlphaGo Zero. After training, the Gomoku player seems to have the ability to play against human. However, after a few games playing with the Gomoku player, we found that it does not have the ability to finish games. The reason is that the game records we feed to train with are usually not finished. When two great human players play against each other, they can predict the trend of the game and one of them will admit defeat before the wining move based on the game rule. In order to solve the problem, we try two ways, one is that we finish game records and retrain the Gomoku player, and the other is that we reinforce the Gomoku player using the method used in AlphaGo Zero.

For robotic arm part, we achieve a automatically pick-and-place task. We will introduce the control of robotic arm and show our hardware configura-tion and a video of the whole task.

The rest of this paper is organized as follows. Section 2 introduces some preliminaries about the method we used to build the model and the rein-forcement learning used in AlphaGo Zero. Section 3 shows the process and the result of training a Gomoku player. Section 4 shows the way of finish-ing game records and the result of retrainfinish-ing a Gomoku player. Section 5 shows the process of the reinforcement learning and the result of it. Section 6 introduces how to control a robotic arm and gives examples of it. Sec-tion 7 introduces the hardware we use and line diagrams of it. SecSec-tion 8 introduces some functions of the robotic arm and the whole process of the pick-and-place task. Some discussion and conclusion are given in Section 9

2. Preliminary of Gomoku Player 2.1. Convolutional Neural Network.

(6)

Artificial neural network (ANN) is a model that intends to replicate how human’s brain works. Brains deliver information through the basic cells, neurons. When a neuron receives some signals, it would pass the informa-tion to the next one if those signals are higher than a threshold. In 1957, a model based on a single neuron, perceptron learning algorithm (PLA)[1], was introduced by Frank Rosenblatt. However, PLA only can solve classification problem that is linear separable. In that time, although people know how to construct a model with multiple layers and neurons can solve more com-plicated problem, there was no good way to adapt the weights in the model. Until 1974, Paul J. Werbos proposes backpropagation algorithm[2] which solves the problem and makes the training of multi-layer neural networks feasible and efficient.

Convolutional neural network (CNN) has had huge success in the field of image recognition. The main difference between CNN and ANN is that CNN deals with the data before it gets into the fully connected network. For ANN, the inputs of the fully connected network are the pixels of the image. For CNN, the inputs of the fully connected network are the features of the image like if there is a leaf in the upper right. In order to extract features from images, there are two more layers, convolution layer and pooling layer in CNN. More details about these two layers will be shown in the rest of the section. See figure 1[15] for an simple example of CNN model.

Figure 1. 2.1.1. Convolution Layer.

Take image classification task for instance. The input of the convolutional layer is a 3-dimensional tensor, the first index is depth which indicates the different channels like RGB and the others are the spatial coordinates of each channel. Comparing with the parameters of ANN, the parameters here consist of a set of learnable filters. Every filter has smaller height and width than the input, but the same depth. See figure 2.

During the forward propagation, we convolve the input with each filter and stack up the outputs. The convolution operation is basically sliding the filter all over the input and computing dot product of the filter and the scanned area in every slide. When sliding the filter, we can also control the stride which decides how many entries the filter has to move. Suppose we have a input Al×m×n and a filter Bl×p×q. Convolving A with B using stride

s outputs C₍m−p s +1)×(

n−q

(7)

Input: 3 x m x n filter: i x3 x p x q

Figure 2.

Ci,j = ∑

l,p,q

Al,(i−1)s+p,(j−1)s+qBl,p,q

for all i, j. In some sense, the value of the dot product represents the extent to which the feature is detected. That is to say, the goal of convolution is detecting the filter represented pattern over the input. See figure 3 for the example of the convolution operation using stride 1.

Input: 3 x 3 x 3 𝐴1 𝐴2 𝐴3 𝐴4 𝐴5 𝐴6 𝐴7 𝐴8 𝐴9 𝐵1 𝐵2 𝐵3 𝐵4 𝐵5 𝐵6 𝐵7 𝐵8 𝐵9 𝐶1 𝐶2 𝐶3 𝐶4 𝐶5 𝐶6 𝐶7 𝐶8 𝐶9 𝐷1 𝐷2 𝐷3 𝐷4 filter: 3 x 2 x 2 𝐸1 𝐸2 𝐸3 𝐸4 𝐹1 𝐹2 𝐹3 𝐹4 𝐴4 × 𝐷1 + 𝐴5 × 𝐷2 + 𝐴7 × 𝐷3 + 𝐴8 × 𝐷4 + 𝐵4 × 𝐸1 + 𝐵5 × 𝐸2 + 𝐵7 × 𝐸3 + 𝐵8 × 𝐸4 + 𝐶4 × 𝐹1 + 𝐶5 × 𝐹2 + 𝐶7 × 𝐹3 + 𝐶8 × 𝐹4 𝐴2 × 𝐷1 + 𝐴3 × 𝐷2 + 𝐴5 × 𝐷3 + 𝐴6 × 𝐷4 + 𝐵2 × 𝐸1 + 𝐵3 × 𝐸2 + 𝐵5 × 𝐸3 + 𝐵6 × 𝐸4 + 𝐶2 × 𝐹1 + 𝐶3 × 𝐹2 + 𝐶5 × 𝐹3 + 𝐶6 × 𝐹4 𝐴1 × 𝐷1 + 𝐴2 × 𝐷2 + 𝐴4 × 𝐷3 + 𝐴5 × 𝐷4 + 𝐵1 × 𝐸1 + 𝐵2 × 𝐸2 + 𝐵4 × 𝐸3 + 𝐵5 × 𝐸4 + 𝐶1 × 𝐹1 + 𝐶2 × 𝐹2 + 𝐶4 × 𝐹3 + 𝐶5 × 𝐹4 𝐴5 × 𝐷1 + 𝐴6 × 𝐷2 + 𝐴8 × 𝐷3 + 𝐴9 × 𝐷4 + 𝐵5 × 𝐸1 + 𝐵6 × 𝐸2 + 𝐵8 × 𝐸3 + 𝐵9 × 𝐸4 + 𝐶5 × 𝐹1 + 𝐶6 × 𝐹2 + 𝐶8 × 𝐹3 + 𝐶9 × 𝐹4 Output: 2 x 2 Figure 3.

As we can see, the height and the width are getting smaller after the convolution. If we build a deep network with a lot of convolution layers, the data will be shrank to a very small volume, which restricts the functionality of the network. Therefore, when adding a convolution layer, we sometimes choose to pad zeros around the input.

2.1.2. Activation Layer.

Activation layer is applied after convolution layer. A neural network without activation functions is just a linear function no matter how deep it is. A simple task like XOR problem is not linear separable, not to mention

(8)

the complicated tasks for neural network. Applying non-linear activation functions to neural network can let neural network be a non-linear function. What we do here is choosing a non-linear activation function like sigmoid function, hyperbolic tangent (tanh) function, or rectified linear unit (ReLU) function and apply it to every entry in the input tensor. The input and the output of this layer will have the same size. See figure 4, 5 and 6 for some activation functions.

Figure 4. sigmoid function: f (x) = _1+e1−x

Figure 5. tanh function: f (x) =1−e

−2x

1+e−2x

Figure 6. ReLU function: f (x) = max{0, x}

2.1.3. Pooling Layer.

Once a feature has been detected through convolution layer, its precise location is less important than its relative location to the other features. For example, when doing a face recognition, once the eyes are detected, we do not need to know the precise locations of eyes, we only need to know there is an eye on the left side of the face and an eye on the right side of the face. Therefore, we cut the input into small regions and extract values from it to represent each region. There are several pooling ways to extract values and

(9)

3

1

0

9

6

1

3

5

0

1

2

1

7

1

3

5 9 6

2 7

Input: 1 x 4 x 4, filter: 2 x 2, stride: 2

Figure 7.

the most frequently used one is called the max pooling. When doing max pooling, there are two parameters, the size of small regions and the stride. Suppose we have an input Al×m×n, p × q the size of small regions and stride

s. After doing max pooling, it outputs C_l×(m−p s +1)×(

n−q

s +1) where

Cl,i,j =max{A_{l,(i−1)s+k,(j−1)s+t}∶k ∈ [1, p], t ∈ [1, q]}

for all i, j. Note that [a, b] = {i ∈ Z ∶ a ≤ i ≤ b}. See figure 7 for an example. After doing max pooling, the output becomes a smaller tensor. With less parameters, it also helps reducing the computation cost and controlling the overfitting problem.

2.2. Residual Block.

When dealing with neural network, a deeper network is considered to have more capability than a shallower one. Although it may cause overfitting, it should work better for the training set. In fact, scientists found that the performance of a network is getting worse after a certain depth even in the training set[5]. The problem is the deeper the network is, the more difficult to find the global minimum of its loss function.

Suppose we have a neural network with great performance and we want to train a new neural network with extra layers and keep the performance. One way to do this is keeping all the parameters of the original neural network and letting the rest layers be identity mappings. If a network can learn this way, no matter how deep the the neural network is, the performance will be kept at least. However, it is hard in practice. The idea of residual block is that for some continuous layers in a network, we denote the desired underlying mapping of them with H(x). Instead of directly fitting H(x), we let the layers fit F (x) = H(x) − x, the residual of the identity mapping. To do a residual block in practice, we just construct a shortcut which adds the input to the output. See figure 8.

2.3. Batch Normalization.

Before applying an algorithm of machine learning, we normalize the input to speed up the training process. In neural network, the output of a hidden layer is the input for the next one. The task batch normalization does is applying normalization to the hidden layers. During training, the distribu-tion of each layers inputs changes every time and the next layer is forced

(10)

𝑥

Layer

𝐹(𝑥)

Identity

_mapping

𝐹 𝑥 + 𝑥

Figure 8.

to adapt it, which slows down the training. By keeping mean and variance fixed can reduce the impacts of previous layers. The interesting thing here is batch normalization will be applied before the activation function because it can deal with the gradient vanishing problem in some sense[6].

Suppose we have a set Z = {z1, z2, ..., zm} which are the inputs of a unit in the hidden layer over a mini-batch m. Simply normalizing Z will reduce the expressive power of the network. For instance, normalizing the inputs of a sigmoid function will constrain the inputs in a certain range. To address this, there are two learnable parameters γ and β which are used to scale and shift the normalized value.

Step 1: Compute mean and variance zµ= 1 m m ∑ i=1 Zi , zσ2 = m ∑ i=1 (z_i−z_µ)2 Step 2: Normalize Znorm= {zˆ_i∶zˆ_i= zi−z_µ √ z_σ2+ } : a minor number to avoid dividing by zero Step 3: Scale and shift

Znorm∶= {zˆ_i∶zˆ_i∶=γ ˆz_i+β}

In CNN, the output of the convolution layer is the unit of neural network. Suppose we have a 4-dimensional input tensor, the first index is batch size and the others are channels, height and weight. A unit here is a slice from the index of channels. Just flatten it and do the same work in last section. 2.4. Asynchronous Policy and Value Monte Carlo Tree Search Al-gorithm.

Asynchronous policy and value Monte Carlo tree search(APV-MCTS) is a variant of Monte Carlo tree search (MCTS)[7]. It was first introduced in AlphaGo[8]. In the following, we will introduce a simpler APV-MCTS variant used in AlphaGo Zero[9]. The main difference between APV-MCTS and MCTS is that APV-MCTS uses two neural networks, policy neural network to give expanded nodes prior probabilities, and value neural network to improve the game simulation which is a random playout in MCTS. The policy neural network f and the value neural network g both take the raw

(11)

board representation s of the position and its history as input, and one outputs move probabilities pa for all legal move a and the other outputs a

value v. The value v is a scalar evaluation of the chance that the current player wins from s.

2.4.1. Algorithm.

Each node s in the search tree has branches (s, a) for all legal moves a. Each branch stores a set of four values

{N (s, a), W (s, a), Q(s, a), P (s, a)} where

N (s, a) ∶ visit count W (s, a) ∶ total action value

Q(s, a) ∶ mean action value

P (s, a) ∶ prior probability of selecting a

Each search has four phases in APV-MCTS algorithm. See figure 9. It will be terminated by time or a number of searches.

Select Expand Evaluate Backup

Figure 9.

Select: It starts from selecting one of the children of the root node based on a selection formula and keep doing this until the selected node is a leaf node. We let st denote the node selected during the selection where t is

the order from 0, the root node, to L, the leaf node and atdenote the move

selected at st. atis selected using a variant of the PUCT algorithm[10]. Note

that UCT(Upper Confidence Bound 1 applied to trees) is an algorithm to balance the selection between the exploitation of depth and the exploration of branches of trees, and PUCT is an algorithm based on UCT with a prior probability for each node.

at=argmax a (Q(st, a) + U (st, a)) where U (s, a) = cpuctP (s, a) √ ∑_bN (s, b) 1 + N (s, a)

where b belongs to the set of the legal moves at s and cpuct is a constant

determining the level of exploration.

Expand and evaluate: When the leaf node sL is selected, it will be

expanded and each branch (sL, a) is initialized to

(12)

where pa is the output of f (sL). The value v = g(sL), the chance that the

current player wins from sL, is evaluated as the simulation result. v will be

used to update the information of the branches.

Backup: The sets branches store update backward from (sL−1, aL−1) to (s₀, a₀) N (st, at) ∶=N (s_t, a_t) +1 W (st, at) ∶=W (s_t, a_t) +v Q(st, at) ∶= W (st, at) N (st, at)

Play: After the repeated four phases part terminate, we can get the policy function π(a∣s0) that outputs the probability of selecting move a at s0 π(a∣s0) = N (s0, a) 1 τ ∑_bN (s0, b) 1 τ

where τ controls the level of exploration, when τ → 0 it will select the move with maximal visit count.

APV-MCTS can be seen as doing MCTS based on the neural networks. The move selected after APV-MCTS is usually better than the move sim-ply output from the policy neural network. Therefore, we can make game matches using APV-MCTS, and use it to improve our neural network. By looping the process, we can reinforce our networks.

3. Training a Gomoku Player

The Gomoku player is constructed based on the APV-MCTS. The policy neural network and the value neural network of the APV-MCTS will be first trained using supervised learning. In the rest of this section, data preprocessing, model construction, training result and the problem of the Gomoku player will be shown.

3.1. Data Preprocessing.

The database of Gomoku is download online[12]. After deleting the miss-ing data, we transform the data into the input, policy output and value output. The input is a stack of 11 feature planes with size 15 × 15, note that 15 × 15 is the size of the Gomoku board. Feature plane Xt represents

current player’s stone distribution at current time-step t and feature plane Yt represents opponent’s stone distribution at current time-step t where

Xt(i, j) = { 1, Xt

(i, j) contains a current player’s stone

0, Xt(i, j) contains a opponent’s stone or is empty or t < 0 Yt(i, j) = {

1, Yt(i, j) contains a opponent’s stone

0, Yt(i, j) contains a current player’s stone or is empty or t < 0 for all i, j and feature plane C represents the next stone color to play, the val-ues in it are either all 1 if the color is black or all 0 if the color is white. These

(13)

planes, {Xt, Yt, Xt−1, Yt−1, Xt−2, Yt−2, Xt−3, Yt−3, Xt−4, Yt−4, C}, are

concate-nated together as the input. The policy output Z is a matrix with size 15 × 15 where

Z(i, j) = { 1, Z(i, j) is the next move 0, otherwise

for all i, j and we flatten it to 1 dimensional. The value output v is the result of the game where

v = { 1, the current player wins −1, the current player loses

There are about 34302 game records download online, we transform them into 1162665 data and divide it to 1000000 training data and 16225 testing data.

3.2. Model Construction.

We use a multiple outputs model here to combine policy neural network and value neural network. See figure 10. The model is based on the model used in AlphaGo Zero. Since Gomoku is less complicated than Go, we do a little change on the number of the filters and the size of it. More details is shown below. Conv. Layer1 19 times Input B.N. Layer ReLU Layer Conv. Layer1 B.N. Layer ReLU Layer Conv. Layer1 B.N. Layer ReLU Layer Conv. Layer2 B.N. Layer ReLU Layer Flatten Layer Dense Layer1 Softmax Layer Policy Output Conv. Layer3 B.N. Layer ReLU Layer Flatten Layer Dense Layer2 ReLU Layer Dense Layer3 tanh Layer Value Output Figure 10. where

Conv. Layer1 ∶ 256 filters with size 3x3 and stride 1 Conv. Layer2 ∶ 2 filters with size 1x1 and stride 1 Conv. Layer3 ∶ 1 filters with size 1x1 and stride 1

(14)

B.N. Layer ∶ Doing batch normalization

ReLU Layer ∶ Activation layer with ReLU function tanh Layer ∶ Activation layer with tanh function Flatten Layer ∶ Flattening the input to 1 dimensional

Dense Layer1 ∶ Fully connected layer outputs 225 units Dense Layer2 ∶ Fully connected layer outputs 256 units Dense Layer3 ∶ Fully connected layer outputs 1 unit Softmax Layer ∶ Transforming values into probabilities

3.3. Training Result.

We divide the 1,000,000 training data into 10,000 batches with batch size 100. The loss function J (θ) is the cross entropy from the policy output plus the mean square error from the value output. Suppose that

ppred∶policy output of the model ptrue∶true policy output

vpred∶value output of the model vtrue∶true value output

then

J (θ) = − ∑

i

ptrue(i) log p_pred(i) + (v_true−v_pred)2

for all i. We train the model for 5 epochs using stochastic gradient descent [3] (SGD) with learning rate 0.001 and momentum[4] 0.9. SGD is that the gradient descent updates based on one sample at one time, which helps us save the computation cost and the convergence time. Momentum is a method to make the convergence process more smooth and faster by adding last gradient value with weighted. Suppose that kt is the gradient at

time-step t, then the update rule of SGD with learning rate η and momentum γ is

θj∶=θ_j− (γk_t+η ∂ ∂θj

J (θ))

(15)

(a) Accuracy on training set and testing set

(b) Mean square error on training set and testing set

Figure 11 3.4. The Problem of the Gomoku Player.

In the following, we will show a game record that human against the

Gomoku player and point out the problem of the Gomoku player. The

black stone is the human player and the white stone is the Gomoku player with 1600 searches of APV-MCTS. See figure 12 for the game record.

Figure 12.

Now, the current player is white stone. The winning move for white stone is (3,9), but the Gomoku player choose to play move (3,7). In this case, we found that the Gomoku player does not have the ability to finish games.

(16)

4. Reinforcing the Gomoku player

In this section, we want to solve the finishing problem mentioned in the last section and enhance the Gomoku player. We used two methods here to approach the goal. One is retraining the model after finishing the game records, and the other is reinforcing the model through APV-MCTS. 4.1. Retrained Model.

We finish the 34,302 game records downloaded online, using the Gomoku player trained in the last section and an algorithm. The algorithm is to play the winning move and block opponent’s winning move for the Gomoku player. There are three situations that the algorithm will play the move, one is to play the move to achieve 5-in-a-row, another is to play the move to block opponent’s 4-in-a-row, and the other is to play the move to achieve live 4-in-a-row. After finishing game records, we transform them into 1,700,000 data and divide it to 1,500,000 training data and 200,000 testing data.

We use the same model, loss function and optimizer used in the last section and train the model for 5 epochs. See figure 13 for result.

(a) Accuracy on training set and testing set

(b) Mean square error on training set and testing set

Figure 13

4.2. Reinforced Model.

We take the supervised model and improve it using APV-MCTS. Since the move output from policy neural network is usually weaker than the move chosen through APV-MCTS, we basically let the Gomoku player play against itself using APV-MCTS and train the model using the self-play game records. By looping the process, we will get a stronger model, which means a better Gomoku player. The details of the process are shown below.

(17)

Step 1: Self-playing 10 game matches using APV-MCTS with 1600 searches and cpuct is set to 5. For the first 20 moves of each game, τ is set

to 1 to ensure the diversity at the beginning of the game. For the rest of the game, we set τ → 0 for the strongest move. Note that the algorithm used to finish game records is also applied here, which is good for the process since the algorithm only makes necessary moves.

Step 2: Transforming 10 game matches into data and train the model for one epoch with the same parameters introduced in section 2.1.3 After training for three weeks, we have a reinforced model which loops the process 67 times.

4.3. Result.

To check the capability for finishing games of the retrained model and the reinforced model, we use these two models to predict move for the same game record represented last section. See figure 12. Using 1600 searches of APV-MCTS, these two models successfully choose to play the wining move (3,9).

In order to know the strength of the supervised model, the retrained and the reinforced model. We compare these three models using 1600 searches of APV-MCTS to select each move. We let them play 20 game matches against each other. The result is shown below.

As we can see that the two enhanced models beat the original super-vised model with score 15:5 and 12:8. We also compare these two enhanced models, the retrained model beats the reinforced model with score 14:6. However, due to time and hardware limitation, the reinforced model only loops the process 67 times. If we keep reinforcing it, we believe that the reinforced model can beat the retrained model after a certain time.

5. Control of Robotic Arm

In the following, we will introduce how to control a robotic arm. For more details, you can refer this book Robotics and Automation Handbook [16]. 5.1. Homogeneous Transformation Matrix.

A robotic arm can be expressed as some links connected by joints. See figure 14. The joints make movements like rotation or shift between two links and each joint has its own reference coordinate frame. In order to control a robotic arm, given a certain gesture to each joint, we want to know where the gripper would be. The whole process which is called forward kinematics can be seen as a sequence of coordinate frames transforming from the base frame, joints reference frames, to gripper frame. See figure 14.

(18)

Figure 14.

For any two frames A and B, the change in displacement of the origins can be described using vector APBorg. See figure 15.

𝑥A 𝑦A 𝑧A A_P 𝐵𝑜𝑟𝑔 𝑥B 𝑦B 𝑧B Figure 15.

The change in orientation can be described using rotation matrix A_BR.

A BR =

⎛ ⎜ ⎝

proj_x_AxB projxAyB projxAzB

proj_y_AxB projyAyB projyAzB

projzAxB projzAyB projzAzB

⎞ ⎟ ⎠

where projv⃗u is the projection of ⃗⃗ u onto ⃗v and all the axes are unit vectors.

Furthermore, these two changes can be described simultaneously by just one matrix which is called homogeneous transformation matrix A_BT .

A BT = (

A

BR APBorg

0 1 )

Once every transformation matrix in a robotic arm is defined, the se-quence of frame transformations from base frame to gripper frame is just the product of the sequence of transformation matrices. Transformation matrix is sometimes hard to be constructed directly. Decomposing the transfor-mation into combination of some rotations and shifts about the axes can help us solve the problem. However, there are a lot combinations that can achieve the transformation. In the following section, we will introduce an algorithm to standardize the process.

5.2. Denavit-Hartenberg Parameters.

Denavit-Hartenberg parameters are presented by Jaques Denavit and Richard S. Hartenberg. Each of the parameters describes an angle or a displacement of X axes or Z axes between two frames. Using these parameters we can transform one frame to the other by intrinsic rotations. Note that intrinsic

(19)

rotation is a transformation based on the current frame and extrinsic rota-tion is a transformarota-tion based on a fixed frame. These four parameters are defined below corresponding to figure 16.

(1) αi−1: an angle from Zi−1 to Zi about axis Xi−1

(2) ai−1: a displacement from Zi−1 to Zi along axis Xi−1

(3) θi: an angle from Xi−1 to Xi about axis Zi

(4) di: a displacement from Xi−1 to Xi along axis Zi

𝑧𝑖−1 𝑥𝑖−1 𝑧𝑖 𝑥𝑖 𝜃𝑖 𝑎𝑖−1 𝛼𝑖−1 𝑑𝑖 Figure 16.

Therefore, there are four movements to transform one frame to the other. Movement 1: Rotate the frame αi−1about Xi−1axis, and the transformation

matrix is ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 0

0 cos αi−1 −sin α_i−1 0 0 sin αi−1 cos αi−1 0

0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠

Movement 2: Shift the frame ai−1 along Xi−1 axis, and the transformation

matrix is ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 ai−1 0 1 0 0 0 0 1 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠

Movement 3: Rotate the frame θi about Zi axis, and the transformation

matrix is ⎛ ⎜ ⎜ ⎜ ⎝ cos θi −sin θ_i 0 0 sin θi cos θi 0 0 0 0 1 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠

Movement 4: Shift the frame dialong Ziaxis, and the transformation matrix

is ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 0 0 1 0 0 0 0 1 di 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠

Multiplying the transformation matrices of these four movements to con-struct the transformation matrix i−1_i T .

(20)

i−1 i T = ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 0

0 cos αi−1 −sin α_i−1 0 0 sin αi−1 cos αi−1 0

0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 ai−1 0 1 0 0 0 0 1 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ ⎛ ⎜ ⎜ ⎜ ⎝ cos θi −sin θ_i 0 0 sin θi cos θi 0 0 0 0 1 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 0 0 1 0 0 0 0 1 di 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎝

cos θi −sin θ_i 0 a_i−1

sin θicos αi−1 cos θicos αi−1 −sin α_i−1 −sin α_i−1d_i sin θisin αi−1 cos θisin αi−1 cos αi−1 cos αi−1di

0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠

At this point, we have all we need to compute the forward kinematics. Now, take the robotic arm BCN3D MOVEO[13] for instance. See figure 17.

Joint 1 Joint 2 Joint 3 Joint 5 Joint 4 Figure 17. BCN3D MOVEO

We compute where the gripper frame would be under a specific gesture. See figure 18(b) for schematic diagram. Note that joint 1 and joint 2 are set to intersect for the convenience of computation, which does not affect to the real robotic arm, same reason for joint 4 and joint 5.

First, we construct each frame of the robotic arm. Z axes are assigned to the joints axes and X axes are assigned to aim at the next Z axes. Note that frame 0 and frame 6 are set for convenience of connecting to the base frame and the gripper frame. See figure 18(a).

With the frames, we can write down the Denavit-Hartenberg table based on Denavit-Hartenberg parameters from frame 0 to frame 6. See table 1. The transformation matrices base₀T and_gripper6T will be defined later.

After constructing the Denavit-Hartenberg table, we can find the gripper frame by computing the product of the sequence of transformation matrices. First, base₀T and_gripper6T can be observed based on the figure 18(a).

(21)

𝑥3 𝑧3 𝑦3 𝑥4, 𝑥5, 𝑥6 𝑦4, 𝑧5, 𝑦6 𝑧4, 𝑦5, 𝑧6 𝑥0, 𝑥1, 𝑦2 𝑦0, 𝑦1, 𝑧2 𝑧0, 𝑧1, 𝑥2 𝑥𝑏𝑎𝑠𝑒 𝑦𝑏𝑎𝑠𝑒 𝑧𝑏𝑎𝑠𝑒 𝑥𝑔𝑟𝑖𝑝𝑝𝑒𝑟 𝑦𝑔𝑟𝑖𝑝𝑝𝑒𝑟 𝑧𝑔𝑟𝑖𝑝𝑝𝑒𝑟 15 23 22.4 23 Joint 1 Joint 2 Joint 3 Joint 4 Joint 5 (a) 𝑥3 𝑧3 𝑦3 𝑥4, 𝑥5, 𝑥6 𝑦4, 𝑧5, 𝑦6 𝑧4, 𝑦5, 𝑧6 𝑥0, 𝑥1, 𝑦2 𝑦0, 𝑦1, 𝑧2 𝑧0, 𝑧1, 𝑥2 𝑥𝑏𝑎𝑠𝑒 𝑦𝑏𝑎𝑠𝑒 𝑧𝑏𝑎𝑠𝑒 𝑥𝑔𝑟𝑖𝑝𝑝𝑒𝑟 𝑦𝑔𝑟𝑖𝑝𝑝𝑒𝑟 𝑧𝑔𝑟𝑖𝑝𝑝𝑒𝑟 15 23 22.4 23 Joint 1 Joint 2 Joint 3 Joint 4 Joint 5 (b) Figure 18 i αi−1 ai−1 di θi 1 0 0 0 0 2 −90○ 0 0 −90○ 3 0 22.4 0 90○ 4 90○ ₀ ₂₃ ₀ 5 −90○ ₀ ₀ ₀ 6 90○ ₀ ₀ ₀

Table 1. Denavit-Hartenberg table

base 0T = ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 0 0 1 0 0 0 0 1 23 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ 6 gripperT = ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 0 0 1 0 0 0 0 1 15 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ and 0₁T , 1₂T , 2₃T , 3₄T , 4₅T , 5₆T can be defined based on the DH-table. Hence,

base

gripperT =base0T01T21T32T43T54T65Tgripper6T

= ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 0 0 1 0 0 0 0 1 83.4 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠

(22)

which matches the observation that the gripper frame is a 83.4 centimeters displacement along the zbase and no rotation from the base frame.

5.3. Pieper’s Solution.

We have known that the gripper frame position can be seen as the product of a sequence of transformation matrices. On the other hand, given a target frame, we want to know how to move the joints to let the gripper frame match the target frame. The process is called inverse kinematics. Suppose we have had the transformation matrix _targetbaseT . What inverse kinematics do is to find the θ’s such that the product of the transformation matrices from the base frame, joints reference frames, to gripper frame match the transformation matrix_targetbaseT . The problem is rather difficult since the more transformation matrices we product the more difficult to solve the equations. Pieper’s solution is a way to solve the problem under some restrictions. It only applied to robotic arms in which the last three axes are rotational and the origins of them are intersect. Instead of introducing the details, we see a concrete example here.

Using the same robotic arm last section which matches the conditions of Pieper’s solution. Suppose there is a ball in front of the base frame and we want the gripper to catch it from the top of the ball. See figure 19. We solve the angles the robotic arm has to move.

𝑥3 𝑧3 𝑦3 𝑥4, 𝑥5, 𝑥6 𝑦4, 𝑧5, 𝑦6 𝑧4, 𝑦5, 𝑧6 𝑥0, 𝑥1, 𝑦2 𝑦0, 𝑦1, 𝑧2 𝑧0, 𝑧1, 𝑥2 𝑥𝑏𝑎𝑠𝑒 𝑦𝑏𝑎𝑠𝑒 𝑧𝑏𝑎𝑠𝑒 𝑥𝑔𝑟𝑖𝑝𝑝𝑒𝑟 𝑦_{𝑔𝑟𝑖𝑝𝑝𝑒𝑟} 𝑧𝑔𝑟𝑖𝑝𝑝𝑒𝑟 15 23 22.4 23 Joint 1 Joint 2 Joint 3 Joint 4 Joint 5 20 𝑥𝑏𝑎𝑠𝑒 𝑦𝑏𝑎𝑠𝑒 𝑥𝑏𝑎𝑙𝑙 𝑦𝑏𝑎𝑙𝑙 20 Figure 19.

First,base_ballT ,base₀T ,_gripper6T ,gripper_ballT can be defined based on the figure 19

base ballT = ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 20 0 1 0 0 0 0 1 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠

(23)

base 0T = ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 0 0 1 0 0 0 0 1 23 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ 6 gripperT = ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 0 0 1 0 0 0 0 1 15 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ gripper ballT = ⎛ ⎜ ⎜ ⎜ ⎝ −1 0 0 0 0 1 0 0 0 0 −1 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ and since base_ballT =base₀T0₆T_gripper6Tgripper_ballT , 0₆T would be

0

6T =base0T−1 baseballT gripper ballT −1 6 gripperT−1 = ⎛ ⎜ ⎜ ⎜ ⎝ −1 0 0 20 0 1 0 0 0 0 −1 −8 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠

and 0₁T , 1₂T , 2₃T , 3₄T , 4₅T , 5₆T can be defined below based on the DH-table last section. 0 1T = ⎛ ⎜ ⎜ ⎜ ⎝ cos θ1 −sin θ₁ 0 0 sin θ1 cos θ1 0 0 0 0 1 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ 1 2T = ⎛ ⎜ ⎜ ⎜ ⎝ sin θ2 cos θ2 0 0 0 0 1 0 cos θ2 −sin θ₂ 0 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ 2 3T = ⎛ ⎜ ⎜ ⎜ ⎝ −sin θ₃ −cos θ₃ 0 22.4 cos θ3 −sin θ₃ 0 0 0 0 1 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ 3 4T = ⎛ ⎜ ⎜ ⎜ ⎝ cos θ4 −sin θ₄ 0 0 0 0 −1 −23 sin θ4 cos θ4 0 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ 4 5T = ⎛ ⎜ ⎜ ⎜ ⎝ cos θ5 −sin θ₅ 0 0 0 0 1 0 −sin θ₅ −cos θ₅ 0 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠

(24)

5 6T = ⎛ ⎜ ⎜ ⎜ ⎝ 1 0 0 0 0 0 −1 0 0 1 0 0 0 0 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ Second, find θ1, θ2, θ3, θ4, θ5. 0_P 4org=0₁T1₂T2₃T 3P_4org =0₁T1₂T2₃T ⎛ ⎜ ⎜ ⎜ ⎝ 0 −23 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ =0₁T1₂T ⎛ ⎜ ⎜ ⎜ ⎝ 23 cos θ3+22.4 23 sin θ3 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ Let f1=23 cos θ₃+22.4 and f₂ =23 sin θ₃.

0_P 4org=0₁T1₂T ⎛ ⎜ ⎜ ⎜ ⎝ f1 f2 0 1 ⎞ ⎟ ⎟ ⎟ ⎠ =0₁T ⎛ ⎜ ⎜ ⎜ ⎝ sin θ2f1+cos θ₂f₂ 0 cos θ2f1−sin θ₂f₂ 1 ⎞ ⎟ ⎟ ⎟ ⎠ Let g1=sin θ₂f₁+cos θ₂f₂ and g₃=cos θ₂f₁−sin θ₂f₂.

0_P 4org=0₁T ⎛ ⎜ ⎜ ⎜ ⎝ g1 0 g3 1 ⎞ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎝ cos θ1g1 sin θ1g1 g3 1 ⎞ ⎟ ⎟ ⎟ ⎠

Since the last three axes intersect, we know that 0P4org=0P_5org=0P_6org. ⎛ ⎜ ⎜ ⎜ ⎝ 20 0 −8 1 ⎞ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎝ cos θ1g1 sin θ1g1 g3 1 ⎞ ⎟ ⎟ ⎟ ⎠ implies 202+02+ (−8)2=cos2θ₁g12+sin2θ₁g12+g₃2 =g2₁+g₃2

(25)

= (sin θ2f1+cos θ₂f₂)2+ (cos θ₂f₁−sin θ₂f₂)2 =f₁2+f₂2

= (23 cos θ₃+22.4)2+ (23 sin θ₃)2 =1030.4 cos θ₃+1030.76

Solving the equation above we have θ3 =123.36○ and 236.63○. Using the angle 123.36○ _{to find the other angles since we want the robotic arm to be}

a specific gesture. Substituting θ3 into g3.

−8 = −19.21 sin θ₂+9.75 cos θ₂

Solving the equation above we have θ2 = −174.89○ and 48.71○. Using the angle 48.71○ _{to find the other angles since we want the robotic arm to be a}

specific gesture. Substituting θ3 and θ2 into g1

20 = 20 cos θ1

0 = 20 sin θ1

Solving the equations above we have θ1 = 0○. Now that θ₁, θ₂, θ₃ are known,3₆T can be computed from0₃T−10₆T and3₄T₅4T₆5T which provide us the equations to solve θ4=0○ and θ₅=7.92○.

6. Hardware Introduction and Wiring Diagram 6.1. Hardware Introduction.

(1) Robotic Arm BCN3D MOVEO: An open source 3D printed robotic arm with 5 joints driven by six stepper motors and a gripper driven by a servo motor. See figure 20. STL files of the parts and assembly manual can be downloaded online[14].

(26)

Servo Motor

Stepper Motor 1 Stepper Motor 2

Stepper Motor 3

Stepper Motor 4

Stepper Motor 6 Stepper Motor 5

Gripper

Figure 20.

In order to control hardware error during the movement of the robotic arm, we add a gyroscope and a camera on the gripper. See figure 21. The gyroscope helps us to make sure the camera is directly facing down. The camera is used to position the object.

(27)

Gyroscope

Camera

Figure 21.

(2) Arduino Board: An open source electronic platform which can be programmed. It helps us control some electronic components more easily. We use two arduino boards here. See figure 22 and 23. Arduino Mega controls the motors and Arduino Uno reads from the gyroscope on the gripper.

(28)

Figure 23. Arduino Uno

(3) Raspberry Pi: A single-board computer with Linux in it. We use it to read video frames from camera on the gripper and compute the object position.

Figure 24. Raspberry Pi 3

(4) Laptop: The communication center of the whole process. It receives the object position from the raspberry pi and the gripper angle from the Arduino Uno, does inverse kinematics computation and tells Ar-duino Mega the angles to move.

6.2. Wiring Diagram.

(1) Arduino Mega: Arduino Mega connects to 6 stepper motors which control the joints of the robotic arm and a servo motor which controls the gripper. See the figure 25.

(29)

Servo Motor Stepper Motor 1 Power Converter 24V to 12 V Power Supply 24V V+ V- Stepper Motor 2 Stepper Motor 3 Stepper Motor 4 Stepper Motor 5 Stepper Motor 6 Figure 25.

Note that the power supply also supplies all six stepper motor drivers, which is not mentioned in the figure 25. The decoupling capacitors across the supplies to the motors are added to reduce the electronic interference to the gyroscope on the gripper.

(2) Arduino Uno: Arduino Uno connects the gyroscope on the gripper. See the figure 26.

Figure 26.

(3) Laptop: Laptop communicates with two arduino boards using USB ports and downloads the object position from Raspberry Pi through secure shell (ssh).

(30)

7. Object Detection and Pick and Place

With the preliminary knowledge introduced earlier, we now introduce the whole process of object detection and pick-and-place. The task is that let the robot arm automatically detect the object and the target location and move the object to the target location.

7.1. Functions.

(1) Robot Arm Movement: All the movements are based on the Pieper’s solution. Given a coordinate according to the base frame, the laptop will compute the angles of the movement using Pieper’s solution and send it to Arduino Mega through serial. Serial is a function of Arduino, which is used for communication between de-vices. Note that the angle which the servo motor has to move is sent at the same time. Once Arduino Mega receives the angles, it starts to move the joints. After all the joints finish moving, it will send a message ”The robot arm is ready for the next movement.” back to the laptop.

(2) Object Detection: Through the camera, we filter the certain range of HSV color space. The color red is for the object and the color blue is for the target position. See figure 27. Note that the green point gripper is the position that the robot arm can reach vertically downward.

Figure 27.

(3) Movement Adjustion: The movement of the robot arm will gen-erate hardware error. When the green point gripper does not match the center of the object, we need to adjust the error. However, sim-ply adjusting offsets along x-axis and y-axis does not work since the camera will turn to the orientation the robot arm facing after mov-ing. The problem here is caused by the reason that we want the camera directly facing down to catch the object’s coordinate and there is no other joint to control the orientation of the camera. The solution is doing some mathematical calculation before adjusting. See the example below.

(31)

The origin of the gripper frame is a point which is not affected by the orientation and the gripper point can be calculated from it. We turn this problem into a plane coordinate system and the goal is to find the offsets of the origin of the gripper frame has to move. See figure 28.

origin of base frame Camera gripper point center of object 22 2 1

origin of base frame

center of object 3

origin of gripper frame

new origin of gripper frame 3 x y x y Figure 28.

First, we calculate the coordinates of the origin of the gripper frame and the center of the object which are (0, 22) and (3, 25). Using the coordinate (3, 25), we can calculate cos θ and sin θ where θ is the angle between the vector from the origin of the base frame to the center of the object and the x-axis.

cos θ = √ 3

32₊₂₅3, sin θ =

25 √

32₊₂₅3

With cos θ and sin θ, we can compute the coordinate of new origin of the gripper frame. See figure 29. New origin of base frame is (3 − 3 cos θ, 25 − 3 sin θ).

3 (3,25)

𝜃

3𝑠𝑖𝑛𝜃

3𝑐𝑜𝑠𝜃

new origin of base frame

(32)

With the coordinates of the origin of the gripper frame and new origin of the gripper frame. We now know the offsets to move so that the gripper point match the center of the object.

(4) Camera Adjustion: We adjust movement through camera. There-fore, the consistency that the camera detects the object right from the top is important. The consistency is also affected by the er-ror which movement generated. We keep the consistency by the gyroscope on the camera. Arduino Uno reads the angle from the gyroscope and sends it to the laptop in real time. First, the laptop will tell the Arduino Mega to move the camera to a certain angle and slowly move it downward. See figure 30.

Figure 30.

When the laptop receives the angle which let the camera directly facing down, it will stop the move.

7.2. Process.

With all the functions above, we now have the abilities to accomplish the task. In the following, we will write down the process step by step

Step 1: Move the robotic arm to a certain gesture for catching the whole workspace from camera and position the object and the target lo-cation.

Step 2: Move the robotic arm to the top of the object and adjust the camera directly facing down.

Step 3: Let the gripper point match the center of the object through move-ment adjustion.

Step 4: Move the robotic arm to catch the object and return.

Step 5: Move the robotic arm to the top of the target and adjust the camera directly facing down.

Step 6: Let the gripper point match the center of the target location through movement adjustion.

(33)

To see the video of the whole process, a youtube website[17] is provided.

8. Conclusion and Discussion

As we can image, a computer game player needs strong understanding of the game to support it. However, AlphaGo Zero breaks the rule. By applying neural networks into traditional game tree search. Without any human knowledge, the computer game player now has the ability to reinforce itself. In the beginning, we also try to train our Gomoku player using APV-MCTS from a initialized weights model. Because of the computation power, we start to reinforce our model from a supervised model and have some results of it.

Now, we have a decent Gomoku player and a robotic arm with some abil-ities. In future work, we want to achieve our final goal. Let the robotic arm automatically plays Gomoku against human. There are still some difficul-ties we have to conquer. Playing chess needs highly precise operations for robotic arms. The stones are small and arrange on the chess board closely. The gripper of the robotic arm is too big to place stones. To achieve this, one idea is to redesign the robotic arm and change the gripper to a vacuum cup.

References

[1] Rosenblatt, Frank, The Perceptron–a perceiving and recognizing au-tomaton. Report 85-460-1, Cornell Aeronautical Laboratory (1957). [2] Paul J. Werbos, Beyond Regression: New Tools for Prediction and

Analysis in the Behavioral Sciences. PhD thesis, Harvard University (1974).

[3] J.; Wolfowitz, J. Stochastic Estimation of the Maximum of a Regression Function. Ann. Math. Statist. 23, no. 3, 462–466 (1952).

[4] Ning Qian, On the momentum term in gradient descent learning algo-rithms. Neural Netw. Jan;12(1):145-151 (1999).

[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition (2015).

[6] Sergey Ioffe, Christian Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015). [7] Rmi Coulom, Efficient Selectivity and Backup Operators in

Monte-Carlo Tree Search (2006).

[8] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Lau-rent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Tim-othy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis, Mastering the game of Go with deep neural networks and tree search (2016).

[9] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis

Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui,

(34)

Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hass-abis, Mastering the game of Go without human knowledge (2017). [10] Rosin, C. D. Multi-armed bandits with episode context. Annals of

Mathematics and Artificial Intelligence 61, 203230 (2011).

[11] Kocsis, L. and Szepesvari, C. Bandit-based montecarlo planning. ECML06 (2006). [12] http://www.renju.net/index.php [13] https://www.bcn3dtechnologies.com/en/bcn3d-moveo-the-future-of-learning/ [14] https://github.com/BCN3D/BCN3D-Moveo [15] https://www.mathworks.com/videos/introduction-to-deep-learning-what-are- convolutional-neural-networks–1489512765771.html

[16] Thomas R. Kurfess. Robotics and automation handbook. CRC Press LLC (2005).

機器棋士的製作與挑戰

國立高雄大學應用數學系

碩士論文