Organization of the Dissertation - 應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法

Chapter 1 Introduction

1.5 Organization of the Dissertation

The organization of this dissertation is as follows. Chapter 1 gives an introduction of computer games, the game of Go, computer Go, a summary of the contributions in this research and the organization of the dissertation. Chapter 2 presents the background and related work of this research. It introduces Monte Carlo Go, explains Monte Carlo Tree Search (MCTS) and Upper Confidence bounds applied to Trees (UCT), and surveys some of the start-of-the-art Go-playing programs as well as their contributions. Chapter 3 introduces our Go-playing ERICA. We narrate its development history and standings in the tournaments that we have participated, and introduce the framework of the program. Chapter 4 presents our first contribution:

applying SB to 9×9 Go. Chapter 5 shows the second contribution: time management schemes utilized in 19×19 Go. Finally, conclusions and proposals for future work are given in Chapter 6.

Chapter 2 Background and Related Work

In this Chapter, we introduce the background and related work of this research.

Section 2.1 introduces the progress of Monte Carlo Go until the development of Monte Carlo Tree Search (MCTS) and Upper Confidence bounds applied to Trees (UCT). Section 2.2 explains MCTS and its four stages along with the related work.

Section 2.3 explains UCT, which was mainly proposed for the first stage (selection) of MCTS. Finally, Section 2.4 surveys a number of state-of-the-art Go-playing programs as well as their contributions.

2.1 Monte Carlo Go

The idea of Monte Carlo Go was at the very beginning introduced by Brügmann (Brügmann, 1993). In his paper “Monte Carlo Go”, Brügmann proposed an algorithm which attempts to find the best move by simulated annealing, without including any Go knowledge, except the rule “do-not-fill-eye” in the simulation. Based on Abramson’s expected-outcome model (Abramson, 1990), a position is evaluated by the average score of a certain number of simulations (random games) played from that position on. Remarkably, by this approach, Brügmann’s program GOBBLE achieved a playing strength of about 25-kyu on a 9×9 board. In 2003, on the basis of Brügmann’s work, Bouzy started to make some experiments on Monte Carlo Go (Bouzy, 2003;

Bouzy and Helmstetter, 2003) and accordingly built a new version of his program INDIGO. In the next few years, Bouzy and Chaslot proceeded to bring forward not a few groundbreaking ideas, such as Bayesian generation of patterns for 19×19 Go (Bouzy and Chaslot, 2005), Progressive Pruning and its variants Miai Pruning (MP) and Set Pruning (SP) (Bouzy, 2005a), History Heuristic and Territory Heuristic (Bouzy, 2005b) and Enhanced 3×3 patterns by reinforcement learning (Bouzy and Chaslot, 2006).

It was based on these preliminary works on Monte Carlo Go that the significant breakthrough of Monte Carlo Tree Search (MCTS) (Coulom, 2006) and Upper Confidence bounds applied to Trees (UCT) (Kocsis and Szepesv´ari, 2006) independently came to realize in 2006.

2.2 Monte Carlo Tree Search (MCTS)

Monte Carlo Tree Search (MCTS) (Coulom, 2006) is a kind of best-first search that tries to find the best move and to keep the balance between exploration and exploitation of all moves. MCTS was firstly implemented in CRAZY STONE, the winner in the 9×9 Go tournament at the 2006 Computer Olympiad. Together with the emergence of UCT (Kocsis and Szepesv´ari, 2006), the huge success of MCTS stimulated profound interest among Go programmers. So far, many enhancements of MCTS have been proposed and developed, such as Rapid Action Value Estimation (RAVE), proposed by (Gelly and Silver, 2007; Gelly and Silver, 2011), and progressive bias, proposed by (Chaslot et al., 2007), to strengthen its effect. Plenty of comprehensive studies were also focused on the policy and better quality of the playout (Coulom, 2007; Chaslot et al., 2009; Hendrik, 2010).

MCTS is commonly classified into four stages (Chaslot et al., 2007): selection,

expansion, simulation and backpropagation, as shown in Figure 2.1. The operation of MCTS consists in performing these four stages ever and again as long as there is time left. The repeated four stages of MCTS and the related work are described in the following subsections.

Figure 2.1: The scheme of MCTS.

2.2.1 Selection

The first stage selection is intent on selecting one of the children, according to a selection function (or selection formula), of a given node and repeats from the root node until the end of the tree. Figure 2.2 gives an example. The selection strategy UCT and the various selection functions adopted by different Go-playing programs will be independently investigated in Section 2.3.

Figure 2.2: The first stage of MCTS: selection. Node 1 (Root) selects Node 2 then Node 2 selects Node 6, which reaches the end of the tree.

2.2.2 Expansion

The second stage expansion is to create a new child node, corresponding to one of the legal moves of the parent node, and store this new node to the memory to “expand”

Selection Expansion Simulation Backpropagation

1 2 3 4

5 6 7 8 9

the tree. Figure 2.3 gives an example.

The simplest scheme of expansion is to create a new node in the first visit of a leaf node (Coulom, 2006). However, for RAVE, it is necessary to create all the child nodes in preparation for updating the RAVE statistics in the fourth stage backpropagation. To reduce this memory overhead, a popular solution is delayed node creation, namely to expand a node in the nth (n>1) visit. The NOMITAN team has reported some effective variants of delayed node creation (Yajima et al., 2010).

To raise the performance of RAVE, it is suggested to assign a prior value to each created node (Gelly and Silver, 2007). If many features are taken into account for the computation of a prior value, node creation can be costly and slow. To speed up node creation in multithreaded environment, FUEGO uses an independent, thread-specific, memory array for node creation (Enzenberger and Müller, 2009).

Figure 2.3: The second stage of MCTS: expansion. Node 10, the child node of the leaf node 6, is created and stored to the memory to expand the tree.

2.2.3 Simulation

The third stage simulation is to perform a simulation (also called playout) from the

1 2 3 4

5 6 7 8 9

10

position represented by the new created node. For delayed node creation, a simulation is simply performed from the leaf node. In MCTS, a simulation is carried out by Monte Carlo simulation composed of random or pseudo-random moves. This is the reason for the name “Monte Carlo Go” and “Monte Carlo Tree Search”. When the random game is completed, the final position is scored⁴ to decide the winner. Then the associated outcome 0/1 is passed to the tree to indicate loss/win of this simulation.

Figure 2.4 gives an example.

Figure 2.4: The third stage of MCTS: simulation. After Node 10 was created, a Monte Carlo simulation is performed from the position represented by this node. Finally, the outcome 0/1 is returned to indicate loss/win of this simulation.

Simulation is the most crucial step of MCTS. In general, there are mainly two

4 For the game of Go, a simulation is usually scored by the Chinese rules.

Simulation outcome=0/1

1 2 3 4

5 6 7 8 9

10

types of Monte Carlo simulation among the current strong Go-playing programs.

The first type, called Mogo-type, sequence-like or fixed-sequence simulation (Gelly et al., 2006), also being called Mogo’s magic formula, is used by MOGO, PACHI, FUEGO,and many strong Go-playing programs. Mogo-type, sequence-like simulation will be further investigated in section 2.4.2.

The second type, called CRAZY STONE-like, probabilistic simulation, being called CRAZY STONE’s update formula (Teytaud, 2011) that allows more flexibility, was proposed by Rémi Coulom (Coulom, 2007) and is being used by CRAZY STONE, AYA

and our Go-playing program ERICA. ZEN was reported to use a mixed type of simulation between Mogo-type and CRAZY STONE-like (Yamato, 2011). CRAZY

STONE-like simulation will be further investigated in the next chapter.

Recent research on simulation centers on two directions. The first direction is to balance the simulations in the framework of Boltzmann softmax playout policy with the trained feature weights which will be discussed in Chapter 4. The second direction is to improve the playout policy by letting the simulations learn from itself, according to the results of the previous simulations (Drake, 2009; Hendrik, 2010; Baier and Drake, 2010) or the statistical data accumulated in the tree (Rimmel et al., 2010).

Such dynamic or adaptive scheme for the simulation is being called adaptive playout.

2.2.4 Backpropagation

The fourth stage backpropagation is to propagate the simulation outcome 0/1 from the new created node, along with the path decided in the selection stage, to the root node.

Each node in this path updates its own statistical data by the simulation outcome.

Figure 2.5 gives an example.

In backpropagation, it is possible to update other statistical data by the information collected from the simulation, to obtain a faster estimation of the child

nodes. For instance, with RAVE (Gelly and Silver, 2007) or other kinds of AMAF (All-Moves-As-First) (Brügmann, 1993; Helmbold and Wood, 2009) a node updates all the moves that were played in the tree and the simulation after the position represented by this node.

Some researchers also tried to assign heavier weights to the later simulation outcomes when the tree grows larger (Xie and Liu, 2009), under the assumption that the larger the sub-tree the more promising the simulation outcome.

Figure 2.5: The fourth stage of MCTS: backpropagation. The simulation outcome 0/1 is propagated from Node 10 along with the path in the selection stage (Node 6 and Node 2) to the root node (Node 1). Each node updates its own statistical data by the simulation outcome.

A recent topic in backpropagation which calls for much attention is dynamic komi. Dynamic komi was proposed to cure the awful performance of MCTS in handicap games on the 19×19 board. The objective under current structure of MCTS is to maximize the winning rate rather than score. So, MCTS works best if the winning rate of the root node is close to 50%, because it is the very occasion that the

1 2 3 4

5 6 7 8 9

10

Simulation outcome=0/1

simulation outcomes can reflect good and bad moves to the maximum degree. In the case that the winning rate is close to 100% (the case of 0% can be deduced in the same way), MCTS becomes reluctant to explore (since a 0.5 point win or a 20.5 points win are of the same outcome) and incapable to discriminate between good and bad moves. After all, Monte Carlo simulation is more or less biased and far from perfection. This problem becomes particularly apparent in the handicap games against strong human players, as a result of the huge and early advantage offered by the handicap stones.

Figure 2.6 gives a practical example. This position is selected from the exhibition game at the 2010 Computer Olympiad, Rina Fujisawa (White) vs. ERICA (Black), with 6 handicap stones. The stone marked by ∆ is the last move. Point A (extending) is a mandatory move for Black in this case but ERICA played at B, a clearly bad move, and showed over 80% winning rate.

Figure 2.6: The exhibition game at the 2010 Computer Olympiad: Rina Fujisawa (White) vs.

ERICA (Black), with 6 handicap stones. White won by resignation.

The main idea of dynamic komi is to adjust the komi value, by the averaged score derived from the last search, in order to shift the winning rate of the root node closer to 50%. ZEN, THE MANY FACES OF GO and PACHI⁵ have been reported to benefit from dynamic komi, although each has a different approach.

2.3 Upper Confidence Bound Applied to Trees (UCT)

Upper Confidence bound applied to Trees (UCT) (Kocsis and Szepesv´ari, 2006) is the extension of the UCB1 strategy (Auer et at., 2002) to minimax tree search. The deterministic UCB1 algorithm or policy was designed to solve the Multi-Armed Bandit problem (Auer et al., 1995) and ensures that the optimal machine is played exponentially more than any other machine uniformly when the rewards are in [0,1].

In MCTS, UCT is mainly served as a selection function in the first stage of MCTS and, in general, can be viewed as a special case of MCTS. Under the formulation of UCT, the selection in each node is similar to the Multi-Armed Bandit problem (Coquelin and Munos, 2007). It aims to find the best move and in the meantime keep the balance between the exploration and exploitation of all moves. MOGO was the first Go-playing program that successfully applied UCT (Gelly et al., 2006).

The strategy of UCT is to choose a child node which maximizes the selection

5 The author of PACHI, Petr Baudiš, described his successful implementation of dynamic komi in the draft of his paper “Balancing MCTS by Dynamically Adjusting Komi Value”.

The latter part of the formula (2.1) is usually called “exploration term” for the purpose of balancing the exploration and exploitation.

The strategy of UCT was very quickly found not feasible to the game of Go, because it requires that each child node must be visited at least once. Even on a 9×9 board, the branching factor, 81 for the node representing the empty position, is still too large to do such complete search. To remedy this flaw, Rapid Action Value Estimation (RAVE) was proposed (Gelly and Silver, 2007; Gelly and Silver, 2011).

RAVE is a kind of the heuristic AMAF (All-Moves-As-First) (Brügmann, 1993;

Helmbold and Wood, 2009) that updates all the moves which were played in the tree and the simulation after the position represented by this node. The strategy RAVE is to choose a child node which maximizes the selection formula (2.2):

rave

Blending UCT with RAVE, the strategy UCT-RAVE is to choose a node which maximizes the selection formula (2.3):

where Coefficient is the weight of RAVE (Gelly and Silver, 2007; Silver, 2009).

In the past few years, many efforts have been paid to improve the selection function based on the strategy UCT-RAVE. Some new ideas and the various selection functions adopted by different Go-playing programs are listed as follows.

1. Chaslot et al. proposed two progressive strategies for the selection stage and

measured a significant improvement from 25% to 58% (200 games) on their program MANGO against GNU GO 3.7.10 on 13×13 board (Chaslot et al., 2007).

The first strategy is progressive unpruning, also called progressive widening (Coulom, 2007), which gradually unprunes the child nodes according to their scores computed by the selection function. The other substantial strategy is progressive bias, realized as an independent term added behind the selection formula aiming to direct the search according to time-expensive heuristic knowledge.

2. Chaslot et al. presented a selection formula combing online learning (bandit module), transient learning (RAVE values), expert knowledge and offline patter-information (Chaslot et al., 2009), which is being used in their program MOGO.

3. Silver, in his Ph.D. dissertation, based on the experiments on MOGO (Silver, 2009), suggested to take off the exploration terms of both UCT and RAVE, namely set C_uct and Crave to 0.

4. Rosin proposed a new algorithm PUCB under the assumption that contextual side information is available at the start of the episode (Rosin, 2010).

5. Tesauro et al. proposed a Bayesian framework for MCTS that allows potentially much more accurate (Bayes-optimal) estimation of node values and node uncertainties from a limited number of simulation trials (Tesauro et al., 2010).

6. THE MANY FACES OF GO is using the formula (2.4) in collaboration with

8. PEBBLES is using the formula (2.6) (Sheppard, 2011):

qRAVE where beta is set according to Silver’s dissertation (Silver, 2009). Both qUCT and

qRAVE incorporate exploration terms from the Beta Distribution (Stogin et al., 2010).

9. PACHI is using a formula similar to that of AYA, except that C_uct and C_RAVE are set to 0 (Baudis, 2011). The “Even game prior” is used to set vuct with 0.5 at n playouts, where n can be between 7 and 40. Another important prior is “playout policy hinter”, which uses the same heuristics (and code) as the playout policy to pick good tree moves.

2.4 State-of-the-Art Go-Playing Programs

In this section, we survey some start-of-the-art Go-playing programs as well as their contributions.

2.4.1 Crazy Stone

CRAZY STONE was created by Rémi Coulom, the inventor of MCTS (Coulom, 2006),

which has been regarded as the most significant contribution to Computer Go in recent years. At the 2006 Computer Olympiad, CRAZY STONE demonstrated the usefulness and effectiveness of MCTS by the overwhelming victory in the 9×9 Go tournament. In this tournament, CRAZY STONE defeated many senior Go-playing programs such as GNU GO, GOKING, JIMMY, etc, and tied with AYA and Go INTELLECT. In the first UEC Cup in 2007, CRAZY STONE won the exhibition match against Kaori Aoba 4p with 7 handicap stones. This game was described as “very beautiful”. Figure 2.7 shows the final position of this game. CRAZY STONE finally killed the whole White’s big group⁶ in the center (marked by ×) and secured a solid win.

The second great contribution of Coulom is the supervised learning algorithm named Minorization-Maximization (MM) for computing the Elo ratings of move patterns (Coulom, 2007), which will be further investigated in Chapter 4. This learning algorithm is still used by some of the top-level Go-playing programs, such as ZEN and AYA.

Figure 2.7: The final position of the exhibition match: Kaori Aoba 4p (White) vs. CRAZY

6 A group consists of one or more loosely connected strings.

STONE (Black), with 7 handicap stones, in the first UEC Cup, 2007. Black won by resignation.

Currently, Coulom is again working on CRAZY STONE after a suspension of about 2 years. Right now, CRAZY STONE is rated 4-dan on the KGS Go Server (KGS) on a 24-core machine (account CrazyStone, retrieved at 2011-07-14 T12:12:42+08:00⁷) on the 19×19 board and reached a Bayes-Elo rating (Coulom,2010) of 2914 in Computer Go Server (CGOS) on the 9×9 board (account bonobot, retrieved at 2011-07-14 T12:15:34+08:00).

2.4.2 M

OGO

MOGO was created in the beginning by Yizao Wang and Gelly Sylvain, supervised by Rémi Munos. Olivier Teytaud took the lead of the “MOGO team” after Yizao Wang and Gelly Sylvain left. There are several important contributions from the MOGO

team.

The first and the greatest contribution is applying UCT (Kocsis and Szepesv´ari, 2006), which was invented by Kocsis et al. independently at the same time as Coulom’s MCTS, to computer Go (Gelly et al., 2006). It is widely maintained that the contributions of CRAZY STONE and MOGO collaboratively enable the Monte Carlo Go programs to be competitive with, and stronger than, the strongest traditional Go-playing programs, such as HANDTALK, THE MANY FACES OF GO and GO

INTELLECT.

The second contribution of MOGO team lies in the Monte Carlo part. The earliest creators of MOGO, mainly Sylvain Gelly and Yizao Wang, designed a sequence-like simulation (Gelly et al., 2006) that still has dominant influence on almost all the current strong Go-playing programs. This sequence-like simulation was further

7 Presented in ISO 8061 date format.

improved by expert knowledge, such as nakade, and heuristics such as “Fill the board”

(Chaslot et al., 2009). Figure 2.8 gives the example of sequence-like simulation.

Figure 2.8: An example of sequence-like simulation proposed by MOGO team, cited from the paper “Modiﬁcation of UCT with Patterns in Monte Carlo Go”.

The main principle of such sequence-like simulation consists in considering the present move by responding to the previous move played by the opponent. Two important responses to the previous move are “save a string by capturing” and “save a string by extending”. For “save a string by capturing”, it means to save the string and put in atari by the previous move, by capturing its directly neighboring opponent string. “Save a string by extending” means to save the string and put in atari by the previous move by extending its liberty. Figure 2.9 gives an example of these responses.

The most powerful part of the sequence-like simulation is considering the 3×3 patterns around the previous move. It is generally stated that the 3×3 patterns designed by Yizao Wang and RAVE are the major factors that enabled MOGO to be the solid strongest Go-playing program in the period of the first half of 2007.

This sequence-like simulation, handcrafted policy was improved by the offline reinforcement learning from games of self-play (Gelly and Silver, 2007). Gelly and Silver reported that this generated policy outperformed both the random policy and

the handcrafted policy by a margin of over 90%.

The third big idea of MOGO team is RAVE (Gelly and Silver, 2007), which is a kind of the heuristic AMAF (All-Moves-As-First) (Brügmann, 1993; Helmbold and Wood, 2009). Presently, RAVE is reported to be utilized in almost every strong Go-playing program. Some authors even reported that RAVE boosts the playing

在文檔中應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法 (頁 24-0)