KGS Games of E RICA - 應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法

Chapter 3 E RICA

3.3 KGS Games of E RICA

Starting December 13, 2010, ERICA played on the KGS Go Server (KGS) using the account EricaBot, running on a 4-core CPU of 3.07 GHz. With short time setting of 10×00:15 (15 seconds byo-yomi for 10 times), it was rated 3-dan in the beginning and about 3.75-dan on June, 2010, as shown in Figure 3.11.

Search path: Node 1→move A→Node 2→move B→Node 3→move C

→Node 4→move D→Node 5→move E→Node 6.

Suppose:

A: Black captures a ko.

B: White plays a ko threat.

C: Black responses the ko threat threatened by B.

D: White re-captures the ko.

E: Black plays a ko threat.

Then don’t update the RAVE value of move E in node 1.

Figure 3.11: The KGS Rank Graph for EricaBot.

Figure 3.12 shows a 19×19 game between EricaBot (White) and a 2-dan human player BOThater36. In this game, ERICA captured the center group by ladder-atari (move 120) and won. This game shows that ERICA is a solid 3-dan player and features moderate opening play on the 19×19 board.

Figure 3.12: A 19×19 ranked game on KGS: EricaBot 3-dan (White) vs. BOThater36 2-dan (Black). White won by resignation.

Figure 3.13 shows a 9×9 game between Erica9 (White) and a 5-dan human player guxxan. In this game, ERICA played a classical killing method (move 32 and 34) to kill the Black group in the top-left corner. This game shows that ERICA is already a solid high dan player on the 9×9 board.

Figure 3.13: A 9×9 game on KGS: Erica9 (White) vs. guxxan 5-dan (Black). White won by resignation.

Chapter 4 Monte Carlo Simulation Balancing

Applied to 9×9 Go

4.1 Introduction

Monte Carlo evaluation of a position depends on the choice of a probability distribution over legal moves. A uniform distribution is the simplest choice, but produces poor evaluations. It is often better to play good moves with a higher probability, and bad moves with a lower probability. Playout policy has a large influence on the playing strength. Several methods have been proposed to optimize it.

The simplest approach to policy optimization is trial and error. Some knowledge is implemented in playouts, and its effect on the playing strength is estimated by measuring the winning rate against other programs (Bouzy, 2005; Gelly et al., 2006;

Chen and Chang, 2008; Chaslot et al., 2009). This approach is often slow and costly, because measuring the winning rate by playing games takes a large amount of time, and many trials fail. It is difficult to guess what change in the playout policy will make the program stronger, because making playouts play better often causes the Monte Carlo program to become weaker (Bouzy and Chaslot, 2006; Gelly and Silver, 2007).

In order to avoid the difficulties of crafting a playout policy manually, some authors tried to establish principles for automatic optimization. We mention two of them. First, it is possible to optimize directly numerical parameters with generic stochastic optimization algorithms such as the cross-entropy method (Chaslot et al., 2008). Such a method may work for a few parameters, but it still suffers from the rather high cost of measuring strength by playing games against some opponents. This cost may be overcome by methods such as reinforcement learning (Bouzy and Chaslot, 2006; Gelly and Silver, 2007; Silver and Tesauro, 2009), or supervised learning from good moves collected from game records (Coulom, 2007). Supervised learning from game records has been quite successful, and is used in some top-level Go programs such as ZEN and CRAZY STONE.

Second, among the reinforcement-learning approaches to playout optimization, a recent method is simulation balancing (SB) (Silver and Tesauro, 2009). It consists in tuning continuous parameters of the playout policy in order to match some target evaluation over a set of positions. This target evaluation is determined by an expert.

For instance, it may be obtained by letting a strong program analyze positions quite deeply. Experiments reported by Silver and Tesauro indicate that this method is promising: they measured a 200-point Elo improvement over previous approaches.

Yet, the SB experiments were promising, but not completely convincing, because they were not run in a realistic setting. They were limited to 2×2 patterns of stone configurations, on the 5×5 and 6×6 Go boards. Moreover, they relied on a much stronger program, FUEGO (Enzenberger and Müller, 2009), that was used to evaluate positions of the training database. Anderson (2009) failed to replicate the success of SB for 9×9 Go, but may have had bugs, because he did not improve much over uniform-random playouts. So, it was not clear whether this idea could be applied successfully to a state-of-the-art program.

This chapter presents the successful application of SB to ERICA, a state-of-the-art Monte Carlo program. Experiments were run on the 9×9 board. The training set was made of positions evaluated by ERICA herself. So this learning method does not require any external expert supervisor. Experimental results demonstrate that SB made the program stronger than its previous version, where patterns were trained by minorization-maximization (MM) (Coulom, 2007). Besides a raise in playing strength, a second interesting result is that pattern weights computed by MM and SB are quite different from each other. For instance, SB patterns may wish to play some rather bad shape positions, which are evaluated quite badly by MM, but that helps to arrive at a correct playout outcome.

4.2 Description of Algorithms

This section is a brief reminder of the MM (Coulom, 2007) and SB (Silver and Tesauro, 2009) algorithms. More details about these algorithms can be found in the references.

4.2.1 Softmax Policy

Both MM and SB optimize linear parameters of a Boltzmann softmax policy, which was introduced in Section 3.2.3.1. The objective of learning algorithms is to find a good value for .

4.2.2 Supervised Learning with MM

MM learns feature weights by supervised learning over a database of sample moves (Coulom, 2007). MM is a maximization algorithm for computing maximum-a-posteriori values of , given a prior distribution and sample moves. The principle of this algorithm dates back to at least Zermelo (1929). Its formulation and convergence properties were studied recently in a more general case by Hunter

(2004).

When learning with MM, the training set is typically made of moves extracted from game records of strong players. It may also be made of self-play games if no expert game records are available.

4.2.3 Policy-Gradient Simulation Balancing (SB)

SB does not learn from examples of good moves, but from a set of evaluated positions.

This training set may be made of random positions evaluated by a strong program, or a human expert. Feature weights are trained so that the average of playout outcomes matches the target evaluation given in the training set. Silver and Tesauro (Silver and Tesauro, 2009) proposed two such algorithms: policy-gradient simulation balancing and two-step simulation balancing. We chose to implement policy-gradient simulation balancing only, because it is simpler and produced better results in the experiments by Silver and Tesauro.

The principle of Policy-Gradient Simulation Balancing consists in minimizing the quadratic evaluation error by the steepest gradient descent. Estimates of the is the outcome of one playout, from the point of view of the player who made action a1(+1 for a win, -1 for a loss, for instance). si and ai are successive states and actions in a playout of T moves. M and N are integer parameters of the algorithm. V and g are

multiplied in the update of , so they must be evaluated in two separate loops, in order to obtain two independent estimates.

4.3 Experiments

Experiments were run with the Go-playing program ERICA. The SB algorithm was applied repeatedly with different parameter values, in order to measure their effects.

Playing strength was estimated with matches against FUEGO. The result of applying SB is compared to MM, both in terms of playing strength and feature weights.

4.3.1 E

RICA

ERICA is developed by the author in the framework of his Ph.D. research. More details of ERICA can be found in Chapter 3.

4.3.2 Playout Features

This subsection and the remainder of this chapter uses Go jargon that may not be familiar to some readers. Explanations for all items of the Go-related vocabulary can be found in the Sensei’s Library web site (http://senseis.xmp.net/). Still, it should be possible to understand the main ideas of this chapter without understanding that vocabulary. The playouts of ERICA are based on 3×3 stone patterns, augmented by the

atari status of the four directly connected points. These patterns are centred on the move to be played. By taking rotations, symmetries, and move legality into consideration, there is a total of 2,051 such patterns. In addition to stone patterns, ERICA uses 7 features related to the previous move (examples are given in Figure 4.1).

1. Contiguous to the previous move. Active if the candidate move is among the 8 neighbouring points of the previous move. Also active for all Features 2–7.

2. Save the string in new atari, by capturing. The candidate move that is able to save the string in new atari by capturing has this feature.

3. Same as Feature 2, which is also self-atari. If the candidate move has Feature 2 but is also a self-atari, then instead it has Feature 3.

4. Save the string in new atari, by extending. The candidate move that is able to save the string in new atari by extending has this feature.

5. Same as Feature 4, which is also self-atari.

6. Solve a new ko by capturing. If there is a new ko, then the candidate move that is able to solve the ko by capturing any one of the neighbouring strings has this feature.

7. 2-point semeai. If the previous move reduces the liberties of a string to only two, then the candidate move that gives atari to its neighbouring string which has no way to escape has this feature. This feature deals with the most basic type of semeai.

Figure 4.1: Examples of Features 2,3,4,5,6 and 7. Previous move is marked with a dot.

4.3.3 Experimental Setting

The performances of MM and SB were measured by the winning rate of ERICA

against FUEGO 0.4 with 3,000 playouts per move for both programs. In the empty position, ERICA ran 6,200 playouts per second, whereas FUEGO ran 7,200 playouts per second. For reference, performance of the uniform random playout policy and the MM policy are shown in Table 4.1.

Table 4.1: Reference results against FUEGO 0.4, 1,000 games, 9×9, 3k playouts/move

For fairness, the trainings of MM and SB were both performed with the same features described above. The training of MM was accomplished within a day,

19 × 19

performed on 1,400,000 positions, chosen from 150,000 19×19 game records by strong players. The games were KGS games collected from the web site of Kombilo (Goertz and Shubert, 2007), combined with professional games collected from the web2go web site (Lin, 2009).

The production of the training data and the training process of SB were accomplished through ERICA without any external program. The training positions were randomly selected from the games self-played by ERICA with 3,000 playouts per move. Then ERICA with playouts parameters determined by MM, was directly used to evaluate these positions. It took over three days to complete merely the production and evaluation of the training positions. From this viewpoint, SB training costs much more time than MM.

The 9×9 positions were also used to measure the performance of MM in the situation equivalent to that of SB. The same 5k positions, that were served as the training set of SB, were trained on MM to compute the patterns.

The strength of these patterns was measured and shown in Table 1 as 9×9 MM.

4.3.4 Results and Influence of Meta-Parameters

SB has a few meta-parameters that need tuning. For the gradient-descent part, it is necessary to choose M, N, and



. Two other parameters define how the training set was built: number of positions, and number of playouts for each position evaluation.

Table 4.2 summarizes the experimental results with these parameters.

Table 4.2: Experimental results. The winning rate was measured 1,000 games against FUEGO

0.4, with 3,000 playouts per move. 95% condifence is ±3.1 when the winning rate is close to 50%, and ±2.5 when it is close to 80%.

Since the algorithm is random, it would have been better to replicate each experiment more than once, in order to measure the effect of randomness. Unlike MM, SB has no guarantee to find the global optimum, and may have a risk to get stuck at a bad local optimum. Because of limited computer resources, we preferred trying many parameter values rather than replicating experiments with the same parameters.

In the original algorithm, the simulations of outcome 0 are ignored when N simulations are performed to accumulate the gradient. The algorithm can be safely modified to use outcome -1/1 and replace z by (z - b), where b is the average reward, to make the 0/1 and -1/1 cases equivalent (Silver, 2009). The results of the 1st and 4th columns in Table 2 show that the learning speed of outcome -1/1 was much faster than 0/1, so that the winning rate of outcome -1/1 of Iteration 20 (69.2%) was even higher

than that of outcome 0/1 of Iteration 100 (63.9%). This is an indication that -1/1 might be better than 0/1, but more replications would be necessary to make a general conclusion.

The SB algorithm was designed to reduce the mean squared error (MSE) of the whole training set by stochastic gradient-descent. As a result, the MSE should gradually decrease if the training is performed on the same training set ever and again.

Running the SB algorithm through the whole training set once is defined as an Iteration. Figure 4.2 shows that the measure MSE actually decreases.

Figure 4.2: Mean square error as a function of iteration number. M=N=500,



=10, training set has 5k positions evaluated by 100 playouts. The error was measured by 1,000 playouts for every position for the training set.

4.4 Comparison between MM and SB Feature Weights

For all comparisons, SB values that scored 77.9% against FUEGO 0.4 were used (60 iterations, fourth column of Table 4.2). Table 4.3 shows the  -values of local features (_i=e^ⁱis a factor proportional to the probability that Feature i is played). Table 4.4 shows some interesting 3×3 patterns (top 10, bottom 10, top 10 without atari, and

most different 10 patterns). Local features (Table 4.3) show that SB plays tactical moves such as captures and extensions in a way that is much more deterministic than MM. A possible interpretation is that strong players may sometimes find subtle alternatives to those tactical moves, such as playing a move in sente elsewhere. But those considerations are far beyond what playouts can understand, so more deterministic captures and extensions may produce better Monte Carlo evaluations.

Table 4.3: Comparison of local features, between MM and SB

Pattern weights obtained by SB are quite different from those obtained by MM.

Figure 4.3 shows that SB has a rather high density of neutral patterns. Observing individual patterns on Table 4.4 shows that patterns are sometimes ranked in a completely different order. Top patterns (first two lines) are all captures and extensions. Many of the top MM patterns are ko-fight patterns. Again, this is because those occur quite often in games by strong human experts. Resolving a ko fight is beyond the scope of this playout policy, so it is not likely that ko-fight patterns help the quality of playouts. Remarkably, all the best SB patterns, as well as all the worst SB patterns (line 3) are border patterns. That may be because the border is where most crucial life-and-death problems occur.

The bottom part of Table 4.4 shows the strangest differences between MM and SB. Lines 5 and 6 are top patterns without atari, and lines 7 and 8 are patterns with the

highest difference in pattern rank. It is quite difficult to find convincing interpretations for most of them. Maybe the first pattern of line 7 (with SB rank 34) allows to evaluate a dead 2×2 eye. After this move, White will probably reply by a nakade, thus evaluating this eye correctly. Patterns with SB ranks 40, 119, and 15 offer White a deserved eye. These are speculative interpretations, but they show the general idea:

playing such ugly shapes may help playouts to evaluate life-and-death correctly.

4.5 Against GNU Go on the 9×9 Board

The SB patterns of subsection 4.3.5 were also tested against GNU GO. For having more evident statistical observations, ERICA was set to play with 300 playouts per move to keep the winning rate as close to 50% as possible. The results presented in Table 4.5 indicate that SB performs almost identical to MM. The reason for this result is maybe that progressive bias still has a dominant influence to guide the UCT search within 300 playouts. Also, it is a usual observation that improvement against GNU GO

is often much less than improvement against other Monte Carlo programs.

Table 4.4: 3×3 patterns. A triangle indicates a stone in atari. Black to move.

Table 4.5: Results against GNU GO 3.8 Level 10, 1,000 game, 9×9, 300 playouts/move

4.6 Playing Strength on the 19×19 Board

The comparison between MM and SB was also carried out on the 19×19 board by playing against GNU GO 3.8 Level 0 with 1,000 playouts per move. Although the foregoing experiments confirm that SB surpasses MM on the 9×9 board under almost every setting of M, N, and



, MM is still more effective on the 19×19 board. In Table 4.6, the original SB scored only 33.2% with patterns of which the winning rate was 77.9% on the 9×9 board. Even if the  -values of all local features of SB are replaced by those of MM (MM and SB Hybrid), the playing strength still does not improve at all (33.4%). Nonetheless, the winning rate of SB raises to 41.2% if the -value of Feature 1 is manually multiplied by 4.46 (= (19×19)/(9×9)), which was empirically obtained from the experimental results. This clearly points out that patterns computed by SB on the 9×9 board are far from optimal on the 19×19 board.

So far, it is not likely to use SB training directly on the 19×19 board, because even the top Go-playing programs are still too weak to offer good evaluations of the 19×19 training set. However, good performance of SB on the 13×13 board can be expected.

Table 4.6: Results against GNU GO 3.8 Level 0, 500 game, 19×19, 1,000 playouts/move

4.7. Conclusions

Below we provide three conclusions.

(1) The experiments presented in this chapter demonstrate the good performance of SB on the 9×9 board. This is an important result for practitioners of Monte Carlo tree

search, because previous results with this algorithm were limited to more artificial conditions.

(2) The results also demonstrate that SB gives high weights to some patterns in a rather bad shape position. This remains to be tested, but it indicates that SB pattern weights may not be appropriate for progressive bias. Also, learning opening patterns on the 19×19 board seems to be out of the reach of SB, so MM is likely to remain the learning algorithm of choice for progressive bias.

(3) The results of the experiments also indicate that SB has the potential to perform even better. Many improvements seem possible. We mention two of them.

(3a) The steepest descent is an extremely inefficient algorithm for stochastic function optimization. More clever algorithms may provide convergence that is an order of magnitude faster (Schraudolph, 1999), without having to choose meta-parameters.

(3b) It would be possible to improve the training set. Using many more positions would probably reduce risks of overfitting, and may produce better pattern weights.

It may also be a good idea to try to improve the quality of evaluations by cross-checking values with a variety of different programs, or by incorporating positions evaluated by a human expert.

Chapter 5 Time Management for Monte Carlo Tree

Search Applied to the Game of Go

5.1 Introduction

One of the interesting aspects of MCTS that remains to be investigated is time management. In tournament play, the amount of thinking time for each player is limited. The most simple form of time control, called sudden death, consists in limiting the total amount of thinking time for the whole game. A player who uses more time than the allocated budget loses the game. More complicated time-control methods exist, like byo-yomi¹², but they will not be investigated in this Chapter.

Sudden death is the simplest system, and the most often used in computer

在文檔中應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法 (頁 64-0)