MCTS in E RICA - 應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法

Chapter 3 E RICA

3.2 MCTS in E RICA

This section investigates the implementation of MCTS in ERICA, along with some of our own ideas. Note that these ideas might be re-inventions, since there are plenty of open source Go-playing programs to trace that we might overlook, not mentioning to the ones of unavailable source code.

3.2.1 Selection

The selection formula of ERICA is a combination of the strategies of UCT, RAVE and

progressive bias which maximizes the selection formula (3.1):

e_bias

where Coefficient is the weight of RAVE computed by Silver’s formula (Silver, 2009) and the exploration term of RAVE is taken off as Silver suggested. For the exploration term of UCT, Cuct is set to 0.6 for all board sizes.

The term progressive_bias is computed by the formula (3.2):

where CPB is a constant which has to be tuned empirically. nuct, initialized to 1, is the visit count of this node and v_prioris the prior value in [0,1]. After the end of search, the most visited candidate move in the root node is selected to play. For ERICA, the good value of C_PB on the 19×19 board is around 50. Note that the good values of C_PB can vary in different board sizes.

3.2.2 Expansion

ERICA uses delayed node creation (a node is expanded in the nth (n>1) visit ) to reduce the memory overhead caused by RAVE, as explained in Section 2.2.2. For ERICA, the good value of n on the 19×19 board is around 5. In node creation, the prior computation takes into account various features which are partly listed in (Coulom, 2007), according to the pattern weights given by MM.

3.2.2.1 Larger Patterns

For ERICA, the first and foremost feature in prior computation on the 19×19 board is larger patterns of diamond-shape (Stern at al., 2006). Firstly, larger patterns of up to

size 9 (by the definition in (Stern at al., 2006)) are harvested from the game records according to their frequencies of appearance. Then, these patterns are trained by MM together with other features that participate in prior computation. In ERICA, larger patterns are only used in progressive bias, not in the playout. The improvement from larger patterns is measured to be over 100 Elo.

3.2.2.2 Other Features

Other useful features for the 19×19 board are, for instance, ladder, distance features (distance to the previous move, distance to the move before the previous move and Common Fate Graph (CFG) distance (Graepel et al., 2001), etc) and various tactical features of semeai and life-and-death, such as “save a string by capturing”.

3.2.3 Simulation

3.2.3.1 Boltzmann Softmax Playout Policy

In simulation stage, ERICA uses Boltzmann softmax playout policy (usually called softmax policy or Gibbs sampling (Geman and Geman; 1984)). Softmax policy was firstly applied to a Monte Carlo Go program in (Bouzy and Chaslot, 2006) and called psdueo-randommoves that are generated by domain-dependent approach which uses a non-uniform probability. In the experiments of Bouzy and Chaslot, only 3×3 patterns along with one-liberty urgency were served as the features. This scheme of pseudo-random, non-uniform probabilistic distribution was further improved and extended by Coulom to multiple features (Coulom, 2007).

The softmax policy _ is defined by the probability of choosing action a in

where Φ(s, a) is a vector of binary features, and θ is a vector of feature weights.

To explain the softmax policy, Figure 3.5 gives an example of a position in the playout, Black to move. The previous move is marked by ∆. For Black, now the only legal moves are A, B, C and D¹⁰.

Figure 3.5: An example of a position in the playout. The previous move is marked by ∆. Black to move.

Suppose there two binary features (for simplicity, e^ⁱis denoted by _i):

1. Contiguous to the previous move. A candidate move that is directly neighboring to the previous move has this feature. The weight of this feature is ₁. Point A, C and D have this feature.

2. Save the string, put in atari by the previous move, by extending. The weight of this feature is ₂. Point A has this feature.

Then, the weight of each move is, A: weight= ₁₂

B: weight= ₁ C: weight= ₁

D: weight= 1, with no corresponding feature.

10 In ERICA, an empty point that fills a real eye, such as E, is also regarded as an illegal move, though they are legal according to the Go rules. Forbid “filling a real eye” in the playout is commonly used in current Mone Carlo Go-playing programs.

Consequently, the probability to choose each move is given by A:

The move generator in the playout of ERICA is depicted by the pseudocode shown in Table 3.5. The details are explained as follows.

Table 3.5: Pseudocode of the move generator in the playout of ERICA.

ComputeLocalFeatures deals with the local features (the features related to the MoveGenerator()

previous move or the move before the previous move, etc) and updates the gammas of the local moves which have the local features. 3×3 patterns and some of the local features of ERICA will be introduced in Section 4.3.2.

The move to be generated is decided in the “for loop”. Firstly, if TotalGamma, the sum of the gammas of all the moves in this position, is equal to 0, Move is set to pass and returned immediately since no move has a nonzero probability. Otherwise, ChooseMoveByProbability chooses a move and assigns to Move by softmax policy as described in the previous section.

After Move is chosen, ForbiddenMove examines that if Move is forbidden, which means it has a feature of zero weight. If Move is detected to be forbidden, SetZeroGamma subtracts its gamma from TotalGamma and resets the gamma to zero.

The mechanism of ForbiddenMove is a compromise for the features which are too costly to incrementally update. Note that it is also possible to check the legality of Move in ForbiddenMove. The next section will give an example of ForbiddenMove.

After the examination of forbiddenMove, Move is passed (call by reference) to ReplaceMove for further inspection. ReplaceMove is an extended version of ForbiddenMove in the sense that it not only examines if Move is forbidden or not, but also replaces it with a better move for the former case. Section 3.2.3.4 will give an example of ReplaceMove.

Outside the “for loop”, when Move is ready to be returned, RecoverMoves sets back the gammas of the moves reset by forbiddenMove.

3.2.3.3 ForbiddenMove

Figure 3.6 gives an example of ForbiddenMove of ERICA’s move generator in the playout, Black to move. In this example, point A is forbidden because it is a self-atari of 9 stones, which is a clearly bad move. In ERICA, a self-atari move is not forbidden

if it forms a nakade shape.

Figure 3.6: An example of ForbiddenMove, Black to move.

3.2.3.4 ReplaceMove

Figure 3.7 gives an example of ReplaceMove of ERICA’s move generator in the playout, Black to move. In this example, point A is forbidden and replaced with B by the rule “when filling a false eye, if there is a capturable group in one of the diagonal point, then capture the group instead of filling the false eye”.

Figure 3.7: An example of ForbiddenMove, Black to move.

3.2.4 Backpropagation

In this section, we present two useful heuristics of RAVE to improve its performance.

Section 3.2.4.1 presents the first heuristic, to bias RAVE updates by move distance.

Section 3.2.4.2 presents the second heuristic, to fix RAVE updates for ko threats.

3.2.4.1 Bias RAVE Updates by Move Distance

When updating the RAVE values in a node, the heuristic “Bias RAVE Updates by

Move Distance” is to bias the simulation outcome according to how far the updated move was played away from this node. The number of the moves between this node and the updated move is defined as the distance of this move, denoted by d. The weight to bias the simulation outcome is defined as distance weight, denoted by w. If the simulation outcome is 1, then the updated outcome is 1−d*w; if the simulation outcome is 0, then the updated outcome is 0+d*w. Figure 3.8 gives an example.

Figure 3.8: An example of “Bias RAVE Updates by Move Distance”.

As far as we know, FUEGO was the first Go-playing program that proposed and used this idea¹¹. This heuristic brings in the information of move sequence to RAVE.

It is worth about 50 Elo in our experiments.

3.2.4.2 Fix RAVE Updates for Ko Threats

Figure 3.9 is an illustration to show the occasion where this heuristic is applicable.

This position is selected from the game played on the KGS Go Server (KGS) between

11 The details of FUEGO’s approach to “Bias RAVE Updates by Move Distance” can be found in the documents of FUEGO in the official web site, http://fuego.sourceforge.net/.

Search path: Node 1→move A→Node 2→move B→Node 3→move C…

Suppose simulation outcome=1 and w=0.001.

When updating the RAVE values of Node 1,

For move A, d=0 and the updated outcome of A is 1 – 0*0.001 = 1.

For move B, d=1 and the updated outcome of B is 1 – 1*0.001 = 0.999.

For move C, d=2 and the updated outcome of C is 1 – 2*0.001 = 0.998.

ajahuang [6d] (White) and Zen19D[5d] (Black). The previous move (marked by ∆) played by ZEN is clearly meaningless, though it’s a sente move or ko threat that forces White to respond. Apparently, the correct move in this moment is A, namely to capture the ko. But why ZEN played a ko threat before capturing the ko?

Figure 3.9: An example to show the occasion of the heuristic “Fix RAVE Updates for Ko Threats”: ajahuang [6d] (White) vs. Zen19D [5d] (Black). White won by resignation.

The problem is (probably) out of RAVE. It is due to the intrinsic problem of RAVE that in the root node, the RAVE value of the ko threats (such as the previous move marked by ∆), which were searched in the lower levels of the tree, are also updated. But a ko threat is supposed to be played after a ko capture. So, in this example, the RAVE value of Black’s ko threats (such as the previous move marked by

∆) should not be updated in the root node. This is the main idea of the heuristic “Fix RAVE updates for Ko Threats”. Figure 3.10 gives an example of this heuristic to show how it works practically in the tree. In this example, the RAVE value of move E in Node 1 is not updated because it is detected as a ko threat move of Node 5. This

heuristic is worth about 30 Elo in our experiments.

Figure 3.10: An example of “Fix RAVE Updates for Ko Threats”.

在文檔中應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法 (頁 55-64)