Additional Techniques - 機器學習於合約橋牌叫牌上之應用

In addition to the model and the core algorithms introduced in the previous section, we adopt several additional techniques to improve performance and computational efficiency.

The first two techniques focus on improving the performance, and the last technique aims at improving computational efficiency.

Full Update.

In the proposed model, whenever a bidding sequence b is sampled from theUCB algorithms for an instance x, the reward r can be calculated from c[b[ℓ]], and the example ((x, b^k), b[k + 1], r) is formed to update the bidding nodes. A closer look shows that some additional examples can be calculated easily with b. In particular, the cost for calling immediately after k bids can be calculated by c[b[k]], and the cost for selecting a terminal node with a bid label b can be calculated by c[b]. Thus, we can form additional examples by considering all the decisions of which reward can be calculated based on the above analysis for each bidding node on the bidding sequence b, and include those examples in updating the associated bidding nodes. Such an update scheme is called

F U as opposed to the original S U scheme in the proposed model.

Penetrative Update.

We consider the UCB algorithms to balance the need for explo-ration in the proposed model. In some ways, the UCB algorithms are not properly de-signed for the multi-layer model, and thus can lead to some caveats. For example, in the tree model, the number of instances that pass through a classifier in the top layer can be much more than those in the bottom layer. Thus, whenUCB puts the top-layer classifiers in the exploitation stage, the bottom-layer classifiers may still be in the exploration stage.

Even worse, if the classifiers in the top layers often result in an early, the ones in the bottom layer might not receive enough examples, which result in a worse learning performance.

To solve this problem, we consider a probabilistic “penetrative” scheme to continue bidding during training. That is, whenever a classifier predicts a bid that results in an early

, we select another bid and call the corresponding U with some probability p.

We require that the selected bid not on a terminal node (i.e., not resulting in an early) and to be of the highestUCB term. In other words, with some probability, we hope to generate longer (but good) bidding sequences b to help update the lower layers of the model in thisP U scheme. The scheme is related to the famous epsilon-greedy algorithm for the contextual bandit problem [18].

Delayed Update.

We adopt the contextual bandit algorithms in our model, which were designed for the online scenario where examples arrive one by one. Even with the Sherman-Morrison formula, updating the internal w_mright after an example arrives requires O(d²), where d is the dimension of x. The updating step becomes the computational bottleneck of the algorithms. In view of the efficiency, we consider aD U scheme that does not update w_m immediately after each example is formed, but waits until gather-ing a pile of examples. Experimental results in Chapter 4 will show that such a scheme substantially decreases the amount of training time without loss of performance.

Chapter 4 Experiments

Next, we study the proposed model and compare it with the baseline and optimistic methods. In addition, we compare the model with a well-known computer bridge software, Wbridge5 [19], which has won the computer bridge championship for several years. A randomly-generated data set of 100, 000 instances (deals) is used in the experiment. We re-serve 10, 000 instances for validation and another 10, 000 for testing, and leave the rest for training. We study two different representations for x: binary features and condensed fea-tures. The binary features are represented by a 52-dimensional binary vector, where each dimension representing the existence of the corresponding card. The condensed features contain two parts that are widely used in real-world bridge games and human-designed bidding systems, high card points (HCP) and number of cards in each suit. The HCP is a method for evaluating the round-winning power. It is calculated by summing up the val-ues of cards, which is defined by Ace = 4, King = 3, Queen = 2, Jack = 1, and 0 otherwise.

For both representations, a constant dimension is added to reflect the bias term.

We obtain the cost vectors c from International Match Points (IMP). The IMP is an integer between

{0, 1, · · · , 24}, widely used for comparing the relative performance of

two teams in real-world bridge game [20]. We obtain c by comparing the best possible contract of the deal to each contract and calculate the IMP, where higher IMP indicates that the contract is far from the best one and should suffer from a higher cost. When transforming the costs to the rewards in the proposed model, we take 24 minus the cost as

Table 4.1: Results of baseline and optimistic methods

Method Dimensions Baseline Optimistic

CSOSR - binary 53 3.9659 2.5657

CSOSR - condensed 6 3.8329 1.8985

CSTSR - binary 53 3.9399 2.7270

CSTSR - condensed 6 3.9428 2.7697

CSTSR - condensed + 2nd order expansion 21 3.8465 2.1106 CSTSR - condensed + 3rd order expansion 56 3.8272 1.9228

Wbridge5 N/A 2.9550 N/A

the reward to keep the rewards non-negative. ¹

4.1 Baseline and Optimistic Methods

First, we present the performance of the baseline and the optimistic methods in Ta-ble 4.1. In CSOSR, SVM with the Gaussian kernel implemented with LIBSVM [21] is used as the base learner. In CSTSR, ridge regression is used as the base learner. Because SVM training is time consuming, we only sub-sample 20, 000 instances for CSOSR. For CSTSR with condensed features, we also extend its capability by considering simple poly-nomial expansion of the features. For parameters, we consider C

∈ {10

⁰

, 10

}

and γ

∈ {10

⁻³

, 10

⁻²

, 10

⁻¹

, 10

⁰

} for CSOSR, and λ ∈ {10

⁻⁶

, 10

⁻⁵

, · · · , 10

} for CSTSR.

We choose the best parameters based on the validation set and report the average test cost in Table 4.1.

Unsurprisingly, we find that the performance of the optimistic methods to be much better than their baseline counterparts. This justifies that the information in both players are valuable, and it is important to properly exchange information via bidding. In addi-tion, note that the optimistic methods can often achieve lower test cost than the Wbridge5 software. This suggests that the human-designed bidding system within the computer soft-ware may have room for improvement. Comparing over all the baseline methods, we see that using the 2nd order expansion with the condensed features reach decent performance by the baseline CSTSR with only 21 expanded features. Thus, we will take those features within the proposed model in the next experiments.

1One technical detail is that the cost vector c is generated by assuming that the player who can win more rounds for the contract is the declarer. We will discuss the effect in the end of this chapter.

0 20 40 60 80 100

Number of Iterations (x10⁴)

Average cost per deal

Number of Iterations (x10⁴) instant

Number of Iterations (x10⁴)

Average cost per deal

In Section 3.4, three techniques are proposed to improve the model. We first compare F U with S U. Figure 4.1 shows how the average validation cost varies with the number of iterations on a tree model with ℓ = 4, M = 5 coupled with ridge regression with λ = 10⁻³and UCB1 with α

∈ {10, 100}. We can easily observe that F

U outperforms S U, which justifies that the additional examples used forF U capture valuable information to make the cost estimation more precise.

Thus, we adoptF U in all the next experiments.

Then, we compareD U with I U. Figure 4.2 shows how the average validation cost varies with the number of iterations on the same tree model used for Figure 4.1. ForD U, we consider piles of size

{10, 100, 1000} instances

在文檔中機器學習於合約橋牌叫牌上之應用 (頁 23-27)