The Multi-Layer Bandit Model - 機器學習於合約橋牌叫牌上之應用

Whereas each player’s decision making can be modelled by the contextual bandit prob-lem, recall that our goal is to obtain a bidding strategy G that produces a sequence of de-cisions that satisfy bridge rules. We propose to represent each player’s decision making with layers of “bidding nodes” V . With a careful design to structure these nodes, we can ensure that the bridge rules are all satisfied.

We define a bidding node V as a pair (b, g), where b

∈ B is called the bid label that V

represents, and g is the bidding function subject to x and b^k. We propose to structure the bidding nodes as a tree with ℓ + 1 layers, where the first layer of the tree contains a single root node with the first bidding function g(x_n

,

∅) and b =  indicating the entering of the bidding stage. At each V , g is only allowed to predict or something higher than b to satisfy the bridge rules. Every prediction of its g connects V to a child bidding node V^′ at the next layer such that the prediction equals the bid label of V^′. We restrict only the lowest M predictions of g to connect to non-terminal nodes to control the model complexity. Other nodes are designated as terminal nodes, which contain a constant g that always predicts. In addition, all nodes at layer ℓ + 1 are terminal nodes.

Since we form the nodes as a tree, each unique path from the root to V readily rep-resents a bidding sequence b^k. Thus, the classifier g of V only needs to consider the cards x. We call such a structure the tree model, as illustrated in Figure 3.1(a) with ℓ = 3 and M = 2. A variant of the tree model can be performed by combining the non-terminal nodes that represent the same bid label in each layer. The combination allows the nodes to share their data to learn a better g. We call the variant the layered model, as illustrated in Figure 3.1(b).

Given the model above, a bidding strategy G can be formed by first inputting x_nto g at the root node, following the prediction of g to another node that represents b[1] in the next layer, then inputting x_sto the node, and so on. The process ends when a call is predicted by some g of a non-root node.

After a particular model structure is decided, the remaining task becomes learning each g from data. We propose using CSTSR with ridge regression, which is among the



Figure 3.1: Tree model and layered model, the terminal nodes are not fully drawn

baseline methods that we have studied, as the learning algorithm, because it is a core part of the LinUCB algorithm that we adopt from the contextual bandit problem. Following the notations that are commonly used in the contextual bandit problem, we consider the reward r, which is defined as the maximum possible cost minus the cost, instead of the cost c. For each possible bid b_m, ridge regression is used to compute a weight vector w_m for estimating the potential reward w^T_m

x of making the bid. During prediction, CSTSR

predicts with the bid associated with the maximum potential reward. The computation of w_mtakes

w

_m = (X^T_m

X

_m+ λI)⁻¹(X^T_m

r

_m),

where r_m contains all the rewards gathered when the m-th bid b_m is made by g and X_m contains all the x associated with those rewards. λ > 0 is the regularization parameter of ridge regression and I is the identity matrix.

Our final task is to describe the learning algorithm for the model structure with ridge regression. As discussed, we use the cost of the final contract (i.e., the last bid) to form

Algorithm 1 The Proposed Learning Algorithm

Input: Data, D = {(x

, x

_si

, c

_i)

}

^Ni=1; a pre-defined model structure with all weights w_m within all CSTSR ridge regression classifiers initialized to 0.

Output: A bidding strategy G based on the learned w

_m.

repeat

8: Select the bid b_m with the maximum UCB reward

if b

_m = and V is not root then

the rewards for intermediate bidding decisions. Then, we follow the UCB algorithms in the contextual bandit problem to update each node. The UCB algorithms assume an online learning scenario in which each x arrives one by one. First, we discuss the LinUCB algorithm [15] to balance between exploration and exploitation. During the training of each node, LinUCB selects the bid that maximizes

w

^T_m

x + α

√

x

^T(X^T_m

X

_m+ λI)⁻¹

x,

where the first term is the potential reward on which CSTSR relies, and the second term represents the uncertainty of x with respect to the m-th bid. The α > 0 is a parameter that balances between exploitation (of rewarding bids) and exploration (of uncertain bids).

After LinUCB selects the bid for the root node, we follow the bid to the bidding node in the next layer, until a call is predicted by LinUCB. Then, we know the cost of the bidding sequence, and all the nodes on the bidding sequence path can be updated with the calculated rewards using ridge regression. The full algorithm is illustrated in Algorithm 1.

Another choice for the UCB algorithms is called UCB1 [17], which replaces the

un-certainty term

√

· · · in LinUCB with

^√^{2 ln(T )}_T_m , where T is the number of examples used to learn the entire g, and Tm is the number of examples used to update wm.

The full algorithm is illustrated in Algorithm 1. We randomly select an instance x per iteration to satisfy the online nature of the UCB algorithms. Then, a bidding sequence is generated with either a series of LinUCB or UCB1 computations. Finally, all the nodes on the bidding sequence path are updated with the calculated rewards.

The uncertainty term is the key component for making the UCB algorithms work.

First, we initialize all w_m with zeroes, and the uncertainty term is equally large for all possible bids. Therefore, the algorithm distributes instances to different bidding sequences somewhat randomly. Then, the uncertainty term decreases gradually after seeing more examples, which allows the reward term w_m^T

x to dominate the decision process. This

allows the algorithm to focus on rewarding bidding sequences to fine-tune the bidding decisions.

在文檔中機器學習於合約橋牌叫牌上之應用 (頁 20-23)