• 沒有找到結果。

Linear Upper Confidence Bound Algorithm for Contextual Bandit Problem with Piled Rewards

N/A
N/A
Protected

Academic year: 2022

Share "Linear Upper Confidence Bound Algorithm for Contextual Bandit Problem with Piled Rewards"

Copied!
12
0
0

加載中.... (立即查看全文)

全文

(1)

Contextual Bandit Problem with Piled Rewards

Kuan-Hao Huang and Hsuan-Tien Lin

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

{r03922062,htlin}@csie.ntu.edu.tw

Abstract. We study the contextual bandit problem with linear payoff function. In the traditional contextual bandit problem, the algorithm iteratively chooses an action based on the observed context, and imme- diately receives a reward for the chosen action. Motivated by a practical need in many applications, we study the design of algorithms under the piled-reward setting, where the rewards are received as a pile instead of immediately. We present how the Linear Upper Confidence Bound (Lin- UCB) algorithm for the traditional problem can be na¨ıvely applied under the piled-reward setting, and prove its regret bound. Then, we extend LinUCB to a novel algorithm, called Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR), which digests the observed contexts to choose actions more strategically before the piled rewards are received.

We prove that LinUCBPR can match LinUCB in the regret bound un- der the piled-reward setting. Experiments on the artificial and real-world datasets demonstrate the strong performance of LinUCBPR in practice.

Keywords: contextual bandit, piled rewards, upper confidence bound

1 Introduction

We study the contextual bandit problem (CBP) [13], which is an interactive process between an algorithm and an environment. In the traditional CBP, the algorithm observes a context from the environment in each time step. Then, the algorithm is asked to strategically choose an action from the action set based on the context, and receives a corresponding feedback, called reward, while the reward for other actions are hidden from the algorithm. The goal of the algorithm is to maximize the cumulative reward over all time steps.

Because only the reward of the chosen action is revealed, the algorithm needs to choose different actions to estimate their goodness, called exploration. On the other hand, the algorithm also needs to choose the better actions to maximize the reward, called exploitation. Balancing between exploration and exploitation is arguably the most important issue for designing algorithms of CBP.

-Greedy [3] and Linear Upper Confidence Bound (LinUCB) [10] are two representative algorithms for CBP. -Greedy learns one model per action for exploitation and randomly explores different actions with a small probability .

(2)

LinUCB is based on online ridge regression, and takes the concept of upper- confidence bound [5] [2] to strategically balance between exploration and ex- ploitation. LinUCB enjoys a strong theoretical guarantee [5] and is state-of-the- art in many practical applications [10].

The traditional CBP setting assumes that the algorithm receives the reward immediately after choosing an action. In some practical applications, however, the environment cannot present the reward to the algorithm immediately. This work is motivated from one such application. Consider an online advertisement system operated by a contextual bandit algorithm. For each user visit (time step), the system (algorithm) receives the information of the user (context) from an ad exchange, and chooses an appropriate ad (action) to display to the user. In the application, the click from the user naturally acts as the reward of the action.

Nevertheless, to reduce the cost of communication, the ad exchange often does not reveal the individual reward immediately after choosing an action. Instead, the ad exchange stores the individual reward first, and only sends a pile of rewards back to the system until sufficient number of rewards are gathered. We call the scenario as the contextual bandit problem under the piled-reward setting.

A related setting in the literature is the delayed-reward setting, where the reward is assumed to come at several time steps after the algorithm chooses an action. Most existing works on the delayed-reward setting consider constant de- lays [6] and cannot be easily applied to the piled-reward setting. Several works [8]

[9] [12] propose algorithms for bandit problems with arbitrarily-delayed rewards, but their algorithms are non-contextual. Thus, to the best of our knowledge, no existing work has carefully studied the CBP under the piled-reward setting.

In this paper, we study how LinUCB can be applied under the piled-reward setting. We present a na¨ıve use of LinUCB for the setting and prove its theoretical guarantee in the form of the regret bound. The result helps us understand the difference between the traditional setting and the piled-reward setting. Then, we design a novel algorithm, Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR), which is a variant of LinUCB that allows more strategic use of the context information before the piled rewards are received. We prove that LinUCBPR can match the na¨ıve LinUCB in its regret bound under the piled- reward setting. Experiments on the artificial and real-world datasets demonstrate that LinUCBPR results in strong and stable performance in practice.

This paper is organized as follows. Section 2 formalizes the CBP with the piled-reward setting. Section 3 describes our design of LinUCB and LinUCBPR under the piled-reward setting. The theoretical guarantees of the algorithms are analyzed in Section 4. We discuss the experiment results in Section 5 and conclude in Section 6.

2 Preliminaries

We use bold lower-case symbol like u to denote a column vector, bold upper-case symbol like A to denote a matrix, Idto denote the d × d identity matrix, and [K]

to denote the set {1, 2, ..., K}.

(3)

We first introduce the CBP under the traditional setting. Let T be the total number of rounds and K be the number of actions. In each round t ∈ [T ], the algorithm observes a context xt ∈ Rd with kxtk2 ≤ 1 from the environment.

Upon observing the context xt, the algorithm chooses an action at from K actions based on the context. Right after choosing at, the algorithm receives a reward rt,at that corresponds to the context xtand the chosen action at, while other rewards rt,a for a 6= at are hidden from the algorithm. The goal of the algorithm is to maximize the cumulative reward after T rounds.

Now, we introduce the CBP under the piled-reward setting. Instead of receiv- ing the reward right after choosing an action (and thus right before observing the next context), the setting assumes that the rewards come as a pile after observing multiple contexts in a round. We shall extend our notation above to the piled-reward setting as follows. In each round t, the algorithm sequentially observes n contexts xt1, xt2, ..., xtn∈ Rd with kxtik2≤ 1 from the environment.

For simplicity, we assume that n is a fixed number while all the technical re- sults in this paper can be easily extended to the case where n can vary in each round. We use ti to denote the i-th step in round t. For example, 35 means the 5-th step in round 3. Upon observing the context xti in round t, the algo- rithm chooses an action ati from K actions based on the context, and observes the next context xti+1. In the end of round t, the algorithm receives n rewards rt1,at1, rt2,at2, ..., rtn,atn that correspond to xti and ati, while other rewards rti,a for a 6= ati are hidden from the algorithm. The goal is again to maximize the cumulative reward PT

t=1

Pn

i=1rti,ati after T rounds.

In other words, the piled-reward setting assumes that the context comes at time steps {11, 12, ..., 1n, 21, 22, ..., t1, t2, ..., Tn−1, Tn}, while the rewards come after every n contexts as a pile. Note that the traditional setting is a special case of the piled-reward setting when n = 1.

In this paper, we consider the CBP with linear payoff function. We as- sume that rti,a connects with xti linearly through K hidden weight vectors u1, u2, ..., uK ∈ Rd with kuik2 ≤ 1. That is, E [rti,a| xti] = x>tiua. Let ati = arg maxa∈[K]x>tiua be the optimal action for xti. We define regret of an algo- rithm to be (PT

t=1

Pn i=1rti,a

ti−PT t=1

Pn

i=1rti,ati). The goal of maximizing the cumulative reward is equivalent to minimizing the regret.

Linear Upper Confidence Bound (LinUCB) [5] is a state-of-the-art algo- rithm for the traditional CBP (n = 1). LinUCB maintains K weight vectors wt1,1, wt1,2, ..., wt1,K to estimate u1, u2, ..., uK at time step t1. The K weight vectors are calculated by ridge regression

wt1,a= argmin

w∈Rd

kwk2+ kX(t−1)

1,aw − r(t−1)

1,ak2 , (1)

where X(t−1)1,a is a matrix with rows being the contexts x>τ where τ are the time steps before round t and aτ = a, and r(t−1)1,ais a column vector with each element representing the corresponding reward for each context in X(t−1)1,a. Let At1,a = (Id+ X>(t−1)

1,aX(t−1)

1,a) and bt1,a= (X>(t−1)

1,ar(t−1)

1,a). The solution to (1) is wt1,a= A−1t1,abt1,a.

(4)

When a new context xt1 comes, LinUCB calculates two terms for each ac- tion a: the estimated reward ˜rt1,a = x>t1wt1,a and the uncertainty ct1,a = q

x>t1A−1t1,axt1, and chooses the action with the highest score ˜rt1,a + αct1,a, where α is a trade-off parameter. After receiving the reward, LinUCB updates the weight vector wt1,at1 immediately, and uses the new weight vector to choose the action for the next context. LinUCB conducts exploration when the chosen action is of high uncertainty. After sufficient (context, action, reward) informa- tion is received, ˜rt1,a shall be close the the expected reward, and ct1,awill be smaller. Then, LinUCB conducts exploitation with the learned weight vectors to choose the action with the highest expected reward.

3 Proposed Algorithm

We first discuss how LinUCB can be na¨ıvely applied under the piled-reward setting. Then, we extend LinUCB to a more general framework that utilizes the additional information within the contexts before the true rewards are received.

Since no rewards are received before the end of the current round t, the na¨ıve LinUCB does not update the model during round t, and only takes the fixed wt1,a and At1,a to calculate the estimated reward ˜rti,a and the uncer- tainty cti,a for each action a. That is, LinUCB only updates wt1,a before the beginning of round t as the solution to (1) with (X(t−1)1,a, r(t−1)1,a) under the traditional setting replaced by (X(t−1)n,a, r(t−1)n,a) under the piled-reward set- ting. In addition, At1,a can be similarly defined from X(t−1)n,a instead.

The na¨ıve LinUCB can be viewed as a baseline upper confidence bound algo- rithm under the piled-reward setting. There is a possible drawback for the na¨ıve LinUCB. If similar contexts come repeatedly in the same round, because wti,a

and Ati,a stay unchanged within the round, LinUCB will choose similar actions repeatedly. Then, if the chosen action suffers from low reward, LinUCB suffers from making the low-reward choice repeatedly before the end of the round.

The question is, can we do even better? Our idea is that the contexts xti

received during round t can be utilized to update the model before the rewards come. That is, at time step ti, in addition to the labelled data (context, action, reward) gathered before time step (t − 1)nthat LinUCB uses, the unlabelled data (context, action) gathered at time steps {t1, t2, . . . , ti−1} can also be included to learn a more decent model. In other words, we hope to design some semi- supervised learning scheme within round t to guide the upper-confidence bound algorithm towards more strategic exploration within the round.

Our idea is motivated from the regret analysis. In Section 4, we will show that the regret of LinUCB under the piled-reward setting is bounded by the summation of cti,aover all time steps. But note that cti,a only depends on xti,a

and Ati,a, but not the reward. That is, upon receiving xti and choosing an action ati, the term cti,ati can readily be updated without the true reward. By updating cti,ati within the round, the algorithm can explore different actions strategically instead of following similar actions when similar contexts come repeatedly in the same round.

(5)

Algorithm 1 LinUCBPR under the piled-reward setting 1: Parameter: α ∈ R+

2: Initialize: ˆA11,a← Id, ˆb11,a← 0d×1, ˆw11,a← ˆA−111,aˆb11,a

3: for t = 1, 2, 3, ..., T do 4: for i = 1, 2, 3, ..., n do

5: Observe xti and choose ati = argmaxa∈[K]x>titi,a+ α q

x>ti−1ti,axti

6: Calculate the pseudo reward pti,ati

7: Aˆti+1,ati ← ˆAti,ati+ xtix>ti, bˆti+1,ati ← ˆbti,ati+ xtipti,ati

8: wˆti+1,ati ← ˆA−1ti+1,atiti+1,ati

9: end for

10: Receive rewards rt1,at1, rt2,at2, ..., rtn,atn

11: for a ∈ [K] do

12: Aˆ(t+1)1,a← ˆAt1,a+P

ati=axtix>ti, ˆb(t+1)1,a← ˆbt1,a+P

ati=axtirti,ati

13: wˆ(t+1)1,a← ˆA−1(t+1)

1,aˆb(t+1)1,a

14: end for 15: end for

This idea can be extended to the following framework. We propose to couple each context xt1, xt2, . . . , xti−1 with a pseudo reward pτ,aτ, where τ is the time step, before receiving the true reward rτ,aτ. The pseudo reward can then pretend to be the true reward and allow the algorithm to keep updating the model before the true rewards are received. Note that pseudo rewards have been used to speed up exploration in the traditional CBP [4], and can encourage more strategic exploration in our framework. We name the framework Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR). The framework updates the weight vector and the estimated covariance matrix by

ˆ

wti,a= argmin

w∈Rd

(kwk2+ kX(t−1)

n,aw − r(t−1)

n,ak2+ k ˆXti−1,aw − pti−1,ak2) (2) Aˆti,a= Id+ X>(t−1)

n,aX(t−1)n,a+ ˆX>ti−1,ati−1,a (3) where ˆXti−1,a is a matrix with rows being the contexts x>τ with t1≤ τ ≤ ti−1

and aτ = a, and pti−1,a is a column vector with each element representing the corresponding pseudo reward for each context in ˆXti−1,a.

When receiving the true rewards in the end of round t, we discard the change from pseudo rewards, and use the true rewards to update model again. We show the framework of LinUCBPR in Algorithm 1.

The only remained task is what pτ,a should be. We will study two variants, one is to use pτ,a= ˜rτ,a, the estimated reward of actions. We name the variant LinUCBPR with estimated reward (LinUCBPR-ER). Another variant is to be even more aggressive, and set pτ,a = ˜rτ,a− βcτ,a, a lower-confidence bound of the reward, where β is a trade-off parameter. The lower-confidence bound can be viewed as the underestimated reward, and should allow more exploration within the round, at the cost of more computation. We name the variant LinUCBPR with underestimated reward (LinUCBPR-UR).

(6)

Algorithm 2 BaseLinUCB under the piled-reward setting at round t 1: Parameter: α ∈ R+, Ψt⊆ {11, 12, ..., (t − 1)n}

2: ¯At1← IdK+ P

τ ∈Ψt

¯

xτ,aτ¯x>τ,aτ, ¯bt1 ← 0dK×1+ P

τ ∈Ψt

¯

xτ,aτrτ,aτ, ¯wt1 ← ¯A−1t1t1

3: for i = 1, 2, 3, ..., n do

4: Observe xti and calculate ¯xti,1, ¯xti,2, ..., ¯xti,K

5: for a ∈ [K] do

6: widthti,a← (1 + α)q

¯

x>ti,a−1t1 ¯xti,a

7: ucbti,a← ¯x>ti,at1+ widthti,a

8: end for 9: end for

4 Theoretical Analysis

In this section, we establish the theoretical guarantee for the regret bound of Lin- UCB and LinUCBPR-ER under the piled-reward setting. Similar to the analysis of LinUCB in the immediate-reward setting [5], there is a difficulty. In particular, the algorithms choose actions based on previous outcomes. Hence, the rewards in each round are not independent random variables. To deal with this prob- lem, we follow the approach of [5]. We modify the algorithm to a base algorithm which assumes the independent rewards, and construct a master algorithm which ensures that the assumption holds.

Note that [5] takes a CBP setting with one context per action instead for our setting of the one context share by actions. To let the notation be con- sistent with [5], we simply cast our setting as theirs by following steps. We define a (dK)-dimensional vector ¯u to be the concatenation of u1, u2, ..., uK, and define a (dK)-dimensional context ¯xτ,a per action with xτ, where ¯xτ,a =

0 0 · · · 0 x>τ 0 · · · 0>

with xτ being the a-th vector within the concatenation.

All ¯Xτ, ¯Aτ, ¯rτ, ¯bτ, and ¯wτ can be similarly defined from ˆXτ,a, ˆAτ,a, rτ,a, ˆbτ,a, and ˆwτ,a.

4.1 Regret for LinUCB under the piled-reward setting

Algorithm 2 lists the base algorithm for LinUCB under the piled-reward setting, called BaseLinUCB. We first prove the theoretical guarantee of BaseLinUCB.

Let ¯cti,a= q

¯

x>ti,a−1t1ti,a. We can establish the following lemmas.

Lemma 1 (Li et al. [5], Lemma 1). Suppose the input time step set Ψt ⊆ {11, 12, ..., (t − 1)n} given to BaseLinUCB has property that for fixed context ¯xti,a with ti ∈ Ψt, the corresponding rewards rti,a are independent random variables with means ¯x>ti,a¯u. Then, for some α = O(pln(nT K/δ)), we have with proba- bility at least 1 − δ/(nT ) that

¯x>ti,ati− ¯x>ti,a

≤ (1 + α)¯cti,a.

Note that in Lemma 1, the bound is related to the time steps. We want the bound to be related to the rounds, and hence establish Lemma 2 and Lemma 3.

(7)

Lemma 2. Let ψt be a subset of {t1, t2, ..., tn}. Suppose Ψt+1 = Ψt∪ ψt in BaseLinUCB. Then, the eigenvalues of ¯At1 and ¯A(t+1)1 can be arranged so that λt1,j ≤ λ(t+1)1,j for all j and ¯c2ti,a≤ 10PdK

j=1

λ(t+1)1,j−λt1,j λt1,j .

Proof. The proof can be done by combining Lemma 2 and Lemma 8 in [5].

Lemma 3. Let Φt+1 = {t | t ∈ [T ] and ∃ j such that tj ∈ Ψt+1}, and assume

t+1| ≥ 2. ThenP

ti∈Ψt+1¯cti,a≤ 5npdK |Φt+1| ln |Φt+1|.

Proof. By Lemma 2 and the technique in the proof of Lemma 3 in [5], we have

X

ti∈Ψt+1

¯

cti,a≤ X

ti∈Ψt+1

v u ut10

dK

X

j=1

λ(t+1)1,j− λt1,j

λt1,j

≤ X

t∈Φt+1

n v u ut10

dK

X

j=1

λ(t+1)1,j− λt1,j

λt1,j ≤ 5np

dK |Φt+1| ln |Φt+1|.

We construct SupLinUCB on each round similar to [5]. Then, we borrow Lemma 14 and Lemma 15 of [2], and extend Lemma 16 of [2] to the following lemma.

Lemma 4. For each s ∈ [S], ΨT +1s

≤ 5n · 2s(1 + α)q dK

Φst+1 ln

Φst+1 . Based on the lemmas, we can then establish the following theorem for the regret bound of LinUCB under the piled-reward setting.

Theorem 1. For some α = O(pln(nT K/δ)), with probability 1 − δ, the regret of LinUCB under the piled-reward setting is O(

q

dn2T K ln3(nT K/δ)).

Proof. Let Ψ0 = {11, 12, ..., Tn} \S

s∈[S]ΨT +1s . Observing that s−s ≤ 1/√ T , given the previous lemmas and Jensen’s inequality, we have

Regret =

T

X

t=1 n

X

i=1



E[rti,ati] − E[rti,ati]

= X

ti∈Ψ0



E[rti,ati] − E[rti,ati] +

S

X

s=1

X

ti∈ΨT +1s



E[rti,ati] − E[rti,ati]

≤ 2

√T Ψ0

+

S

X

s=1

23−s ΨT +1s

≤ 2

√ T

Ψ0 +

S

X

s=1

40n(1 + α) q

dK Φst+1

ln Φst+1

≤ 2n√

T + 40n(1 + α)

√ dK ln T

S

X

s=1

q Φst+1

≤ 2n√

T + 40n(1 + α)√

dK ln T√ ST .

(8)

The rest of proof is almost identical to the proof of Theorem 6 in [2]. By substitut- ing α = O(pln(nT K/δ), replacing δ with δ/(S + 1)S, substituting S = ln(nT ), and applying Azuma’s inequality, we obtain Theorem 1.

Note that if we let nT = C to be a constant, the original regret bound under the traditional setting (n = 1) in [5] is O(

q

dCK ln3(CK/δ)), while the regret bound under the piled-reward setting is O(

q

dnCK ln3(CK/δ)), which is the original bound multiplied by√

n.

4.2 Regret for LinUCBPR-ER under the piled-reward setting We first prove two lemmas for LinUCBPR-ER.

Lemma 5. After updating with the context xti and the pseudo reward pti,a =

˜

rti,a, the estimated reward of LinUCBPR-ER is the same. That is, ˜rti+1 = ˜rti. Proof. Because pti,a= x>t

iti,a= ˜rti,a, xti and pti,awill not change ˆwti,a. Thus the reward stays the same.

Lemma 6. After updating with the context xti and the pseudo reward pti,a =

˜

rti,a, the uncertainty of LinUCBPR-ER for the context is non-increasing. That is, for any x,

q

x>−1ti+1,ax ≤ q

x>−1ti,ax.

Proof. By Sherman-Morrison formula, we have

x>−1ti+1,ax = x>−1ti,ax −x>−1t

i,axtix>t

i

−1t

i,ax 1 + x>t

i

−1t

i,axti = x>−1ti,ax −

x>−1t

i,axti2 1 + x>t

i

−1t

i,axti. The second term is greater than or equal to zero, and implies the lemma.

Similarly, we can construct BaseLinUCBPR-ER and SupLinUCBPR-ER.

By Lemma 5, we have that for each time step ti in the round t, the esti- mated reward ¯x>ti,ati= ¯x>ti,at1 does not change. Furthermore, By Lemma 6, we have

q

x>−1ti,ax ≤ q

x>−1t1,ax. Hence, all lemmas we need also hold for BaseLinUCBPR-ER. Similar to LinUCB, we can then establish the following theorem. The proof is almost identical to Theorem 1.

Theorem 2. For some α = O(pln(nT K/δ)), with probability 1 − δ, the regret of LinUCBPR-ER under the piled-reward setting is O(

q

dn2T K ln3(nT K/δ)).

5 Experiments

We apply the proposed algorithms on both artificial and real-world datasets to justify that using pseudo-rewards is useful. In addition, we follow [1], and take the simple supervised-to-contextual-bandit transformation [7] on 8 multi-class datasets to evaluate our idea.

(9)

Table 1. ACR on artificial datasets (mean ± std)

d = 10 d = 30

K = 50 K = 100 K = 50 K = 100

n = 500 n = 1000 n = 500 n = 1000 n = 500 n = 1000 n = 500 n = 1000 Ideal 0.6607 0.6607 0.7061 0.7061 0.3930 0.3930 0.4252 0.4252

±0.0002 ±0.0002 ±0.0002 ±0.0002 ±0.0002 ±0.0002 ±0.0002 ±0.0002

-Greedy 0.6265 0.6329 0.6317 0.6538 0.3566 0.3690 0.3537 0.3739

±0.0030 ±0.0016 ±0.0043 ±0.0030 ±0.0030 ±0.0014 ±0.0022 ±0.0026 LinUCB 0.6555 0.6513 0.6866 0.6868 0.3905 0.3880 0.4188 0.4164

±0.0004 ±0.0005 ±0.0011 ±0.0011 ±0.0003 ±0.0004 ±0.0007 ±0.0004 LinUCB

PR-ER

0.6591 0.6535 0.7000 0.7040 0.3917 0.3896 0.4227 0.4224

±0.0001 ±0.0002 ±0.0012 ±0.0003 ±0.0002 ±0.0002 ±0.0004 ±0.0004 LinUCB

PR-UR

0.6586 0.6533 0.6978 0.7027 0.3911 0.3887 0.4210 0.4215

±0.0001 ±0.0001 ±0.0011 ±0.0003 ±0.0002 ±0.0002 ±0.0002 ±0.0003 QPM-D 0.6552 0.6502 0.6925 0.6860 0.3897 0.3871 0.4172 0.4123

±0.0003 ±0.0004 ±0.0010 ±0.0013 ±0.0003 ±0.0003 ±0.0007 ±0.0010

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.4 0.45 0.5 0.55 0.6 0.65 0.7

d=10, K=100, n=500

normalized rounds

ACR Ideal

ε−Greedy LinUCB LinUCBPR−ER LinUCBPR−UR QMP−D

Fig. 1. ACR versus round

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

d=10, K=100, n=500

normalized rounds

regret

Ideal ε−Greedy LinUCB LinUCBPR−ER LinUCBPR−UR QMP−D

Fig. 2. regret versus round

Artificial Datasets. For each artificial dataset, we first sample unit vectors u1, u2, ..., uK uniformly from Rd to simulate the K actions. In each round t, the context xti is sampled from an uniform distribution within kxtik ≤ 1. The reward is generated by rti,ati = u>a

tixti + ti, where ti ∈ [−0.05, 0.05] is a uniform random noise. In the experiments, we let nT = 50000 to be a constant, and consider parameters d ∈ {10, 30}, K ∈ {50, 100}, and n ∈ {500, 1000}.

We compare the performance of -Greedy, LinUCB, LinUCBPR-ER and LinUCBPR-UR under the piled-reward setting. We also compare Queued Partial Monitoring with Delays (QPM-D) [9], which uses a queue to handle arbitrarily- delay rewards. Furthermore, we consider an “ideal” LinUCB under the tradi- tional setting (n = 1) to study the difference between the traditional setting and the piled-reward setting. The parameters of algorithms are selected by grid search, where α, β ∈ {0.05, 0.10, ..., 1.00} and  ∈ {0.025, 0, 05, ..., 0.1}. We run the experiment 20 times and show the average cumulative reward (ACR), which is the cumulative reward over the number of time steps, in Table 1. From the ta- ble, Ideal LinUCB clearly outperforms others. This verifies that the piled-reward setting introduces difficulty in applying upper-confidence bound algorithms. It also echoes the regret bound in Section 4, where LinUCB under the piled-reward setting suffers some penalty when compared with the original bound.

Next, we focus on the influence of the pseudo rewards. LinUCBPR-ER and LinUCBPR-UR are consistently better than LinUCB on all datasets. Figures 1

(10)

Table 2. datasets

dataset D K shuttle 9 7 poker 10 10 pendigits 16 10 letter 16 26 satimage 36 6

acoustic 50 3 covtype 54 7 usps 256 10

Table 4. t -test at 95% confidence level (win/tie/loss)

Competitor Algorithm -Greedy LinUCB LinUCB

PR-ER

LinUCB PR-UR QPM-D

-Greedy 0/0/8 0/0/8 0/0/8 0/0/8

LinUCB 8/0/0 0/1/7 2/4/2 1/6/1

LinUCBPR-ER 8/0/0 7/1/0 5/3/0 6/2/0

LinUCBPR-UR 8/0/0 2/4/2 0/3/5 2/5/1

QPM-D 8/0/0 1/6/1 0/2/6 1/5/2

Table 3. ACR on supervised-to-contextual-bandit datasets (mean ± std)

shuttle poker pendigits letter satimage acoustic covtype usps Ideal 0.9373 0.4866 0.8929 0.6271 0.8344 0.7216 0.6987 0.9358

±0.0005 ±0.0075 ±0.0056 ±0.0117 ±0.0014 ±0.0006 ±0.0020 ±0.0012

-Greedy 0.8844 0.4766 0.8667 0.4746 0.8062 0.6992 0.6736 0.9009

±0.0092 ±0.0086 ±0.0058 ±0.0247 ±0.0024 ±0.0012 ±0.0035 ±0.0023 LinUCB 0.9168 0.4863 0.8876 0.5696 0.8225 0.7103 0.6888 0.9192

±0.0068 ±0.0087 ±0.0043 ±0.0176 ±0.0016 ±0.0016 ±0.0039 ±0.0017 LinUCB

PR-ER

0.9200 0.4865 0.8901 0.6053 0.8236 0.7112 0.6915 0.9221

±0.0029 ±0.0046 ±0.0019 ±0.0137 ±0.0022 ±0.0007 ±0.0021 ±0.0011 LinUCB

PR-UR

0.9170 0.4846 0.8872 0.6017 0.8189 0.7099 0.6913 0.9179

±0.0027 ±0.0107 ±0.0043 ±0.0167 ±0.0045 ±0.0021 ±0.0014 ±0.0025 QPM-D 0.9166 0.4860 0.8844 0.5585 0.8221 0.7101 0.6915 0.9185

±0.0046 ±0.0033 ±0.0044 ±0.0225 ±0.0024 ±0.0012 ±0.0013 ±0.0018

and 2 respectively depict the ACR and the regret along normalized rounds, which is t/T , when d = 10, K = 100, and n = 500. Note that LinUCBPR algorithms enjoy an advantage in the early rounds. This is because the exploration is gener- ally more important than the exploitation in the early rounds, and LinUCBPR algorithms encourage more strategic exploration by using pseudo rewards.

We take -Greedy to compare the effect of conducting exploration within the round based on randomness rather than pseudo rewards. Table 1 suggests LinUCBPR algorithms reach much better performance, and justifies the effec- tiveness of the strategic exploration. We also compare LinUCBPR algorithms with QPM-D. Table 1 shows that LinUCBPR algorithms are consistently better than QPM-D. The results again justify the superiority of LinUCBPR algorithms.

LinUCBPR-ER and LinUCBPR-UR perform quite comparably across all datasets. The results suggest that we do not need to be more aggressive than LinUCBPR-ER. The simple LinUCBPR-ER, which can be efficiently imple- mented by updating Ati,a only, can readily reach decent performance.

Supervised-to-contextual-bandit Datasets. Next, we take 8 public multi- class datasets1(Table 2). We randomly split each dataset into two parts: 30% for parameter tuning and 70% for testing. For each part, we repeatedly present the examples as an infinite data stream. We let nT = 10000 for parameter tuning and nT = 30000 for testing. We consider n = 500 for all datasets. The parameter setting is the same as the one for artificial datasets.

1 available from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

(11)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.04

0.045 0.05 0.055 0.06 0.065 0.07

normalized rounds

CTR

ε−Greedy LinUCB LinUCBPR−ER LinUCBPR−UR QMP−D

(a) R6A

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.04 0.045 0.05 0.055 0.06 0.065

normalized rounds

CTR

ε−Greedy LinUCB LinUCBPR−ER LinUCBPR−UR QMP−D

(b) R6B

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075

normalized rounds

CTR

ε−Greedy LinUCB LinUCBPR−ER LinUCBPR−UR QMP−D

(c) R6B-clean Fig. 3. CTR versus round on real-world datasets

Table 3 shows the average results of 20 experiments. LinUCBPR-ER is consis- tently better than others, and LinUCBPR-UR is competitive with others. The results again confirm that LinUCBPR algorithms are useful under the piled- reward setting, and also again confirm that LinUCBPR-ER to be the best al- gorithm. We further compare these algorithms with a two-sample t-test at 95%

confidence level in Table 4. The results demonstrate the significance of the strong performance of LinUCB-ER.

Real-world Datasets. Finally, We use two real-world datasets R6A and R6B released by Yahoo! to examine our proposed algorithms. The datasets are the only two public datasets for the CBP to the best of our knowledge. They first appear in ICML 2012 workshop on New Challenges for Exploration and Exploita- tion 3 and also appear in [10]. They are about the news article recommendation.

Note that the action set of the two datasets are dynamic. To deal with this, we let algorithms maintain a weight vector wti,afor each action. The dimensions of the contexts for R6A and R6B are 6 and 136 separately. The rewards for both datasets are in {0, 1}, which represent the clicks from users. We note that in R6B, there are some examples that do not come with valid contexts. Hence we remove these examples and form a new dataset, R6B-clean. We use click through rate (CTR) to evaluate the algorithms, and use the technique described in [11]

to achieve an unbiased off-line evaluation.

We split the datasets into two parts: parameter tuning part and testing test- ing part. For R6A, we let nT = 10000 for parameter tuning and nT = 300000 for testing. For R6B and R6B-clean, we let nT = 10000 for parameter tuning and nT = 100000 for testing. We consider n = 500 for each dataset. The parameter setting is the same as the one for artificial datasets.

Figures 3 shows the experiment results. Unlike the results of artificial datasets, the CTR curve is non-monotonic. This is possibly because the action set is dy- namic, and the better actions may disappear from the action set in the middle, which leads to some dropping of CTR.

LinUCBPR algorithms and QPM-D usually perform better than LinUCB and

-Greedy in these datasets. LinUCBPR-ER is stable among the better choices, while LinUCBPR-UR and QPM-D can sometimes be inferior. The results again suggest LinUCBPR-ER to be a promising algorithm for the piled-reward setting.

(12)

6 Conclusion

We introduce the contextual bandit problem under the piled-reward setting and show how to apply LinUCB to this setting. We also propose a novel algorithm, LinUCBPR, which uses the pseudo reward to encourage strategic exploration to utilize received contexts that are temporarily without rewards. We prove a regret bound for both LinUCB and the LinUCBPR with estimated reward (-ER), and discuss how the bound compares with the original bound. Empirical results show that LinUCBPR perform better in early time steps, and is competitive in the long term. Most importantly, LinUCBPR-ER yields promising performance on all datasets. The results suggest LinUCBPR-ER to be the best choice in practice.

Acknowledgements

We thank the anonymous reviewers and the members of the NTU CLLab for valuable suggestions. This work is partially supported by the Ministry of Science and Technology of Taiwan (MOST 103-2221-E-002 -148 -MY3) and Asian Office of Aerospace Research and Development (AOARD FA2386-15-1-4012).

References

1. Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., Schapire, R.E.: Taming the monster: A fast and simple algorithm for contextual bandits. In: ICML. pp. 1638–

1646 (2014)

2. Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3, 397–422 (2003)

3. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3), 235–256 (2002)

4. Chou, K.C., Chiang, C.K., Lin, H.T., Lu, C.J.: Pseudo-reward algorithms for con- textual bandits with linear payoff functions. In: ACML. pp. 344–359 (2014) 5. Chu, W., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandits with linear payoff

functions. In: AISTATS. pp. 208–214 (2011)

6. Dud´ık, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., Zhang, T.: Efficient optimal learning for contextual bandits. In: UAI. pp. 169–178 (2011) 7. Dud´ık, M., Langford, J., Li, L.: Doubly robust policy evaluation and learning. In:

ICML. pp. 1097–1104 (2011)

8. Guha, S., Munagala, K., Pal, M.: Multiarmed bandit problems with delayed feed- back. arXiv:1011.1161 (2010)

9. Joulani, P., Gy¨orgy, A., Szepesv´ari, C.: Online learning under delayed feedback.

In: ICML. pp. 1453–1461 (2013)

10. Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: WWW. pp. 661–670 (2010) 11. Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-

bandit-based news article recommendation algorithms. In: WSDM. pp. 297–306 (2011)

12. Mandel, T., Liu, Y.E., Brunskill, E., Popovic, Z.: The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In: AAAI (2015)

13. Wang, C.C., Kulkarni, S.R., Poor, H.V.: Bandit problems with side observations.

IEEE Transactions on Automatic Control 50(3), 338–355 (2005)

參考文獻

相關文件

In Chapter 3, we transform the weighted bipartite matching problem to a traveling salesman problem (TSP) and apply the concepts of ant colony optimization (ACO) algorithm as a basis

Here, a deterministic linear time and linear space algorithm is presented for the undirected single source shortest paths problem with positive integer weights.. The algorithm

For periodic sequence (with period n) that has exactly one of each 1 ∼ n in any group, we can find the least upper bound of the number of converged-routes... Elementary number

In particular, we present a linear-time algorithm for the k-tuple total domination problem for graphs in which each block is a clique, a cycle or a complete bipartite graph,

The function f (m, n) is introduced as the minimum number of lolis required in a loli field problem. We also obtained a detailed specific result of some numbers and the upper bound of

Proposition 3.2.21 以及 Proposition 3.2.22, metric space 的 compact subset closed bounded.. least upper bound 以及 greatest lower

Then, we tested the influence of θ for the rate of convergence of Algorithm 4.1, by using this algorithm with α = 15 and four different θ to solve a test ex- ample generated as

In this paper, we develop a novel volumetric stretch energy minimization algorithm for volume-preserving parameterizations of simply connected 3-manifolds with a single boundary