Linear Upper Conﬁdence Bound Algorithm for Contextual Bandit Problem with Piled Rewards

(1)

Contextual Bandit Problem with Piled Rewards

Kuan-Hao Huang and Hsuan-Tien Lin

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

{r03922062,htlin}@csie.ntu.edu.tw

Abstract. We study the contextual bandit problem with linear payoff function. In the traditional contextual bandit problem, the algorithm iteratively chooses an action based on the observed context, and immediately receives a reward for the chosen action. Motivated by a practical need in many applications, we study the design of algorithms under the piled-reward setting, where the rewards are received as a pile instead of immediately. We present how the Linear Upper Confidence Bound (Lin- UCB) algorithm for the traditional problem can be na¨ıvely applied under the piled-reward setting, and prove its regret bound. Then, we extend LinUCB to a novel algorithm, called Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR), which digests the observed contexts to choose actions more strategically before the piled rewards are received.

We prove that LinUCBPR can match LinUCB in the regret bound under the piled-reward setting. Experiments on the artificial and real-world datasets demonstrate the strong performance of LinUCBPR in practice.

Keywords: contextual bandit, piled rewards, upper confidence bound

1 Introduction

We study the contextual bandit problem (CBP) [13], which is an interactive process between an algorithm and an environment. In the traditional CBP, the algorithm observes a context from the environment in each time step. Then, the algorithm is asked to strategically choose an action from the action set based on the context, and receives a corresponding feedback, called reward, while the reward for other actions are hidden from the algorithm. The goal of the algorithm is to maximize the cumulative reward over all time steps.

Because only the reward of the chosen action is revealed, the algorithm needs to choose different actions to estimate their goodness, called exploration. On the other hand, the algorithm also needs to choose the better actions to maximize the reward, called exploitation. Balancing between exploration and exploitation is arguably the most important issue for designing algorithms of CBP.

-Greedy [3] and Linear Upper Confidence Bound (LinUCB) [10] are two representative algorithms for CBP. -Greedy learns one model per action for exploitation and randomly explores different actions with a small probability .

(2)

LinUCB is based on online ridge regression, and takes the concept of upper- confidence bound [5] [2] to strategically balance between exploration and exploitation. LinUCB enjoys a strong theoretical guarantee [5] and is state-of-the- art in many practical applications [10].

The traditional CBP setting assumes that the algorithm receives the reward immediately after choosing an action. In some practical applications, however, the environment cannot present the reward to the algorithm immediately. This work is motivated from one such application. Consider an online advertisement system operated by a contextual bandit algorithm. For each user visit (time step), the system (algorithm) receives the information of the user (context) from an ad exchange, and chooses an appropriate ad (action) to display to the user. In the application, the click from the user naturally acts as the reward of the action.

Nevertheless, to reduce the cost of communication, the ad exchange often does not reveal the individual reward immediately after choosing an action. Instead, the ad exchange stores the individual reward first, and only sends a pile of rewards back to the system until sufficient number of rewards are gathered. We call the scenario as the contextual bandit problem under the piled-reward setting.

A related setting in the literature is the delayed-reward setting, where the reward is assumed to come at several time steps after the algorithm chooses an action. Most existing works on the delayed-reward setting consider constant delays [6] and cannot be easily applied to the piled-reward setting. Several works [8]

[9] [12] propose algorithms for bandit problems with arbitrarily-delayed rewards, but their algorithms are non-contextual. Thus, to the best of our knowledge, no existing work has carefully studied the CBP under the piled-reward setting.

In this paper, we study how LinUCB can be applied under the piled-reward setting. We present a na¨ıve use of LinUCB for the setting and prove its theoretical guarantee in the form of the regret bound. The result helps us understand the difference between the traditional setting and the piled-reward setting. Then, we design a novel algorithm, Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR), which is a variant of LinUCB that allows more strategic use of the context information before the piled rewards are received. We prove that LinUCBPR can match the na¨ıve LinUCB in its regret bound under the piled- reward setting. Experiments on the artificial and real-world datasets demonstrate that LinUCBPR results in strong and stable performance in practice.

This paper is organized as follows. Section 2 formalizes the CBP with the piled-reward setting. Section 3 describes our design of LinUCB and LinUCBPR under the piled-reward setting. The theoretical guarantees of the algorithms are analyzed in Section 4. We discuss the experiment results in Section 5 and conclude in Section 6.

2 Preliminaries

We use bold lower-case symbol like u to denote a column vector, bold upper-case symbol like A to denote a matrix, Idto denote the d × d identity matrix, and [K]

to denote the set {1, 2, ..., K}.

(3)

We first introduce the CBP under the traditional setting. Let T be the total number of rounds and K be the number of actions. In each round t ∈ [T ], the algorithm observes a context xt ∈ R^d with kxtk2 ≤ 1 from the environment.

Upon observing the context xt, the algorithm chooses an action at from K actions based on the context. Right after choosing at, the algorithm receives a reward r_t,a_t that corresponds to the context x_tand the chosen action a_t, while other rewards r_t,a for a 6= a_t are hidden from the algorithm. The goal of the algorithm is to maximize the cumulative reward after T rounds.

Now, we introduce the CBP under the piled-reward setting. Instead of receiving the reward right after choosing an action (and thus right before observing the next context), the setting assumes that the rewards come as a pile after observing multiple contexts in a round. We shall extend our notation above to the piled-reward setting as follows. In each round t, the algorithm sequentially observes n contexts xt₁, xt₂, ..., xt_n∈ R^d with kxt_ik2≤ 1 from the environment.

For simplicity, we assume that n is a fixed number while all the technical results in this paper can be easily extended to the case where n can vary in each round. We use ti to denote the i-th step in round t. For example, 35 means the 5-th step in round 3. Upon observing the context xt_i in round t, the algorithm chooses an action at_i from K actions based on the context, and observes the next context xti+1. In the end of round t, the algorithm receives n rewards r_t₁_,a_t1, r_t₂_,a_t2, ..., r_t_n_,a_tn that correspond to x_t_i and a_t_i, while other rewards r_t_i_,a for a 6= a_t_i are hidden from the algorithm. The goal is again to maximize the cumulative reward PT

t=1

Pn

i=1r_t_i_,a_ti after T rounds.

In other words, the piled-reward setting assumes that the context comes at time steps {1₁, 1₂, ..., 1_n, 2₁, 2₂, ..., t₁, t₂, ..., T_n−1, T_n}, while the rewards come after every n contexts as a pile. Note that the traditional setting is a special case of the piled-reward setting when n = 1.

In this paper, we consider the CBP with linear payoff function. We assume that rt_i,a connects with xt_i linearly through K hidden weight vectors u1, u2, ..., uK ∈ R^d with kuik2 ≤ 1. That is, E [rt_i,a| xt_i] = x^>_t_iua. Let a^∗_t_i = arg max_a∈[K]x^>_t_iua be the optimal action for xt_i. We define regret of an algorithm to be (PT

t=1

Pn i=1r_t_i_,a^∗

ti−PT t=1

Pn

i=1r_t_i_,a_ti). The goal of maximizing the cumulative reward is equivalent to minimizing the regret.

Linear Upper Confidence Bound (LinUCB) [5] is a state-of-the-art algorithm for the traditional CBP (n = 1). LinUCB maintains K weight vectors wt1,1, wt1,2, ..., wt1,K to estimate u1, u2, ..., uK at time step t1. The K weight vectors are calculated by ridge regression

wt₁,a= argmin

w∈R^d

kwk²+ kX_(t−1)

1,aw − r_(t−1)

1,ak² , (1)

where X_(t−1)₁_,a is a matrix with rows being the contexts x^>_τ where τ are the time steps before round t and a_τ = a, and r_(t−1)₁_,ais a column vector with each element representing the corresponding reward for each context in X_(t−1)₁_,a. Let At₁,a = (Id+ X^>_(t−1)

1,aX_(t−1)

1,a) and bt₁,a= (X^>_(t−1)

1,ar_(t−1)

1,a). The solution to (1) is wt1,a= A⁻¹_t₁_,abt1,a.

(4)

When a new context xt₁ comes, LinUCB calculates two terms for each action a: the estimated reward ˜rt₁,a = x^>_t₁wt₁,a and the uncertainty ct₁,a = q

x^>_t₁A⁻¹_t₁_,axt₁, and chooses the action with the highest score ˜rt₁,a + αct₁,a, where α is a trade-off parameter. After receiving the reward, LinUCB updates the weight vector wt₁,a_t1 immediately, and uses the new weight vector to choose the action for the next context. LinUCB conducts exploration when the chosen action is of high uncertainty. After sufficient (context, action, reward) information is received, ˜rt1,a shall be close the the expected reward, and ct1,awill be smaller. Then, LinUCB conducts exploitation with the learned weight vectors to choose the action with the highest expected reward.

3 Proposed Algorithm

We first discuss how LinUCB can be na¨ıvely applied under the piled-reward setting. Then, we extend LinUCB to a more general framework that utilizes the additional information within the contexts before the true rewards are received.

Since no rewards are received before the end of the current round t, the na¨ıve LinUCB does not update the model during round t, and only takes the fixed w_t₁_,a and A_t₁_,a to calculate the estimated reward ˜r_t_i_,a and the uncertainty c_t_i_,a for each action a. That is, LinUCB only updates w_t₁_,a before the beginning of round t as the solution to (1) with (X_(t−1)₁_,a, r_(t−1)₁_,a) under the traditional setting replaced by (X_(t−1)_n_,a, r_(t−1)_n_,a) under the piled-reward setting. In addition, At₁,a can be similarly defined from X_(t−1)_n_,a instead.

The na¨ıve LinUCB can be viewed as a baseline upper confidence bound algorithm under the piled-reward setting. There is a possible drawback for the na¨ıve LinUCB. If similar contexts come repeatedly in the same round, because wt_i,a

and At_i,a stay unchanged within the round, LinUCB will choose similar actions repeatedly. Then, if the chosen action suffers from low reward, LinUCB suffers from making the low-reward choice repeatedly before the end of the round.

The question is, can we do even better? Our idea is that the contexts xti

received during round t can be utilized to update the model before the rewards come. That is, at time step t_i, in addition to the labelled data (context, action, reward) gathered before time step (t − 1)_nthat LinUCB uses, the unlabelled data (context, action) gathered at time steps {t₁, t₂, . . . , t_i−1} can also be included to learn a more decent model. In other words, we hope to design some semi- supervised learning scheme within round t to guide the upper-confidence bound algorithm towards more strategic exploration within the round.

Our idea is motivated from the regret analysis. In Section 4, we will show that the regret of LinUCB under the piled-reward setting is bounded by the summation of ct_i,aover all time steps. But note that ct_i,a only depends on xt_i,a

and At_i,a, but not the reward. That is, upon receiving xt_i and choosing an action at_i, the term ct_i,a_ti can readily be updated without the true reward. By updating ct_i,a_ti within the round, the algorithm can explore different actions strategically instead of following similar actions when similar contexts come repeatedly in the same round.

(5)

Algorithm 1 LinUCBPR under the piled-reward setting 1: Parameter: α ∈ R⁺

2: Initialize: ˆA1₁,a← Id, ˆb1₁,a← 0d×1, ˆw1₁,a← ˆA⁻¹₁₁_,aˆb1₁,a

3: for t = 1, 2, 3, ..., T do 4: for i = 1, 2, 3, ..., n do

5: Observe xti and choose ati = argmax_a∈[K]x^>t_iwˆti,a+ α q

x^>_t_iAˆ⁻¹_t_i_,axti

6: Calculate the pseudo reward pt_i,a_ti

7: Aˆt_i+1,a_ti ← ˆAt_i,a_ti+ xt_ix^>_t_i, bˆt_i+1,a_ti ← ˆbt_i,a_ti+ xt_ipt_i,a_ti

8: wˆt_i+1,a_ti ← ˆA⁻¹_t_i+1_,a_tibˆt_i+1,a_ti

9: end for

10: Receive rewards rt₁,a_t1, rt₂,a_t2, ..., rt_n,a_tn

11: for a ∈ [K] do

12: Aˆ_(t+1)₁_,a← ˆAt₁,a+P

a_ti=axt_ix^>_t_i, ˆb_(t+1)₁_,a← ˆbt₁,a+P

a_ti=axt_irt_i,a_ti

13: wˆ(t+1)₁,a← ˆA⁻¹_(t+1)

1,aˆb(t+1)₁,a

14: end for 15: end for

This idea can be extended to the following framework. We propose to couple each context x_t₁, x_t₂, . . . , x_t_i−1 with a pseudo reward p_τ,a_τ, where τ is the time step, before receiving the true reward r_τ,a_τ. The pseudo reward can then pretend to be the true reward and allow the algorithm to keep updating the model before the true rewards are received. Note that pseudo rewards have been used to speed up exploration in the traditional CBP [4], and can encourage more strategic exploration in our framework. We name the framework Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR). The framework updates the weight vector and the estimated covariance matrix by

ˆ

wt_i,a= argmin

w∈R^d

(kwk²+ kX_(t−1)

n,aw − r_(t−1)

n,ak²+ k ˆXt_i−1,aw − pt_i−1,ak²) (2) Aˆt_i,a= Id+ X^>_(t−1)

n,aX(t−1)_n,a+ ˆX^>_t_i−1_,aXˆt_i−1,a (3) where ˆXt_i−1,a is a matrix with rows being the contexts x^>_τ with t1≤ τ ≤ ti−1

and aτ = a, and pt_i−1,a is a column vector with each element representing the corresponding pseudo reward for each context in ˆX_t_i−1_,a.

When receiving the true rewards in the end of round t, we discard the change from pseudo rewards, and use the true rewards to update model again. We show the framework of LinUCBPR in Algorithm 1.

The only remained task is what pτ,a should be. We will study two variants, one is to use pτ,a= ˜rτ,a, the estimated reward of actions. We name the variant LinUCBPR with estimated reward (LinUCBPR-ER). Another variant is to be even more aggressive, and set pτ,a = ˜rτ,a− βcτ,a, a lower-confidence bound of the reward, where β is a trade-off parameter. The lower-confidence bound can be viewed as the underestimated reward, and should allow more exploration within the round, at the cost of more computation. We name the variant LinUCBPR with underestimated reward (LinUCBPR-UR).

(6)

Algorithm 2 BaseLinUCB under the piled-reward setting at round t 1: Parameter: α ∈ R⁺, Ψt⊆ {11, 12, ..., (t − 1)n}

2: ¯At₁← IdK+ P

τ ∈Ψt

¯

xτ,a_τ¯x^>τ,a_τ, ¯bt₁ ← 0dK×1+ P

τ ∈Ψt

¯

xτ,a_τrτ,a_τ, ¯wt₁ ← ¯A⁻¹_t₁b¯t₁

3: for i = 1, 2, 3, ..., n do

4: Observe xt_i and calculate ¯xt_i,1, ¯xt_i,2, ..., ¯xt_i,K

5: for a ∈ [K] do

6: widtht_i,a← (1 + α)q

¯

x^>_t_i_,aA¯⁻¹_t₁ ¯xt_i,a

7: ucbt_i,a← ¯x^>_t_i_,aw¯t₁+ widtht_i,a

8: end for 9: end for

4 Theoretical Analysis

In this section, we establish the theoretical guarantee for the regret bound of Lin- UCB and LinUCBPR-ER under the piled-reward setting. Similar to the analysis of LinUCB in the immediate-reward setting [5], there is a difficulty. In particular, the algorithms choose actions based on previous outcomes. Hence, the rewards in each round are not independent random variables. To deal with this problem, we follow the approach of [5]. We modify the algorithm to a base algorithm which assumes the independent rewards, and construct a master algorithm which ensures that the assumption holds.

Note that [5] takes a CBP setting with one context per action instead for our setting of the one context share by actions. To let the notation be con- sistent with [5], we simply cast our setting as theirs by following steps. We define a (dK)-dimensional vector ¯u to be the concatenation of u1, u2, ..., uK, and define a (dK)-dimensional context ¯xτ,a per action with xτ, where ¯xτ,a =

0 0 · · · 0 x^>_τ 0 · · · 0^>

with xτ being the a-th vector within the concatenation.

All ¯Xτ, ¯Aτ, ¯rτ, ¯bτ, and ¯wτ can be similarly defined from ˆXτ,a, ˆAτ,a, rτ,a, ˆbτ,a, and ˆwτ,a.

4.1 Regret for LinUCB under the piled-reward setting

Algorithm 2 lists the base algorithm for LinUCB under the piled-reward setting, called BaseLinUCB. We first prove the theoretical guarantee of BaseLinUCB.

Let ¯ct_i,a= q

¯

x^>_t_i_,aA¯⁻¹_t₁ x¯t_i,a. We can establish the following lemmas.

Lemma 1 (Li et al. [5], Lemma 1). Suppose the input time step set Ψ_t ⊆ {1₁, 1₂, ..., (t − 1)_n} given to BaseLinUCB has property that for fixed context ¯x_t_i_,a with t_i ∈ Ψ_t, the corresponding rewards r_t_i_,a are independent random variables with means ¯x^>_t_i_,a¯u. Then, for some α = O(pln(nT K/δ)), we have with probability at least 1 − δ/(nT ) that

¯x^>_t_i_,aw¯t_i− ¯x^>_t_i_,au¯

≤ (1 + α)¯ct_i,a.

Note that in Lemma 1, the bound is related to the time steps. We want the bound to be related to the rounds, and hence establish Lemma 2 and Lemma 3.

(7)

Lemma 2. Let ψt be a subset of {t1, t2, ..., tn}. Suppose Ψt+1 = Ψt∪ ψt in BaseLinUCB. Then, the eigenvalues of ¯At₁ and ¯A_(t+1)₁ can be arranged so that λt₁,j ≤ λ_(t+1)₁_,j for all j and ¯c²_t_i_,a≤ 10PdK

j=1

λ_(t+1)1,j−λ_t1,j λ_t1,j .

Proof. The proof can be done by combining Lemma 2 and Lemma 8 in [5].

Lemma 3. Let Φt+1 = {t | t ∈ [T ] and ∃ j such that tj ∈ Ψt+1}, and assume

|Φt+1| ≥ 2. ThenP

t_i∈Ψt+1¯cti,a≤ 5npdK |Φt+1| ln |Φt+1|.

Proof. By Lemma 2 and the technique in the proof of Lemma 3 in [5], we have

X

ti∈Ψt+1

¯

ct_i,a≤ X

ti∈Ψt+1

v u ut10

dK

X

j=1

λ_(t+1)₁_,j− λt₁,j

λt₁,j

≤ X

t∈Φ_t+1

n v u ut10

dK

X

j=1

λ_(t+1)₁_,j− λt₁,j

λ_t₁_,j ≤ 5np

dK |Φ_t+1| ln |Φ_t+1|.

We construct SupLinUCB on each round similar to [5]. Then, we borrow Lemma 14 and Lemma 15 of [2], and extend Lemma 16 of [2] to the following lemma.

Lemma 4. For each s ∈ [S], Ψ_{T +1}^s

≤ 5n · 2^s(1 + α)q dK

Φ^s_t+1 ln

Φ^s_t+1 . Based on the lemmas, we can then establish the following theorem for the regret bound of LinUCB under the piled-reward setting.

Theorem 1. For some α = O(pln(nT K/δ)), with probability 1 − δ, the regret of LinUCB under the piled-reward setting is O(

q

dn²T K ln³(nT K/δ)).

Proof. Let Ψ⁰ = {11, 12, ..., Tn} \S

s∈[S]Ψ_{T +1}^s . Observing that s^−s ≤ 1/√ T , given the previous lemmas and Jensen’s inequality, we have

Regret =

T

X

t=1 n

X

i=1

E[rti,a^∗_ti] − E[rti,a_ti]

= X

t_i∈Ψ⁰

E[rt_i,a^∗_ti] − E[r^ti,a_ti] +

S

X

s=1

X

ti∈Ψ_{T +1}^s

E[rt_i,a^∗_ti] − E[r^ti,a_ti]

≤ 2

√T Ψ⁰

+

S

X

s=1

2^3−s Ψ_{T +1}^s

≤ 2

√ T

Ψ⁰ +

S

X

s=1

40n(1 + α) q

dK Φ^s_t+1

ln Φ^s_t+1

≤ 2n√

T + 40n(1 + α)

√ dK ln T

S

X

s=1

q Φ^s_t+1

≤ 2n√

T + 40n(1 + α)√

dK ln T√ ST .

(8)

The rest of proof is almost identical to the proof of Theorem 6 in [2]. By substituting α = O(pln(nT K/δ), replacing δ with δ/(S + 1)S, substituting S = ln(nT ), and applying Azuma’s inequality, we obtain Theorem 1.

Note that if we let nT = C to be a constant, the original regret bound under the traditional setting (n = 1) in [5] is O(

q

dCK ln³(CK/δ)), while the regret bound under the piled-reward setting is O(

q

dnCK ln³(CK/δ)), which is the original bound multiplied by√

n.

4.2 Regret for LinUCBPR-ER under the piled-reward setting We first prove two lemmas for LinUCBPR-ER.

Lemma 5. After updating with the context xt_i and the pseudo reward pt_i,a =

˜

rti,a, the estimated reward of LinUCBPR-ER is the same. That is, ˜rti+1 = ˜rti. Proof. Because p_t_i_,a= x^>_t

iwˆ_t_i_,a= ˜r_t_i_,a, x_t_i and p_t_i_,awill not change ˆw_t_i_,a. Thus the reward stays the same.

Lemma 6. After updating with the context xt_i and the pseudo reward pt_i,a =

˜

rt_i,a, the uncertainty of LinUCBPR-ER for the context is non-increasing. That is, for any x,

q

x^>Aˆ⁻¹_t_i+1_,ax ≤ q

x^>Aˆ⁻¹_t_i_,ax.

Proof. By Sherman-Morrison formula, we have

x^>Aˆ⁻¹_t_i+1_,ax = x^>Aˆ⁻¹_t_i_,ax −x^>Aˆ⁻¹_t

i,ax_t_ix^>_t

i

Aˆ⁻¹_t

i,ax 1 + x^>_t

i

Aˆ⁻¹_t

i,ax_t_i = x^>Aˆ⁻¹_t_i_,ax −

x^>Aˆ⁻¹_t

i,ax_t_i² 1 + x^>_t

i

Aˆ⁻¹_t

i,ax_t_i. The second term is greater than or equal to zero, and implies the lemma.

Similarly, we can construct BaseLinUCBPR-ER and SupLinUCBPR-ER.

By Lemma 5, we have that for each time step ti in the round t, the estimated reward ¯x^>_t_i_,aw¯t_i= ¯x^>_t_i_,aw¯t₁ does not change. Furthermore, By Lemma 6, we have

q

x^>Aˆ⁻¹_t_i_,ax ≤ q

x^>Aˆ⁻¹_t₁_,ax. Hence, all lemmas we need also hold for BaseLinUCBPR-ER. Similar to LinUCB, we can then establish the following theorem. The proof is almost identical to Theorem 1.

Theorem 2. For some α = O(pln(nT K/δ)), with probability 1 − δ, the regret of LinUCBPR-ER under the piled-reward setting is O(

q

dn²T K ln³(nT K/δ)).

5 Experiments

We apply the proposed algorithms on both artificial and real-world datasets to justify that using pseudo-rewards is useful. In addition, we follow [1], and take the simple supervised-to-contextual-bandit transformation [7] on 8 multi-class datasets to evaluate our idea.

(9)

Table 1. ACR on artificial datasets (mean ± std)

d = 10 d = 30

K = 50 K = 100 K = 50 K = 100

n = 500 n = 1000 n = 500 n = 1000 n = 500 n = 1000 n = 500 n = 1000 Ideal 0.6607 0.6607 0.7061 0.7061 0.3930 0.3930 0.4252 0.4252

±0.0002 ±0.0002 ±0.0002 ±0.0002 ±0.0002 ±0.0002 ±0.0002 ±0.0002

-Greedy 0.6265 0.6329 0.6317 0.6538 0.3566 0.3690 0.3537 0.3739

±0.0030 ±0.0016 ±0.0043 ±0.0030 ±0.0030 ±0.0014 ±0.0022 ±0.0026 LinUCB 0.6555 0.6513 0.6866 0.6868 0.3905 0.3880 0.4188 0.4164

±0.0004 ±0.0005 ±0.0011 ±0.0011 ±0.0003 ±0.0004 ±0.0007 ±0.0004 LinUCB

PR-ER

0.6591 0.6535 0.7000 0.7040 0.3917 0.3896 0.4227 0.4224

±0.0001 ±0.0002 ±0.0012 ±0.0003 ±0.0002 ±0.0002 ±0.0004 ±0.0004 LinUCB

PR-UR

0.6586 0.6533 0.6978 0.7027 0.3911 0.3887 0.4210 0.4215

±0.0001 ±0.0001 ±0.0011 ±0.0003 ±0.0002 ±0.0002 ±0.0002 ±0.0003 QPM-D 0.6552 0.6502 0.6925 0.6860 0.3897 0.3871 0.4172 0.4123

±0.0003 ±0.0004 ±0.0010 ±0.0013 ±0.0003 ±0.0003 ±0.0007 ±0.0010

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.4 0.45 0.5 0.55 0.6 0.65 0.7

d=10, K=100, n=500

normalized rounds

ACR Ideal

ε−Greedy LinUCB LinUCBPR−ER LinUCBPR−UR QMP−D

Fig. 1. ACR versus round

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

d=10, K=100, n=500

normalized rounds

regret

Ideal ε−Greedy LinUCB LinUCBPR−ER LinUCBPR−UR QMP−D

Fig. 2. regret versus round

Artificial Datasets. For each artificial dataset, we first sample unit vectors u1, u2, ..., uK uniformly from R^d to simulate the K actions. In each round t, the context xt_i is sampled from an uniform distribution within kxt_ik ≤ 1. The reward is generated by rti,a_ti = u^>_a

tixti + ti, where ti ∈ [−0.05, 0.05] is a uniform random noise. In the experiments, we let nT = 50000 to be a constant, and consider parameters d ∈ {10, 30}, K ∈ {50, 100}, and n ∈ {500, 1000}.

We compare the performance of -Greedy, LinUCB, LinUCBPR-ER and LinUCBPR-UR under the piled-reward setting. We also compare Queued Partial Monitoring with Delays (QPM-D) [9], which uses a queue to handle arbitrarily- delay rewards. Furthermore, we consider an “ideal” LinUCB under the traditional setting (n = 1) to study the difference between the traditional setting and the piled-reward setting. The parameters of algorithms are selected by grid search, where α, β ∈ {0.05, 0.10, ..., 1.00} and ∈ {0.025, 0, 05, ..., 0.1}. We run the experiment 20 times and show the average cumulative reward (ACR), which is the cumulative reward over the number of time steps, in Table 1. From the table, Ideal LinUCB clearly outperforms others. This verifies that the piled-reward setting introduces difficulty in applying upper-confidence bound algorithms. It also echoes the regret bound in Section 4, where LinUCB under the piled-reward setting suffers some penalty when compared with the original bound.

Next, we focus on the influence of the pseudo rewards. LinUCBPR-ER and LinUCBPR-UR are consistently better than LinUCB on all datasets. Figures 1

(10)

Table 2. datasets

dataset D K shuttle 9 7 poker 10 10 pendigits 16 10 letter 16 26 satimage 36 6

acoustic 50 3 covtype 54 7 usps 256 10

Table 4. t -test at 95% confidence level (win/tie/loss)

Competitor Algorithm -Greedy LinUCB LinUCB

PR-ER

LinUCB PR-UR QPM-D

-Greedy – 0/0/8 0/0/8 0/0/8 0/0/8

LinUCB 8/0/0 – 0/1/7 2/4/2 1/6/1

LinUCBPR-ER 8/0/0 7/1/0 – 5/3/0 6/2/0

LinUCBPR-UR 8/0/0 2/4/2 0/3/5 – 2/5/1

QPM-D 8/0/0 1/6/1 0/2/6 1/5/2 –

Table 3. ACR on supervised-to-contextual-bandit datasets (mean ± std)

shuttle poker pendigits letter satimage acoustic covtype usps Ideal 0.9373 0.4866 0.8929 0.6271 0.8344 0.7216 0.6987 0.9358

±0.0005 ±0.0075 ±0.0056 ±0.0117 ±0.0014 ±0.0006 ±0.0020 ±0.0012

-Greedy 0.8844 0.4766 0.8667 0.4746 0.8062 0.6992 0.6736 0.9009

±0.0092 ±0.0086 ±0.0058 ±0.0247 ±0.0024 ±0.0012 ±0.0035 ±0.0023 LinUCB 0.9168 0.4863 0.8876 0.5696 0.8225 0.7103 0.6888 0.9192

±0.0068 ±0.0087 ±0.0043 ±0.0176 ±0.0016 ±0.0016 ±0.0039 ±0.0017 LinUCB

PR-ER

0.9200 0.4865 0.8901 0.6053 0.8236 0.7112 0.6915 0.9221

±0.0029 ±0.0046 ±0.0019 ±0.0137 ±0.0022 ±0.0007 ±0.0021 ±0.0011 LinUCB

PR-UR

0.9170 0.4846 0.8872 0.6017 0.8189 0.7099 0.6913 0.9179

±0.0027 ±0.0107 ±0.0043 ±0.0167 ±0.0045 ±0.0021 ±0.0014 ±0.0025 QPM-D 0.9166 0.4860 0.8844 0.5585 0.8221 0.7101 0.6915 0.9185

±0.0046 ±0.0033 ±0.0044 ±0.0225 ±0.0024 ±0.0012 ±0.0013 ±0.0018

and 2 respectively depict the ACR and the regret along normalized rounds, which is t/T , when d = 10, K = 100, and n = 500. Note that LinUCBPR algorithms enjoy an advantage in the early rounds. This is because the exploration is gener- ally more important than the exploitation in the early rounds, and LinUCBPR algorithms encourage more strategic exploration by using pseudo rewards.

We take -Greedy to compare the effect of conducting exploration within the round based on randomness rather than pseudo rewards. Table 1 suggests LinUCBPR algorithms reach much better performance, and justifies the effec- tiveness of the strategic exploration. We also compare LinUCBPR algorithms with QPM-D. Table 1 shows that LinUCBPR algorithms are consistently better than QPM-D. The results again justify the superiority of LinUCBPR algorithms.

LinUCBPR-ER and LinUCBPR-UR perform quite comparably across all datasets. The results suggest that we do not need to be more aggressive than LinUCBPR-ER. The simple LinUCBPR-ER, which can be efficiently imple- mented by updating At_i,a only, can readily reach decent performance.

Supervised-to-contextual-bandit Datasets. Next, we take 8 public multi- class datasets¹(Table 2). We randomly split each dataset into two parts: 30% for parameter tuning and 70% for testing. For each part, we repeatedly present the examples as an infinite data stream. We let nT = 10000 for parameter tuning and nT = 30000 for testing. We consider n = 500 for all datasets. The parameter setting is the same as the one for artificial datasets.

1 available from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

(11)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.04

0.045 0.05 0.055 0.06 0.065 0.07

normalized rounds

CTR

(a) R6A

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.04 0.045 0.05 0.055 0.06 0.065

normalized rounds

CTR

(b) R6B

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075

normalized rounds

CTR

(c) R6B-clean Fig. 3. CTR versus round on real-world datasets

Table 3 shows the average results of 20 experiments. LinUCBPR-ER is consistently better than others, and LinUCBPR-UR is competitive with others. The results again confirm that LinUCBPR algorithms are useful under the piled- reward setting, and also again confirm that LinUCBPR-ER to be the best algorithm. We further compare these algorithms with a two-sample t-test at 95%

confidence level in Table 4. The results demonstrate the significance of the strong performance of LinUCB-ER.

Real-world Datasets. Finally, We use two real-world datasets R6A and R6B released by Yahoo! to examine our proposed algorithms. The datasets are the only two public datasets for the CBP to the best of our knowledge. They first appear in ICML 2012 workshop on New Challenges for Exploration and Exploita- tion 3 and also appear in [10]. They are about the news article recommendation.

Note that the action set of the two datasets are dynamic. To deal with this, we let algorithms maintain a weight vector wti,afor each action. The dimensions of the contexts for R6A and R6B are 6 and 136 separately. The rewards for both datasets are in {0, 1}, which represent the clicks from users. We note that in R6B, there are some examples that do not come with valid contexts. Hence we remove these examples and form a new dataset, R6B-clean. We use click through rate (CTR) to evaluate the algorithms, and use the technique described in [11]

to achieve an unbiased off-line evaluation.

We split the datasets into two parts: parameter tuning part and testing testing part. For R6A, we let nT = 10000 for parameter tuning and nT = 300000 for testing. For R6B and R6B-clean, we let nT = 10000 for parameter tuning and nT = 100000 for testing. We consider n = 500 for each dataset. The parameter setting is the same as the one for artificial datasets.

Figures 3 shows the experiment results. Unlike the results of artificial datasets, the CTR curve is non-monotonic. This is possibly because the action set is dynamic, and the better actions may disappear from the action set in the middle, which leads to some dropping of CTR.

LinUCBPR algorithms and QPM-D usually perform better than LinUCB and

-Greedy in these datasets. LinUCBPR-ER is stable among the better choices, while LinUCBPR-UR and QPM-D can sometimes be inferior. The results again suggest LinUCBPR-ER to be a promising algorithm for the piled-reward setting.

(12)

6 Conclusion

We introduce the contextual bandit problem under the piled-reward setting and show how to apply LinUCB to this setting. We also propose a novel algorithm, LinUCBPR, which uses the pseudo reward to encourage strategic exploration to utilize received contexts that are temporarily without rewards. We prove a regret bound for both LinUCB and the LinUCBPR with estimated reward (-ER), and discuss how the bound compares with the original bound. Empirical results show that LinUCBPR perform better in early time steps, and is competitive in the long term. Most importantly, LinUCBPR-ER yields promising performance on all datasets. The results suggest LinUCBPR-ER to be the best choice in practice.

Acknowledgements

We thank the anonymous reviewers and the members of the NTU CLLab for valuable suggestions. This work is partially supported by the Ministry of Science and Technology of Taiwan (MOST 103-2221-E-002 -148 -MY3) and Asian Office of Aerospace Research and Development (AOARD FA2386-15-1-4012).

References

1. Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., Schapire, R.E.: Taming the monster: A fast and simple algorithm for contextual bandits. In: ICML. pp. 1638–

1646 (2014)

2. Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3, 397–422 (2003)

3. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3), 235–256 (2002)

4. Chou, K.C., Chiang, C.K., Lin, H.T., Lu, C.J.: Pseudo-reward algorithms for contextual bandits with linear payoff functions. In: ACML. pp. 344–359 (2014) 5. Chu, W., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandits with linear payoff

functions. In: AISTATS. pp. 208–214 (2011)

6. Dud´ık, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., Zhang, T.: Efficient optimal learning for contextual bandits. In: UAI. pp. 169–178 (2011) 7. Dud´ık, M., Langford, J., Li, L.: Doubly robust policy evaluation and learning. In:

ICML. pp. 1097–1104 (2011)

8. Guha, S., Munagala, K., Pal, M.: Multiarmed bandit problems with delayed feedback. arXiv:1011.1161 (2010)

9. Joulani, P., Gy¨orgy, A., Szepesv´ari, C.: Online learning under delayed feedback.

In: ICML. pp. 1453–1461 (2013)

10. Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: WWW. pp. 661–670 (2010) 11. Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-

bandit-based news article recommendation algorithms. In: WSDM. pp. 297–306 (2011)

12. Mandel, T., Liu, Y.E., Brunskill, E., Popovic, Z.: The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In: AAAI (2015)

13. Wang, C.C., Kulkarni, S.R., Poor, H.V.: Bandit problems with side observations.

IEEE Transactions on Automatic Control 50(3), 338–355 (2005)