Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions

19  Download (0)

Full text


Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions

Ku-Chun Chou

Department of Computer Science and Information Engineering, National Taiwan University

Chao-Kai Chiang

Department of Mathematics and Information Technology, University of Leoben

Hsuan-Tien Lin

Department of Computer Science and Information Engineering, National Taiwan University

Chi-Jen Lu

Institute of Information Science, Academia Sinica


We study the contextual bandit problem with linear payoff functions, which is a generaliza- tion of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observa- tion that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We ar- gue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of re- gret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical perfor- mance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice.

Keywords: contextual bandit, pseudo-reward, upper confidence bound

1. Introduction

We study the contextual bandit problem (Wang et al.,2005), or known as the k-armed bandit problem with context (Auer et al.,2002a), in which a learner needs to interact iteratively with the external environment. In the contextual bandit problem, the learner has to make decisions in a number of iterations. During each iteration, the learner can first access certain information, called context, about the environment. After seeing the context, the learner is asked to strategically select an action from a set of candidate actions. The external environment then reveals the reward of the selected action to the learner as the feedback,


while hiding the rewards of all other actions. This type of feedback scheme is known as the bandit setting (Auer et al., 2002b). The learner then updates its internal models with the feedback to reach the goal of gathering the most cumulative reward over all the iterations.

The difference of cumulative reward between the best strategy and the learner’s strategy is called the regret, which is often used to measure the performance of the learner.

The contextual bandit problem can be viewed as a more realistic extension of the tradi- tional bandit problem (Lai and Robbins,1985), which does not provide any context informa- tion to the learner. The use of context allows expressing many interesting applications. For instance, consider an online advertising system operated by a contextual bandit learner.

For each user visit (iteration), the advertising algorithm (learner) is provided with the known properties about the user (context), and an advertisement (action) from a pool of relevant advertisements (set of candidate actions) is selected and displayed to that user.

The company’s earning (reward) can be calculated based on whether the user clicked the selected advertisement and the value of the advertisement itself. The reward can then be revealed to the algorithm as the feedback, and maximizing the cumulative reward directly connects to the company’s total earning. Many other web applications for advertisement and recommendation can be similarly modeled as a contextual bandit problem (Li et al., 2010,2011;Pandey et al.,2007), and hence the problem has been attracting much research attention (Auer et al., 2002a,b; Auer, 2002; Kakade et al., 2008; Chu et al., 2011; Chen et al.,2014).

The bandit setting contrasts the full-information setting in traditional online learn- ing (Kakade et al.,2008;Chen et al.,2014), in which the reward of every action is revealed to the learner regardless of whether the action is selected. Intuitively, it is easier to design learners under the full-information setting, and such learners (shorthanded full-information learners) usually perform better than their siblings that run under bandit-setting (to be shown in Section 2). The performance difference motivates us to study whether we can mimic the full-information setting to design better contextual bandit learners.

In particular, we propose the concept of pseudo-rewards, which embed estimates to the hidden rewards of non-selected actions under the bandit setting. Then, the revealed reward and the pseudo-rewards can be jointly fed to full-information learners to mimic the full-information setting. We study the possibility of feeding pseudo-rewards to linear full-information learners, which compute the estimates by linear ridge regression. In par- ticular, we propose to use pseudo-rewards that over-estimate the hidden rewards. Such pseudo-rewards embed the intention of exploration, which means selecting the action that the learner is less certain about. The intention has been shown useful in many of the ex- isting works on the contextual bandit problem (Auer et al., 2002b; Li et al., 2010). To neutralize the effect of over-estimating the hidden rewards routinely, we further propose a forgetting mechanism that decreases the influence of the pseudo-rewards received in the earlier iterations.

Combining the over-estimated pseudo-rewards, the forgetting mechanism, and the state- of-the-art Linear Upper Confidence Bound (LinUCB) algorithm (Li et al., 2010; Chu et al.,2011), we design a novel algorithm, Linear Pseudo-reward Upper Confidence Bound (LinPRUCB). We prove that LinPRUCB results in a matching regret bound to Lin- UCB. In addition, we demonstrate empirically with artificial and real-world data sets that LinPRUCB enjoys two additional properties. First, with the pseudo-rewards on hand,


LinPRUCB is able to mimic the full information setting more closely, which leads to faster reward-gathering in earlier iterations when compared with LinUCB. Second, with the over-estimated pseudo-rewards on hand, LinPRUCB is able to express the intention of exploration in the model updating step rather than the action selection step of the learner.

Based on the second property, we derive a variant of LinPRUCB, called LinPRGreedy, to perform action selection faster than LinPRUCB and LinUCB while maintaining com- petitive performance to those two algorithms. The two properties above make our pro- posed LinPRGreedy a favorable choice in real-world applications that usually demand fast reward-gathering and fast action-selection.

This paper is organized as follows. In Section 2, we give a formal setup of the problem and introduce related works. In Section 3, we describe the framework of using pseudo- rewards and derive LinPRUCB with its regret bound analysis. We present experimental simulations on artificial data sets and large-scale benchmark data sets in Section4. Finally, we introduce the practical variant LinPRGreedy in Section5, and conclude in Section6.

2. Preliminaries

We use bold lower-case symbol like w to denote a column vector and bold upper-case symbol like Q to denote a matrix. Let [N ] represents the set {1, 2, · · · , N }. The contextual bandit problem consists of T iterations of decision making. In each iteration t, a contextual bandit algorithm (learner) first observes a context xt∈ Rd from the environment, and is asked to select an action at from the action set [K]. We shall assume that the contexts are bounded by kxtk2 ≤ 1. After selecting an action at, the algorithm receives the corresponding reward rt,at ∈ [0, 1] from the environment, while other rewards rt,a for a 6= at are hidden from the algorithm. The algorithm should strategically select the actions in order to maximize the cumulative rewardPT

t=1rt,at after the T iterations.

In this paper, we study the contextual bandit problem with linear payoff functions (Chu et al., 2011), where the reward is generated by the following process. For any context xt

and any action a ∈ [K], assume that the reward rt,a is a random variable with conditional expectation E[rt,a|xt] = u>axt, where ua ∈ Rd is an unknown vector with kuak2 ≤ 1.

The linear relationship allows us to define at = argmaxa∈[K]u>axt, which is the optimal strategy of action selection in the hindsight. Then, the goal of maximizing the cumulative reward can be translated to minimizing the regret of the algorithm, which is defined as regret(T ) =PT

t=1rt,at −PT t=1rt,at.

Next, we introduce a family of existing linear algorithms for the contextual bandit problem with linear payoff functions. The algorithms all maintain wt,a as the current estimate of ua. Then, wt,a>xt represents the estimated reward of selecting action a upon seeing xt. The most na¨ıve algorithm of the family is called Linear Greedy (LinGreedy).

During iteration t, with the estimates wt,aon hand, LinGreedy simply selects an action at

that maximizes the estimated reward. That is, at= argmaxa∈[K]w>t,axt. Then, LinGreedy computes the weight vectors wt+1,a for the next iteration by ridge regression

wt+1,a= (λId+ X>t,aXt,a)−1(X>t,art,a), (1) where λ > 0 is a given parameter for ridge regression; Id is a d × d identity matrix. We use (Xt,a, rt,a) to store the observed contexts and rewards only when action a gets selected by


the algorithm: Xt,a is a matrix that contains x>τ as rows, where 1 ≤ τ ≤ t and aτ = a; rt,a is a vector that contains the corresponding rτ,a as components. That is, each xτ is only stored in Xt,aτ. Then, it is easy to see that wt+1,a = wt,a for a 6= at. That is, only the weight vector wt+1,a with a = at will be updated.

Because only the reward rt,at of the selected action at is revealed to the algorithm, it is known that LinGreedy suffers from its myopic decisions on action selection. In partic- ular, LinGreedy only focuses on exploitation (selecting the actions that are seemly more rewarding) but does not conduct exploration (trying the actions that the algorithm is less certain about). Thus, it is possible that LinGreedy would miss the truly rewarding actions in the long run. One major challenge in designing better contextual bandit algorithms is to strike a balance between exploitation and exploration (Auer,2002).

For LinGreedy, one simple remedy for the challenge is to replace the atof LinGreedy by a randomly selected action with a non-zero probability  in each iteration. This remedy is known as -greedy, which is developed from reinforcement learning (Sutton and Barto,1998), and has been extended to some more sophisticated algorithms like epoch-greedy (Langford and Zhang, 2007) that controls  dynamically. Both -greedy and epoch-greedy explores other potential actions with a non-zero probability, and thus generally performs better than LinGreedy in theory and in practice.

Another possible approach of balancing exploitation and exploration is through the use of the upper confidence bound (Auer, 2002). Upper-confidence-bound algorithms cleverly select an action based on some mixture of the estimated reward and an uncertainty term.

The uncertainty term represents the amount of information that has been received for the candidate action and decreases as more information is gathered during the iterations.

The mixture allows the algorithms to select either an action with high estimated reward (exploitation) or with high uncertainty (exploration).

Linear Upper Confidence Bound (LinUCB) (Chu et al., 2011; Li et al., 2010) is a state-of-the-art representative within the family of upper-confidence-bound algorithms. In addition to calculating the estimated reward w>t,axt like LinGreedy, LinUCB computes the uncertainty term by ct,a =


xtQ−1t−1,axt, where Qt−1,a = (λId+ X>t−1,aXt−1,a) is the matrix that gets inverted when computing wt,a by (1). For any given α > 0, the upper confidence bound of action a can then be formed by w>t,axt+ αct,a. LinUCB takes this bound for selecting the action to balance exploitation and exploration. That is,

at= argmax


w>t,axt+ αct,a

. (2)

Then, after receiving the reward rt,at, similar to LinGreedy, LinUCB takes ridge regres- sion (1) to update only the weight vector wt+1,at.

Note that LinGreedy can then be viewed as an extreme case of LinUCB with α = 0.

The extreme case of LinGreedy enjoys the computational benefit of only taking O(d) of time for selecting an action, which is useful in practical applications that need fast action-selection. In contrast, general LinUCB with α > 0 requires O(d2) in computing the uncertainty term. Nevertheless, LinGreedy enjoys a better theoretical guarantee and better practical performance. In particular, for proper choices of a non-zero α, LinUCB enjoys a rather low regret bound (Chu et al.,2011) and reaches state-of-the-art performance in practice (Li et al.,2010).


3. Linear Pseudo-reward Upper Confidence Bound

In LinUCB (as well as LinGreedy), the estimated weight vectors wt+1,a are computed by (Xt,a, rt,a): all the observed contexts and rewards only when action a gets selected by the algorithm. Because of the bandit setting, if an action at is selected in iteration t, the rewards rt,a for a 6= at are unknown to the learning algorithm. Since rt,a are unknown, the context xt is not used to update wt+1,a for any a 6= at, which appears to be a waste of the information in xt. On the other hand, if we optimistically assume that all the rewards rt,a are revealed to the learning algorithm, which corresponds to the full-information setting in online learning, ideally wt+1,afor every action a can be updated with (xt, rt,a). Then, wt+1,a

shall converge to a decent estimate faster, and the resulting “full-information variant” of LinUCB shall achieve better performance.

We demonstrate the motivation above with the results on a simple artificial data set in Figure 1. More detailed comparisons with artificial data sets will be shown in Section 4.

Each curve represents the average cumulative reward with different types of feedback in- formation. The pink circles represent LinUCB that receives full information about the rewards; the red diamonds represent the original LinUCB that receives the bandit feed- back. Unsurprisingly, Figure1suggests that the full-information LinUCB outperforms the bandit-information LinUCB by a big gap.

We include another curve colored by black in Figure 1. The black curve represents the performance of the full-information LinUCB when the received rewards are perturbed by some random white noise within [−0.05, 0.05]. As a consequence, the black curve is worse than the (pink-circle) curve of the full-information LinUCB with noise-less rewards.

Nevertheless, the black curve is still better than the (red-diamond) curve of the bandit- information LinUCB. That is, by updating wt+1,a on all actions a (full-information), even with inaccurate rewards, it is possible to improve the original LinUCB that only updates wt+1,at of the selected action at(bandit-information). The results motivate us to study how contextual bandit algorithms can be designed by mimicking the full-information setting.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

−0.05 0 0.05 0.1 0.15

average cumulative reward


Full Full (Noisy) Bandit

Figure 1: a motivating example

Based on the motivation, we propose a novel framework of Linear Pseudo-reward (LinPR) algorithms. In addition to using (xt, rt,at) to update the weight vector wt+1,at, LinPR al- gorithms compute some pseudo-rewards pt,a for all a 6= at, and take (xt, pt,a) to update


wt+1,a for a 6= at as well. Three questions remain in designing an algorithm under the LinPR framework. How are pt,acomputed? How are (xt, pt,a) used to update wt+1,a? How are wt+1,a used in selecting the next action? We answer those questions below to design a concrete and promising algorithm.

Choosing pt,a. Ideally, for the purpose of mimicking the full-information setting, we hope pt,a to be an accurate estimate of the true reward rt,a. If so, pt,a serves as a free feedback without costing the algorithm any action on exploration. Of course, the hope cannot be achieved in general. We propose to use pt,a to over-estimate the true reward instead. The over-estimates tilt wt,a towards those actions a that the algorithm is less certain about when observing xt. That is, the over-estimates guide the algorithm towards exploring the less certain actions during the model updating step, which fits the intent to strike a balance between exploitation and exploration (Auer,2002).

Formally, we propose to borrow the upper confidence bound in LinUCB, which is an over-estimate of the true reward, for computing the pseudo-reward pt,a. That is,

pt,a= w>t,axt+ βˆct,a, with ˆct,a= q

x>t−1t−1,axt, (3) where ˆQt−1,a is a matrix extended from Qt−1,a and will be specified later in (5); β > 0 is a given parameter like α. Similar to the upper confidence bound in LinUCB, the first term w>t,axtcomputes the estimated reward and the second term βˆct,a corresponds to some measure of uncertainty.

With the definition of pt,ain (3), we use the vector pt,ato gather all the pseudo-rewards calculated for action a, and the matrix ¯Xt,a for the corresponding contexts. That is, ¯Xt,a

is a matrix that contains x>τ as rows, where 1 ≤ τ ≤ t and aτ 6= a; pt,a is a vector that contains the corresponding pτ,aas components. Next, we discuss how the information within ( ¯Xt,a, pt,a) can be used to properly update wt+1,a.

Updating wt+1,awith (xt, pt,a). Recall that (Xt,a, rt,a) contains all the observed contexts and rewards only when action a gets selected by the algorithm, and ( ¯Xt,a, pt,a) contains the contexts and pseudo-rewards only when action a is not selected by the algorithm. Then, a natural attempt is to directly include the pseudo-rewards as if they are the true rewards when updating the model. That is, wt+1,a can be updated by solving the ridge regression problem of

wt+1,a = argmin


λkwk2+ kXt,aw − rt,ak2+ k ¯Xt,aw − pt,ak2 . (4) The caveat with the update formula in (4) is that the pseudo-rewards affect the calcu- lation of wt+1,a permanently, even though some of them, especially the earlier ones, may be rather inaccurate. The inaccurate ones may become irrelevant or even misleading in the latter iterations. Thus, intuitively we may not want those inaccurate pseudo-rewards to have the same influence on wt+1,a as the more recent, and perhaps more accurate, pseudo- rewards.

Following the intuition, we propose the following forgetting mechanism, which puts emphases on more recent pseudo-rewards and their contexts than earlier ones. Formally, we take a forgetting parameter η ∈ [0, 1) to control how fast the algorithm forgets previous


pseudo-rewards, where smaller η corresponds to faster forgetting. Note that every pair (xt, pt,a) corresponds to an error term of (w>xt− pt,a)2 within the objective function of (4).

When updating wt+1,a, we treat the most recent pseudo-reward as if it were the true reward, and do not modify the associated term. For the second most recent pseudo-reward, however, we decrease its influence on wt+1,a by multiplying its associated term with η2. Similarly, if

`t,a pseudo-rewards have been gathered when updating wt+1,a, the term that corresponds to the earliest pseudo-reward would be multiplied by η2(`t,a−1), which decreases along `t,a rapidly when η < 1.

Equivalently, the forgetting mechanism can be performed as follows. Note that there are `t,a rows in ¯Xt,a and pt,a. Let ˆXt,a be an `t,a× d matrix with its i-th row being that of ¯Xt,amultiplied by the factor η`t,a−i; similarly, let ˆpt,a be an `t,a-dimensional vector with its i-th component being that of pt,a multiplied by η`t,a−i. We then update wt+1,a by

wt+1,a= argmin


λkwk2+ kXt,aw − rt,ak2+ k ˆXt,aw − ˆpt,ak2 ,

which yields the closed-form solution wt+1,a= ˆQ−1t,a

X>t,art,a+ ˆX>t,aˆpt,a

, where ˆQt,a= λId+ X>t,aXt,a+ ˆX>t,at,a. (5) One remark for the update formula in (5) is that the use of β > 0 to over-estimate the true reward in pt,a is crucial. If only p0t,a = wt,a>xt is used as the pseudo-reward, we would be feeding the model with its own output. Then, the weight vector wt,acan readily make zero error on (xt, p0t,a), and thus provably wt+1,a = wt,a is not updated. When β > 0, however, the weight vectors wt,a for a 6= at effectively embed the intention for exploration when the algorithm is less certain about selecting action a for xt.

Using wt,a for Action-selection. Two natural choices for using wt,afor action selection is either to decide greedily based on the estimated rewards w>t,axt, or to decide in an upper- confidence-bound manner by

at= argmax


w>t,axt+ αˆct,a

, (6)

where ˆct,a = q

xt−1t−1,axt is upgraded from the ct,a used by LinUCB. We introduce the latter choice here, and will discuss about the greedy choice in Section5.

Combining the pseudo-rewards in (3), the model updating with the forgetting mechanism in (5), and the upper-confidence-bound decision in (6), we derive our proposed Linear Pseudo-reward Upper Confidence Bound (LinPRUCB) algorithm, as shown in Algorithm1.

Similar to LinUCB, our LinPRUCB selects an action based on an upper confidence bound controlled by the parameter α > 0. The main difference from LinUCB is the loop from Step 7 to Step 16 that updates the weight wt+1,a for all actions a. For the selected action at, its corresponding context vector and actual reward is updated in Step 9 as LinUCB does. For actions whose rewards are hidden, LinPRUCB computes the pseudo-rewards in Step 11, and applies the forgetting mechanism in Step 12. Finally Step 15 computes the new weight vectors wt+1,a shown in (5) for the next iteration.


Algorithm 1 Linear Pseudo-Reward Upper Confidence Bound (LinPRUCB)

1: Parameters: α, β, η, λ > 0

2: Initialize w1,a := 0d, ˆQ0,a := λId, ˆV0,a := V0,a := 0d×d, ˆz0,a := z0,a := 0d, for every a ∈ [K]

3: for t := 1 to T do

4: Observe xt

5: Select at:= argmaxa∈[K]w>t,axt+ α q

x>t−1t−1,axt 6: Receive reward rt,at

7: for a ∈ [K] do

8: if a = at then

9: Vt,a:= Vt−1,a+ xtx>t ,

10: else

11: pt,a:= min

w>t,axt+ β q

x>t−1t,axt, 1

12:t,a:= η ˆVt−1,a+ xtx>t ,

13: end if

14:t,a:= λId+ Vt,a+ ˆVt,a

15: wt+1,a := ˆQ−1t,a(zt,a+ ˆzt,a)

16: end for

17: end for

We will compare LinPRUCB to LinUCB empirically in Section 4. Before so, we establish the theoretical guarantee of LinPRUCB and compare it to that of LinUCB first.

The key results are summarized and discussed as follows. Under the assumption that the received rewards are independent from each other, Lemma2below shows that LinPRUCB can achieve a regret of (1 + α + ρ)


t∈[T ]ˆct,at

+ ˜O√


with probability 1 − δ, where ρ = O(β). LinUCB under the same condition (Chu et al., 2011), on the other hand, achieves a regret of (1 + α)


t∈[T ]ct,at

+ ˜O(√

T ) with probability 1 − δ.

One difference between the two bounds is that the bound of LinPRUCB contains an additional term of ρ, which is of O(β) and comes from the pseudo-rewards. By choosing β = O(α), however, the term can always be made comparable to α.

The other difference is that the bound of LinPRUCB contains ˆct,at terms instead of ct,at. Recall that ˆct,a =


x>t−1t−1,axt, with ˆQt−1,a = λId+ X>t−1,aXt−1,a+ ˆX>t−1,at−1,a, while ct,a=


x>tQ−1t−1,axt, with Qt−1,a = λId+X>t−1,aXt−1,a. Since ˆQt−1,acontains the additional term ˆX>t−1,at−1,awhen compared with Qt−1,a, the uncertainty terms ˆct,aare likely smaller than ct,a, especially in the early iterations. Figure 2 shows one typical case that compares ct,at of LinUCB and ˆct,at of LinPRUCB with one artificial data set used in Section4. We see that ˆct,at are indeed generally smaller than ct,at. The empirical observation in Figure 2 and the theoretical bound in Lemma 2 explain how LinPRUCB could perform better than LinUCB through playing with the bias-variance trade-off: decreasing the variance (expressed by ˆct,at) while slightly increasing the bias (expressed by ρ).


0 10 20 30 40 50 60 70 80 90 100 0.2

0.4 0.6 0.8 1



Figure 2: Comparison of the values on the uncertainty terms

The regret bounds discussed above rely on the assumption of independent rewards, which is unlikely to hold in general. To deal with this issue, we follow the approach ofChu et al.(2011) and discuss below on how LinPRUCB can be modified to a variant algorithm for which a small regret bound can actually be guaranteed without any assumptions. The key technique is to partition the iterations into several parts such that the received rewards in the same part are in fact independent from each other (Auer, 2002). More precisely, Chu et al. (2011) constructs a master algorithm named SupLinUCB, which performs the partitioning and calls a subroutine BaseLinUCB (modified from LinUCB) on each part separately. Similarly, we construct another master algorithm SupLinPRUCB, which is similar to SupLinUCB but calls BaseLinPRUCB (modified from LinPRUCB) instead.

The detailed descriptions are listed in Appendix A.1

Then, the big picture of the analysis follows closely from that of LinUCB (Auer,2002;

Chu et al.,2011). First, as shown in Appendix A, SupLinPRUCB in iteration t partitions the first t−1 iterations into S parts: Ψ1t, . . . , ΨSt. Using the same proof as that for Lemma 14 by Auer(2002), we can first establish the following lemma.

Lemma 1 For any s ∈ [S], any t ∈ [T ], and any fixed sequence of contexts xτ, with τ ∈ Ψst, the corresponding rewards rτ,aτ are independent random variables with E[rτ,aτ] = x>τua. Then, we can prove that BaseLinPRUCB when given such an independence guarantee can provide a good estimate of the true reward.

Lemma 2 Suppose the input index set Ψt ⊆ [t − 1] given to BaseLinPRUCB has the property that for fixed contexts xτ, for τ ∈ Ψt, the corresponding rewards rτ,aτ are inde- pendent random variables with means x>τuaτ. Then, for some α = O(pln(T K/δ)) and ρ = (2 + β)/√

1 − η, we have with probability 1 − δ/T that |x>t wt,a− x>t ua| ≤ (1 + α + ρ)ˆct,a for every a ∈ [K].

Proof For notation convenience, we drop all the subscripts involving t and a below. By definition,

x>w − x>u =

x>−1(X>r + ˆX>p) − xˆ >−1(λId+ X>X + ˆX>X)uˆ


x>−1X>(r − Xu) − λx>−1u + x>−1>(ˆp − ˆXu)

x>−1X>(r − Xu) + λ

x>−1u +

x>−1>(ˆp − ˆXu) . (7)

1. The full version that includes the appendix can be found at



The first two terms in (7) together can be bounded by (1 + α)ˆc with probability 1 − δ/T using arguments similar to those by Chu et al.(2011). The third term arises from our use of pseudo-rewards, which makes our analysis different from that of BaseLinUCB. This is where our forgetting mechanism comes to help. By the Cauchy-Schwarz inequality, this term is at most kx>−1>kkˆp − ˆXuk, where one can show that kx>−1>k ≤ ˆc using a similar argument as Lemma 1 of Chu et al. (2011). Since the i-th entry of the vector ˆ

p − ˆX>u by definition is at most (2 + β)η`−i, we have kˆp − ˆX>uk ≤ (2 + β)

s X


η2(`−i)≤ 2 + β

√1 − η = ρ.

Combining these bounds together, we have the lemma.

Lemma2is the key distinction in our analysis, which justifies our use of pseudo-rewards and the forgetting mechanism. With Lemma1and Lemma2, the rest of the regret analysis is almost identical to that of LinUCB (Auer,2002;Chu et al.,2011), and the analysis leads to the following main theoretical result. The details of the proof are listed in Appendix A.

Theorem 3 For some α = O(pln(T K/δ)), for any β = O(α) and any constant η ∈ [0, 1), SupLinPRUCB with probability 1 − δ achieves a regret of O


dKT ln3(KT /δ)


4. Experiments

Next, we conduct empirical studies on both artificial and real-world data sets to justify the idea of using pseudo-rewards and to compare the performance between LinPRUCB and the state-of-the-art LinUCB algorithm.

Artificial Data Sets. First, we simulate with artificial data sets as follows. Unit vectors u1, . . . , uK for K actions are drawn uniformly from Rd. In iteration t out of the T iterations, a context xtis first sampled from an uniform distribution within kxk ≤ 1. Then, the reward rt,at = u>atxt+ vt is generated, where vt ∈ [−0.05, 0.05] is a random white noise. For the simulation, we consider parameters T = 1000, K ∈ {10, 30}, d ∈ {10, 30}.

We plot the average cumulative reward (vertical axis) versus the normalized number of iterations (horizontal axis) to illustrate the key ideas of the proposed LinPRUCB algorithm and to compare it with LinUCB. Each curve is the average of 10 runs of simulations. The results are presented in Figure 3. For each (K, d) combination, we use pink-circle marks to represent the curve of LinUCB with full-information feedback, and red-diamond marks to represent the curve of LinUCB with bandit-information. Similar to the observations in the motivating Figure1, the full-information LinUCB performs better than the bandit- information one. Another curve we include is colored yellow, which represents LinGreedy.

The three curves serve as references to evaluate the performance of LinPRUCB.

To justify the idea of using over-estimated rewards as pseudo-rewards, we first set β in (3) to be 0, fix the forgetting parameter η = 1, and use grid search for the optimal α ∈ {0, 0.2, . . . , 1.2} based on the final cumulative reward after T iterations. This setting of (α, β, η) represents a LinPRUCB variant that estimates the pseudo-reward without the


0.2 0.4 0.6 0.8 1

−0.05 0 0.05 0.1 0.15 0.2

average cumulative reward

t/T Full LinUCB Greedy

LinPRUCB (β=0, η=0) LinPRUCB (η=1) LinPRUCB

(a) d = 10, K = 10

0.2 0.4 0.6 0.8 1

−0.05 0 0.05 0.1 0.15 0.2 0.25

average cumulative reward

t/T Full LinUCB Greedy

LinPRUCB (β=0, η=1) LinPRUCB (η=1) LinPRUCB

(b) d = 10, K = 30

0.2 0.4 0.6 0.8 1

−0.02 0 0.02 0.04 0.06 0.08 0.1 0.12

average cumulative reward

t/T Full LinUCB Greedy

LinPRUCB (β=0, η=1) LinPRUCB (η=1) LinPRUCB

(c) d = 30, K = 10

0.2 0.4 0.6 0.8 1

−0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

average cumulative reward

t/T Full LinUCB Greedy

LinPRUCB (β=0, η=1) LinPRUCB (η=1) LinPRUCB

(d ) d = 30, K = 30 Figure 3: Comparison of average cumulative reward for T = 1000.

uncertainty term ˆct,a and never forgets the previous estimates. We use green-cross marks to represent the curve and call it LinPRUCB(α, 0, 1). Note that we allow LinUCB the same flexibility in choosing α with grid search.

Then, we fix η = 1 and use grid search for the optimal (α, β) within α ∈ {0, 0.2, . . . , 1.2}

and β ∈ {0, 0.2, . . . , 1.2}. This setting of (α, β, η) represents a LinPRUCB variant that esti- mates the pseudo-reward with the uncertainty term in (3) but never forgets the previous es- timates. We use light-blue-star marks to represent the curve and call it LinPRUCB(α, β, 1).

Finally, we study the full power of LinPRUCB that not only estimates the pseudo- reward according to (3) but also utilizes the forgetting mechanism. This is done by a grid search on α, β, and η, where η ∈ [0.8, 0.95] with a grid step of 0.05. We use blue-square marks to represent the curve.

From the results that are plotted in Figure3, for each given (K, d), LinPRUCB(α, 0, 1) and LinPRUCB(α, β, 1) cannot perform well. They are often worse than LinUCB in the end, and are even worse than LinGreedy. The results justify the importance of both the over-estimate with the uncertainty terms ˆct,aand the forgetting mechanism that we propose.

Next, we compare the actual LinPRUCB with LinUCB in Figure 3. When focusing on the early iterations, LinPRUCB often gathers more rewards than LinUCB for each


(K, d). Interestingly, similar phenomenons also occur in the curves of LinPRUCB(α, 0, 1) and LinPRUCB(α, β, 1). Those results echo our discussions in Section 3 that it can be beneficial to trade some bias for a smaller variance to improve LinUCB. After T iterations, the performance of LinPRUCB is better than or comparable to that of LinUCB, which echoes our theoretical results in Section 3of the matching regret bound.

The observations above support the key ideas in designing LinPRUCB. By allowing the algorithm to update every wt+1,a instead of only wt+1,at, LinPRUCB achieves faster reward-gathering in the earlier iterations, and the over-estimates in the pseudo-rewards and the forgetting mechanism are both essential in maintaining the promising performance of LinPRUCB.

Real-world Benchmark Data Sets. Next, we extend our empirical study to two real- world benchmark data sets, R6A and R6B released by Yahoo!. To the best of our knowledge, the two data sets are the only public data sets for the contextual bandit problem. They first appeared in ICML 2012 Workshop on New Challenges for Exploration and Exploitation 3 ( The challenge is a practical problem that needs the algorithm to display news articles on a website strategically. The algorithm has to display a news article (action at) during the visit (iteration t) of a user (context xt), and the user can select to click (reward rt,at) on the news to read more information or simply ignore it. The LinUCB algorithm and its variants are among the leading algorithms in the challenge, and there is a 36-hour run time limitation to all algorithms. This roughly translates to less then 5 ms for selecting an action, and 50 ms for updating the weight vectors wt+1,a.

In R6A, each context xt is 6 dimensional and each dimension consists a value within [0, 1]. In R6B, each xt is 136 dimensional with values within {0, 1} and the first dimension being a constant 1. Rather than selecting an action from a fixed set [K], the actions in the data sets are dynamic since new news articles may emerge over time and old article may be dropped. The dynamic nature does not affect the LinPRUCB and LinUCB algorithms, which maintain one wt,a per action regardless of the size of the action set. For each data set, the reward rt,at equals 1 if the user clicked on the article, and equals 0 otherwise. Then, the click through rate (CTR) of the algorithm, which is defined as the ratio of total clicks in the total evaluated iterations, corresponds to the average cumulative reward. In order to achieve an unbiased evaluation when using an offline data set such as R6A or R6B, we use the technique described by Li et al.(2011).

We split each data set into two sets, one for parameter tuning and the other for test- ing. The tuning set consists of 50,000 visits, and the testing set consists of the remaining 4,000,000 visits. We run grid search on the tuning set for every algorithm that comes with tunable parameters. Then, the parameter combination that achieves the best tuning CTR is coupled with the corresponding algorithm to evaluate the CTR on the testing set.

The experimental results for R6A and R6B are summarized in Table 1 and Table 2.

The tables demonstrate that LinPRUCB is pretty competitive to LinUCB with 10% of test CTR improvement in R6A and comparable CTR performance in R6B. The results justify LinPRUCB to be a promising alternative to LinUCB in practical applications.

Interestingly and somewhat surprisingly, for R6B, the best performing algorithm during


Algorithm CTR (tuning) CTR (testing) selection time updating time

LinPRUCB 0.038 0.070 9.1 3.6

LinUCB 0.039 0.063 9.4 0.7

LinGreedy 0.028 0.037 3.8 0.4

Random 0.024 0.030 0.7 0.0

Table 1: Experimental results on data set R6A. (d = 6, K = 30)

Algorithm CTR (tuning) CTR (testing) selection time updating time

LinPRUCB 0.046 0.054 1082 5003

LinUCB 0.046 0.053 1166 248

LinGreedy 0.039 0.059 25 201

Random 0.032 0.034 1 0

Table 2: Experimental results on data set R6B. (d = 136, K = 30)

testing is neither LinUCB nor LinPRUCB, but LinGreedy. It is possibly because the data set is of a larger d and thus exploration is a rather challenging task.

We also list the selection time and the updating time for each algorithm on data sets R6A and R6B in Table 1 and Table 2, respectively. The updating time is the time that an algorithm needed to update its model wt+1,a after the reward is revealed. The selection time is the time that an algorithm needed to return an action at, which links directly to the loading speed of a web page in this competition. On both data sets we calculate the selection time and the updating time aggregated in 50,000 iterations and report them in seconds.

The results in Table 1 show that the selection time of LinPRUCB and LinUCB are similar, but LinPRUCB needs more time in updating. The situation becomes more promi- nent in Table 2for larger dimension d. This is because in one iteration, LinUCB only has to update one model wt,a, but LinPRUCB needs to update every model for each action.

We also note that the selection time for LinPRUCB and LinUCB are larger than the updating time. This is somewhat not practical since in real-world applications, a system usually needs to react to users within a short time, while updating on the server can take longer. To meet the real-world requirements, we present a practical variant of the proposed LinPRUCB algorithm in the next section. The variant can conduct faster action selection while enjoying similar performance to the original LinPRUCB in some cases.

5. A Practical Variant of LinPRUCB

The practical variant is derived by replacing the action selection step (6) of LinPRUCB with a greedy one at = argmaxa∈[K]w>t,axt. That is, we can simply set α = 0 for Lin- PRUCB to drop the calculation of the uncertainty terms ˆct,a. Then, similar to how Lin- UCB becomes LinGreedy when α = 0, LinPRUCB becomes the Linear Pseudo-reward Greedy (LinPRGreedy) algorithm.

Recall our discussion in Section 2 that LinGreedy is faster in action selection than LinUCB. The main reason is in calculating the uncertainty terms ct,a, which takes O(d2)


Simulation LinPRUCB LinUCB LinPRGreedy LinGreedy d = 10, K = 10 0.460 ± 0.010 0.461 ± 0.017 0.454 ± 0.033 0.150 ± 0.016 d = 10, K = 30 0.558 ± 0.005 0.563 ± 0.007 0.504 ± 0.011 0.155 ± 0.006 d = 30, K = 10 0.270 ± 0.008 0.268 ± 0.008 0.271 ± 0.016 0.074 ± 0.010 d = 30, K = 30 0.297 ± 0.003 0.297 ± 0.005 0.255 ± 0.014 0.091 ± 0.009

Table 3: Comparisons of average cumulative reward on artificial data sets.

Algorithm CTR (tuning) CTR (testing) selection time (seconds) updating time (seconds)

LinPRUCB 0.038 0.070 9.1 3.6

LinUCB 0.039 0.063 9.4 0.7

LinPRGreedy 0.036 0.068 4.8 3.4

Greedy 0.028 0.037 3.8 0.4

Table 4: Experimental results on data set R6A.

time even when the inverse of all Qt−1,a have been cached during model updating. Nev- ertheless, LinUCB heavily relies on the uncertainty terms in selecting proper actions, and cannot afford dropping the terms.

Our proposed LinPRUCB, however, may not suffer that much from dropping the terms. In particular, because of our reuse of the upper confidence bound in (3), the term β


xTt−1t,axt in (3) can readily carry out the job of exploration, and thus αˆct,a in (6) may not be heavily needed. Thus, LinPRUCB provides a flexibility to move the computation loads in the action selection step to the model updating step, which matches the need of practical applications.

The experimental results on artificial data are summarized in Table3, where the artificial data sets are generated in the same way as Figure3. We can see that although LinPRUCB and LinUCB reach higher cumulative reward, the performance of LinPRGreedy is close to that of LinPRUCB and LinUCB when K = 10. The results justify our argument that LinPRUCB may not heavily rely on αˆct,ain (6) in some cases, and thus dropping the terms does not affect the performance much. There is a larger gap between LinPRGreedy and LinUCB when K = 30, though.

The experimental results for for R6A and R6B are summarized in Table 4and Table5.

The tables are the extensions of Table 1 and Table 2, respectively. Once again, we see that the performance of LinPRGreedy is competitive with LinPRUCB and LinUCB.

Furthermore, note that the major advantage of LinPRGreedy is its selection time. In each table, the selection time of LinPRGreedy is much smaller than that of LinPRUCB and LinUCB, and the gap grows larger for larger dimension d. The small selection time and competitive performance makes LinPRGreedy a favorable choice for practical appli- cations.


Algorithm CTR (tuning) CTR (testing) selection time updating time

LinPRUCB 0.046 0.054 1082 5003

LinUCB 0.046 0.053 1166 248

LinPRGreedy 0.045 0.056 24 4899

LinGreedy 0.039 0.059 25 201

Table 5: Experimental results on R6B.

6. Conclusion

We propose a novel contextual bandit algorithm LinPRUCB, which combines the idea of using pseudo-rewards that over-estimate the true rewards, applying a forgetting mechanism of earlier pseudo-rewards, and taking the upper confidence bound in action selection. We prove a matching regret bound of LinPRUCB to the state-of-the-art algorithm LinUCB, and discuss how the bound echoes the promising performance of LinPRUCB in gathering rewards faster in the earlier iterations. Empirical results on artificial and real-world data sets demonstrate that LinPRUCB compares favorably to LinUCB in the early iterations, and is competitive to LinUCB in the long run. Furthermore, we design a variant of LinPRUCB called LinPRGreedy, which shares similar performance benefits to LinPRUCB while enjoying much faster action selection.


The paper originates from part of the first author’s M.S. thesis (Chou,2012) and the second author’s Ph.D. thesis (Chiang,2014). The authors thank the anonymous reviewers and the oral committee members of the two leading authors for their valuable comments. The completion of the paper was partially supported by the government of Taiwan via NSC 101-2628-E-002-029-MY2 and 100-2221-E-001-008-MY3.


Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.

Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002a.

Peter Auer, Nicolo Cesa-bianchi, Yoav Freund, and Robert Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32:2002, 2002b.

Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. Boosting with online binary learners for the multiclass bandit problem. In ICML, 2014.

Chao-Kai Chiang. Toward realistic online learning. PhD thesis, National Taiwan University, 2014.

Ku-Chun Chou. Pseudo-reward algorithms for linear contextual bandit problems. Master’s thesis, National Taiwan University, 2012.


Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. In AISTATS, 2011.

Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Efficient bandit algorithms for online multiclass prediction. In ICML, 2008.

T. L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.

John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In NIPS, 2007.

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, 2010.

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In WSDM, 2011.

Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. Bandits for taxonomies: A model-based approach. In SDM, 2007.

Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. ISBN 0262193981.

Chih-Chun Wang, Sanjeev R. Kulkarni, and H. Vincent Poor. Bandit problems with side observations. IEEE Transactions Automatic Control, 50(3):338–355, 2005.


Appendix A. Proof of Theorem 3

The descriptions of BaseLinPRUCB and SupLinPRUCB are given in Algorithm 2 and Algorithm3respectively. To bound the regret of our SupLinPRUCB, we follow the analysis in (Auer, 2002;Chu et al., 2011) and replace several critical steps that are related to our algorithm.

Algorithm 2 BaseLinPRUCB

1: Inputs: Ψt⊆ [t − 1]; α, β, η, λ > 0

2: Initialize wa:= 0d, ˆQa:= λId, Va:= ˆVa:= 0d×d, ˆza:= za:= 0d, for every a ∈ [K]

3: Observe xt 4: for a ∈ [K] do

5: for τ ∈ Ψt do

6: if aτ = a then

7: Va:= Va+ xτx>τ

8: za:= za+ xτrτ,a

9: else

10: pτ,a:= min

wa>xτ + β q

x>τ−1a xτ, 1

11:a:= η ˆVa+ xτx>τ

12: ˆza:= ηˆza+ xτpτ,a 13: end if

14:a:= λId+ Va+ ˆVa

15: wa:= ˆQ−1a (za+ ˆza)

16: end for

17: widtht,a:= (1 + α + ρ) q

x>t−1a xt

18: ucbt,a:= w>axt+ widtht,a

19: end for

Let Ψ0 = [T ] \S

s∈[S]Ψ(s)T +1, which contains those indices not in any of those Ψ(s)T +1. Let Ψ(s)t,a denote the subset of Ψ(s)t that collects all τ ∈ Ψ(s)t,a such that aτ = a; thus, Ψ(s)t = S

a∈[K]Ψ(s)t,a. By definition, the expected regret is P

t∈[T ](E[rt,at] − E[rt,at]), which can be regrouped as



E[rt,at] − E[rt,at]

+ X





t∈Ψ(s)T +1,a

E[rt,at] − E[rt,a] . (8)

Using Lemma 1 and Lemma 2 stated in Section 3, we can establish a lemma similar to Lemma 15 by Auer(2002), which implies that with probability 1 − δS, the bound in (8) is at most


T + X




23−s Ψ(s)T +1,a

. (9)

To bound the second term in (9), we need the following.


Algorithm 3 SupLinPRUCB

1: Initialize S := ln T and Ψ(s)1 := ∅ for every s ∈ [S]

2: for t = 1 to T do

3: s := 1 and ˆA(1):= [K]

4: repeat

5: Call BaseLinPRUCB with input Ψ(s)t to compute width(s)t,a and ucb(s)t,a for every a ∈ ˆA(s).

6: if width(s)t,a > 2−s for some a ∈ ˆA(s) then

7: Select action at:= a and update:

Ψ(s)t+1:= Ψ(s)t ∪ {t}, and Ψ(st+10) := Ψ(st 0) for every s0 6= s.

8: else if width(s)t,a ≤ 1/√

T for every a ∈ ˆA(s) then

9: Select action at:= arg maxa∈ ˆA(s)ucb(s)t,a, and let Ψ(s)t+1:= Ψ(s)t for every s ∈ [S].

10: else

11: Let ˆA(s+1) := {a ∈ ˆA(s)| ucb(s)t,a ≥ u(s)− 21−s}, where u(s)= maxa0∈ ˆA(s)ucb(s)t,a0. Let s := s + 1.

12: end if

13: until an action at is selected

14: end for

Lemma 4 For any s ∈ [S] and a ∈ [K], 2−s

Ψ(s)T +1,a

≤ 5(1 + α + ρ) r

d Ψ(s)T +1,a


Ψ(s)T +1,a


Proof From Steps 6 and 7 of SupLinPRUCB, we know that for any s and a, width(s)t,a > 2−s for any t ∈ Ψ(s)T +1,a, and therefore

2−s Ψ(s)T +1,a

≤ X

t∈Ψ(s)T +1,a


From Step 17 of BaseLinPRUCB, we know that width(s)t,a = (1 + α + ρ)


x>t−1a xt

for some matrix ˆQa = Id+ Va+ ˆVa. The matrices ˆQa, Va, and ˆVa actually depend on Ψ(s)t,a, and thus let us denote them more appropriately by ˆQ(s)t,a, Vt,a(s), and ˆV(s)t,a, respectively, where Vt,a(s) is the sum of xτx>τ, over τ ∈ Ψ(s)t,a, and ˆVt,a(s)is the sum of xτx>τ scaled by some power of η, over τ ∈ Ψ(s)t \ Ψ(s)t,a. Using the same proof for Lemma 3 in (Chu et al.,2011), one can show that for the matrices Q(s)t,a = Id+ V(s)t,a,


t∈Ψ(s)T +1,a

r x>t



xt≤ 5 r

d Ψ(s)T +1,a


Ψ(s)T +1,a



Note that ˆQ(s)t,a  Q(s)t,a and ( ˆQ(s)t,a)−1  (Q(s)t,a)−1, which implies that x>( ˆQ(s)t,a)−1x ≤ x>(Q(s)t,a)−1x for any x ∈ Rd. By combining all the bounds together, we have the lemma.

Using Lemma 4, the second term in (9) can be upper bounded by 40 X




(1 + α + ρ) r

d Ψ(s)T +1,a


Ψ(s)T +1,a

≤ 40(1 + α + ρ)√

d ln T X




r Ψ(s)T +1,a

≤ 40(1 + α + ρ)√ d ln T

√ SKT ,

by Jensen’s inequality. The rest of the proof is almost identical to that byAuer(2002). That is, by replacing δ with δ/(S + 1), substituting α = O(pln(T K/δ)) and ρ = O(β) = O(α), and then applying Azuma’s inequality, we obtain our Theorem3.




Related subjects :