## Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions

Ku-Chun Chou r99922095@csie.ntu.edu.tw

Department of Computer Science and Information Engineering, National Taiwan University

Chao-Kai Chiang chao-kai.chiang@unileoben.ac.at

Department of Mathematics and Information Technology, University of Leoben

Hsuan-Tien Lin htlin@csie.ntu.edu.tw

Department of Computer Science and Information Engineering, National Taiwan University

Chi-Jen Lu cjlu@iis.sinica.edu.tw

Institute of Information Science, Academia Sinica

Abstract

We study the contextual bandit problem with linear payoff functions, which is a generaliza- tion of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observa- tion that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We ar- gue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of re- gret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical perfor- mance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice.

Keywords: contextual bandit, pseudo-reward, upper confidence bound

1. Introduction

We study the contextual bandit problem (Wang et al.,2005), or known as the k-armed bandit problem with context (Auer et al.,2002a), in which a learner needs to interact iteratively with the external environment. In the contextual bandit problem, the learner has to make decisions in a number of iterations. During each iteration, the learner can first access certain information, called context, about the environment. After seeing the context, the learner is asked to strategically select an action from a set of candidate actions. The external environment then reveals the reward of the selected action to the learner as the feedback,

while hiding the rewards of all other actions. This type of feedback scheme is known as the bandit setting (Auer et al., 2002b). The learner then updates its internal models with the feedback to reach the goal of gathering the most cumulative reward over all the iterations.

The difference of cumulative reward between the best strategy and the learner’s strategy is called the regret, which is often used to measure the performance of the learner.

The contextual bandit problem can be viewed as a more realistic extension of the tradi- tional bandit problem (Lai and Robbins,1985), which does not provide any context informa- tion to the learner. The use of context allows expressing many interesting applications. For instance, consider an online advertising system operated by a contextual bandit learner.

For each user visit (iteration), the advertising algorithm (learner) is provided with the known properties about the user (context), and an advertisement (action) from a pool of relevant advertisements (set of candidate actions) is selected and displayed to that user.

The company’s earning (reward) can be calculated based on whether the user clicked the selected advertisement and the value of the advertisement itself. The reward can then be revealed to the algorithm as the feedback, and maximizing the cumulative reward directly connects to the company’s total earning. Many other web applications for advertisement and recommendation can be similarly modeled as a contextual bandit problem (Li et al., 2010,2011;Pandey et al.,2007), and hence the problem has been attracting much research attention (Auer et al., 2002a,b; Auer, 2002; Kakade et al., 2008; Chu et al., 2011; Chen et al.,2014).

The bandit setting contrasts the full-information setting in traditional online learn- ing (Kakade et al.,2008;Chen et al.,2014), in which the reward of every action is revealed to the learner regardless of whether the action is selected. Intuitively, it is easier to design learners under the full-information setting, and such learners (shorthanded full-information learners) usually perform better than their siblings that run under bandit-setting (to be shown in Section 2). The performance difference motivates us to study whether we can mimic the full-information setting to design better contextual bandit learners.

In particular, we propose the concept of pseudo-rewards, which embed estimates to the hidden rewards of non-selected actions under the bandit setting. Then, the revealed reward and the pseudo-rewards can be jointly fed to full-information learners to mimic the full-information setting. We study the possibility of feeding pseudo-rewards to linear full-information learners, which compute the estimates by linear ridge regression. In par- ticular, we propose to use pseudo-rewards that over-estimate the hidden rewards. Such pseudo-rewards embed the intention of exploration, which means selecting the action that the learner is less certain about. The intention has been shown useful in many of the ex- isting works on the contextual bandit problem (Auer et al., 2002b; Li et al., 2010). To neutralize the effect of over-estimating the hidden rewards routinely, we further propose a forgetting mechanism that decreases the influence of the pseudo-rewards received in the earlier iterations.

Combining the over-estimated pseudo-rewards, the forgetting mechanism, and the state- of-the-art Linear Upper Confidence Bound (LinUCB) algorithm (Li et al., 2010; Chu et al.,2011), we design a novel algorithm, Linear Pseudo-reward Upper Confidence Bound (LinPRUCB). We prove that LinPRUCB results in a matching regret bound to Lin- UCB. In addition, we demonstrate empirically with artificial and real-world data sets that LinPRUCB enjoys two additional properties. First, with the pseudo-rewards on hand,

LinPRUCB is able to mimic the full information setting more closely, which leads to faster reward-gathering in earlier iterations when compared with LinUCB. Second, with the over-estimated pseudo-rewards on hand, LinPRUCB is able to express the intention of exploration in the model updating step rather than the action selection step of the learner.

Based on the second property, we derive a variant of LinPRUCB, called LinPRGreedy, to perform action selection faster than LinPRUCB and LinUCB while maintaining com- petitive performance to those two algorithms. The two properties above make our pro- posed LinPRGreedy a favorable choice in real-world applications that usually demand fast reward-gathering and fast action-selection.

This paper is organized as follows. In Section 2, we give a formal setup of the problem and introduce related works. In Section 3, we describe the framework of using pseudo- rewards and derive LinPRUCB with its regret bound analysis. We present experimental simulations on artificial data sets and large-scale benchmark data sets in Section4. Finally, we introduce the practical variant LinPRGreedy in Section5, and conclude in Section6.

2. Preliminaries

We use bold lower-case symbol like w to denote a column vector and bold upper-case symbol
like Q to denote a matrix. Let [N ] represents the set {1, 2, · · · , N }. The contextual bandit
problem consists of T iterations of decision making. In each iteration t, a contextual bandit
algorithm (learner) first observes a context x_{t}∈ R^{d} from the environment, and is asked to
select an action at from the action set [K]. We shall assume that the contexts are bounded
by kx_{t}k_{2} ≤ 1. After selecting an action a_{t}, the algorithm receives the corresponding reward
r_{t,a}_{t} ∈ [0, 1] from the environment, while other rewards r_{t,a} for a 6= a_{t} are hidden from the
algorithm. The algorithm should strategically select the actions in order to maximize the
cumulative rewardP_{T}

t=1r_{t,a}_{t} after the T iterations.

In this paper, we study the contextual bandit problem with linear payoff functions (Chu et al., 2011), where the reward is generated by the following process. For any context xt

and any action a ∈ [K], assume that the reward rt,a is a random variable with conditional
expectation E[rt,a|x_{t}] = u^{>}_{a}x_{t}, where u_{a} ∈ R^{d} is an unknown vector with ku_{a}k_{2} ≤ 1.

The linear relationship allows us to define a^{∗}_{t} = argmax_{a∈[K]}u^{>}_{a}xt, which is the optimal
strategy of action selection in the hindsight. Then, the goal of maximizing the cumulative
reward can be translated to minimizing the regret of the algorithm, which is defined as
regret(T ) =PT

t=1rt,a^{∗}_{t} −PT
t=1rt,at.

Next, we introduce a family of existing linear algorithms for the contextual bandit
problem with linear payoff functions. The algorithms all maintain w_{t,a} as the current
estimate of ua. Then, w_{t,a}^{>}xt represents the estimated reward of selecting action a upon
seeing xt. The most na¨ıve algorithm of the family is called Linear Greedy (LinGreedy).

During iteration t, with the estimates w_{t,a}on hand, LinGreedy simply selects an action at

that maximizes the estimated reward. That is, at= argmax_{a∈[K]}w^{>}_{t,a}xt. Then, LinGreedy
computes the weight vectors wt+1,a for the next iteration by ridge regression

wt+1,a= (λId+ X^{>}_{t,a}Xt,a)^{−1}(X^{>}_{t,a}rt,a), (1)
where λ > 0 is a given parameter for ridge regression; Id is a d × d identity matrix. We use
(Xt,a, rt,a) to store the observed contexts and rewards only when action a gets selected by

the algorithm: X_{t,a} is a matrix that contains x^{>}_{τ} as rows, where 1 ≤ τ ≤ t and a_{τ} = a; r_{t,a}
is a vector that contains the corresponding rτ,a as components. That is, each xτ is only
stored in X_{t,a}_{τ}. Then, it is easy to see that w_{t+1,a} = w_{t,a} for a 6= a_{t}. That is, only the
weight vector w_{t+1,a} with a = a_{t} will be updated.

Because only the reward rt,at of the selected action at is revealed to the algorithm, it is known that LinGreedy suffers from its myopic decisions on action selection. In partic- ular, LinGreedy only focuses on exploitation (selecting the actions that are seemly more rewarding) but does not conduct exploration (trying the actions that the algorithm is less certain about). Thus, it is possible that LinGreedy would miss the truly rewarding actions in the long run. One major challenge in designing better contextual bandit algorithms is to strike a balance between exploitation and exploration (Auer,2002).

For LinGreedy, one simple remedy for the challenge is to replace the atof LinGreedy by a randomly selected action with a non-zero probability in each iteration. This remedy is known as -greedy, which is developed from reinforcement learning (Sutton and Barto,1998), and has been extended to some more sophisticated algorithms like epoch-greedy (Langford and Zhang, 2007) that controls dynamically. Both -greedy and epoch-greedy explores other potential actions with a non-zero probability, and thus generally performs better than LinGreedy in theory and in practice.

Another possible approach of balancing exploitation and exploration is through the use of the upper confidence bound (Auer, 2002). Upper-confidence-bound algorithms cleverly select an action based on some mixture of the estimated reward and an uncertainty term.

The uncertainty term represents the amount of information that has been received for the candidate action and decreases as more information is gathered during the iterations.

The mixture allows the algorithms to select either an action with high estimated reward (exploitation) or with high uncertainty (exploration).

Linear Upper Confidence Bound (LinUCB) (Chu et al., 2011; Li et al., 2010) is a
state-of-the-art representative within the family of upper-confidence-bound algorithms. In
addition to calculating the estimated reward w^{>}_{t,a}x_{t} like LinGreedy, LinUCB computes
the uncertainty term by ct,a =

q

xtQ^{−1}_{t−1,a}xt, where Qt−1,a = (λI_{d}+ X^{>}_{t−1,a}Xt−1,a) is the
matrix that gets inverted when computing wt,a by (1). For any given α > 0, the upper
confidence bound of action a can then be formed by w^{>}_{t,a}x_{t}+ αc_{t,a}. LinUCB takes this
bound for selecting the action to balance exploitation and exploration. That is,

at= argmax

a∈[K]

w^{>}_{t,a}xt+ αct,a

. (2)

Then, after receiving the reward r_{t,a}_{t}, similar to LinGreedy, LinUCB takes ridge regres-
sion (1) to update only the weight vector w_{t+1,a}_{t}.

Note that LinGreedy can then be viewed as an extreme case of LinUCB with α = 0.

The extreme case of LinGreedy enjoys the computational benefit of only taking O(d)
of time for selecting an action, which is useful in practical applications that need fast
action-selection. In contrast, general LinUCB with α > 0 requires O(d^{2}) in computing
the uncertainty term. Nevertheless, LinGreedy enjoys a better theoretical guarantee and
better practical performance. In particular, for proper choices of a non-zero α, LinUCB
enjoys a rather low regret bound (Chu et al.,2011) and reaches state-of-the-art performance
in practice (Li et al.,2010).

3. Linear Pseudo-reward Upper Confidence Bound

In LinUCB (as well as LinGreedy), the estimated weight vectors wt+1,a are computed
by (Xt,a, rt,a): all the observed contexts and rewards only when action a gets selected by
the algorithm. Because of the bandit setting, if an action a_{t} is selected in iteration t, the
rewards r_{t,a} for a 6= a_{t} are unknown to the learning algorithm. Since r_{t,a} are unknown, the
context xt is not used to update wt+1,a for any a 6= at, which appears to be a waste of the
information in x_{t}. On the other hand, if we optimistically assume that all the rewards r_{t,a}
are revealed to the learning algorithm, which corresponds to the full-information setting in
online learning, ideally wt+1,afor every action a can be updated with (xt, rt,a). Then, wt+1,a

shall converge to a decent estimate faster, and the resulting “full-information variant” of LinUCB shall achieve better performance.

We demonstrate the motivation above with the results on a simple artificial data set in Figure 1. More detailed comparisons with artificial data sets will be shown in Section 4.

Each curve represents the average cumulative reward with different types of feedback in- formation. The pink circles represent LinUCB that receives full information about the rewards; the red diamonds represent the original LinUCB that receives the bandit feed- back. Unsurprisingly, Figure1suggests that the full-information LinUCB outperforms the bandit-information LinUCB by a big gap.

We include another curve colored by black in Figure 1. The black curve represents the performance of the full-information LinUCB when the received rewards are perturbed by some random white noise within [−0.05, 0.05]. As a consequence, the black curve is worse than the (pink-circle) curve of the full-information LinUCB with noise-less rewards.

Nevertheless, the black curve is still better than the (red-diamond) curve of the bandit-
information LinUCB. That is, by updating wt+1,a on all actions a (full-information), even
with inaccurate rewards, it is possible to improve the original LinUCB that only updates
w_{t+1,a}_{t} of the selected action a_{t}(bandit-information). The results motivate us to study how
contextual bandit algorithms can be designed by mimicking the full-information setting.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

−0.05 0 0.05 0.1 0.15

average cumulative reward

t/T

Full Full (Noisy) Bandit

Figure 1: a motivating example

Based on the motivation, we propose a novel framework of Linear Pseudo-reward (LinPR)
algorithms. In addition to using (xt, rt,at) to update the weight vector wt+1,at, LinPR al-
gorithms compute some pseudo-rewards p_{t,a} for all a 6= a_{t}, and take (x_{t}, p_{t,a}) to update

w_{t+1,a} for a 6= a_{t} as well. Three questions remain in designing an algorithm under the
LinPR framework. How are pt,acomputed? How are (xt, pt,a) used to update wt+1,a? How
are w_{t+1,a} used in selecting the next action? We answer those questions below to design a
concrete and promising algorithm.

Choosing p_{t,a}. Ideally, for the purpose of mimicking the full-information setting, we hope
p_{t,a} to be an accurate estimate of the true reward r_{t,a}. If so, p_{t,a} serves as a free feedback
without costing the algorithm any action on exploration. Of course, the hope cannot be
achieved in general. We propose to use p_{t,a} to over-estimate the true reward instead. The
over-estimates tilt w_{t,a} towards those actions a that the algorithm is less certain about
when observing xt. That is, the over-estimates guide the algorithm towards exploring the
less certain actions during the model updating step, which fits the intent to strike a balance
between exploitation and exploration (Auer,2002).

Formally, we propose to borrow the upper confidence bound in LinUCB, which is an
over-estimate of the true reward, for computing the pseudo-reward p_{t,a}. That is,

p_{t,a}= w^{>}_{t,a}x_{t}+ βˆc_{t,a}, with ˆc_{t,a}=
q

x^{>}_{t}Qˆ^{−1}_{t−1,a}x_{t}, (3)
where ˆQ_{t−1,a} is a matrix extended from Q_{t−1,a} and will be specified later in (5); β > 0
is a given parameter like α. Similar to the upper confidence bound in LinUCB, the first
term w^{>}_{t,a}xtcomputes the estimated reward and the second term βˆct,a corresponds to some
measure of uncertainty.

With the definition of pt,ain (3), we use the vector pt,ato gather all the pseudo-rewards calculated for action a, and the matrix ¯Xt,a for the corresponding contexts. That is, ¯Xt,a

is a matrix that contains x^{>}_{τ} as rows, where 1 ≤ τ ≤ t and a_{τ} 6= a; p_{t,a} is a vector that
contains the corresponding pτ,aas components. Next, we discuss how the information within
( ¯Xt,a, pt,a) can be used to properly update wt+1,a.

Updating wt+1,awith (xt, pt,a). Recall that (Xt,a, rt,a) contains all the observed contexts
and rewards only when action a gets selected by the algorithm, and ( ¯X_{t,a}, p_{t,a}) contains the
contexts and pseudo-rewards only when action a is not selected by the algorithm. Then, a
natural attempt is to directly include the pseudo-rewards as if they are the true rewards
when updating the model. That is, w_{t+1,a} can be updated by solving the ridge regression
problem of

wt+1,a = argmin

w∈R^{d}

λkwk^{2}+ kXt,aw − rt,ak^{2}+ k ¯Xt,aw − pt,ak^{2} . (4)
The caveat with the update formula in (4) is that the pseudo-rewards affect the calcu-
lation of wt+1,a permanently, even though some of them, especially the earlier ones, may
be rather inaccurate. The inaccurate ones may become irrelevant or even misleading in the
latter iterations. Thus, intuitively we may not want those inaccurate pseudo-rewards to
have the same influence on wt+1,a as the more recent, and perhaps more accurate, pseudo-
rewards.

Following the intuition, we propose the following forgetting mechanism, which puts emphases on more recent pseudo-rewards and their contexts than earlier ones. Formally, we take a forgetting parameter η ∈ [0, 1) to control how fast the algorithm forgets previous

pseudo-rewards, where smaller η corresponds to faster forgetting. Note that every pair
(xt, pt,a) corresponds to an error term of (w^{>}xt− p_{t,a})^{2} within the objective function of (4).

When updating w_{t+1,a}, we treat the most recent pseudo-reward as if it were the true reward,
and do not modify the associated term. For the second most recent pseudo-reward, however,
we decrease its influence on wt+1,a by multiplying its associated term with η^{2}. Similarly, if

`_{t,a} pseudo-rewards have been gathered when updating w_{t+1,a}, the term that corresponds
to the earliest pseudo-reward would be multiplied by η^{2(`}^{t,a}^{−1)}, which decreases along `_{t,a}
rapidly when η < 1.

Equivalently, the forgetting mechanism can be performed as follows. Note that there
are `_{t,a} rows in ¯X_{t,a} and p_{t,a}. Let ˆX_{t,a} be an `_{t,a}× d matrix with its i-th row being that
of ¯Xt,amultiplied by the factor η^{`}^{t,a}^{−i}; similarly, let ˆpt,a be an `t,a-dimensional vector with
its i-th component being that of pt,a multiplied by η^{`}^{t,a}^{−i}. We then update wt+1,a by

w_{t+1,a}= argmin

w∈R^{d}

λkwk^{2}+ kX_{t,a}w − r_{t,a}k^{2}+ k ˆX_{t,a}w − ˆp_{t,a}k^{2}
,

which yields the closed-form solution
wt+1,a= ˆQ^{−1}_{t,a}

X^{>}_{t,a}rt,a+ ˆX^{>}_{t,a}ˆpt,a

, where ˆQt,a= λId+ X^{>}_{t,a}Xt,a+ ˆX^{>}_{t,a}Xˆt,a. (5)
One remark for the update formula in (5) is that the use of β > 0 to over-estimate the true
reward in pt,a is crucial. If only p^{0}_{t,a} = w_{t,a}^{>}xt is used as the pseudo-reward, we would be
feeding the model with its own output. Then, the weight vector w_{t,a}can readily make zero
error on (x_{t}, p^{0}_{t,a}), and thus provably w_{t+1,a} = w_{t,a} is not updated. When β > 0, however,
the weight vectors wt,a for a 6= at effectively embed the intention for exploration when the
algorithm is less certain about selecting action a for x_{t}.

Using wt,a for Action-selection. Two natural choices for using wt,afor action selection
is either to decide greedily based on the estimated rewards w^{>}_{t,a}x_{t}, or to decide in an upper-
confidence-bound manner by

at= argmax

a∈[K]

w^{>}_{t,a}xt+ αˆct,a

, (6)

where ˆc_{t,a} =
q

x_{t}Qˆ^{−1}_{t−1,a}x_{t} is upgraded from the c_{t,a} used by LinUCB. We introduce the
latter choice here, and will discuss about the greedy choice in Section5.

Combining the pseudo-rewards in (3), the model updating with the forgetting mechanism in (5), and the upper-confidence-bound decision in (6), we derive our proposed Linear Pseudo-reward Upper Confidence Bound (LinPRUCB) algorithm, as shown in Algorithm1.

Similar to LinUCB, our LinPRUCB selects an action based on an upper confidence bound
controlled by the parameter α > 0. The main difference from LinUCB is the loop from
Step 7 to Step 16 that updates the weight wt+1,a for all actions a. For the selected action
a_{t}, its corresponding context vector and actual reward is updated in Step 9 as LinUCB
does. For actions whose rewards are hidden, LinPRUCB computes the pseudo-rewards in
Step 11, and applies the forgetting mechanism in Step 12. Finally Step 15 computes the
new weight vectors w_{t+1,a} shown in (5) for the next iteration.

Algorithm 1 Linear Pseudo-Reward Upper Confidence Bound (LinPRUCB)

1: Parameters: α, β, η, λ > 0

2: Initialize w1,a := 0d, ˆQ0,a := λId, ˆV0,a := V0,a := 0d×d, ˆz0,a := z0,a := 0d, for every a ∈ [K]

3: for t := 1 to T do

4: Observe xt

5: Select at:= argmax_{a∈[K]}w^{>}_{t,a}xt+ α
q

x^{>}_{t}Qˆ^{−1}_{t−1,a}xt
6: Receive reward rt,at

7: for a ∈ [K] do

8: if a = a_{t} then

9: Vt,a:= Vt−1,a+ xtx^{>}_{t} ,

10: else

11: p_{t,a}:= min

w^{>}_{t,a}x_{t}+ β
q

x^{>}_{t} Qˆ^{−1}_{t,a}x_{t}, 1

12: Vˆt,a:= η ˆVt−1,a+ xtx^{>}_{t} ,

13: end if

14: Qˆ_{t,a}:= λI_{d}+ V_{t,a}+ ˆV_{t,a}

15: w_{t+1,a} := ˆQ^{−1}_{t,a}(z_{t,a}+ ˆz_{t,a})

16: end for

17: end for

We will compare LinPRUCB to LinUCB empirically in Section 4. Before so, we establish the theoretical guarantee of LinPRUCB and compare it to that of LinUCB first.

The key results are summarized and discussed as follows. Under the assumption that the received rewards are independent from each other, Lemma2below shows that LinPRUCB can achieve a regret of (1 + α + ρ)

P

t∈[T ]ˆct,at

+ ˜O√

T

with probability 1 − δ, where ρ = O(β). LinUCB under the same condition (Chu et al., 2011), on the other hand, achieves a regret of (1 + α)

P

t∈[T ]ct,at

+ ˜O(√

T ) with probability 1 − δ.

One difference between the two bounds is that the bound of LinPRUCB contains an additional term of ρ, which is of O(β) and comes from the pseudo-rewards. By choosing β = O(α), however, the term can always be made comparable to α.

The other difference is that the bound of LinPRUCB contains ˆct,at terms instead of c_{t,a}_{t}.
Recall that ˆct,a =

q

x^{>}_{t}Qˆ^{−1}_{t−1,a}xt, with ˆQt−1,a = λId+ X^{>}_{t−1,a}Xt−1,a+ ˆX^{>}_{t−1,a}Xˆt−1,a, while
ct,a=

q

x^{>}_{t}Q^{−1}_{t−1,a}xt, with Qt−1,a = λI_{d}+X^{>}_{t−1,a}Xt−1,a. Since ˆQt−1,acontains the additional
term ˆX^{>}_{t−1,a}Xˆt−1,awhen compared with Qt−1,a, the uncertainty terms ˆct,aare likely smaller
than c_{t,a}, especially in the early iterations. Figure 2 shows one typical case that compares
ct,at of LinUCB and ˆct,at of LinPRUCB with one artificial data set used in Section4. We
see that ˆct,at are indeed generally smaller than ct,at. The empirical observation in Figure 2
and the theoretical bound in Lemma 2 explain how LinPRUCB could perform better
than LinUCB through playing with the bias-variance trade-off: decreasing the variance
(expressed by ˆct,at) while slightly increasing the bias (expressed by ρ).

0 10 20 30 40 50 60 70 80 90 100 0.2

0.4 0.6 0.8 1

t

LinPRUCB LinUCB

Figure 2: Comparison of the values on the uncertainty terms

The regret bounds discussed above rely on the assumption of independent rewards, which is unlikely to hold in general. To deal with this issue, we follow the approach ofChu et al.(2011) and discuss below on how LinPRUCB can be modified to a variant algorithm for which a small regret bound can actually be guaranteed without any assumptions. The key technique is to partition the iterations into several parts such that the received rewards in the same part are in fact independent from each other (Auer, 2002). More precisely, Chu et al. (2011) constructs a master algorithm named SupLinUCB, which performs the partitioning and calls a subroutine BaseLinUCB (modified from LinUCB) on each part separately. Similarly, we construct another master algorithm SupLinPRUCB, which is similar to SupLinUCB but calls BaseLinPRUCB (modified from LinPRUCB) instead.

The detailed descriptions are listed in Appendix A.^{1}

Then, the big picture of the analysis follows closely from that of LinUCB (Auer,2002;

Chu et al.,2011). First, as shown in Appendix A, SupLinPRUCB in iteration t partitions
the first t−1 iterations into S parts: Ψ^{1}_{t}, . . . , Ψ^{S}_{t}. Using the same proof as that for Lemma 14
by Auer(2002), we can first establish the following lemma.

Lemma 1 For any s ∈ [S], any t ∈ [T ], and any fixed sequence of contexts x_{τ}, with τ ∈ Ψ^{s}_{t},
the corresponding rewards rτ,aτ are independent random variables with E[rτ,aτ] = x^{>}_{τ}ua.
Then, we can prove that BaseLinPRUCB when given such an independence guarantee
can provide a good estimate of the true reward.

Lemma 2 Suppose the input index set Ψ_{t} ⊆ [t − 1] given to BaseLinPRUCB has the
property that for fixed contexts x_{τ}, for τ ∈ Ψ_{t}, the corresponding rewards r_{τ,a}_{τ} are inde-
pendent random variables with means x^{>}_{τ}uaτ. Then, for some α = O(pln(T K/δ)) and
ρ = (2 + β)/√

1 − η, we have with probability 1 − δ/T that |x^{>}_{t} w_{t,a}− x^{>}_{t} u_{a}| ≤ (1 + α + ρ)ˆc_{t,a}
for every a ∈ [K].

Proof For notation convenience, we drop all the subscripts involving t and a below. By definition,

x^{>}w − x^{>}u
=

x^{>}Qˆ^{−1}(X^{>}r + ˆX^{>}p) − xˆ ^{>}Qˆ^{−1}(λI_{d}+ X^{>}X + ˆX^{>}X)uˆ

=

x^{>}Qˆ^{−1}X^{>}(r − Xu) − λx^{>}Qˆ^{−1}u + x^{>}Qˆ^{−1}Xˆ^{>}(ˆp − ˆXu)

≤

x^{>}Qˆ^{−1}X^{>}(r − Xu)
+ λ

x^{>}Qˆ^{−1}u
+

x^{>}Qˆ^{−1}Xˆ^{>}(ˆp − ˆXu)
. (7)

1. The full version that includes the appendix can be found at http://www.csie.ntu.edu.tw/~htlin/

paper/doc/acml14pseudo.pdf.

The first two terms in (7) together can be bounded by (1 + α)ˆc with probability 1 − δ/T
using arguments similar to those by Chu et al.(2011). The third term arises from our use
of pseudo-rewards, which makes our analysis different from that of BaseLinUCB. This is
where our forgetting mechanism comes to help. By the Cauchy-Schwarz inequality, this
term is at most kx^{>}Qˆ^{−1}Xˆ^{>}kkˆp − ˆXuk, where one can show that kx^{>}Qˆ^{−1}Xˆ^{>}k ≤ ˆc using
a similar argument as Lemma 1 of Chu et al. (2011). Since the i-th entry of the vector
ˆ

p − ˆX^{>}u by definition is at most (2 + β)η^{`−i}, we have
kˆp − ˆX^{>}uk ≤ (2 + β)

s X

i≤`

η^{2(`−i)}≤ 2 + β

√1 − η = ρ.

Combining these bounds together, we have the lemma.

Lemma2is the key distinction in our analysis, which justifies our use of pseudo-rewards and the forgetting mechanism. With Lemma1and Lemma2, the rest of the regret analysis is almost identical to that of LinUCB (Auer,2002;Chu et al.,2011), and the analysis leads to the following main theoretical result. The details of the proof are listed in Appendix A.

Theorem 3 For some α = O(pln(T K/δ)), for any β = O(α) and any constant η ∈ [0, 1), SupLinPRUCB with probability 1 − δ achieves a regret of O

q

dKT ln^{3}(KT /δ)

.

4. Experiments

Next, we conduct empirical studies on both artificial and real-world data sets to justify the idea of using pseudo-rewards and to compare the performance between LinPRUCB and the state-of-the-art LinUCB algorithm.

Artificial Data Sets. First, we simulate with artificial data sets as follows. Unit vectors
u1, . . . , uK for K actions are drawn uniformly from R^{d}. In iteration t out of the T iterations,
a context x_{t}is first sampled from an uniform distribution within kxk ≤ 1. Then, the reward
r_{t,a}_{t} = u^{>}_{a}_{t}x_{t}+ v_{t} is generated, where v_{t} ∈ [−0.05, 0.05] is a random white noise. For the
simulation, we consider parameters T = 1000, K ∈ {10, 30}, d ∈ {10, 30}.

We plot the average cumulative reward (vertical axis) versus the normalized number of iterations (horizontal axis) to illustrate the key ideas of the proposed LinPRUCB algorithm and to compare it with LinUCB. Each curve is the average of 10 runs of simulations. The results are presented in Figure 3. For each (K, d) combination, we use pink-circle marks to represent the curve of LinUCB with full-information feedback, and red-diamond marks to represent the curve of LinUCB with bandit-information. Similar to the observations in the motivating Figure1, the full-information LinUCB performs better than the bandit- information one. Another curve we include is colored yellow, which represents LinGreedy.

The three curves serve as references to evaluate the performance of LinPRUCB.

To justify the idea of using over-estimated rewards as pseudo-rewards, we first set β in (3) to be 0, fix the forgetting parameter η = 1, and use grid search for the optimal α ∈ {0, 0.2, . . . , 1.2} based on the final cumulative reward after T iterations. This setting of (α, β, η) represents a LinPRUCB variant that estimates the pseudo-reward without the

0.2 0.4 0.6 0.8 1

−0.05 0 0.05 0.1 0.15 0.2

average cumulative reward

t/T Full LinUCB Greedy

LinPRUCB (β=0, η=0) LinPRUCB (η=1) LinPRUCB

(a) d = 10, K = 10

0.2 0.4 0.6 0.8 1

−0.05 0 0.05 0.1 0.15 0.2 0.25

average cumulative reward

t/T Full LinUCB Greedy

LinPRUCB (β=0, η=1) LinPRUCB (η=1) LinPRUCB

(b) d = 10, K = 30

0.2 0.4 0.6 0.8 1

−0.02 0 0.02 0.04 0.06 0.08 0.1 0.12

average cumulative reward

t/T Full LinUCB Greedy

LinPRUCB (β=0, η=1) LinPRUCB (η=1) LinPRUCB

(c) d = 30, K = 10

0.2 0.4 0.6 0.8 1

−0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

average cumulative reward

t/T Full LinUCB Greedy

LinPRUCB (β=0, η=1) LinPRUCB (η=1) LinPRUCB

(d ) d = 30, K = 30 Figure 3: Comparison of average cumulative reward for T = 1000.

uncertainty term ˆct,a and never forgets the previous estimates. We use green-cross marks to represent the curve and call it LinPRUCB(α, 0, 1). Note that we allow LinUCB the same flexibility in choosing α with grid search.

Then, we fix η = 1 and use grid search for the optimal (α, β) within α ∈ {0, 0.2, . . . , 1.2}

and β ∈ {0, 0.2, . . . , 1.2}. This setting of (α, β, η) represents a LinPRUCB variant that esti- mates the pseudo-reward with the uncertainty term in (3) but never forgets the previous es- timates. We use light-blue-star marks to represent the curve and call it LinPRUCB(α, β, 1).

Finally, we study the full power of LinPRUCB that not only estimates the pseudo- reward according to (3) but also utilizes the forgetting mechanism. This is done by a grid search on α, β, and η, where η ∈ [0.8, 0.95] with a grid step of 0.05. We use blue-square marks to represent the curve.

From the results that are plotted in Figure3, for each given (K, d), LinPRUCB(α, 0, 1)
and LinPRUCB(α, β, 1) cannot perform well. They are often worse than LinUCB in the
end, and are even worse than LinGreedy. The results justify the importance of both the
over-estimate with the uncertainty terms ˆc_{t,a}and the forgetting mechanism that we propose.

Next, we compare the actual LinPRUCB with LinUCB in Figure 3. When focusing on the early iterations, LinPRUCB often gathers more rewards than LinUCB for each

(K, d). Interestingly, similar phenomenons also occur in the curves of LinPRUCB(α, 0, 1) and LinPRUCB(α, β, 1). Those results echo our discussions in Section 3 that it can be beneficial to trade some bias for a smaller variance to improve LinUCB. After T iterations, the performance of LinPRUCB is better than or comparable to that of LinUCB, which echoes our theoretical results in Section 3of the matching regret bound.

The observations above support the key ideas in designing LinPRUCB. By allowing
the algorithm to update every w_{t+1,a} instead of only w_{t+1,a}_{t}, LinPRUCB achieves faster
reward-gathering in the earlier iterations, and the over-estimates in the pseudo-rewards and
the forgetting mechanism are both essential in maintaining the promising performance of
LinPRUCB.

Real-world Benchmark Data Sets. Next, we extend our empirical study to two real-
world benchmark data sets, R6A and R6B released by Yahoo!. To the best of our knowledge,
the two data sets are the only public data sets for the contextual bandit problem. They first
appeared in ICML 2012 Workshop on New Challenges for Exploration and Exploitation 3
(https://explochallenge.inria.fr/). The challenge is a practical problem that needs
the algorithm to display news articles on a website strategically. The algorithm has to
display a news article (action a_{t}) during the visit (iteration t) of a user (context x_{t}), and
the user can select to click (reward r_{t,a}_{t}) on the news to read more information or simply
ignore it. The LinUCB algorithm and its variants are among the leading algorithms in
the challenge, and there is a 36-hour run time limitation to all algorithms. This roughly
translates to less then 5 ms for selecting an action, and 50 ms for updating the weight
vectors wt+1,a.

In R6A, each context x_{t} is 6 dimensional and each dimension consists a value within
[0, 1]. In R6B, each x_{t} is 136 dimensional with values within {0, 1} and the first dimension
being a constant 1. Rather than selecting an action from a fixed set [K], the actions in the
data sets are dynamic since new news articles may emerge over time and old article may be
dropped. The dynamic nature does not affect the LinPRUCB and LinUCB algorithms,
which maintain one wt,a per action regardless of the size of the action set. For each data
set, the reward r_{t,a}_{t} equals 1 if the user clicked on the article, and equals 0 otherwise. Then,
the click through rate (CTR) of the algorithm, which is defined as the ratio of total clicks
in the total evaluated iterations, corresponds to the average cumulative reward. In order to
achieve an unbiased evaluation when using an offline data set such as R6A or R6B, we use
the technique described by Li et al.(2011).

We split each data set into two sets, one for parameter tuning and the other for test- ing. The tuning set consists of 50,000 visits, and the testing set consists of the remaining 4,000,000 visits. We run grid search on the tuning set for every algorithm that comes with tunable parameters. Then, the parameter combination that achieves the best tuning CTR is coupled with the corresponding algorithm to evaluate the CTR on the testing set.

The experimental results for R6A and R6B are summarized in Table 1 and Table 2.

The tables demonstrate that LinPRUCB is pretty competitive to LinUCB with 10% of test CTR improvement in R6A and comparable CTR performance in R6B. The results justify LinPRUCB to be a promising alternative to LinUCB in practical applications.

Interestingly and somewhat surprisingly, for R6B, the best performing algorithm during

Algorithm CTR (tuning) CTR (testing) selection time updating time

LinPRUCB 0.038 0.070 9.1 3.6

LinUCB 0.039 0.063 9.4 0.7

LinGreedy 0.028 0.037 3.8 0.4

Random 0.024 0.030 0.7 0.0

Table 1: Experimental results on data set R6A. (d = 6, K = 30)

Algorithm CTR (tuning) CTR (testing) selection time updating time

LinPRUCB 0.046 0.054 1082 5003

LinUCB 0.046 0.053 1166 248

LinGreedy 0.039 0.059 25 201

Random 0.032 0.034 1 0

Table 2: Experimental results on data set R6B. (d = 136, K = 30)

testing is neither LinUCB nor LinPRUCB, but LinGreedy. It is possibly because the data set is of a larger d and thus exploration is a rather challenging task.

We also list the selection time and the updating time for each algorithm on data sets
R6A and R6B in Table 1 and Table 2, respectively. The updating time is the time that
an algorithm needed to update its model wt+1,a after the reward is revealed. The selection
time is the time that an algorithm needed to return an action a_{t}, which links directly to
the loading speed of a web page in this competition. On both data sets we calculate the
selection time and the updating time aggregated in 50,000 iterations and report them in
seconds.

The results in Table 1 show that the selection time of LinPRUCB and LinUCB are
similar, but LinPRUCB needs more time in updating. The situation becomes more promi-
nent in Table 2for larger dimension d. This is because in one iteration, LinUCB only has
to update one model w_{t,a}, but LinPRUCB needs to update every model for each action.

We also note that the selection time for LinPRUCB and LinUCB are larger than the updating time. This is somewhat not practical since in real-world applications, a system usually needs to react to users within a short time, while updating on the server can take longer. To meet the real-world requirements, we present a practical variant of the proposed LinPRUCB algorithm in the next section. The variant can conduct faster action selection while enjoying similar performance to the original LinPRUCB in some cases.

5. A Practical Variant of LinPRUCB

The practical variant is derived by replacing the action selection step (6) of LinPRUCB
with a greedy one at = argmax_{a∈[K]}w^{>}_{t,a}xt. That is, we can simply set α = 0 for Lin-
PRUCB to drop the calculation of the uncertainty terms ˆct,a. Then, similar to how Lin-
UCB becomes LinGreedy when α = 0, LinPRUCB becomes the Linear Pseudo-reward
Greedy (LinPRGreedy) algorithm.

Recall our discussion in Section 2 that LinGreedy is faster in action selection than
LinUCB. The main reason is in calculating the uncertainty terms ct,a, which takes O(d^{2})

Simulation LinPRUCB LinUCB LinPRGreedy LinGreedy d = 10, K = 10 0.460 ± 0.010 0.461 ± 0.017 0.454 ± 0.033 0.150 ± 0.016 d = 10, K = 30 0.558 ± 0.005 0.563 ± 0.007 0.504 ± 0.011 0.155 ± 0.006 d = 30, K = 10 0.270 ± 0.008 0.268 ± 0.008 0.271 ± 0.016 0.074 ± 0.010 d = 30, K = 30 0.297 ± 0.003 0.297 ± 0.005 0.255 ± 0.014 0.091 ± 0.009

Table 3: Comparisons of average cumulative reward on artificial data sets.

Algorithm CTR (tuning) CTR (testing) selection time (seconds) updating time (seconds)

LinPRUCB 0.038 0.070 9.1 3.6

LinUCB 0.039 0.063 9.4 0.7

LinPRGreedy 0.036 0.068 4.8 3.4

Greedy 0.028 0.037 3.8 0.4

Table 4: Experimental results on data set R6A.

time even when the inverse of all Qt−1,a have been cached during model updating. Nev- ertheless, LinUCB heavily relies on the uncertainty terms in selecting proper actions, and cannot afford dropping the terms.

Our proposed LinPRUCB, however, may not suffer that much from dropping the terms. In particular, because of our reuse of the upper confidence bound in (3), the term β

q

x^{T}_{t}Qˆ^{−1}_{t,a}xt in (3) can readily carry out the job of exploration, and thus αˆct,a in (6) may
not be heavily needed. Thus, LinPRUCB provides a flexibility to move the computation
loads in the action selection step to the model updating step, which matches the need of
practical applications.

The experimental results on artificial data are summarized in Table3, where the artificial data sets are generated in the same way as Figure3. We can see that although LinPRUCB and LinUCB reach higher cumulative reward, the performance of LinPRGreedy is close to that of LinPRUCB and LinUCB when K = 10. The results justify our argument that LinPRUCB may not heavily rely on αˆct,ain (6) in some cases, and thus dropping the terms does not affect the performance much. There is a larger gap between LinPRGreedy and LinUCB when K = 30, though.

The experimental results for for R6A and R6B are summarized in Table 4and Table5.

The tables are the extensions of Table 1 and Table 2, respectively. Once again, we see that the performance of LinPRGreedy is competitive with LinPRUCB and LinUCB.

Furthermore, note that the major advantage of LinPRGreedy is its selection time. In each table, the selection time of LinPRGreedy is much smaller than that of LinPRUCB and LinUCB, and the gap grows larger for larger dimension d. The small selection time and competitive performance makes LinPRGreedy a favorable choice for practical appli- cations.

Algorithm CTR (tuning) CTR (testing) selection time updating time

LinPRUCB 0.046 0.054 1082 5003

LinUCB 0.046 0.053 1166 248

LinPRGreedy 0.045 0.056 24 4899

LinGreedy 0.039 0.059 25 201

Table 5: Experimental results on R6B.

6. Conclusion

We propose a novel contextual bandit algorithm LinPRUCB, which combines the idea of using pseudo-rewards that over-estimate the true rewards, applying a forgetting mechanism of earlier pseudo-rewards, and taking the upper confidence bound in action selection. We prove a matching regret bound of LinPRUCB to the state-of-the-art algorithm LinUCB, and discuss how the bound echoes the promising performance of LinPRUCB in gathering rewards faster in the earlier iterations. Empirical results on artificial and real-world data sets demonstrate that LinPRUCB compares favorably to LinUCB in the early iterations, and is competitive to LinUCB in the long run. Furthermore, we design a variant of LinPRUCB called LinPRGreedy, which shares similar performance benefits to LinPRUCB while enjoying much faster action selection.

Acknowledgments

The paper originates from part of the first author’s M.S. thesis (Chou,2012) and the second author’s Ph.D. thesis (Chiang,2014). The authors thank the anonymous reviewers and the oral committee members of the two leading authors for their valuable comments. The completion of the paper was partially supported by the government of Taiwan via NSC 101-2628-E-002-029-MY2 and 100-2221-E-001-008-MY3.

References

Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.

Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002a.

Peter Auer, Nicolo Cesa-bianchi, Yoav Freund, and Robert Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32:2002, 2002b.

Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. Boosting with online binary learners for the multiclass bandit problem. In ICML, 2014.

Chao-Kai Chiang. Toward realistic online learning. PhD thesis, National Taiwan University, 2014.

Ku-Chun Chou. Pseudo-reward algorithms for linear contextual bandit problems. Master’s thesis, National Taiwan University, 2012.

Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. In AISTATS, 2011.

Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Efficient bandit algorithms for online multiclass prediction. In ICML, 2008.

T. L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.

John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In NIPS, 2007.

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, 2010.

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In WSDM, 2011.

Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. Bandits for taxonomies: A model-based approach. In SDM, 2007.

Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. ISBN 0262193981.

Chih-Chun Wang, Sanjeev R. Kulkarni, and H. Vincent Poor. Bandit problems with side observations. IEEE Transactions Automatic Control, 50(3):338–355, 2005.

Appendix A. Proof of Theorem 3

The descriptions of BaseLinPRUCB and SupLinPRUCB are given in Algorithm 2 and Algorithm3respectively. To bound the regret of our SupLinPRUCB, we follow the analysis in (Auer, 2002;Chu et al., 2011) and replace several critical steps that are related to our algorithm.

Algorithm 2 BaseLinPRUCB

1: Inputs: Ψt⊆ [t − 1]; α, β, η, λ > 0

2: Initialize w_{a}:= 0_{d}, ˆQ_{a}:= λI_{d}, V_{a}:= ˆV_{a}:= 0_{d×d}, ˆz_{a}:= z_{a}:= 0_{d}, for every a ∈ [K]

3: Observe xt 4: for a ∈ [K] do

5: for τ ∈ Ψ_{t} do

6: if aτ = a then

7: Va:= Va+ xτx^{>}_{τ}

8: z_{a}:= z_{a}+ x_{τ}r_{τ,a}

9: else

10: pτ,a:= min

w_{a}^{>}xτ + β
q

x^{>}_{τ}Qˆ^{−1}a xτ, 1

11: Vˆ_{a}:= η ˆV_{a}+ x_{τ}x^{>}_{τ}

12: ˆza:= ηˆza+ xτpτ,a 13: end if

14: Qˆ_{a}:= λI_{d}+ V_{a}+ ˆV_{a}

15: wa:= ˆQ^{−1}_{a} (za+ ˆza)

16: end for

17: width_{t,a}:= (1 + α + ρ)
q

x^{>}_{t} Qˆ^{−1}a x_{t}

18: ucb_{t,a}:= w^{>}_{a}x_{t}+ width_{t,a}

19: end for

Let Ψ^{0} = [T ] \S

s∈[S]Ψ^{(s)}_{T +1}, which contains those indices not in any of those Ψ^{(s)}_{T +1}.
Let Ψ^{(s)}_{t,a} denote the subset of Ψ^{(s)}_{t} that collects all τ ∈ Ψ^{(s)}_{t,a} such that a_{τ} = a; thus,
Ψ^{(s)}_{t} = S

a∈[K]Ψ^{(s)}_{t,a}. By definition, the expected regret is P

t∈[T ](E[rt,a^{∗}_{t}] − E[rt,at]), which
can be regrouped as

X

t∈Ψ^{0}

E[rt,a^{∗}_{t}] − E[rt,at]

+ X

s∈[S]

X

a∈[K]

X

t∈Ψ^{(s)}_{T +1,a}

E[rt,a^{∗}_{t}] − E[rt,a] . (8)

Using Lemma 1 and Lemma 2 stated in Section 3, we can establish a lemma similar to Lemma 15 by Auer(2002), which implies that with probability 1 − δS, the bound in (8) is at most

2

√

T + X

s∈[S]

X

a∈[K]

2^{3−s}
Ψ^{(s)}_{T +1,a}

. (9)

To bound the second term in (9), we need the following.

Algorithm 3 SupLinPRUCB

1: Initialize S := ln T and Ψ^{(s)}_{1} := ∅ for every s ∈ [S]

2: for t = 1 to T do

3: s := 1 and ˆA^{(1)}:= [K]

4: repeat

5: Call BaseLinPRUCB with input Ψ^{(s)}t to compute width^{(s)}_{t,a} and ucb^{(s)}_{t,a} for every
a ∈ ˆA^{(s)}.

6: if width^{(s)}_{t,a} > 2^{−s} for some a ∈ ˆA^{(s)} then

7: Select action a_{t}:= a and update:

Ψ^{(s)}_{t+1}:= Ψ^{(s)}_{t} ∪ {t}, and
Ψ^{(s}_{t+1}^{0}^{)} := Ψ^{(s}_{t} ^{0}^{)} for every s^{0} 6= s.

8: else if width^{(s)}_{t,a} ≤ 1/√

T for every a ∈ ˆA^{(s)} then

9: Select action at:= arg max_{a∈ ˆ}_{A}(s)ucb^{(s)}_{t,a}, and let Ψ^{(s)}_{t+1}:= Ψ^{(s)}_{t} for every s ∈ [S].

10: else

11: Let ˆA^{(s+1)} := {a ∈ ˆA^{(s)}| ucb^{(s)}_{t,a} ≥ u^{(s)}− 2^{1−s}}, where u^{(s)}= max_{a}0∈ ˆA^{(s)}ucb^{(s)}_{t,a}0.
Let s := s + 1.

12: end if

13: until an action a_{t} is selected

14: end for

Lemma 4 For any s ∈ [S] and a ∈ [K],
2^{−s}

Ψ^{(s)}_{T +1,a}

≤ 5(1 + α + ρ) r

d
Ψ^{(s)}_{T +1,a}

ln

Ψ^{(s)}_{T +1,a}

.

Proof From Steps 6 and 7 of SupLinPRUCB, we know that for any s and a, width^{(s)}t,a > 2^{−s}
for any t ∈ Ψ^{(s)}_{T +1,a}, and therefore

2^{−s}
Ψ^{(s)}_{T +1,a}

≤ X

t∈Ψ^{(s)}_{T +1,a}

width^{(s)}_{t,a}.

From Step 17 of BaseLinPRUCB, we know that
width^{(s)}_{t,a} = (1 + α + ρ)

q

x^{>}_{t} Qˆ^{−1}_{a} x_{t}

for some matrix ˆQa = I_{d}+ Va+ ˆVa. The matrices ˆQa, Va, and ˆVa actually depend on
Ψ^{(s)}_{t,a}, and thus let us denote them more appropriately by ˆQ^{(s)}_{t,a}, V_{t,a}^{(s)}, and ˆV^{(s)}_{t,a}, respectively,
where V_{t,a}^{(s)} is the sum of xτx^{>}_{τ}, over τ ∈ Ψ^{(s)}_{t,a}, and ˆV_{t,a}^{(s)}is the sum of xτx^{>}_{τ} scaled by some
power of η, over τ ∈ Ψ^{(s)}_{t} \ Ψ^{(s)}_{t,a}. Using the same proof for Lemma 3 in (Chu et al.,2011),
one can show that for the matrices Q^{(s)}_{t,a} = I_{d}+ V^{(s)}_{t,a},

X

t∈Ψ^{(s)}_{T +1,a}

r
x^{>}_{t}

Q^{(s)}_{t,a}

−1

xt≤ 5 r

d
Ψ^{(s)}_{T +1,a}

ln

Ψ^{(s)}_{T +1,a}

.

Note that ˆQ^{(s)}_{t,a} Q^{(s)}_{t,a} and ( ˆQ^{(s)}_{t,a})^{−1} (Q^{(s)}_{t,a})^{−1}, which implies that x^{>}( ˆQ^{(s)}_{t,a})^{−1}x ≤
x^{>}(Q^{(s)}_{t,a})^{−1}x for any x ∈ R^{d}. By combining all the bounds together, we have the lemma.

Using Lemma 4, the second term in (9) can be upper bounded by 40 X

s∈[S]

X

a∈[K]

(1 + α + ρ) r

d
Ψ^{(s)}_{T +1,a}

ln

Ψ^{(s)}_{T +1,a}

≤ 40(1 + α + ρ)√

d ln T X

s∈[S]

X

a∈[K]

r
Ψ^{(s)}_{T +1,a}

≤ 40(1 + α + ρ)√ d ln T

√ SKT ,

by Jensen’s inequality. The rest of the proof is almost identical to that byAuer(2002). That is, by replacing δ with δ/(S + 1), substituting α = O(pln(T K/δ)) and ρ = O(β) = O(α), and then applying Azuma’s inequality, we obtain our Theorem3.