Shang-Tse Chen SCHEN351@GATECH.EDU

School of Computer Science, Georgia Institute of Technology, Atlanta, GA

Hsuan-Tien Lin HTLIN@CSIE.NTU.EDU.TW

Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

Chi-Jen Lu CJLU@IIS.SINICA.EDU.TW

Institute of Information Science, Academia Sinica, Taipei, Taiwan

### Abstract

We consider the problem of online multiclass prediction in the bandit setting. Compared with the full-information setting, in which the learner can receive the true label as feedback after mak- ing each prediction, the bandit setting assumes that the learner can only know the correctness of the predicted label. Because the bandit setting is more restricted, it is difficult to design good bandit learners and currently there are not many bandit learners. In this paper, we propose an ap- proach that systematically converts existing on- line binary classifiers to promising bandit learn- ers with strong theoretical guarantee. The ap- proach matches the idea of boosting, which has been shown to be powerful for batch learning as well as online learning. In particular, we es- tablish the weak-learning condition on the online binary classifier, and show that the condition al- lows automatically constructing a bandit learner with arbitrary strength by combining several of those classifiers. Experimental results on sev- eral real-world data sets demonstrate the effec- tiveness of the proposed approach.

### 1. Introduction

Recently, machine learning problems that involve partial feedback have received an increasing amount of attention (Auer et al.,2002;Flaxman et al.,2005). These problems occur naturally in many modern applications, such as on- line advertising and recommender systems (Li et al.,2010).

Proceedings of the 31^{st} International Conference on Machine
Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-
right 2014 by the author(s).

For instance, in a recommender system, the partial feed- back represents whether the user likes the content recom- mended by the system, whereas the user’s preference for the other contents that have not been displayed remain un- known. In this paper, we consider one particular learning problem related to partial feedback: the online multiclass prediction problem in the bandit setting (Kakade et al., 2008). In the problem, the learner iteratively interacts with the environment. In each iteration, the learner observes an instance and is asked to predict its label. The main dif- ference between the traditional full-information setting and the bandit setting is the feedback received after each pre- diction. In the full-information setting, the true label of the instance is revealed, whereas in the bandit setting, only whether the prediction is correct is known. That is, in the bandit setting, the true label remains unknown if the pre- diction is incorrect. Our goal is to make as few errors as possible in the harsh environment of the bandit setting.

With more restricted information available, it becomes harder to design good learning algorithms in the bandit set- ting, except for the case of online binary classification, in which the bandit setting and full-information setting coin- cide. Thus, it is desirable to find a systematic way to trans- form the existing online binary classification algorithms, or combine many of them, to get an algorithm that effec- tively deals with the bandit setting. The motivation calls for boosting(Schapire,1990), which is one of the most popu- lar and well-developed ensemble methods implemented in the traditional batch supervised classification framework.

While most studies on boosting focus on the batch setting (Freund & Schapire,1997;Schapire & Singer,1999), some works have extended the success of boosting to the online setting (Oza & Russell,2001;Chen et al.,2012). However, to the best of our knowledge, there is no boosting algo- rithm yet for the problem of online multiclass prediction in the bandit setting. In this paper, we study the possibility of

extending the promising theoretical and empirical results of boosting to the bandit setting.

As in the design and analysis of boosting algorithms in other settings, we need an appropriate assumption on weak learners in order for boosting to work. A stronger assump- tion makes the design of a boosting algorithm easier, but at the expense of more restricted applicability. To weaken the assumption, we consider binary full-information weak learners, instead of multiclass bandit ones, with the given binary examples constructed through the one-versus-rest decomposition from the multiclass examples (Chen et al., 2009). Following (Chen et al.,2012), we propose a sim- ilar assumption which requires such binary weak learn- ers to perform better than random guessing with respect to “smooth” weight distributions over the binary exam- ples. Then we prove that boosting is possible under this assumption by designing a strong bandit algorithm using such binary weak learners. Our bandit algorithm is ex- tended from the full-information one of (Chen et al.,2012), which provides a method to generate such smooth example weights for updating weak learners, as well as some ap- propriate voting weights for combining the predictions of weak learners.

Nevertheless, our extension in this paper is non-trivial. To compute these weights exactly in (Chen et al.,2012), one needs the full-information feedback, which is not available in our bandit setting. With the limited information of ban- dit feedback, we show how to find good estimators for the example weights as well as for the voting weights, and we prove that they can in fact be used to replace the true weights to make boosting work in the bandit setting.

Our proposed bandit boosting algorithm enjoys nice theo- retical properties similar to those of its batch counterpart.

In particular, the proposed algorithm can achieve a small error rate if the performance of each weak learner is better than that of random guessing with respect to the carefully generated weight distributions. In addition, the algorithm reaches promising empirical performance on real-world data sets, even when using very simple full-information weak learners.

Finally, let us stress the difference between our work and existing ones on the bandit problem. Unlike existing works, our goal is not to construct one specific bandit algorithm and analyze its regret. Instead, our goal is to study the possibility of a general paradigm for designing bandit al- gorithms in a systematic way. Note that there are currently only a very small number of bandit algorithms for the mul- ticlass prediction problem, and most seem to be based on linear models (Kakade et al.,2008;Hazan & Kale,2011).

With the limited power of such linear models, a high error rate is unavoidable in general, so the focus of these works was to reduce the regret, regardless of whether the actual

error rate is high. Our result, on the other hand, works for a broader class of classifiers beyond linear ones. We show how to construct a strong bandit algorithm with an error rate close to zero, when we have weak learners which can perform slightly better than random guessing. Here we al- low any weak learners, not just linear ones, that only need to work in the simpler full-information setting rather than in the more challenging bandit setting. Constructing such weak learners may look much less daunting, but we show that they in fact suffice for constructing strong bandit algo- rithms. We hope that this could open more possibilities for designing better bandit algorithms in the future.

### 2. Boosting in different settings

Before formally describing our boosting framework in the online bandit setting, let us first review the traditional batch setting as well as the online full-information setting.

In the batch setting, the boosting algorithm has the whole
training set S = {(x1, y1), . . . , (xT, yT)} available at the
beginning, where each xtis the feature vector from some
space X ⊆ R^{d} and yt is its label from some space Y.

For the case of binary classification, we assume Y =
{−1, +1}, and the boosting algorithm repeatedly calls the
batch weak learner for a number of rounds as follows. In
round i, it feeds S as well as a probability distribution p^{(i)}
over S to the weak learner, which then returns a weak hy-
pothesis h^{(i)} after seeing the whole S and p^{(i)}. It stops
at some round N when the strong hypothesis H(x) =
sign(PN

i=1α^{(i)}h^{(i)}(x)), with α^{(i)} ∈ R being the voting
weight of h^{(i)}, achieves a small error rate over S, defined
as |{t : H(xt) 6= yt}|/T .

For the case of multiclass classification, we assume Y =
{1, . . . , K}, and for simplicity we adopt the one-versus-
rest approach to reduce the multiclass problem to a binary
one. More precisely, each multiclass example (xt, yt) is
decomposed into K binary examples ((xt, k), ytk), k =
1, . . . , K, where ytkis 1 if yt= k and −1 otherwise. One
can then apply the boosting algorithm to such binary ex-
amples and use H(x) = arg max_{k}P

iα^{(i)}_{k} h^{(i)}(x, k) as the
strong hypothesis for the original multiclass problem.

In the online full-information setting, the examples of S
are usually considered as chosen adversarially and they ar-
rive one at a time. The online boosting algorithm must de-
cide on some number N of online weak learners to start
with. At step t, the boosting algorithm receives xtand it
predicts H(xt) = arg max_{k}P

iα^{(i)}_{tk}h^{(i)}_{t} (x_{t}, k), where h^{(i)}_{t}
is the weak hypothesis provided by the i’th weak learner
and α_{tk}^{(i)}is its voting weight. After the prediction, the true
label ytis revealed, and to update each weak learner, we
would like to feed it with a probability measure on each
binary example, as in the batch boosting. However, in the

online setting it is hard to determine a good measure of an
example without seeing the remaining examples, so we in-
stead only generate a weight w^{(i)}_{tk} for ((xt, k), y_{tk}), which
after normalization corresponds to the measure p^{(i)}_{tk}, for the
i-th weak learner. The goal is again to achieve a small error
rate over S, given that each weak learner has some positive
advantage, defined asP

t,kp^{(i)}_{tk}y_{tk}h^{(i)}_{t} (x_{t}, k). Chen et al.

(2012) proposed an online boosting algorithm that achieves this goal in the binary case, which can be easily adapted to the multiclass case here.

In this paper, we consider the online multiclass prediction problem in the bandit setting. The setting is similar to the full-information one, except that at step t the boosting al- gorithm only receives the bandit information of whether its prediction is correct or not. The goal is essentially the same—to achieve a small error rate, given that each weak learner has some positive advantage.

Several issues arise in designing such a bandit boosting al- gorithm. The standard approach in designing a bandit al- gorithm is to use a full-information algorithm as a black box, with its needed information replaced by some esti- mated one. Usually, the only information needed by a full- information algorithm is the gradient of the loss function at each step, and this information is used only once, for updating its next strategy or action. As a result, the per- formance (regret) of such a bandit algorithm can be easily analyzed based on that of the full-information one, as it is usually expressed as a simple function of the gradients.

For our boosting problem, we would also like to follow this
approach, and the only available full-information boosting
algorithm with theoretical guarantee is that of (Chen et al.,
2012). However, it is not obvious what to estimate now
since that algorithm involves three online processes which
all need the information yt, but for different purposes. First,
the boosting algorithm needs yt to compute the example
weights w_{tk}^{(i)}’s. Second, the boosting algorithm needs ytto
compute the voting weights α_{tk}^{(i)}’s. Third, the weak learn-
ers also need yt, in addition to w^{(i)}_{tk}’s, to update its next hy-
pothesis. Can one single bit of bandit information about yt

be used to get good estimators for all the three processes?

Furthermore, as ytis used in several places and in a more involved way, the bandit algorithm may not be able to use the full-information one as a simple black box, and its per- formance (error rate) may not be easily based on that of the full-information one. Finally, it is not clear what the appro- priate assumption one should make on the weak learners in order for boosting to work in the bandit setting. In fact, it is not even clear what type of weak learners one should use.

Perhaps the most natural choice is to use multiclass bandit algorithms. That is, starting from weak multiclass bandit algorithms, we “boost” them into strong multiclass bandit ones. Surprisingly, we will show that it suffices to use bi-

nary full-information algorithms with a positive advantage as weak learners. This not only gives us a stronger result in theory, as a weaker assumption on weak learners is needed, but also provides us more possibilities of designing weak learners (and thus strong bandit algorithms) in practice, as most existing multiclass bandit algorithms are linear ones.

We will use the following notation and convention. For a positive integer n, we let [n] denote the set {1, . . . , n}. For a condition π, we use the notation 1[π] which gives the value 1 if π holds and 0 otherwise. For simplicity, we as- sume that each xthas length kxtk2≤ 1 and each hypothe- sis htcomes from some family H with ht(xt, k) ∈ [−1, 1].

### 3. Online weak learners

In this section, we study reasonable assumptions on weak learners for allowing boosting to work in the bandit setting.

As mentioned in the previous section, instead of using mul- ticlass bandit algorithms as weak learners, we will use bi- nary full-information ones. A natural assumption to make is for such a binary full-information algorithm to achieve a positive advantage with respect to any example weights.

However, as noted in (Chen et al.,2012), this assumption is too strong to achieve, as one cannot expect an online algo- rithm to achieve a positive advantage in extreme cases, such as when only the first example has a nonzero weight. Thus, some constraints must be put on the example weights.

To identify an appropriate constraint, let us follow (Chen et al.,2012) and consider the case that each hypothesis ht

consists of K linear functions with ht(x, k) = hh_{tk}, xi, the
inner product of two vectors htkand x, with khtkk_{2} ≤ 1.

When given an example (x_{t}, k), the weak learner uses htk

to predict the binary label ytk. After that, it receives ytk

as well as the example weight wtk, and uses them to up-
date htk into a new h_{(t+1)k}. We can reduce the task of
such a weak learner to the well-known online linear opti-
mization problem, by using the reward function rtk(htk) =
wtkytkhhtk, xti, which is linear in htk. Then we can apply
the online gradient descent algorithm of (Zinkevich,2003)
to generate htk at step t, and a standard regret analysis
shows that for some constant c > 0,

X

t

w_{tk}y_{tk}hh_{tk}, x_{t}i ≥X

t

w_{tk}y_{tk}hh^{∗}_{k}, x_{t}i −
s

cX

t

w_{tk}^{2}

for any h^{∗}_{k} with kh^{∗}_{k}k2 ≤ 1. Summing over k ∈ [K] and
using Cauchy-Schwarz inequality, we get

X

t,k

wtkytkhhtk, xti ≥X

t,k

wtkytkhh^{∗}_{k}, xti−

s

cKX

t,k

w_{tk}^{2}.

Let |w| denote the total weightP

t,kwtk, so that ptk= ^{w}_{|w|}^{tk}
is the measure of example (x_{t}, k). Then by dividing both

sides of the inequality above by |w|, we obtain

X

t,k

ptkytkhhtk, xti ≥X

t,k

ptkytkhh^{∗}_{k}, xti−

v u u tcK

X

t,k

w_{tk}^{2}

|w|^{2}.

Note thatP

t,kptkytkhh^{∗}_{k}, xti is the advantage of the off-
line learner, and suppose that it is at least 3γ > 0. More-
over, suppose the example weights are large, in the sense
that they satisfy the following condition:

|w| ≥ cKB/γ^{2}, (1)

where B ≥ maxt,kwtkis a constant that will be fixed later.

Then the advantage of the online weak learner becomes X

t,k

p_{tk}y_{tk}hhtk, x_{t}i ≥ 3γ −
s

cKB|w|

|w|^{2} ≥ 3γ − γ = 2γ.

This motivates us to propose the following assumption on weak learners, which need not be linear ones.

Assumption 1. There is an online full-information weak learner which can achieve an advantage 2γ > 0 for any sequence of examples and weights satisfying condition (1).

From the discussion above, we have the following.

Lemma 1. Suppose for any sequence of examples and weights satisfying condition (1), there exists an offline lin- ear hypothesis with an advantage3γ > 0. Then Assump- tion1holds.

Let us make two remarks on Assumption 1. First, the assumption that a weak learner has a positive advantage is just the assumption that it predicts better than random guessing, which is the standard assumption used by (al- most) all previous batch boosting algorithms. Second, the condition (1) on example weights actually makes our as- sumption weaker, which in turn makes the boosting task harder and our boosting result in the next section stronger.

More precisely, we only require the weak learner to per- form well (having a positive advantage) when the weights are large, and we do not care how bad it may perform with small weights. In fact, we will make our boosting algorithm call the weak learner with large weights.

### 4. Our bandit boosting algorithm

In this section we show how to design a bandit boosting algorithm under Assumption1. Let WL be such an online full-information weak learner and we will run N copies of WL, for some N to be determined later. We follow the ap- proach of reducing the multiclass problem to the binary one as described in Section2, and we base our bandit boosting algorithm on the full-information one of (Chen et al.,2012) that works for binary classification in the full-information

setting. More precisely, at step t we do the following after
receiving the feature vector xt. For each class k ∈ [K],
a new feature vector (xt, k) is created, we obtain a binary
weak hypothesis h^{(i)}_{tk}(x) = h^{(i)}_{t} (x, k) from the i’th weak
learner, for i ∈ [N ], and we form the strong hypothesis

Ht(x) = arg max

k∈[K]ftk(x), with ftk(x) =

N

X

i=1

α^{(i)}_{tk}h^{(i)}_{tk}(x),

where α^{(i)}_{tk} is some voting weight for the i’th weak learner.

Then we make our prediction ˆy_{t}based on H_{t}(x_{t}) in some
way and receive the feedback 1[ˆy_{t} = y_{t}]. Using the feed-
back, we prepare some example weight w^{(i)}_{tk} to update the
i’th weak learner, as well as to compute the next voting
weight α^{(i)}_{(t+1)k}, for i ∈ [N ]. It remains to show how to set
the example weights and the voting weights, as well as how
to choose ˆy_{t}, which we describe and analyze in detail next.

The complete algorithm is given in Algorithm1.

Algorithm 1 Bandit boosting algorithm with online weak learner WL

Input: Streaming examples (x1, y1), . . . , (xT, yT).

Parameters: 0 < δ < 1, 0 ≤ θ < γ < ^{1}_{2}.

Choose α^{(i)}_{1k} =_{N}^{1} and random h^{(i)}_{1k} for k ∈ [K], i ∈ [N ].

for t = 1 to T do

Let Ht(x) = arg max_{k∈[K]}P

i∈[N ]α^{(i)}_{tk}h^{(i)}_{tk}(x).

Let pt(k) = (1 − δ)1[k = Ht(xt)] +_{K}^{δ} for k ∈ [K].

Predict ˆytaccording to the distribution pt. Receive the information 1[ˆyt= yt].

for k = 1 to K and i = 1 to N do
Update w^{(i)}_{tk} according to (4).

If ˆy_{t}= k, call WL(h^{(i)}_{tk}, (x_{t}, k), y_{tk}, w^{(i)}_{tk}) to obtain
h^{(i)}_{(t+1)k}; otherwise, let h^{(i)}_{(t+1)k}= h^{(i)}_{tk}.

Update α_{(t+1)k}^{(i)} according to (6).

end for end for

The example weight of (xt, k) for the i’th weak learner used by the full-information algorithm of (Chen et al., 2012) is

¯

w_{tk}^{(i)}= minn

(1 − γ)^{z}^{(i−1)}^{tk} ^{/2}, 1o

, (2)

where z_{tk}^{(0)}= 0 and z_{tk}^{(i−1)}, for i − 1 ≥ 1, is defined as

z^{(i−1)}_{tk} =

i−1

X

j=1

(ytkh^{(j)}_{tk}(xt) − θ), with θ = γ/(2 + γ), (3)

which depends on the information ytk. As we are in the bandit setting, we do not have ytkto compute such weights in general. Thus, we balance exploitation with exploration

by independently predicting H_{t}(x_{t}) with probability 1 − δ
and a random label with probability δ, with δ ≤ γ; let
ˆ

yt denote our prediction. For k ∈ [K], let pt(k) denote
the probability that ˆyt = k. Then we replace the example
weight ¯w_{tk}^{(i)}by the estimator

w_{tk}^{(i)}=

w¯_{tk}^{(i)}/p_{t}(k) if ˆy_{t}= k,

0 otherwise, (4)

which we can compute, because when ˆyt= k, we do have
ytk to compute ¯w_{tk}^{(i)}. As pt(k) ≥ δ/K, we can choose
B = K/δ and have w_{tk}^{(i)}≤ B for any t, k, i. Note that w_{tk}^{(i)}
and ¯w_{tk}^{(i)}are random variables, and the following shows that
w^{(i)}_{tk}’s are in fact good estimators for ¯w_{tk}^{(i)}’s.

Claim 1. For any t, k, i, E[ ¯w^{(i)}_{tk}] = E[wtk^{(i)}]. For any k, i, λ,

Pr

"

X

t

¯

w_{tk}^{(i)}− w^{(i)}_{tk}

> λT

#

≤ 2^{−Ω(λ}^{2}^{T /B}^{2}^{)}.

Proof. Observe that any fixing of the randomness up to
step t − 1 leaves ¯w_{tk}^{(i)}and w^{(i)}_{tk} with the same conditional
expectation. Thus, E[ ¯w_{tk}^{(i)}] = E[w^{(i)}tk]. Moreover, as the
random variables Mt = ¯w^{(i)}_{tk} − w^{(i)}_{tk}, for t ∈ [T ], form a
martingale difference sequence, with |Mt| ≤ B, the prob-
ability bound follows from Azuma’s inequality.

This claim allows us to use w_{tk}^{(i)}as the example weight of
(x_{t}, k) to update the i’th weak learner. However, as each
weak learner is assumed to be a full-information one, it also
needs the label ytkto update which we may not know. One
may try to take a similar approach as before to feed the
weak learner with an estimator which is ytk/pt(k) when
ˆ

yt = k and 0 otherwise, but this does not work as it does
not take a value in {−1, 1} needed by the binary weak
learner. Instead, we take a different approach: we only
call the weak learner to update when ˆyt = k so that we
know ytk. That is, when ˆyt = k, we call the i’th weak
learner with w_{tk}^{(i)}and ytk, which can then update and re-
turn the next weak hypothesis h^{(i)}_{(t+1)k}; otherwise, we do
not call the i’th weak learner to update and we simply let
the next hypothesis h^{(i)}_{(t+1)k} be the current h^{(i)}_{tk}. Another
issue is that a weak learner is only assumed to work well
when given large example weights satisfying condition (1),
and even then, it only works well on those examples which
are given to it to update. This is dealt by the following.

Lemma 2. Let δ ∈ [0, 1], let m be the largest number such thatP

t,kw^{(i)}_{tk} ≥ δKT for every i ≤ m, and let
f¯tk(x) =

m

X

i=1

1

mh^{(i)}_{tk}(x).

Then whenT ≥ c_{0}(K^{2}/δ^{4}) log(K/δ) for a large enough
constantc0,

Pr

{(t, k) : y_{tk}f¯_{tk}(x_{t}) < θ}

> 2δKT ≤ δ,
for the parameterθ = γ/(2 + γ) introduced in (3)
Proof. Note that according to the definition, for any t and
k, ¯w_{tk}^{(m+1)} ≥ 0, and ¯w_{tk}^{(m+1)} = 1 if ytkf¯tk(xt) < θ, as
z_{tk}^{(m)}= m(y_{tk}f¯_{tk}(x_{t}) − θ). This implies that

{(t, k) : y_{tk}f¯_{tk}(x_{t}) < θ}

≤X

t,k

¯
w^{(m+1)}_{tk} .

AsP

t,kw^{(m+1)}_{tk} < δKT by the definition of m, we have
Pr

{(t, k) : ytkf¯tk(xt) < θ}

> 2δKT

≤ Pr

X

t,k

¯

w_{tk}^{(m+1)}−X

t,k

w_{tk}^{(m+1)}> δKT

,

which by a union bound and Claim 1 is at most
K2^{−Ω(δ}^{2}^{T /B}^{2}^{)}≤ δ.

The following lemma gives an upper bound on the param- eter m defined in Lemma2.

Lemma 3. Suppose Assumption 1 holds and T ≥
cK/(δ^{2}γ^{2}) for the constant c in the condition (1). Then
the parameterm in Lemma2is at mostO(K/(δ^{2}γ^{2})).

Proof. Note that for any i ∈ [m],P

t,kw_{tk}^{(i)} ≥ δKT ≥
cKB/γ^{2}, with B = K/δ, and the condition (1) is satisfied.

Thus from Assumption1, we have X

i,t,k

w^{(i)}_{tk}ytkh^{(i)}_{tk}(xt) ≥ 2γX

i,t,k

w_{tk}^{(i)}, (5)

with the sums over i above, as well as in the rest of the proof, being taken over i ∈ [m]. On the other hand, we have the following claim.

Claim 2. P

i,t,kw^{(i)}_{tk}y_{tk}h^{(i)}_{tk}(x_{t}) ≤ O(BKT /γ) +
γP

i,t,kw_{tk}^{(i)}.

We omit its proof here as it is very similar to that for
Lemma 5 in (Servedio,2003).^{1} Combining the bound in
Claim2with the inequality (5), we have

γX

i,t,k

w^{(i)}_{tk} ≤ O(BKT /γ).

1Although that lemma is for ¯w_{tk}^{(i)}’s, its proof can be easily
modified to work for w^{(i)}_{tk}’s, but with an additional factor of B
appearing in the term O(BKT /γ) here.

SinceP

t,kw_{tk}^{(i)}≥ δKT for i ∈ [m] and B = K/δ, we get
γmδKT ≤ O(K^{2}T /(δγ)). From this, the required bound
on m follows, and we have the lemma.

Let us suppose that T ≥ c0(K^{2}/δ^{4}) log(K/δ) for a large
enough constant c0 so that both lemmas apply. Then
Lemma 2 shows that one can obtain a strong learner by
combining the first m weak learners. However, one can-
not determine the number m before seeing all the exam-
ples, and in fact in our online setting, we need to decide
the number N of weak learners even before seeing the first
example. Following (Chen et al.,2012), we set N to be the
upper bound given by Lemma3. Then at step t, for each
k ∈ [K], we consider the function

ftk(x) =

N

X

i=1

α^{(i)}_{tk}h^{(i)}_{tk}(x)

and reduce the task of finding such αtk= (α^{(1)}_{tk}, . . . , α^{(N )}_{tk} )
to the Online Convex Programming problem. More pre-
cisely, we use the N -dimensional probability simplex, de-
noted by P_{N}, as the feasible set and define the loss function
as

L_{tk}(α) = max
(

0, θ − y_{tk}

N

X

i=1

α^{(i)}h^{(i)}_{tk}(x_{t})
)

,

which is a convex function of α. However, unlike in (Chen
et al.,2012), we are in the bandit setting and thus may not
know y_{tk}. To overcome this, we use a similar idea as before
to estimate a subgradient ∇L_{tk}(α_{tk}) by

`_{tk}=

∇L_{tk}(α_{tk})/p_{t}(k) if ˆy_{t}= k,

0 otherwise.

One can then use `tk = (`^{(1)}_{tk}, . . . , `^{(N )}_{tk} ) to perform gradi-
ent descent to update αtkas in (Chen et al.,2012). How-
ever, to get a better theoretical bound, here we choose to
perform a multiplicative update on αtk to get α(t+1)k =
(α^{(1)}_{(t+1)k}, . . . , α^{(N )}_{(t+1)k}) for step t + 1, with

α^{(i)}_{(t+1)k}= α^{(i)}_{tk} · e^{−η`}^{(i)}^{tk}/Z_{(t+1)k}, (6)
where Z_{(t+1)k}is the normalization factor and η is the learn-
ing rate which we set to δ^{3}/K. Then we have the following.

Lemma 4. Prh P

t,kLtk(αtk) ≤ O(δKT )i

≥ 1 − 2δ.

Proof. Following the standard analysis, one can show that
for any k ∈ [K] and any ¯α_{k}∈ P_{N},

X

t

h`tk, α_{tk}− ¯α_{k}i ≤ O((log N )/η) + ηX

t

k`tkk^{2}_{∞}

≤ O(δKT ), (7)

since k`_{tk}k^{2}_{∞} ≤ B^{2} = K^{2}/δ^{2}, η = δ^{3}/K and N ≤
O(K/δ^{4}). Now note that for any t and k, E[h`^{tk}, αtk −

¯

αki] = E[h∇Ltk(αtk), αtk− ¯αki] because given the ran- domness up to step t − 1, αtk is fixed and the condi- tional expectation of `tkequals ∇Ltk(αtk). Moreover, as Ltk(αtk) − Ltk( ¯αk) ≤ h∇Ltk(αtk), αtk− ¯αki for convex Ltk, we have

E [Ltk(αtk) − Ltk( ¯αk)] ≤ E[h`^{tk}, αtk− ¯αki].

Then using the bound in (7) and applying a similar mar- tingale analysis as before, one can show that for any ¯αk ∈ PN,

Pr

X

t,k

(L_{tk}(α_{tk}) − L_{tk}( ¯α_{k})) ≤ O(δKT )

≥ 1 − δ.

Let ¯αk = ( ¯α^{(1)}_{k} , . . . , ¯α^{(N )}_{k} ), with ¯α^{(i)}_{k} = _{m}^{1} for i ≤ m and

¯

α^{(i)}_{k} = 0 for i > m, so that

Ltk( ¯αk) = max{0, θ − ytkf¯tk(xt)}

≤ (1 + θ)1[y_{tk}f¯_{tk}(x_{t}) < θ].

Then we know from Lemma2that

Pr

X

t,k

Ltk( ¯αk) ≤ (1 + θ)2δKT

≥ 1 − δ.

Combining the two probability bounds together, we have the lemma.

Finally, recall that to predict each yt, we independently out-
put Ht(xt) = arg max_{k∈[K]}ftk(xt) with probability 1 − δ
and a random label with probability δ. Thus, by a Cher-
noff bound, our algorithm makes at most |{t : Ht(xt) 6=

yt}| + 2δT errors with probability 1 − 2^{−Ω(δ}^{2}^{T )} ≥ 1 − δ.

On the other hand, as

1[Ht(xt) 6= yt] ≤ X

k

1[ytkftk(xt) < 0]

≤ X

k

Ltk(αtk)/θ,

Lemma4implies that

Pr [|{t : H_{t}(x_{t}) 6= y_{t}}| ≤ O(δKT /θ)] ≥ 1 − 2δ.

Consequently, for θ = γ/(2 + γ), we can conclude that our algorithm makes at most

O(KδT /θ) + 2δT ≤ O(KδT /γ)

errors with probability at least 1 − 3δ. Therefore, we have the following, which is the main result of our paper.

Theorem 1. Suppose Assumption 1 holds and T ≥
c0(K^{2}/δ^{4}) log(K/δ) for a large enough constant c0. Then
our bandit algorithm uses O(K/(δ^{2}γ^{2})) weak learners
and makesO(KδT /γ) errors with probability 1 − 3δ.

Note that the error rate of our algorithm is O(Kδ/γ), which can be made to any ε by setting δ = O(εγ/K), with the requirement on T and the number of weak learners ad- justed accordingly. We remark that we did not attempt to optimize our bounds (which we believe can be improved) as our focus was on establishing the possibility of boosting in the bandit setting. Moreover, it does not seem appropri- ate to compare our error bound with the regret bounds of existing bandit algorithms. This is because existing algo- rithms are usually based on linear classifiers, which may have large error rates even though their regrets are small.

On the other hand, our boosting algorithm works for any type of classifiers and achieves a small error rate as long as we have weak learners which satisfy Assumption1.

### 5. Experiments

In this section, we validate the empirical performance of the proposed algorithm on several real-world data sets. We compare with two representative algorithms. The first one is Banditron (Kakade et al.,2008), which is one of the first proposed algorithms for the bandit setting. It is modi- fied from a multiclass variant of the well-known Perceptron algorithm (Rosenblatt,1962) using the so-called Kesler’s construction (Duda & Hart, 1973). By doing some ran- dom exploration, it can accurately construct the estimation of the update step for the full-information multiclass Per- ceptron. The algorithm has good theoretical guarantee, es- pecially when the data is linearly separable. The algorithm can be viewed as a direct modification of a full-information learner (Perceptron) for the bandit setting, without combin- ing the learners for boosting.

The second one is Conservative OVA (C-OVA) (Chen et al., 2009), which uses the one-versus-all multiclass to binary decomposition similar to our algorithm. But unlike most of the bandit algorithms, it does not do random explo- ration at all. Instead, it conservatively updates using what- ever it gets from the partial feedback, and hence the name.

Note that although it embeds an online binary learning al- gorithm as its “base learner”, it does not perform boosting by combining several base learners like our algorithm does.

Also, C-OVA performs a margin-based decoding of the bi- nary classification results, and hence may not work well with non-margin-based base learners.

To demonstrate the boosting ability of our proposed algo- rithm, we choose two completely different types of online binary classifiers as our weak learners. The first one is Per- ceptron, a standard margin-based linear classifier. Note that in (Chen et al., 2009) they use a similar but more com-

Table 1. The data sets used in our experiments.

Data set Car DNA Nursery Connect4 Reuters4

#classes 4 3 5 3 4

#features 6 180 8 126 346,810

#examples 1,728 3,186 12,960 67,557 673,768

plex Online Passive-Aggressive Algorithm (PA) (Crammer et al.,2006) as its internal learner. Since we found little dif- ference in performance on the data sets we tested between the PA algorithm and the Perceptron algorithm, we only report the results using the simpler and more famous Per- ceptron algorithm to compare fairly with Banditron. The second weak learner we use is Naive Bayes, a simple sta- tistical classifier that estimates the posterior probability for each class using Bayes theorem and the assumption of con- ditional independence between features.

5.1. Results

We test our algorithm on 5 public real-world data sets from various domains with different sizes: CAR, NURSERY, and CONNECT4 from the UCI machine learning repository (Frank & Asuncion,2010); DNAfrom the Statlog project (Michie et al.,1994); REUTERS4 from the paper of Ban- ditron (Kakade et al.,2008). Basic information of these data sets are summarized in Table1. As described previ- ously, each example is first used for prediction before the disclosure of its label, and the error rate is the number of prediction errors divided by the total number of examples.

All the experiments are repeated 10 times with different random orderings of the examples.

For fairness of comparison to Banditron, we do not tune the parameters other than the exploration rate δ. We fix the number of weak learners to be 100 and the assumed weak learner advantage γ to be 0.1 as in the full-information on- line boosting algorithm (Chen et al., 2012). For the ex- ploration rate δ, we test a wide range of values to see the effect of random exploration. The results are shown in Fig- ure1. Note that C-OVA is not included in this figure since C-OVA does not perform random exploration at all and is parameter-free. One can see that for reasonable range of values of δ (around 0.01 to 0.1), the performance of our al- gorithm is quite strong and relatively stable, while setting it too high or too low results in worse performance as ex- pected. Table2summarizes the average error rate and the standard deviation when the best choices of δ are used in Banditron and in our algorithm.

Let us first focus on the case when Perceptron is used as the weak learner. Here, the categorical features are trans- formed into numerical ones by decomposition into binary vectors. We can see that the proposed bandit boosting al- gorithm consistently outperforms Banditron on all the data sets, and is also comparable to C-OVA, especially on larger

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10^{-3} 10^{-2} 10^{-1} 10^{0}

error rate

δ

Banditron BanditBoost + Perceptron BanditBoost + NaiveBayes

(a) CAR

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10^{-3} 10^{-2} 10^{-1} 10^{0}

error rate

δ

Banditron BanditBoost + Perceptron BanditBoost + NaiveBayes

(b) DNA

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10^{-3} 10^{-2} 10^{-1} 10^{0}

error rate

δ

Banditron BanditBoost + Perceptron BanditBoost + NaiveBayes

(c) NURSERY

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10^{-3} 10^{-2} 10^{-1} 10^{0}

error rate

δ

Banditron BanditBoost + Perceptron BanditBoost + NaiveBayes

(d) CONNECT4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10^{-3} 10^{-2} 10^{-1} 10^{0}

error rate

δ

Banditron BanditBoost + Perceptron

(e) REUTERS4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

10^{2} 10^{3} 10^{4} 10^{5} 10^{6}

error rate

# of examples

Banditron C-OVA + Perceptron BanditBoost + Perceptron

(f) Learning Curve of REUTERS4 Figure 1. (a)-(d): Error rate using different values of exploration rate δ. (f): learning curve of REUTERS4 using the best δ

Table 2. Average (over 10 trials) error rate (%) and standard deviation comparison Data set Banditron C-OVA

(Perceptron)

BanditBoost (Perceptron)

C-OVA (Naive Bayes)

BanditBoost (Naive Bayes) CAR 29.4 ± 0.9 22.8 ± 1.1 26.9 ± 2.4 30.0 ± 0.1 25.1 ± 1.8 DNA 26.8 ± 9.0 13.5 ± 0.5 18.6 ± 0.6 42.9 ± 8.3 25.1 ± 2.6 NURSERY 28.8 ± 1.4 17.9 ± 3.0 16.0 ± 1.1 59.3 ± 8.2 28.9 ± 2.1 CONNECT4 39.8 ± 0.4 28.1 ± 0.2 30.8 ± 0.2 34.9 ± 2.4 34.2 ± 0.4

REUTERS4 16.7 ± 0.5 8.5 ± 4.2 6.3 ± 0.1 N/A N/A

data sets. To take a closer look at the performance of these algorithms, we plot the learning curve for the largest data set (REUTERS4) in Figure1(f). One can see that our algo- rithm begins to outperform the other algorithms when the number of examples is sufficiently large. This is due to the more complex model we use and the need for random ex- ploration as opposed to the deterministic C-OVA algorithm.

Note that it is in accordance to our analysis in Theorem1, as the error bound only holds when the number of rounds T is large.

Next, let us see the situation when the weak learner is switched to Naive Bayes. Note that here we did not test on the REUTERS4 data set due to the slow inference of Naive Bayes for high dimensional data. It can be seen that our algorithm consistently reaches the best on all the data sets.

Moreover, we see a large difference between C-OVA and our algorithm, especially in DNAand NURSERYdata sets.

The superiority echoes the earlier conjecture that C-OVA may not work well with non-margin-based base learners.

On the other hand, the proposed bandit boosting algorithm enjoys a stronger theoretical guarantee and works well with various types of weak learners.

### 6. Conclusion

We propose a boosting algorithm to efficiently generate strong multiclass bandit learners by exploiting the abun- dance of existing online binary learners. The proposed algorithm can be viewed as a careful combination of the online boosting algorithm for binary classification (Chen et al.,2012) and some key estimation techniques in the ban- dit algorithms. While the proposed algorithm is simple, we show some non-trivial theoretical analysis that leads to sound theoretical guarantee. To the best of our knowledge, our proposed boosting algorithm is the first one that comes with such theoretical guarantee. In addition, experimental results on real-world data sets show that the proposed ban- dit boosting algorithm can be easily coupled with different weak binary learners to reach promising performance.

### References

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.

The non-stochastic multi-armed bandit problem. SIAM Journal of Computing, 32:48–77, 2002.

Chen, G., Chen, G., Zhang, J., Chen, S., and Zhang, C.

Beyond banditron: A conservative and efficient reduc- tion for online multiclass prediction with bandit setting model. In Proceedings of ICDM, pp. 71–80, 2009.

Chen, S.-T., Lin, H.-T., and Lu, C.-J. An online boosting algorithm with theoretical justifications. In Proceedings of ICML, pp. 1007–1014, July 2012.

Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. Online passive-aggressive algorithms. J.

Mach. Learn. Res., 7:551–585, December 2006.

Duda, R. O. and Hart, P. E. Pattern Classification and Scene Analysis. Wiley, 1973.

Flaxman, A. D., Kalai, A. T., and McMahan, H. B. On- line convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of SODA, pp.

385–394, Philadelphia, PA, USA, 2005.

Frank, A. and Asuncion, A. UCI machine learning repos- itory, 2010. URL http://archive.ics.uci.

edu/ml.

Freund, Y. and Schapire, R. E. A decision-theoretic gener- alization of on-line learning and an application to boost- ing. Journal of Computer and System Sciences, 55(1):

119–139, 1997.

Hazan, E. and Kale, S. Newtron: an efficient bandit algo- rithm for online multiclass prediction. In Proceedings of NIPS, pp. 891–899, 2011.

Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Effi- cient bandit algorithms for online multiclass prediction.

In Proceedings of ICML, pp. 440–447, New York, NY, USA, 2008.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of WWW, pp. 661–670, New York, NY, USA, 2010. ACM.

Michie, D., Spiegelhalter, D. J., and Taylor, C. C. Machine learning, neural and statistical classification, 1994.

Oza, N. C. and Russell, S. Online bagging and boosting. In Proceedings of AISTATS, pp. 105–112, 2001.

Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan, 1962.

Schapire, R. E. The strength of weak learnability. Mach.

Learn., 5(2):197–227, July 1990.

Schapire, R. E. and Singer, Y. Improved boosting al- gorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, December 1999.

Servedio, R. A. Smooth boosting and learning with mali- cious noise. JMLR, 4:473–489, 2003.

Zinkevich, M. Online convex programming and gener- alized infinitesimal gradient ascent. In Proceedings of ICML, pp. 928–936, 2003.