Boosting with Online Binary Learners for the Multiclass Bandit Problem

(1)

Shang-Tse Chen SCHEN351@GATECH.EDU

School of Computer Science, Georgia Institute of Technology, Atlanta, GA

Hsuan-Tien Lin HTLIN@CSIE.NTU.EDU.TW

Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

Chi-Jen Lu CJLU@IIS.SINICA.EDU.TW

Institute of Information Science, Academia Sinica, Taipei, Taiwan

Abstract

We consider the problem of online multiclass prediction in the bandit setting. Compared with the full-information setting, in which the learner can receive the true label as feedback after mak- ing each prediction, the bandit setting assumes that the learner can only know the correctness of the predicted label. Because the bandit setting is more restricted, it is difficult to design good bandit learners and currently there are not many bandit learners. In this paper, we propose an approach that systematically converts existing online binary classifiers to promising bandit learners with strong theoretical guarantee. The approach matches the idea of boosting, which has been shown to be powerful for batch learning as well as online learning. In particular, we es- tablish the weak-learning condition on the online binary classifier, and show that the condition allows automatically constructing a bandit learner with arbitrary strength by combining several of those classifiers. Experimental results on several real-world data sets demonstrate the effec- tiveness of the proposed approach.

1. Introduction

Recently, machine learning problems that involve partial feedback have received an increasing amount of attention (Auer et al.,2002;Flaxman et al.,2005). These problems occur naturally in many modern applications, such as online advertising and recommender systems (Li et al.,2010).

Proceedings of the 31^st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014 by the author(s).

For instance, in a recommender system, the partial feedback represents whether the user likes the content recom- mended by the system, whereas the user’s preference for the other contents that have not been displayed remain unknown. In this paper, we consider one particular learning problem related to partial feedback: the online multiclass prediction problem in the bandit setting (Kakade et al., 2008). In the problem, the learner iteratively interacts with the environment. In each iteration, the learner observes an instance and is asked to predict its label. The main difference between the traditional full-information setting and the bandit setting is the feedback received after each prediction. In the full-information setting, the true label of the instance is revealed, whereas in the bandit setting, only whether the prediction is correct is known. That is, in the bandit setting, the true label remains unknown if the prediction is incorrect. Our goal is to make as few errors as possible in the harsh environment of the bandit setting.

With more restricted information available, it becomes harder to design good learning algorithms in the bandit setting, except for the case of online binary classification, in which the bandit setting and full-information setting coin- cide. Thus, it is desirable to find a systematic way to trans- form the existing online binary classification algorithms, or combine many of them, to get an algorithm that effec- tively deals with the bandit setting. The motivation calls for boosting(Schapire,1990), which is one of the most popu- lar and well-developed ensemble methods implemented in the traditional batch supervised classification framework.

While most studies on boosting focus on the batch setting (Freund & Schapire,1997;Schapire & Singer,1999), some works have extended the success of boosting to the online setting (Oza & Russell,2001;Chen et al.,2012). However, to the best of our knowledge, there is no boosting algorithm yet for the problem of online multiclass prediction in the bandit setting. In this paper, we study the possibility of

(2)

extending the promising theoretical and empirical results of boosting to the bandit setting.

As in the design and analysis of boosting algorithms in other settings, we need an appropriate assumption on weak learners in order for boosting to work. A stronger assumption makes the design of a boosting algorithm easier, but at the expense of more restricted applicability. To weaken the assumption, we consider binary full-information weak learners, instead of multiclass bandit ones, with the given binary examples constructed through the one-versus-rest decomposition from the multiclass examples (Chen et al., 2009). Following (Chen et al.,2012), we propose a similar assumption which requires such binary weak learners to perform better than random guessing with respect to “smooth” weight distributions over the binary examples. Then we prove that boosting is possible under this assumption by designing a strong bandit algorithm using such binary weak learners. Our bandit algorithm is extended from the full-information one of (Chen et al.,2012), which provides a method to generate such smooth example weights for updating weak learners, as well as some appropriate voting weights for combining the predictions of weak learners.

Nevertheless, our extension in this paper is non-trivial. To compute these weights exactly in (Chen et al.,2012), one needs the full-information feedback, which is not available in our bandit setting. With the limited information of bandit feedback, we show how to find good estimators for the example weights as well as for the voting weights, and we prove that they can in fact be used to replace the true weights to make boosting work in the bandit setting.

Our proposed bandit boosting algorithm enjoys nice theoretical properties similar to those of its batch counterpart.

In particular, the proposed algorithm can achieve a small error rate if the performance of each weak learner is better than that of random guessing with respect to the carefully generated weight distributions. In addition, the algorithm reaches promising empirical performance on real-world data sets, even when using very simple full-information weak learners.

Finally, let us stress the difference between our work and existing ones on the bandit problem. Unlike existing works, our goal is not to construct one specific bandit algorithm and analyze its regret. Instead, our goal is to study the possibility of a general paradigm for designing bandit algorithms in a systematic way. Note that there are currently only a very small number of bandit algorithms for the multiclass prediction problem, and most seem to be based on linear models (Kakade et al.,2008;Hazan & Kale,2011).

With the limited power of such linear models, a high error rate is unavoidable in general, so the focus of these works was to reduce the regret, regardless of whether the actual

error rate is high. Our result, on the other hand, works for a broader class of classifiers beyond linear ones. We show how to construct a strong bandit algorithm with an error rate close to zero, when we have weak learners which can perform slightly better than random guessing. Here we al- low any weak learners, not just linear ones, that only need to work in the simpler full-information setting rather than in the more challenging bandit setting. Constructing such weak learners may look much less daunting, but we show that they in fact suffice for constructing strong bandit algorithms. We hope that this could open more possibilities for designing better bandit algorithms in the future.

2. Boosting in different settings

Before formally describing our boosting framework in the online bandit setting, let us first review the traditional batch setting as well as the online full-information setting.

In the batch setting, the boosting algorithm has the whole training set S = {(x1, y1), . . . , (xT, yT)} available at the beginning, where each xtis the feature vector from some space X ⊆ R^d and yt is its label from some space Y.

For the case of binary classification, we assume Y = {−1, +1}, and the boosting algorithm repeatedly calls the batch weak learner for a number of rounds as follows. In round i, it feeds S as well as a probability distribution p⁽ⁱ⁾ over S to the weak learner, which then returns a weak hypothesis h⁽ⁱ⁾ after seeing the whole S and p⁽ⁱ⁾. It stops at some round N when the strong hypothesis H(x) = sign(PN

i=1α⁽ⁱ⁾h⁽ⁱ⁾(x)), with α⁽ⁱ⁾ ∈ R being the voting weight of h⁽ⁱ⁾, achieves a small error rate over S, defined as |{t : H(xt) 6= yt}|/T .

For the case of multiclass classification, we assume Y = {1, . . . , K}, and for simplicity we adopt the one-versus- rest approach to reduce the multiclass problem to a binary one. More precisely, each multiclass example (xt, yt) is decomposed into K binary examples ((xt, k), ytk), k = 1, . . . , K, where ytkis 1 if yt= k and −1 otherwise. One can then apply the boosting algorithm to such binary examples and use H(x) = arg max_kP

iα⁽ⁱ⁾_k h⁽ⁱ⁾(x, k) as the strong hypothesis for the original multiclass problem.

In the online full-information setting, the examples of S are usually considered as chosen adversarially and they ar- rive one at a time. The online boosting algorithm must decide on some number N of online weak learners to start with. At step t, the boosting algorithm receives xtand it predicts H(xt) = arg max_kP

iα⁽ⁱ⁾_tkh⁽ⁱ⁾_t (x_t, k), where h⁽ⁱ⁾_t is the weak hypothesis provided by the i’th weak learner and α_tk⁽ⁱ⁾is its voting weight. After the prediction, the true label ytis revealed, and to update each weak learner, we would like to feed it with a probability measure on each binary example, as in the batch boosting. However, in the

(3)

online setting it is hard to determine a good measure of an example without seeing the remaining examples, so we instead only generate a weight w⁽ⁱ⁾_tk for ((xt, k), y_tk), which after normalization corresponds to the measure p⁽ⁱ⁾_tk, for the i-th weak learner. The goal is again to achieve a small error rate over S, given that each weak learner has some positive advantage, defined asP

t,kp⁽ⁱ⁾_tky_tkh⁽ⁱ⁾_t (x_t, k). Chen et al.

(2012) proposed an online boosting algorithm that achieves this goal in the binary case, which can be easily adapted to the multiclass case here.

In this paper, we consider the online multiclass prediction problem in the bandit setting. The setting is similar to the full-information one, except that at step t the boosting algorithm only receives the bandit information of whether its prediction is correct or not. The goal is essentially the same—to achieve a small error rate, given that each weak learner has some positive advantage.

Several issues arise in designing such a bandit boosting algorithm. The standard approach in designing a bandit algorithm is to use a full-information algorithm as a black box, with its needed information replaced by some esti- mated one. Usually, the only information needed by a full- information algorithm is the gradient of the loss function at each step, and this information is used only once, for updating its next strategy or action. As a result, the performance (regret) of such a bandit algorithm can be easily analyzed based on that of the full-information one, as it is usually expressed as a simple function of the gradients.

For our boosting problem, we would also like to follow this approach, and the only available full-information boosting algorithm with theoretical guarantee is that of (Chen et al., 2012). However, it is not obvious what to estimate now since that algorithm involves three online processes which all need the information yt, but for different purposes. First, the boosting algorithm needs yt to compute the example weights w_tk⁽ⁱ⁾’s. Second, the boosting algorithm needs ytto compute the voting weights α_tk⁽ⁱ⁾’s. Third, the weak learners also need yt, in addition to w⁽ⁱ⁾_tk’s, to update its next hypothesis. Can one single bit of bandit information about yt

be used to get good estimators for all the three processes?

Furthermore, as ytis used in several places and in a more involved way, the bandit algorithm may not be able to use the full-information one as a simple black box, and its performance (error rate) may not be easily based on that of the full-information one. Finally, it is not clear what the appropriate assumption one should make on the weak learners in order for boosting to work in the bandit setting. In fact, it is not even clear what type of weak learners one should use.

Perhaps the most natural choice is to use multiclass bandit algorithms. That is, starting from weak multiclass bandit algorithms, we “boost” them into strong multiclass bandit ones. Surprisingly, we will show that it suffices to use bi-

nary full-information algorithms with a positive advantage as weak learners. This not only gives us a stronger result in theory, as a weaker assumption on weak learners is needed, but also provides us more possibilities of designing weak learners (and thus strong bandit algorithms) in practice, as most existing multiclass bandit algorithms are linear ones.

We will use the following notation and convention. For a positive integer n, we let [n] denote the set {1, . . . , n}. For a condition π, we use the notation 1[π] which gives the value 1 if π holds and 0 otherwise. For simplicity, we assume that each xthas length kxtk2≤ 1 and each hypothesis htcomes from some family H with ht(xt, k) ∈ [−1, 1].

3. Online weak learners

In this section, we study reasonable assumptions on weak learners for allowing boosting to work in the bandit setting.

As mentioned in the previous section, instead of using multiclass bandit algorithms as weak learners, we will use binary full-information ones. A natural assumption to make is for such a binary full-information algorithm to achieve a positive advantage with respect to any example weights.

However, as noted in (Chen et al.,2012), this assumption is too strong to achieve, as one cannot expect an online algorithm to achieve a positive advantage in extreme cases, such as when only the first example has a nonzero weight. Thus, some constraints must be put on the example weights.

To identify an appropriate constraint, let us follow (Chen et al.,2012) and consider the case that each hypothesis ht

consists of K linear functions with ht(x, k) = hh_tk, xi, the inner product of two vectors htkand x, with khtkk₂ ≤ 1.

When given an example (x_t, k), the weak learner uses htk

to predict the binary label ytk. After that, it receives ytk

as well as the example weight wtk, and uses them to update htk into a new h_(t+1)k. We can reduce the task of such a weak learner to the well-known online linear optimization problem, by using the reward function rtk(htk) = wtkytkhhtk, xti, which is linear in htk. Then we can apply the online gradient descent algorithm of (Zinkevich,2003) to generate htk at step t, and a standard regret analysis shows that for some constant c > 0,

X

t

w_tky_tkhh_tk, x_ti ≥X

t

w_tky_tkhh^∗_k, x_ti − s

cX

t

w_tk²

for any h^∗_k with kh^∗_kk2 ≤ 1. Summing over k ∈ [K] and using Cauchy-Schwarz inequality, we get

X

t,k

wtkytkhhtk, xti ≥X

t,k

wtkytkhh^∗_k, xti−

s

cKX

t,k

w_tk².

Let |w| denote the total weightP

t,kwtk, so that ptk= ^w_|w|^tk is the measure of example (x_t, k). Then by dividing both

(4)

sides of the inequality above by |w|, we obtain

X

t,k

ptkytkhhtk, xti ≥X

t,k

ptkytkhh^∗_k, xti−

v u u tcK

X

t,k

w_tk²

|w|².

Note thatP

t,kptkytkhh^∗_k, xti is the advantage of the offline learner, and suppose that it is at least 3γ > 0. More- over, suppose the example weights are large, in the sense that they satisfy the following condition:

|w| ≥ cKB/γ², (1)

where B ≥ maxt,kwtkis a constant that will be fixed later.

Then the advantage of the online weak learner becomes X

t,k

p_tky_tkhhtk, x_ti ≥ 3γ − s

cKB|w|

|w|² ≥ 3γ − γ = 2γ.

This motivates us to propose the following assumption on weak learners, which need not be linear ones.

Assumption 1. There is an online full-information weak learner which can achieve an advantage 2γ > 0 for any sequence of examples and weights satisfying condition (1).

From the discussion above, we have the following.

Lemma 1. Suppose for any sequence of examples and weights satisfying condition (1), there exists an offline linear hypothesis with an advantage3γ > 0. Then Assump- tion1holds.

Let us make two remarks on Assumption 1. First, the assumption that a weak learner has a positive advantage is just the assumption that it predicts better than random guessing, which is the standard assumption used by (al- most) all previous batch boosting algorithms. Second, the condition (1) on example weights actually makes our assumption weaker, which in turn makes the boosting task harder and our boosting result in the next section stronger.

More precisely, we only require the weak learner to perform well (having a positive advantage) when the weights are large, and we do not care how bad it may perform with small weights. In fact, we will make our boosting algorithm call the weak learner with large weights.

4. Our bandit boosting algorithm

In this section we show how to design a bandit boosting algorithm under Assumption1. Let WL be such an online full-information weak learner and we will run N copies of WL, for some N to be determined later. We follow the approach of reducing the multiclass problem to the binary one as described in Section2, and we base our bandit boosting algorithm on the full-information one of (Chen et al.,2012) that works for binary classification in the full-information

setting. More precisely, at step t we do the following after receiving the feature vector xt. For each class k ∈ [K], a new feature vector (xt, k) is created, we obtain a binary weak hypothesis h⁽ⁱ⁾_tk(x) = h⁽ⁱ⁾_t (x, k) from the i’th weak learner, for i ∈ [N ], and we form the strong hypothesis

Ht(x) = arg max

k∈[K]ftk(x), with ftk(x) =

N

X

i=1

α⁽ⁱ⁾_tkh⁽ⁱ⁾_tk(x),

where α⁽ⁱ⁾_tk is some voting weight for the i’th weak learner.

Then we make our prediction ˆy_tbased on H_t(x_t) in some way and receive the feedback 1[ˆy_t = y_t]. Using the feedback, we prepare some example weight w⁽ⁱ⁾_tk to update the i’th weak learner, as well as to compute the next voting weight α⁽ⁱ⁾_(t+1)k, for i ∈ [N ]. It remains to show how to set the example weights and the voting weights, as well as how to choose ˆy_t, which we describe and analyze in detail next.

The complete algorithm is given in Algorithm1.

Algorithm 1 Bandit boosting algorithm with online weak learner WL

Input: Streaming examples (x1, y1), . . . , (xT, yT).

Parameters: 0 < δ < 1, 0 ≤ θ < γ < ¹₂.

Choose α⁽ⁱ⁾_1k =_N¹ and random h⁽ⁱ⁾_1k for k ∈ [K], i ∈ [N ].

for t = 1 to T do

Let Ht(x) = arg max_k∈[K]P

i∈[N ]α⁽ⁱ⁾_tkh⁽ⁱ⁾_tk(x).

Let pt(k) = (1 − δ)1[k = Ht(xt)] +_K^δ for k ∈ [K].

Predict ˆytaccording to the distribution pt. Receive the information 1[ˆyt= yt].

for k = 1 to K and i = 1 to N do Update w⁽ⁱ⁾_tk according to (4).

If ˆy_t= k, call WL(h⁽ⁱ⁾_tk, (x_t, k), y_tk, w⁽ⁱ⁾_tk) to obtain h⁽ⁱ⁾_(t+1)k; otherwise, let h⁽ⁱ⁾_(t+1)k= h⁽ⁱ⁾_tk.

Update α_(t+1)k⁽ⁱ⁾ according to (6).

end for end for

The example weight of (xt, k) for the i’th weak learner used by the full-information algorithm of (Chen et al., 2012) is

¯

w_tk⁽ⁱ⁾= minn

(1 − γ)^z⁽ⁱ⁻¹⁾^tk ^/2, 1o

, (2)

where z_tk⁽⁰⁾= 0 and z_tk⁽ⁱ⁻¹⁾, for i − 1 ≥ 1, is defined as

z⁽ⁱ⁻¹⁾_tk =

i−1

X

j=1

(ytkh^(j)_tk(xt) − θ), with θ = γ/(2 + γ), (3)

which depends on the information ytk. As we are in the bandit setting, we do not have ytkto compute such weights in general. Thus, we balance exploitation with exploration

(5)

by independently predicting H_t(x_t) with probability 1 − δ and a random label with probability δ, with δ ≤ γ; let ˆ

yt denote our prediction. For k ∈ [K], let pt(k) denote the probability that ˆyt = k. Then we replace the example weight ¯w_tk⁽ⁱ⁾by the estimator

w_tk⁽ⁱ⁾=

w¯_tk⁽ⁱ⁾/p_t(k) if ˆy_t= k,

0 otherwise, (4)

which we can compute, because when ˆyt= k, we do have ytk to compute ¯w_tk⁽ⁱ⁾. As pt(k) ≥ δ/K, we can choose B = K/δ and have w_tk⁽ⁱ⁾≤ B for any t, k, i. Note that w_tk⁽ⁱ⁾ and ¯w_tk⁽ⁱ⁾are random variables, and the following shows that w⁽ⁱ⁾_tk’s are in fact good estimators for ¯w_tk⁽ⁱ⁾’s.

Claim 1. For any t, k, i, E[ ¯w⁽ⁱ⁾_tk] = E[wtk⁽ⁱ⁾]. For any k, i, λ,

Pr

"

X

t

¯

w_tk⁽ⁱ⁾− w⁽ⁱ⁾_tk

> λT

#

≤ 2^−Ω(λ²^{T /B}²⁾.

Proof. Observe that any fixing of the randomness up to step t − 1 leaves ¯w_tk⁽ⁱ⁾and w⁽ⁱ⁾_tk with the same conditional expectation. Thus, E[ ¯w_tk⁽ⁱ⁾] = E[w⁽ⁱ⁾tk]. Moreover, as the random variables Mt = ¯w⁽ⁱ⁾_tk − w⁽ⁱ⁾_tk, for t ∈ [T ], form a martingale difference sequence, with |Mt| ≤ B, the probability bound follows from Azuma’s inequality.

This claim allows us to use w_tk⁽ⁱ⁾as the example weight of (x_t, k) to update the i’th weak learner. However, as each weak learner is assumed to be a full-information one, it also needs the label ytkto update which we may not know. One may try to take a similar approach as before to feed the weak learner with an estimator which is ytk/pt(k) when ˆ

yt = k and 0 otherwise, but this does not work as it does not take a value in {−1, 1} needed by the binary weak learner. Instead, we take a different approach: we only call the weak learner to update when ˆyt = k so that we know ytk. That is, when ˆyt = k, we call the i’th weak learner with w_tk⁽ⁱ⁾and ytk, which can then update and re- turn the next weak hypothesis h⁽ⁱ⁾_(t+1)k; otherwise, we do not call the i’th weak learner to update and we simply let the next hypothesis h⁽ⁱ⁾_(t+1)k be the current h⁽ⁱ⁾_tk. Another issue is that a weak learner is only assumed to work well when given large example weights satisfying condition (1), and even then, it only works well on those examples which are given to it to update. This is dealt by the following.

Lemma 2. Let δ ∈ [0, 1], let m be the largest number such thatP

t,kw⁽ⁱ⁾_tk ≥ δKT for every i ≤ m, and let f¯tk(x) =

m

X

i=1

1

mh⁽ⁱ⁾_tk(x).

Then whenT ≥ c₀(K²/δ⁴) log(K/δ) for a large enough constantc0,

Pr

{(t, k) : y_tkf¯_tk(x_t) < θ}

> 2δKT ≤ δ, for the parameterθ = γ/(2 + γ) introduced in (3) Proof. Note that according to the definition, for any t and k, ¯w_tk^(m+1) ≥ 0, and ¯w_tk^(m+1) = 1 if ytkf¯tk(xt) < θ, as z_tk^(m)= m(y_tkf¯_tk(x_t) − θ). This implies that

{(t, k) : y_tkf¯_tk(x_t) < θ}

≤X

t,k

¯ w^(m+1)_tk .

AsP

t,kw^(m+1)_tk < δKT by the definition of m, we have Pr

{(t, k) : ytkf¯tk(xt) < θ}

> 2δKT

≤ Pr



 X

t,k

¯

w_tk^(m+1)−X

t,k

w_tk^(m+1)> δKT



,

which by a union bound and Claim 1 is at most K2^−Ω(δ²^{T /B}²⁾≤ δ.

The following lemma gives an upper bound on the parameter m defined in Lemma2.

Lemma 3. Suppose Assumption 1 holds and T ≥ cK/(δ²γ²) for the constant c in the condition (1). Then the parameterm in Lemma2is at mostO(K/(δ²γ²)).

Proof. Note that for any i ∈ [m],P

t,kw_tk⁽ⁱ⁾ ≥ δKT ≥ cKB/γ², with B = K/δ, and the condition (1) is satisfied.

Thus from Assumption1, we have X

i,t,k

w⁽ⁱ⁾_tkytkh⁽ⁱ⁾_tk(xt) ≥ 2γX

i,t,k

w_tk⁽ⁱ⁾, (5)

with the sums over i above, as well as in the rest of the proof, being taken over i ∈ [m]. On the other hand, we have the following claim.

Claim 2. P

i,t,kw⁽ⁱ⁾_tky_tkh⁽ⁱ⁾_tk(x_t) ≤ O(BKT /γ) + γP

i,t,kw_tk⁽ⁱ⁾.

We omit its proof here as it is very similar to that for Lemma 5 in (Servedio,2003).¹ Combining the bound in Claim2with the inequality (5), we have

γX

i,t,k

w⁽ⁱ⁾_tk ≤ O(BKT /γ).

1Although that lemma is for ¯w_tk⁽ⁱ⁾’s, its proof can be easily modified to work for w⁽ⁱ⁾_tk’s, but with an additional factor of B appearing in the term O(BKT /γ) here.

(6)

SinceP

t,kw_tk⁽ⁱ⁾≥ δKT for i ∈ [m] and B = K/δ, we get γmδKT ≤ O(K²T /(δγ)). From this, the required bound on m follows, and we have the lemma.

Let us suppose that T ≥ c0(K²/δ⁴) log(K/δ) for a large enough constant c0 so that both lemmas apply. Then Lemma 2 shows that one can obtain a strong learner by combining the first m weak learners. However, one cannot determine the number m before seeing all the examples, and in fact in our online setting, we need to decide the number N of weak learners even before seeing the first example. Following (Chen et al.,2012), we set N to be the upper bound given by Lemma3. Then at step t, for each k ∈ [K], we consider the function

ftk(x) =

N

X

i=1

α⁽ⁱ⁾_tkh⁽ⁱ⁾_tk(x)

and reduce the task of finding such αtk= (α⁽¹⁾_tk, . . . , α^{(N )}_tk ) to the Online Convex Programming problem. More precisely, we use the N -dimensional probability simplex, de- noted by P_N, as the feasible set and define the loss function as

L_tk(α) = max (

0, θ − y_tk

N

X

i=1

α⁽ⁱ⁾h⁽ⁱ⁾_tk(x_t) )

,

which is a convex function of α. However, unlike in (Chen et al.,2012), we are in the bandit setting and thus may not know y_tk. To overcome this, we use a similar idea as before to estimate a subgradient ∇L_tk(α_tk) by

`_tk=

∇L_tk(α_tk)/p_t(k) if ˆy_t= k,

0 otherwise.

One can then use `tk = (`⁽¹⁾_tk, . . . , `^{(N )}_tk ) to perform gradient descent to update αtkas in (Chen et al.,2012). How- ever, to get a better theoretical bound, here we choose to perform a multiplicative update on αtk to get α(t+1)k = (α⁽¹⁾_(t+1)k, . . . , α^{(N )}_(t+1)k) for step t + 1, with

α⁽ⁱ⁾_(t+1)k= α⁽ⁱ⁾_tk · e^−η`⁽ⁱ⁾^tk/Z_(t+1)k, (6) where Z_(t+1)kis the normalization factor and η is the learning rate which we set to δ³/K. Then we have the following.

Lemma 4. Prh P

t,kLtk(αtk) ≤ O(δKT )i

≥ 1 − 2δ.

Proof. Following the standard analysis, one can show that for any k ∈ [K] and any ¯α_k∈ P_N,

X

t

h`tk, α_tk− ¯α_ki ≤ O((log N )/η) + ηX

t

k`tkk²_∞

≤ O(δKT ), (7)

since k`_tkk²_∞ ≤ B² = K²/δ², η = δ³/K and N ≤ O(K/δ⁴). Now note that for any t and k, E[h`^tk, αtk −

¯

αki] = E[h∇Ltk(αtk), αtk− ¯αki] because given the randomness up to step t − 1, αtk is fixed and the conditional expectation of `tkequals ∇Ltk(αtk). Moreover, as Ltk(αtk) − Ltk( ¯αk) ≤ h∇Ltk(αtk), αtk− ¯αki for convex Ltk, we have

E [Ltk(αtk) − Ltk( ¯αk)] ≤ E[h`^tk, αtk− ¯αki].

Then using the bound in (7) and applying a similar martingale analysis as before, one can show that for any ¯αk ∈ PN,

Pr



 X

t,k

(L_tk(α_tk) − L_tk( ¯α_k)) ≤ O(δKT )



≥ 1 − δ.

Let ¯αk = ( ¯α⁽¹⁾_k , . . . , ¯α^{(N )}_k ), with ¯α⁽ⁱ⁾_k = _m¹ for i ≤ m and

¯

α⁽ⁱ⁾_k = 0 for i > m, so that

Ltk( ¯αk) = max{0, θ − ytkf¯tk(xt)}

≤ (1 + θ)1[y_tkf¯_tk(x_t) < θ].

Then we know from Lemma2that

Pr



 X

t,k

Ltk( ¯αk) ≤ (1 + θ)2δKT



≥ 1 − δ.

Combining the two probability bounds together, we have the lemma.

Finally, recall that to predict each yt, we independently out- put Ht(xt) = arg max_k∈[K]ftk(xt) with probability 1 − δ and a random label with probability δ. Thus, by a Cher- noff bound, our algorithm makes at most |{t : Ht(xt) 6=

yt}| + 2δT errors with probability 1 − 2^−Ω(δ²^{T )} ≥ 1 − δ.

On the other hand, as

1[Ht(xt) 6= yt] ≤ X

k

1[ytkftk(xt) < 0]

≤ X

k

Ltk(αtk)/θ,

Lemma4implies that

Pr [|{t : H_t(x_t) 6= y_t}| ≤ O(δKT /θ)] ≥ 1 − 2δ.

Consequently, for θ = γ/(2 + γ), we can conclude that our algorithm makes at most

O(KδT /θ) + 2δT ≤ O(KδT /γ)

errors with probability at least 1 − 3δ. Therefore, we have the following, which is the main result of our paper.

(7)

Theorem 1. Suppose Assumption 1 holds and T ≥ c0(K²/δ⁴) log(K/δ) for a large enough constant c0. Then our bandit algorithm uses O(K/(δ²γ²)) weak learners and makesO(KδT /γ) errors with probability 1 − 3δ.

Note that the error rate of our algorithm is O(Kδ/γ), which can be made to any ε by setting δ = O(εγ/K), with the requirement on T and the number of weak learners ad- justed accordingly. We remark that we did not attempt to optimize our bounds (which we believe can be improved) as our focus was on establishing the possibility of boosting in the bandit setting. Moreover, it does not seem appropriate to compare our error bound with the regret bounds of existing bandit algorithms. This is because existing algorithms are usually based on linear classifiers, which may have large error rates even though their regrets are small.

On the other hand, our boosting algorithm works for any type of classifiers and achieves a small error rate as long as we have weak learners which satisfy Assumption1.

5. Experiments

In this section, we validate the empirical performance of the proposed algorithm on several real-world data sets. We compare with two representative algorithms. The first one is Banditron (Kakade et al.,2008), which is one of the first proposed algorithms for the bandit setting. It is modified from a multiclass variant of the well-known Perceptron algorithm (Rosenblatt,1962) using the so-called Kesler’s construction (Duda & Hart, 1973). By doing some random exploration, it can accurately construct the estimation of the update step for the full-information multiclass Per- ceptron. The algorithm has good theoretical guarantee, especially when the data is linearly separable. The algorithm can be viewed as a direct modification of a full-information learner (Perceptron) for the bandit setting, without combining the learners for boosting.

The second one is Conservative OVA (C-OVA) (Chen et al., 2009), which uses the one-versus-all multiclass to binary decomposition similar to our algorithm. But unlike most of the bandit algorithms, it does not do random exploration at all. Instead, it conservatively updates using what- ever it gets from the partial feedback, and hence the name.

Note that although it embeds an online binary learning algorithm as its “base learner”, it does not perform boosting by combining several base learners like our algorithm does.

Also, C-OVA performs a margin-based decoding of the binary classification results, and hence may not work well with non-margin-based base learners.

To demonstrate the boosting ability of our proposed algorithm, we choose two completely different types of online binary classifiers as our weak learners. The first one is Per- ceptron, a standard margin-based linear classifier. Note that in (Chen et al., 2009) they use a similar but more com-

Table 1. The data sets used in our experiments.

Data set Car DNA Nursery Connect4 Reuters4

#classes 4 3 5 3 4

#features 6 180 8 126 346,810

#examples 1,728 3,186 12,960 67,557 673,768

plex Online Passive-Aggressive Algorithm (PA) (Crammer et al.,2006) as its internal learner. Since we found little difference in performance on the data sets we tested between the PA algorithm and the Perceptron algorithm, we only report the results using the simpler and more famous Per- ceptron algorithm to compare fairly with Banditron. The second weak learner we use is Naive Bayes, a simple statistical classifier that estimates the posterior probability for each class using Bayes theorem and the assumption of conditional independence between features.

5.1. Results

We test our algorithm on 5 public real-world data sets from various domains with different sizes: CAR, NURSERY, and CONNECT4 from the UCI machine learning repository (Frank & Asuncion,2010); DNAfrom the Statlog project (Michie et al.,1994); REUTERS4 from the paper of Ban- ditron (Kakade et al.,2008). Basic information of these data sets are summarized in Table1. As described previ- ously, each example is first used for prediction before the disclosure of its label, and the error rate is the number of prediction errors divided by the total number of examples.

All the experiments are repeated 10 times with different random orderings of the examples.

For fairness of comparison to Banditron, we do not tune the parameters other than the exploration rate δ. We fix the number of weak learners to be 100 and the assumed weak learner advantage γ to be 0.1 as in the full-information online boosting algorithm (Chen et al., 2012). For the exploration rate δ, we test a wide range of values to see the effect of random exploration. The results are shown in Fig- ure1. Note that C-OVA is not included in this figure since C-OVA does not perform random exploration at all and is parameter-free. One can see that for reasonable range of values of δ (around 0.01 to 0.1), the performance of our algorithm is quite strong and relatively stable, while setting it too high or too low results in worse performance as ex- pected. Table2summarizes the average error rate and the standard deviation when the best choices of δ are used in Banditron and in our algorithm.

Let us first focus on the case when Perceptron is used as the weak learner. Here, the categorical features are trans- formed into numerical ones by decomposition into binary vectors. We can see that the proposed bandit boosting algorithm consistently outperforms Banditron on all the data sets, and is also comparable to C-OVA, especially on larger

(8)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10^-3 10^-2 10^-1 10⁰

error rate

δ

Banditron BanditBoost + Perceptron BanditBoost + NaiveBayes

(a) CAR

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10^-3 10^-2 10^-1 10⁰

error rate

δ

(b) DNA

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10^-3 10^-2 10^-1 10⁰

error rate

δ

(c) NURSERY

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10^-3 10^-2 10^-1 10⁰

error rate

δ

(d) CONNECT4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10^-3 10^-2 10^-1 10⁰

error rate

δ

Banditron BanditBoost + Perceptron

(e) REUTERS4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

10² 10³ 10⁴ 10⁵ 10⁶

error rate

# of examples

Banditron C-OVA + Perceptron BanditBoost + Perceptron

(f) Learning Curve of REUTERS4 Figure 1. (a)-(d): Error rate using different values of exploration rate δ. (f): learning curve of REUTERS4 using the best δ

Table 2. Average (over 10 trials) error rate (%) and standard deviation comparison Data set Banditron C-OVA

(Perceptron)

BanditBoost (Perceptron)

C-OVA (Naive Bayes)

BanditBoost (Naive Bayes) CAR 29.4 ± 0.9 22.8 ± 1.1 26.9 ± 2.4 30.0 ± 0.1 25.1 ± 1.8 DNA 26.8 ± 9.0 13.5 ± 0.5 18.6 ± 0.6 42.9 ± 8.3 25.1 ± 2.6 NURSERY 28.8 ± 1.4 17.9 ± 3.0 16.0 ± 1.1 59.3 ± 8.2 28.9 ± 2.1 CONNECT4 39.8 ± 0.4 28.1 ± 0.2 30.8 ± 0.2 34.9 ± 2.4 34.2 ± 0.4

REUTERS4 16.7 ± 0.5 8.5 ± 4.2 6.3 ± 0.1 N/A N/A

data sets. To take a closer look at the performance of these algorithms, we plot the learning curve for the largest data set (REUTERS4) in Figure1(f). One can see that our algorithm begins to outperform the other algorithms when the number of examples is sufficiently large. This is due to the more complex model we use and the need for random exploration as opposed to the deterministic C-OVA algorithm.

Note that it is in accordance to our analysis in Theorem1, as the error bound only holds when the number of rounds T is large.

Next, let us see the situation when the weak learner is switched to Naive Bayes. Note that here we did not test on the REUTERS4 data set due to the slow inference of Naive Bayes for high dimensional data. It can be seen that our algorithm consistently reaches the best on all the data sets.

Moreover, we see a large difference between C-OVA and our algorithm, especially in DNAand NURSERYdata sets.

The superiority echoes the earlier conjecture that C-OVA may not work well with non-margin-based base learners.

On the other hand, the proposed bandit boosting algorithm enjoys a stronger theoretical guarantee and works well with various types of weak learners.

6. Conclusion

We propose a boosting algorithm to efficiently generate strong multiclass bandit learners by exploiting the abun- dance of existing online binary learners. The proposed algorithm can be viewed as a careful combination of the online boosting algorithm for binary classification (Chen et al.,2012) and some key estimation techniques in the bandit algorithms. While the proposed algorithm is simple, we show some non-trivial theoretical analysis that leads to sound theoretical guarantee. To the best of our knowledge, our proposed boosting algorithm is the first one that comes with such theoretical guarantee. In addition, experimental results on real-world data sets show that the proposed bandit boosting algorithm can be easily coupled with different weak binary learners to reach promising performance.

(9)

References

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.

The non-stochastic multi-armed bandit problem. SIAM Journal of Computing, 32:48–77, 2002.

Chen, G., Chen, G., Zhang, J., Chen, S., and Zhang, C.

Beyond banditron: A conservative and efficient reduc- tion for online multiclass prediction with bandit setting model. In Proceedings of ICDM, pp. 71–80, 2009.

Chen, S.-T., Lin, H.-T., and Lu, C.-J. An online boosting algorithm with theoretical justifications. In Proceedings of ICML, pp. 1007–1014, July 2012.

Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. Online passive-aggressive algorithms. J.

Mach. Learn. Res., 7:551–585, December 2006.

Duda, R. O. and Hart, P. E. Pattern Classification and Scene Analysis. Wiley, 1973.

Flaxman, A. D., Kalai, A. T., and McMahan, H. B. On- line convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of SODA, pp.

385–394, Philadelphia, PA, USA, 2005.

Frank, A. and Asuncion, A. UCI machine learning repository, 2010. URL http://archive.ics.uci.

edu/ml.

Freund, Y. and Schapire, R. E. A decision-theoretic gener- alization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):

119–139, 1997.

Hazan, E. and Kale, S. Newtron: an efficient bandit algorithm for online multiclass prediction. In Proceedings of NIPS, pp. 891–899, 2011.

Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Effi- cient bandit algorithms for online multiclass prediction.

In Proceedings of ICML, pp. 440–447, New York, NY, USA, 2008.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of WWW, pp. 661–670, New York, NY, USA, 2010. ACM.

Michie, D., Spiegelhalter, D. J., and Taylor, C. C. Machine learning, neural and statistical classification, 1994.

Oza, N. C. and Russell, S. Online bagging and boosting. In Proceedings of AISTATS, pp. 105–112, 2001.

Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan, 1962.

Schapire, R. E. The strength of weak learnability. Mach.

Learn., 5(2):197–227, July 1990.

Schapire, R. E. and Singer, Y. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, December 1999.

Servedio, R. A. Smooth boosting and learning with mali- cious noise. JMLR, 4:473–489, 2003.

Zinkevich, M. Online convex programming and gener- alized infinitesimal gradient ascent. In Proceedings of ICML, pp. 928–936, 2003.