Online Learning and the Perceptron Algorithm

5 Machine Learning

5.5 Online Learning and the Perceptron Algorithm

So far we have been considering what is often called the batch learning scenario. You are given a “batch” of data—the training sample S—and your goal is to use it to produce a hypothesis h that will have low error on new data, under the assumption that both S and the new data are sampled from some fixed distribution D. We now switch to the more challenging online learning scenario where we remove the assumption that data is sampled from a fixed probability distribution, or from any probabilistic process at all.

Specifically, the online learning scenario proceeds as follows. At each time t = 1, 2, . . .:

1. The algorithm is presented with an arbitrary example xt∈ X and is asked to make a prediction `_t of its label.

2. The algorithm is told the true label of the example c^∗(x_t) and is charged for a mistake if c^∗(x_t) 6= `_t.

The goal of the learning algorithm is to make as few mistakes as possible in total. For example, consider an email classifier that when a new email message arrives must classify it as “important” or “it can wait”. The user then looks at the email and informs the algorithm if it was incorrect. We might not want to model email messages as independent random objects from a fixed probability distribution, because they often are replies to previous emails and build on each other. Thus, the online learning model would be more appropriate than the batch model for this setting.

Intuitively, the online learning model is harder than the batch model because we have removed the requirement that our data consists of independent draws from a fixed proba-bility distribution. Indeed, we will see shortly that any algorithm with good performance in the online model can be converted to an algorithm with good performance in the batch model. Nonetheless, the online model can sometimes be a cleaner model for design and analysis of algorithms.

5.5.1 An Example: Learning Disjunctions

As a simple example, let’s revisit the problem of learning disjunctions in the online model.

We can solve this problem by starting with a hypothesis h = x₁∨ x₂∨ . . . ∨ x_d and using it for prediction. We will maintain the invariant that every variable in the target disjunc-tion is also in our hypothesis, which is clearly true at the start. This ensures that the only mistakes possible are on examples x for which h(x) is positive but c^∗(x) is negative.

When such a mistake occurs, we simply remove from h any variable set to 1 in x. Since such variables cannot be in the target function (since x was negative), we maintain our invariant and remove at least one variable from h. This implies that the algorithm makes at most d mistakes total on any series of examples consistent with a disjunction.

In fact, we can show this bound is tight by showing that no deterministic algorithm can guarantee to make fewer than d mistakes.

Theorem 5.7 For any deterministic algorithm A there exists a sequence of examples σ and disjunction c^∗ such that A makes at least d mistakes on sequence σ labeled by c^∗. Proof: Let σ be the sequence e1, e2, . . . , edwhere ej is the example that is zero everywhere except for a 1 in the jth position. Imagine running A on sequence σ and telling A it made a mistake on every example; that is, if A predicts positive on e_j we set c^∗(e_j) = −1 and if A predicts negative on ej we set c^∗(ej) = +1. This target corresponds to the disjunction of all x_j such that A predicted negative on e_j, so it is a legal disjunction. Since A is deterministic, the fact that we constructed c^∗ by running A is not a problem: it would make the same mistakes if re-run from scratch on the same sequence and same target.

Therefore, A makes d mistakes on this σ and c^∗. 5.5.2 The Halving Algorithm

If we are not concerned with running time, a simple algorithm that guarantees to make at most log₂(|H|) mistakes for a target belonging to any given class H is called the halving algorithm. This algorithm simply maintains the version space V ⊆ H consisting of all h ∈ H consistent with the labels on every example seen so far, and predicts based on majority vote over these functions. Each mistake is guaranteed to reduce the size of the version space V by at least half (hence the name), thus the total number of mistakes is at most log₂(|H|). Note that this can be viewed as the number of bits needed to write a function in H down.

5.5.3 The Perceptron Algorithm

The Perceptron algorithm is an efficient algorithm for learning a linear separator in d-dimensional space, with a mistake bound that depends on the margin of separation of the data. Specifically, the assumption is that the target function can be described by a vector w^∗ such that for each positive example x we have x^Tw^∗ ≥ 1 and for each negative example x we have x^Tw^∗ ≤ −1. Note that if we think of the examples x as points in space, then x^Tw^∗/|w^∗| is the distance of x to the hyperplane x^Tw^∗ = 0. Thus, we can view our assumption as stating that there exists a linear separator through the origin with all positive examples on one side, all negative examples on the other side, and all examples at distance at least γ = 1/|w^∗| from the separator. This quantity γ is called the margin of separation (see Figure 5.3).

The guarantee of the Perceptron algorithm will be that the total number of mistakes is at most (R/γ)² where R = max_t|x_t| over all examples x_tseen so far. Thus, if there exists a hyperplane through the origin that correctly separates the positive examples from the negative examples by a large margin relative to the radius of the smallest ball enclosing

margin

Figure 5.3: Margin of a linear separator.

the data, then the total number of mistakes will be small. The algorithm is very simple and proceeds as follows.

The Perceptron Algorithm: Start with the all-zeroes weight vector w = 0. Then, for t = 1, 2, . . . do:

1. Given example x_t, predict sgn(x^T_tw).

2. If the prediction was a mistake, then update:

(a) If xt was a positive example, let w ← w + xt. (b) If x_t was a negative example, let w ← w − x_t.

While simple, the Perceptron algorithm enjoys a strong guarantee on its total number of mistakes.

Theorem 5.8 On any sequence of examples x₁, x₂, . . ., if there exists a vector w^∗ such that x^T_tw^∗ ≥ 1 for the positive examples and x^T_tw^∗ ≤ −1 for the negative examples (i.e., a linear separator of margin γ = 1/|w^∗|), then the Perceptron algorithm makes at most R²|w^∗|² mistakes, where R = max_t|x_t|.

To get a feel for this bound, notice that if we multiply all entries in all the x_t by 100, we can divide all entries in w^∗ by 100 and it will still satisfy the “if”condition. So the bound is invariant to this kind of scaling, i.e., to what our “units of measurement” are.

Proof of Theorem 5.8: Fix some consistent w^∗. We will keep track of two quantities, w^Tw^∗ and |w|². First of all, each time we make a mistake, w^Tw^∗ increases by at least 1.

That is because if x_t is a positive example, then

(w + xt)^Tw^∗ = w^Tw^∗+ x^T_tw^∗ ≥ w^Tw^∗+ 1,

by definition of w^∗. Similarly, if x_t is a negative example, then (w − xt)^Tw^∗ = w^Tw^∗− x^T_tw^∗ ≥ w^Tw^∗+ 1.

Next, on each mistake, we claim that |w|² increases by at most R². Let us first consider mistakes on positive examples. If we make a mistake on a positive example x_t then we have

(w + x_t)^T(w + x_t) = |w|²+ 2x^T_tw + |x_t|² ≤ |w|²+ |x_t|² ≤ |w|²+ R²,

where the middle inequality comes from the fact that we made a mistake, which means that x^T_tw ≤ 0. Similarly, if we make a mistake on a negative example x_t then we have

(w − x_t)^T(w − x_t) = |w|²− 2x^T_tw + |x_t|² ≤ |w|²+ |x_t|² ≤ |w|²+ R². Note that it is important here that we only update on a mistake.

So, if we make M mistakes, then w^Tw^∗ ≥ M , and |w|² ≤ M R², or equivalently,

|w| ≤ R√

M . Finally, we use the fact that w^Tw^∗/|w^∗| ≤ |w| which is just saying that the projection of w in the direction of w^∗ cannot be larger than the length of w. This gives us:

M/|w^∗| ≤ R√

√ M

M ≤ R|w^∗| M ≤ R²|w^∗|² as desired.

5.5.4 Extensions: Inseparable Data and Hinge Loss

We assumed above that there existed a perfect w^∗ that correctly classified all the exam-ples, e.g., correctly classified all the emails into important versus non-important. This is rarely the case in real-life data. What if even the best w^∗ isn’t quite perfect? We can see what this does to the above proof: if there is an example that w^∗ doesn’t cor-rectly classify, then while the second part of the proof still holds, the first part (the dot product of w with w^∗ increasing) breaks down. However, if this doesn’t happen too of-ten, and also x^T_tw^∗ is just a “little bit wrong” then we will only make a few more mistakes.

To make this formal, define the hinge-loss of w^∗on a positive example x_tas max(0, 1−

x^T_tw^∗). In other words, if x^T_tw^∗ ≥ 1 as desired then the loss is zero; else, the hinge-loss is the amount the LHS is less than the RHS.²¹ Similarly, the hinge-loss of w^∗ on a negative example x_t is max(0, 1 + x^T_tw^∗). Given a sequence of labeled examples S, define the total hinge-loss L_hinge(w^∗, S) as the sum of hinge-losses of w^∗ on all examples in S.

We now get the following extended theorem.

21This is called “hinge-loss” because as a function of x^T_tw^∗ it looks like a hinge.

Theorem 5.9 On any sequence of examples S = x₁, x₂, . . ., the Perceptron algorithm makes at most

minw^∗ R²|w^∗|²+ 2L_hinge(w^∗, S) mistakes, where R = maxt|xt|.

Proof: As before, each update of the Perceptron algorithm increases |w|² by at most R², so if the algorithm makes M mistakes, we have |w|² ≤ M R².

What we can no longer say is that each update of the algorithm increases w^Tw^∗ by at least 1. Instead, on a positive example we are “increasing” w^Tw^∗ by x^T_tw^∗ (it could be negative), which is at least 1 − L_hinge(w^∗, x_t). Similarly, on a negative example we

“increase” w^Tw^∗ by −x^T_tw^∗, which is also at least 1 − L_hinge(w^∗, x_t). If we sum this up over all mistakes, we get that at the end we have w^Tw^∗ ≥ M − Lhinge(w^∗, S), where we are using here the fact that hinge-loss is never negative so summing over all of S is only larger than summing over the mistakes that w made.

Finally, we just do some algebra. Let L = L_hinge(w^∗, S). So we have:

w^Tw^∗/|w^∗| ≤ |w|

(w^Tw^∗)² ≤ |w|²|w^∗|² (M − L)² ≤ M R²|w^∗|² M²− 2M L + L² ≤ M R²|w^∗|² M − 2L + L²/M ≤ R²|w^∗|²

M ≤ R²|w^∗|²+ 2L − L²/M ≤ R²|w^∗|²+ 2L as desired.

在文檔中 Foundations of Data Science (頁 134-138)