5 Machine Learning
5.5 Online Learning and the Perceptron Algorithm
So far we have been considering what is often called the batch learning scenario. You are given a “batch” of data—the training sample S—and your goal is to use it to produce a hypothesis h that will have low error on new data, under the assumption that both S and the new data are sampled from some fixed distribution D. We now switch to the more challenging online learning scenario where we remove the assumption that data is sampled from a fixed probability distribution, or from any probabilistic process at all.
Specifically, the online learning scenario proceeds as follows. At each time t = 1, 2, . . .:
1. The algorithm is presented with an arbitrary example xt∈ X and is asked to make a prediction `t of its label.
2. The algorithm is told the true label of the example c∗(xt) and is charged for a mistake if c∗(xt) 6= `t.
The goal of the learning algorithm is to make as few mistakes as possible in total. For example, consider an email classifier that when a new email message arrives must classify it as “important” or “it can wait”. The user then looks at the email and informs the algorithm if it was incorrect. We might not want to model email messages as independent random objects from a fixed probability distribution, because they often are replies to previous emails and build on each other. Thus, the online learning model would be more appropriate than the batch model for this setting.
Intuitively, the online learning model is harder than the batch model because we have removed the requirement that our data consists of independent draws from a fixed proba-bility distribution. Indeed, we will see shortly that any algorithm with good performance in the online model can be converted to an algorithm with good performance in the batch model. Nonetheless, the online model can sometimes be a cleaner model for design and analysis of algorithms.
5.5.1 An Example: Learning Disjunctions
As a simple example, let’s revisit the problem of learning disjunctions in the online model.
We can solve this problem by starting with a hypothesis h = x1∨ x2∨ . . . ∨ xd and using it for prediction. We will maintain the invariant that every variable in the target disjunc-tion is also in our hypothesis, which is clearly true at the start. This ensures that the only mistakes possible are on examples x for which h(x) is positive but c∗(x) is negative.
When such a mistake occurs, we simply remove from h any variable set to 1 in x. Since such variables cannot be in the target function (since x was negative), we maintain our invariant and remove at least one variable from h. This implies that the algorithm makes at most d mistakes total on any series of examples consistent with a disjunction.
In fact, we can show this bound is tight by showing that no deterministic algorithm can guarantee to make fewer than d mistakes.
Theorem 5.7 For any deterministic algorithm A there exists a sequence of examples σ and disjunction c∗ such that A makes at least d mistakes on sequence σ labeled by c∗. Proof: Let σ be the sequence e1, e2, . . . , edwhere ej is the example that is zero everywhere except for a 1 in the jth position. Imagine running A on sequence σ and telling A it made a mistake on every example; that is, if A predicts positive on ej we set c∗(ej) = −1 and if A predicts negative on ej we set c∗(ej) = +1. This target corresponds to the disjunction of all xj such that A predicted negative on ej, so it is a legal disjunction. Since A is deterministic, the fact that we constructed c∗ by running A is not a problem: it would make the same mistakes if re-run from scratch on the same sequence and same target.
Therefore, A makes d mistakes on this σ and c∗. 5.5.2 The Halving Algorithm
If we are not concerned with running time, a simple algorithm that guarantees to make at most log2(|H|) mistakes for a target belonging to any given class H is called the halving algorithm. This algorithm simply maintains the version space V ⊆ H consisting of all h ∈ H consistent with the labels on every example seen so far, and predicts based on majority vote over these functions. Each mistake is guaranteed to reduce the size of the version space V by at least half (hence the name), thus the total number of mistakes is at most log2(|H|). Note that this can be viewed as the number of bits needed to write a function in H down.
5.5.3 The Perceptron Algorithm
The Perceptron algorithm is an efficient algorithm for learning a linear separator in d-dimensional space, with a mistake bound that depends on the margin of separation of the data. Specifically, the assumption is that the target function can be described by a vector w∗ such that for each positive example x we have xTw∗ ≥ 1 and for each negative example x we have xTw∗ ≤ −1. Note that if we think of the examples x as points in space, then xTw∗/|w∗| is the distance of x to the hyperplane xTw∗ = 0. Thus, we can view our assumption as stating that there exists a linear separator through the origin with all positive examples on one side, all negative examples on the other side, and all examples at distance at least γ = 1/|w∗| from the separator. This quantity γ is called the margin of separation (see Figure 5.3).
The guarantee of the Perceptron algorithm will be that the total number of mistakes is at most (R/γ)2 where R = maxt|xt| over all examples xtseen so far. Thus, if there exists a hyperplane through the origin that correctly separates the positive examples from the negative examples by a large margin relative to the radius of the smallest ball enclosing
margin
Figure 5.3: Margin of a linear separator.
the data, then the total number of mistakes will be small. The algorithm is very simple and proceeds as follows.
The Perceptron Algorithm: Start with the all-zeroes weight vector w = 0. Then, for t = 1, 2, . . . do:
1. Given example xt, predict sgn(xTtw).
2. If the prediction was a mistake, then update:
(a) If xt was a positive example, let w ← w + xt. (b) If xt was a negative example, let w ← w − xt.
While simple, the Perceptron algorithm enjoys a strong guarantee on its total number of mistakes.
Theorem 5.8 On any sequence of examples x1, x2, . . ., if there exists a vector w∗ such that xTtw∗ ≥ 1 for the positive examples and xTtw∗ ≤ −1 for the negative examples (i.e., a linear separator of margin γ = 1/|w∗|), then the Perceptron algorithm makes at most R2|w∗|2 mistakes, where R = maxt|xt|.
To get a feel for this bound, notice that if we multiply all entries in all the xt by 100, we can divide all entries in w∗ by 100 and it will still satisfy the “if”condition. So the bound is invariant to this kind of scaling, i.e., to what our “units of measurement” are.
Proof of Theorem 5.8: Fix some consistent w∗. We will keep track of two quantities, wTw∗ and |w|2. First of all, each time we make a mistake, wTw∗ increases by at least 1.
That is because if xt is a positive example, then
(w + xt)Tw∗ = wTw∗+ xTtw∗ ≥ wTw∗+ 1,
by definition of w∗. Similarly, if xt is a negative example, then (w − xt)Tw∗ = wTw∗− xTtw∗ ≥ wTw∗+ 1.
Next, on each mistake, we claim that |w|2 increases by at most R2. Let us first consider mistakes on positive examples. If we make a mistake on a positive example xt then we have
(w + xt)T(w + xt) = |w|2+ 2xTtw + |xt|2 ≤ |w|2+ |xt|2 ≤ |w|2+ R2,
where the middle inequality comes from the fact that we made a mistake, which means that xTtw ≤ 0. Similarly, if we make a mistake on a negative example xt then we have
(w − xt)T(w − xt) = |w|2− 2xTtw + |xt|2 ≤ |w|2+ |xt|2 ≤ |w|2+ R2. Note that it is important here that we only update on a mistake.
So, if we make M mistakes, then wTw∗ ≥ M , and |w|2 ≤ M R2, or equivalently,
|w| ≤ R√
M . Finally, we use the fact that wTw∗/|w∗| ≤ |w| which is just saying that the projection of w in the direction of w∗ cannot be larger than the length of w. This gives us:
M/|w∗| ≤ R√
√ M
M ≤ R|w∗| M ≤ R2|w∗|2 as desired.
5.5.4 Extensions: Inseparable Data and Hinge Loss
We assumed above that there existed a perfect w∗ that correctly classified all the exam-ples, e.g., correctly classified all the emails into important versus non-important. This is rarely the case in real-life data. What if even the best w∗ isn’t quite perfect? We can see what this does to the above proof: if there is an example that w∗ doesn’t cor-rectly classify, then while the second part of the proof still holds, the first part (the dot product of w with w∗ increasing) breaks down. However, if this doesn’t happen too of-ten, and also xTtw∗ is just a “little bit wrong” then we will only make a few more mistakes.
To make this formal, define the hinge-loss of w∗on a positive example xtas max(0, 1−
xTtw∗). In other words, if xTtw∗ ≥ 1 as desired then the loss is zero; else, the hinge-loss is the amount the LHS is less than the RHS.21 Similarly, the hinge-loss of w∗ on a negative example xt is max(0, 1 + xTtw∗). Given a sequence of labeled examples S, define the total hinge-loss Lhinge(w∗, S) as the sum of hinge-losses of w∗ on all examples in S.
We now get the following extended theorem.
21This is called “hinge-loss” because as a function of xTtw∗ it looks like a hinge.
Theorem 5.9 On any sequence of examples S = x1, x2, . . ., the Perceptron algorithm makes at most
minw∗ R2|w∗|2+ 2Lhinge(w∗, S) mistakes, where R = maxt|xt|.
Proof: As before, each update of the Perceptron algorithm increases |w|2 by at most R2, so if the algorithm makes M mistakes, we have |w|2 ≤ M R2.
What we can no longer say is that each update of the algorithm increases wTw∗ by at least 1. Instead, on a positive example we are “increasing” wTw∗ by xTtw∗ (it could be negative), which is at least 1 − Lhinge(w∗, xt). Similarly, on a negative example we
“increase” wTw∗ by −xTtw∗, which is also at least 1 − Lhinge(w∗, xt). If we sum this up over all mistakes, we get that at the end we have wTw∗ ≥ M − Lhinge(w∗, S), where we are using here the fact that hinge-loss is never negative so summing over all of S is only larger than summing over the mistakes that w made.
Finally, we just do some algebra. Let L = Lhinge(w∗, S). So we have:
wTw∗/|w∗| ≤ |w|
(wTw∗)2 ≤ |w|2|w∗|2 (M − L)2 ≤ M R2|w∗|2 M2− 2M L + L2 ≤ M R2|w∗|2 M − 2L + L2/M ≤ R2|w∗|2
M ≤ R2|w∗|2+ 2L − L2/M ≤ R2|w∗|2+ 2L as desired.