Large-Margin Bounds of Threshold Ensembles

Section 3.1 and Chapter 4. Each ranker in the threshold ensemble model is called a threshold ensemble, which uses an ensemble HT of confidence functions as the potential function H. The ensemble is of the form

HT(x) = XT

t=1

αtht(x), αt∈ R. (3.2)

We assume that each confidence function h_t: X → [−1, +1] comes from a hypothesis set H. That is, HT ∈ span(H) = P. The confidence function reflects a possibly imperfect ordering preference. Note that a special instance of the confidence function is a binary classifierX → {−1, +1}, which matches the fact that binary classification is a special case of ordinal ranking with K = 2 (Rudin et al. 2005). The ensemble linearly combines the ordering preferences with α. Note that we allow αt to be any real value, which means that it is possible to reverse the ordering preference of h_t in the ensemble when necessary.

Ensemble models in general have been successfully used for classification and re-gression (Meir and R¨atsch 2003). They not only introduce more stable predictions through the linear combination, but also provide sufficient power for approximating complicated target functions. The threshold ensemble model extends existing ensem-ble models to ordinal ranking and inherits many useful theoretical properties from them. Next, we discuss one such property: the large-margin bounds.

d d

-θ1

t t t

θ2

x xx x θ3

ρ₁ ρ^-2

-ρ3

1 2 3 4

^r^H,θ^(x)

H(x)

Figure 3.2: Margins of a correctly predicted example

where ν(g, ∆) = π(g, Zu, ∆) is an extended form of training cost with respect to a margin parameter ∆, and the complexity term decreases as N or ∆ increases.

For ordinal ranking using threshold models, Herbrich, Graepel and Obermayer (2000) derived a large-margin bound for any threshold ranker r_θ with a potential function Hv = hv, φ(x)i. Unfortunately the bound is quite restricted since it is only applicable when ν(r_θ, ∆) = 0 with respect to the classification cost function C_c. In addition, the bound uses a margin definition that contains O(N²) terms, which makes it more complicated to design algorithms that relate to the bound. Another bound was derived by Shashua and Levin (2003). The bound is based on a margin definition of only O(KN) terms and is applicable to the threshold ensemble model.

Nevertheless, the bound is loose when T , the size of the ensemble, is large, because its complexity term grows with T .

Next, we derive novel large-margin bounds of the threshold ensemble model with two widely used cost functions: the classification cost function Cc and the abso-lute cost function Ca. Similar bounds for more general cost-sensitive setup will be discussed in Chapter 4. The bounds are extended from the results of Schapire et al.

(1998) and are based on a margin definition of O(KN) terms. In addition, our bounds do not require ν(r_θ, ∆) = 0 with respect to C_c, and their complexity terms do not grow with T .

We start by defining the margins of a threshold ensemble, which are illustrated in Figure 3.2. Intuitively, we expect the potential value H(x) to be in the desired interval (θ_y−1, θ_y], and we want H(x) to be far from the boundaries (thresholds):

Definition 3.1. Consider a given threshold ensemble rH,θ, where H = HT. 1. The margin of an example (x, y) with respect to θk is defined as

ρ_k(x, y) =







H_T(x)− θk, if y > k;

θ_k− HT(x), if y ≤ k.

2. The normalized margin ˜ρ_k(x, y) is defined as

ρk(x, y) = ρk(x, y)

t=1

|αt| +

K−1X

k=1

|θk|

! .

Definition 3.1 is similar to the definition of the SVM margin by Shashua and Levin (2003) and is analogous to the definition of margins in binary classification. A negative ρk(x, y) would indicate an incorrect prediction.

For each example (x, y), we can obtain (K−1) margins from Definition 3.1. Two of them are of the most importance. The first one is ρy−1(x, y), which is the margin to the left (lower) boundary of the desired interval. The other is ρy(x, y), which is the margin to the right (upper) boundary. We will give them special names: the left-margin ρ_L(x, y) and the right-margin ρ_R(x, y). Note that by definition, ρ_L(x, 1) = ρR(x, K) =∞.

Next, we take a closer look at the out-of-sample cost π(r) and see how it connects to the margin definition above.

∆-classification cost: For the classification cost function Cc, if we make a minor assumption that the degenerate cases ˜ρR(x, y) = 0 are of an infinitesimal probability,

πc(rθ, F ) = Z

x,yJy 6= r^θ(x)K dF(x, y)

= Z

x,yJ˜ρL(x, y)≤ 0 or ˜ρR(x, y)≤ 0K dF(x, y) .

The definition could be generalized by expecting both margins to be larger than ∆.

That is, we can define the ∆-classification cost as

πc(rθ, F, ∆) ≡ Z

x,yJ˜ρL(x, y)≤ ∆ or ˜ρ^R(x, y)≤ ∆K dF(x, y) .

Then, πc(rθ, F ) is just a special case with ∆ = 0. We can also define the in-sample

∆-classification cost νc(rθ, ∆)≡ π^c(rθ, Zu, ∆).

∆-boundary cost: The “or” operation of πc(rθ, F, ∆) is not easy to handle in the proof of the coming bounds. An alternative choice is the ∆-boundary cost:

πb(rθ, F, ∆) ≡ Z

x,y

dF(x, y)·











J˜ρ_R(x, y)≤ ∆K , if y = 1;

J˜ρ_L(x, y)≤ ∆K , if y = K;

2 · J˜ρ^L(x, y)≤ ∆K + J˜ρ^R(x, y)≤ ∆K

, otherwise.

Similarly, the in-sample ∆-boundary cost ν_b is defined by ν_b(r_θ, ∆) ≡ πb(r_θ, Z_u, ∆).

Note that the ∆-boundary cost and the ∆-classification cost are equivalent up to a constant. That is, for any (rθ, F, ∆),

2πc(rθ, F, ∆)≤ π^b(rθ, F, ∆)≤ π^c(rθ, F, ∆). (3.3)

∆-absolute cost: Next, we look at the out-of-sample cost based on the abso-lute cost function Ca. Again, if we make a minor assumption that the degenerate cases ˜ρk(x, y) = 0 are of an infinitesimal probability,

πa(rθ, F ) = Z

x,y

y 6= r^θ(x)

dF(x, y) = Z

x,y

XK−1 k=1

J˜ρk(x, y)≤ 0K dF(x, y) .

Then, the ∆-absolute cost can be defined as

πa(rθ, F, ∆) ≡ Z

x,y

XK−1 k=1

J˜ρk(x, y)≤ ∆K dF(x, y) ,

which takes πa(rθ, F ) as a special case with ∆ = 0. The in-sample ∆-absolute cost νa

is similarly defined by νa(rθ, ∆) ≡ πa(rθ, Zu, ∆).

An important observation for deriving our bounds is that πa(g, F, ∆) can be writ-ten with respect to an additional sampling procedure on k. That is,

πa(rθ, F, ∆) = Z

x,y

XK−1 k=1

J˜ρk(x, y)≤ ∆K dF(x, y)

= (K−1) Z

x,y,kJ˜ρ^k(x, y)≤ ∆K _K−1¹ · dF(x, y) .

Equivalently, we can define a probability measure dFE(x, y, k) from dF (x, y) and an uniform distribution over {1, . . . , K −1} to generate the tuple (x, y, k). Then

K−1πa(rθ, F, ∆) is the portion of nonpositive ˜ρk(x, y) under dFE(x, y, k). Consider an extended training set ZE ={(xⁿ, yn, k)} with N(K −1) elements. Each element is a possible outcome from dFE(x, y, k). Note, however, that these elements are not all independent. For example, (xn, yn, 1) and (xn, yn, 2) are dependent. Thus, we cannot directly use the whole ZE as a set of independent outcomes from dFE(x, y, k).

Some subsets of ZE contain independent outcomes from dFE(x, y, k). One way to extract such a subset is to choose one kn uniformly and independently from {1, . . . , K −1} for each example (xⁿ, yn). The resulting subset would be named ZS ={(xn, yn, kn)}^Nn=1. Then, we can obtain a large-margin bound of πa(r, F ):

Theorem 3.2. Consider a negation complete¹ set H, which contains only binary classifiers h : X → {−1, 1} and is of VC-dimension d. Assume that δ > 0, and N > d + K − 1 = d^E. Then with probability at least 1− δ over the random choice of the training set Z, every threshold ensemble defined from (3.1) and (3.2) satisfies the following bound for all ∆ > 0:

πa(rθ, F )≤ νa(rθ, ∆) + O K

√N

dElog²(N/dE)

∆² + log1 δ

1/2! .

1H is negation complete if and only if h ∈ H ⇐⇒ (−h) ∈ H, where (−h)(x) = − h(x)

for all x.

Proof. The key is to use the examples (xn, yn, kn)∈ ZS. Let

(X^(k_nⁿ⁾, Y_n^(kⁿ⁾) =







(xn, 1kn) , +1

, if yn > kn; (xn, 1kn) ,−1

, if yn ≤ kn,

(3.4)

where 1_k is a vector of length (K−1) with a single 1 at the k-th dimension and 0 elsewhere. The test examples are constructed similarly with

X^(k), Y^(k)

≡

(x, 1_k), 2 Jy > kK − 1 ,

where (x, y, k) is generated from dFE(x, y, k). Then, large-margin bounds for the ordinal ranking problem can be inferred from those for the binary classification prob-lem. We first consider an ensemble function g(X^(k)) defined by a linear combination of the functions in

G =n

˜h: ˜h(X^(k)) = h(x), h∈ Ho

∪ {sℓ}^K−1ℓ=1 . (3.5)

Here sℓ(X^(k)) is a decision stump on dimension D + ℓ (see Subsection 5.1.2). If the output space of sℓ is {−1, 1}, it is not hard to show that the VC-dimension of G is no more than dE = d + K − 1. Since the proof of Schapire et al. (1998, Theorem 2), which will be applied on G later, only requires a combinatorial counting bound on the possible outputs of sℓ, we let

sℓ(X^(k)) =− sign

X^(k)[D + ℓ]− 0.5 + 1

2 =− Jk = ℓK ∈ {−1, 0}

to get a cosmetically cleaner proof. Some different versions of the bound can be obtained by considering sℓ(X^(k)) ∈ {−1, 1} or by bounding the number of possible outputs of sℓ directly by a tighter term.

Without loss of generality, we normalize rθ such that PT

t=1|α^t| +PK−1

ℓ=1 |θ^ℓ| is 1.

Then, consider an ensemble function

g(X^(k)) = g(x, 1k) = HT(x)− θ^k = XT

t=1

αt˜ht(X^(k)) + XK−1 k=1

θℓsℓ(X^(k)).

An important property for the transform is that for every (X^(k), Y^(k)) derived from the tuple (x, y, k), the term

Y^(k)· g X^(k)

= ˜ρk(x, y).

Because ZScontains N independent outcomes from dFE(x, y, k) = dFE X^(k), Y^(k) , the large-margin theorem (Schapire et al. 1998, Theorem 2) states that with proba-bility at least 1− ^δ₂ over the choice of ZS,

x,y,k

qY^(k)g X^(k)

≤ 0y

dFE X^(k), Y^(k)

≤ 1 N

XN n=1

qY_n^(k)g X^(k)_n

≤ ∆y

+ O 1

√N

dElog²(N/d_E)

∆² + log1 δ

^1/2!

. (3.6)

Since

Y^(k)· g X^(k)

= ˜ρk(x, y), the left-hand side is _K−1¹ πa(rθ, F ).

Let b_n = r

Yn^(k)g X^(k)n

≤ ∆z

= J˜ρ_k_n(x_n, y_n)≤ ∆K, which is a Boolean random variable with mean _K−1¹ PK−1

k=1J˜ρk(xn, yn)≤ ∆K. Using Hoeffding’s inequality (Hoeffd-ing 1963), when each bn is chosen independently, with probability at least 1−₂^δ over the choice of bn,

1 N

XN n=1

bn ≤ 1 N

XN n=1

1 K−1

XK−1 k=1

J˜ρk(xn, yn)≤ ∆K + O 1

√N

log1

1/2!

= 1

K−1νa(rθ, ∆) + O 1

√N

log 1

1/2!

. (3.7)

The desired result can be proved by combining (3.6) and (3.7) with a union bound.

Similarly, if we look at the boundary cost,

πb(rθ, F, ∆) = Z

x,y

kJ˜ρk(x, y)≤ ∆K dF^E(k| x, y) dF(x, y) ,

for some probability measure dFE(k| x, y) on {L, R}. Then, a similar proof leads to

the following theorem.

Theorem 3.3. For the same conditions as stated in Theorem 3.2,

πb(rθ, F )≤ νb(rθ, ∆) + O 1

√N

dElog²(N/dE)

∆² + log1 δ

^1/2! .

Then, a large-margin bound of the classification cost can be immediately derived by applying (3.3).

Corollary 3.4. For the same conditions as stated in Theorem 3.2,

πc(rθ, F )≤ 2νc(rθ, ∆) + O 1

√N

dElog²(N/dE)

∆² + log 1 δ

^1/2! .

Note that because of (3.3) and

K−1πa(rθ, F, ∆)≤ π^c(rθ, F, ∆)≤ π^a(rθ, F, ∆),

we can use either the classification, the boundary, or the absolute cost in the right-hand side and left-right-hand side of the bounds with some changes within O(K).

The bounds above can be generalized whenH contains confidence functions rather than binary classifiers. Even more generally, similar bounds can be derived for any threshold model, as shown below. The bounds provide motivations for building algo-rithms with margin-related formulations.

Theorem 3.5. Let GE =

g : g(X^(k)) = H(x)− θk

H∈P. Consider some ǫ > 0, and ∆ > 0. When each (xn, yn) is generated independently from dF (x, y),

Prob∃r^θ ∈ R^P such that πa(rθ, F ) > νa(rθ, ∆) + Kǫ

≤ 2N G^E,^∆₂,^ǫ₈, 2N exp

−ǫ²N 32

+ exp −2ǫ²N .

Furthermore,

Prob

∃rθ ∈ RP such that πb(rθ, F ) > νb(rθ, ∆) + 2ǫ

≤ 2N GE,^∆₂,^ǫ₈, 2N exp

−ǫ²N 32

+ exp −2ǫ²N ,

and

Prob∃r^θ ∈ RP such that πc(rθ, F ) > 2νc(rθ, ∆) + 4ǫ

≤ 2N G^E,^∆₂,^ǫ₈, 2N exp

−ǫ²N 32

+ exp −2ǫ²N ,

where N (G, ∆, ǫ, N) is maximum size of the smallest ǫ-sloppy ∆-cover of G over all possible set of N examples as defined by Schapire et al. (1998).

Proof. The proof extends the results of Schapire et al. (1998, Theorem 4) with essen-tially the same technique discussed in Theorem 3.2.

Theorem 3.5 implies that if the term N (G^E, ∆, ǫ, N) is polynomial in N, the proba-bility that the out-of-sample cost deviates much from the in-sample ∆-cost is small.

Similar to the work of Bartlett (1998), the theorem can be used to provide cost bounds for threshold rankers based on the neural network, which can be thought as a special form of threshold ensemble ranker.

在文檔中 From Ordinal Ranking to Binary Classiﬁcation (頁 50-58)