Large-Margin Thresholded Ensembles for Ordinal Regression

(1)

Large-Margin Thresholded Ensembles for Ordinal Regression

Hsuan-Tien Lin and Ling Li

Learning Systems Group, California Institute of Technology, U.S.A.

Conf. on Algorithmic Learning Theory, October 9, 2006

(2)

Ordinal Regression Problem

Ordinal Regression

what is the age-group of the person in the picture?

g

2

g

1

g

2

g

3

g

4 rank: a finite ordered set of labelsY = {1,2, · · · ,K}

ordinal regression:

given training set{(x_n,yn)}^N_n=1, find a decision function g that predicts the ranks of unseen examples well

e.g. ranking movies, ranking by document relevance, etc.

matching human preferences:

applications in social science and info. retrieval

(3)

Properties of Ordinal Regression

regression without metric:

possibly metric underlying (age), but not encoded in{1,2,3,4}

monotonic invariance

– relabel by{2,3,5,7}should not change results general regression deteriorates without metric classification with ordered categories:

small mistake – classify a teenager as a child;

big mistake – classify an infant as an adult no shuffle invariance

– relabel by{3,1,2,4}lose information

general classification cannot use ordering information ordinal regression resides uniquely

between classification and regression

(4)

Error Functions for Ordinal Regression

two aspects of ordinal regression:

determine the category – discrete nature

or at least have a close prediction – ordering preference categorical prediction: classification error

L_C(g,x,y) =

g(x) 6=y close prediction: absolute error L_A(g,x,y) =

g(x) −y

neither perfect; both common

(5)

Our Contributions

new model for ordinal regression: thresholded ensemble model – combines thresholding and ensemble learning

new generalization bounds for thresholded ensembles – theoretical guarantee of performance

new algorithms for constructing thresholded ensembles – simple and efficient

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure:target; traditional regression; our ordinal regression promising experimental results

(6)

Thresholded Ensemble Model

Thresholded Model

commonly used in previous work:

thresholded perceptrons (PRank, Crammer and Singer, 2005) thresholded SVMs (SVOR, Chu and Keerthi, 2005)

prediction procedure:

1 compute a potential function H(x)(e.g. raw perceptron output)

2 quantize H(x)by some orderedθto get g(x)

x x -

θ₁ ^d ^{d d}θ₂ ^t ^{tt t}θ₃ ??

1 2 3 4 g(x)

H(x)

thresholded model:

g(x) ≡g_H,θ(x) =min{k:H(x) < θ_k}

(7)

Thresholded Ensemble Model

x x -

θ₁ ^d ^{d d}θ₂ ^t ^{tt t}θ₃ ??

1 2 3 4 g(x)

H(x)

the potential function H(x)is a weighted ensemble H(x) ≡H_T(x) =PT

t=1w_th_t(x)

intuition: combine preferences to estimate the overall confidence e.g. if many people, ht, say a movie x is “good”,

the confidence of the movie H(x)should be high

good theoretical and algorithmic properties inher- ited from ensemble learning for classification

(8)

New Large-Margin Bounds of Thresholded Ensembles

Margins of Thresholded Ensembles

x x - θ₁

d d d

θ₂

t tt t θ₃

??

ρ₁ -

ρ2

ρ3 -

1 2 3 4 g(x)

H(x)

margin: safe from the boundary

normalized margin for thresholded ensemble

¯

ρ(x,y,k) =

H_T(x) − θ_k, if y >k θ_k −H_T(x), if y ≤k

, _T X

t=1

w_t +

K−1

X

k=1

θ_k

!

negative margin ⇐⇒ wrong prediction

K−1

X

k=1

¯ρ(x,y,k) ≤0

⇐⇒

g(x) −y

=LA(g,x,y)

(9)

New Large-Margin Bounds for the Model

core results: if(x_n,y_n)i.i.d. fromD, with prob. >1− δ,∀∆ >0,

E_(x_,y)∼DLA(g,x,y) ≤_N¹

N

X

n=1 K−1

X

k=1

¯ρ(xn,yn,k) ≤ ∆ +O K r

1 N

log²N

∆² +log¹_δ

!

E_(x,y_)∼DL_C(g,x,y) ≤_N²

N

X

n=1 yn

X

k=yn−1

¯ρ(xn,yn,k) ≤ ∆ +O r

1 N

log²N

∆² +log¹_δ

!

sketch of the proof (to be illustrated with L_A):

1 reduce ordinal regression examples to dependent binary examples

2 extract i.i.d. binary examples; apply existing classification bounds

3 bound the deviation caused by the i.i.d. extraction large-margin thresholded ensembles

could generalize

(10)

Reduction to Binary Classification

x x -

– – θ₁ +^d + +^{d d} +^t ++ +^{tt t} ??

++

1 2 3 4 g(x)

H(x)

K −1 binary classification problems w.r.t. eachθk

encode(x,y,k)as (X)_k, (Y)_k = (x,1_k),sign(y−k−0.5) :

ρ(x¯ ,y,k) ∝ (Y)_k H_T(x) − hθ,1_ki =bin. classifier marginρ_C (X)_k, (Y)_k key observation:

E_(x,y)∼DLA(g,x,y) = E_(x,y)∼DX^K−1

k=1 ¯ρ(x,y,k) ≤0

= (K −1)E(x,y)∼D,k∼K ¯ρ(x,y,k) ≤0

= (K −1)E_((X₎

k,(Y)k)∼ ˆDρ_C (X)_k, (Y)_k ≤0 ordinal regression problem=⇒

one big joint binary classification problem

(11)

Extraction of Independent Examples

E_(x,y)∼DL_A(g,x,y) = (K −1)E_((X)

k,(Y)k)∼ ˆDρ_C (X)_k, (Y)_k ≤0 testing distributionDˆ of (X)_k, (Y)_k

: derived from (x,y,k) ∼ D × K

extended training examplesS =ˆ

(X_n)k, (Y_n)k

:

not i.i.d. fromDˆ ; cannot be directly used in existing bounds i.i.d. subset ofS: randomly choose kˆ _nfor each n

apply ensemble learning bound (Schapire et al., 1998):

if(x_n,y_n,k_n)i.i.d. fromD × K, with prob. >1− δ,∀∆ >0,

E_(x,y_)∼DLA(g,x,y) ≤^K_N⁻¹

N

X

n=1

¯ρ(xn,yn,kn) ≤ ∆ +O K r

1 N

log²N

∆² +log¹_δ

!

can we obtain a deterministic RHS?

(12)

Deviation from the Extraction

E_(x,y_)∼DLA(g,x,y) ≤^K_N⁻¹

N

X

n=1

¯ρ(xn,yn,kn) ≤ ∆ +O K r

1 N

_log2N

∆² +log¹_δ

!

let bn= ¯ρ(xn,yn,kn) ≤ ∆

: binary independent r.v. with mean

µ_n= _K−1¹

K−1

X

k=1

¯ρ(xn,yn,k) ≤ ∆

extended Chernoff bound: with prob. >1− δ,

K−1 N

N

X

n=1

b_n≤ _N¹

N

X

n=1 K−1

X

k=1

¯ρ(x_n,y_n,k) ≤ ∆ +O

q

1 Nlog¹_δ

connection between bound and algorithm design? boosting

(13)

New Large-Margin Algorithms for Thresholded Ensembles

Boosting for Large-Margin Thresholded Ensembles

existing algorithm (RankBoost, Freund et al., 2003):

construct H_T iteratively with some margin concepts, but noθ our work:

RankBoost-AE: extended RankBoost for ordinal regression – obtainθby minimizing training LAusing dynamic programming ORBoost: new boosting formulation for ordinal regression

ORBoost:

simpler and faster than existing approaches;

connects well to large-margin bounds

(14)

ORBoost: Ordinal Regression Boosting

inspired from AdaBoost: operationally

min

N

X

n=1

exp −ρ(x_n,yn) ≈max softminnρ(xn,yn)

ORBoost:

min

N

X

n=1

X

k

exp −ρ(x_n,y_n,k) ≥const.·

N

X

n=1

X

k

ρ(x_n,y_n,k) ≤ ∆

ORBoost-LR k ∈ {y_n−1,yn}

connects to bound on LC

ORBoost-All

k ∈ {1,2, · · · ,K−1}

connects to bound on LA

algorithmic derivation based on theoretical bounds

(15)

Advantages of ORBoost

ensemble learning:

combine simple preferences to approximate complex targets thresholding:

adaptively estimating scales to perform ordinal regression benefits inherited from AdaBoost

simple implementation

if ht good enough: guarantee on rapidly minimizing X

n,k

¯ρ(x_n,y_n,k) ≤ ∆

– decision function g improves with T

ORBoost not very vulnerable to overfitting in practice

(16)

Experimental Results

ORBoost v.s. RankBoost

py ma bo ab ba co ca ce

0 0.5 1 1.5 2 2.5

dataset

test absolute error

RankBoost ORBoost

Results (ORBoost-All) significantly better than RankBoost (best existing boosting approach)

simpler to implement and less vulnerable to overfitting

ORBoost: promising boosting approach for ordinal regression

(17)

Experimental Results

ORBoost v.s. SVOR

py ma bo ab ba co ca ce

0 0.5 1 1.5 2 2.5

dataset

test absolute error

SVOR ORBoost

Results (ORBoost-All) comparable to SVOR (state-of-the-art algorithm) much faster in training (1 hour v.s.

2 days on 6000 examples)

ORBoost: could be especially useful for large-scale tasks

(18)

Conclusion

thresholded ensemble model: useful for ordinal regression theoretical reduction: new large-margin bounds

algorithmic reduction: new learning algorithms ORBoost:

simplicity and better performance over existing boosting algorithm comparable performance to state-of-the-art algorithms

fast training and not very vulnerable to overfitting

broader reduction view: many more bounds/algorithms and more general error functions (Li and Lin, NIPS 2006)

Thank you. Questions?