Large-Margin Thresholded Ensembles for Ordinal Regression

(1)

Large-Margin Thresholded Ensembles for Ordinal Regression

Hsuan-Tien Lin

(accepted by ALT ’06, joint work with Ling Li)

Learning Systems Group, Caltech

Workshop Talk in MLSS 2006, Taipei, Taiwan, 07/25/2006

(2)

Ordinal Regression Problem

Reduction Method

Algorithmic

1 identify the type of learning problem(ordinal regression)

2 find premade reduction (thresholded ensemble) and oracle learning algorithm(AdaBoost)

3 build aordinal regression rule using(ORBoost)+ data

R^A

Theoretical

1 identify the type of learning problem (ordinal regression)

2 find premade reduction (thresholded ensemble) andknown generalization bounds (large-margin ensembles)

3 derive new bound

(large-margin thresholded ensembles) using the reduction + known bound this work: a concrete instance of reductions

(3)

Ordinal Regression

what is the age-group of the person in the picture?

g

2

g

1

g

2

g

3

g

4 rank: a finite ordered set of labelsY = {1,2, · · · ,K}

ordinal regression:

given training set{(x_n,yn)}^N_n=1, find a decision function g that predicts the ranks of unseen examples well

e.g. ranking movies, ranking by document relevance, etc.

matching human preferences:

applications in social science and info. retrieval

(4)

Properties of Ordinal Regression

regression without metric:

possibly metric underlying (age), but not encoded in{1,2,3,4}

classification with ordered categories:

small mistake – classify a teenager as a child;

big mistake – classify an infant as an adult common loss functions:

determine the category: classification error LC(g,x,y) =

g(x) 6=y

or at least have a close prediction: absolute error LA(g,x,y) =

g(x) −y

will talk about L_Aonly;

similar for L_C

(5)

Thresholded Ensemble Model

Thresholded Model for Ordinal Regression

naive algorithm for ordinal regression:

1 do general regression on{(x_n,yn)}, and get H(x) – general regression performs badly without metric

2 set g(x) =clip(round(H(x)))

– roundoff operation (uniform quantization) cause large error improved and generalized algorithm:

1 estimate a potential function H(x)

2 quantize H(x)by some orderedθto get g(x)

x x - θ1

d d d

θ2

t tt t θ3

??

1 2 3 4 g(x)

H(x)

thresholded model: g(x) ≡g_H,θ(x) =min{k:H(x) < θ_k}

(6)

Thresholded Ensemble Model

the potential function H(x)is a weighted ensemble H(x) ≡H_T(x) =PT

t=1wtht(x)

intuition: combine preferences to estimate the overall confidence e.g. if many people, h_t, say a movie x is “good”,

the confidence of the movie H(x)should be high h_t can be binary, multi-valued, or continuous w_t <0: allow reversing bad preferences

thresholded ensemble model:

ensemble learning for ordinal regression

(7)

Bounds for Large-Margin Thresholded Ensembles

Margins of Thresholded Ensembles

x x - θ₁

d d d

θ₂

t tt t θ₃

??

ρ₁ -

ρ₂ ρ₃ -

1 2 3 4 g(x)

H(x)

margin: safe from the boundary

normalized margin for thresholded ensemble

¯

ρ(x,y,k) =

H_T(x) − θ_k, if y >k θk −H_T(x), if y ≤k

, _T X

t=1

w_t +

K−1

X

k=1

θ_k

!

negative margin ⇐⇒ wrong prediction

K−1

X

k=1

¯ρ(x,y,k) ≤0

⇐⇒

g(x) −y

(8)

Bounds for Large-Margin Thresholded Ensembles

Theoretical Reduction

x x -

– – θ₁ +^d + +^{d d} +^t ++ +^{tt t} ??

++

1 2 3 4 g(x)

H(x) (K −1)binary classification problems w.r.t. eachθk:

(X)_k, (Y)_k = (x,k), +/−

(Schapire et al., 1998) binary classification: with probability at least 1− δ, for all∆ >0 and binary classifiers gc,

E_(X,Y)∼D⁰

g_c(X) 6=Y ≤ 1 N

N

X

n=1

¯ρ(X_n,Y_n) ≤ ∆ +O(log N

√ N , 1

∆, r

log1 δ)

(Lin and Li, 2006) ordinal regression: with similar settings, for all thresholded ensembles g,

E_(x_,y)∼DL_A(g,x,y) ≤ 1 N

N

X

n=1 K−1

X

k=1

¯ρ(xn,yn,k) ≤ ∆+O(K,log N

√N , 1

∆, r

log1 δ) large-margin thresholded ensembles can generalize

(9)

Algorithms for Large-Margin Thresholded Ensembles

Algorithmic Reduction

(Freund and Schapire, 1996) AdaBoost: binary classification by operationally optimizing

min

N

X

n=1

exp −ρ(x_n,y_n) ≈max softmin_nρ(x¯ _n,y_n)

(Lin and Li, 2006) ORBoost-LR (left-right):

min

N

X

n=1 yn

X

k=yn−1

exp −ρ(x_n,y_n,k)

ORBoost-All:

min

N

X

n=1 K−1

X

k=1

exp −ρ(x_n,y_n,k)

algorithmic reduction to AdaBoost

(10)

Advantages of ORBoost

ensemble learning: combine simple preferences to approximate complex targets

threshold: adaptively estimated scales to perform ordinal regression

inherit from AdaBoost:

simple implementation guarantee on minimizingP

n,k ¯ρ(x_n,y_n,k) ≤ ∆ fast practically less vulnerable to overfitting

useful properties inherited with reduction

(11)

ORBoost Experiments

py ma ho ab ba co ca ce

0 0.5 1 1.5 2 2.5

test absolute error

RankBoost ORBoost SVM

Results (ORBoost-All) ORBoost-All simpler, and much better than RankBoost (Freund et al., 2003)

ORBoost-All much faster, and comparable to SVM (Chu and Keerthi, 2005)

similar for ORBoost-LR

(12)

Conclusion

thresholded ensemble model: useful for ordinal regression theoretical reduction: new large-margin bounds

algorithmic reduction: new training algorithms – ORBoost ORBoost:

simplicity over existing boosting algorithms

comparable performance to state-of-the-art algorithms fast training and less vulnerable to overfitting

on-going work: similar reduction technique for other theoretical and algorithmic results with more general loss functions (Li and Lin, 2006)

Questions?