• 沒有找到結果。

Large-Margin Thresholded Ensembles for Ordinal Regression

N/A
N/A
Protected

Academic year: 2022

Share "Large-Margin Thresholded Ensembles for Ordinal Regression"

Copied!
12
0
0

加載中.... (立即查看全文)

全文

(1)

Large-Margin Thresholded Ensembles for Ordinal Regression

Hsuan-Tien Lin

(accepted by ALT ’06, joint work with Ling Li)

Learning Systems Group, Caltech

Workshop Talk in MLSS 2006, Taipei, Taiwan, 07/25/2006

(2)

Ordinal Regression Problem

Reduction Method

Algorithmic

1 identify the type of learning problem(ordinal regression)

2 find premade reduction (thresholded ensemble) and oracle learning algorithm(AdaBoost)

3 build aordinal regression rule using(ORBoost)+ data

RA

Theoretical

1 identify the type of learning problem (ordinal regression)

2 find premade reduction (thresholded ensemble) andknown generalization bounds (large-margin ensembles)

3 derive new bound

(large-margin thresholded ensembles) using the reduction + known bound this work: a concrete instance of reductions

(3)

Ordinal Regression Problem

Ordinal Regression

what is the age-group of the person in the picture?

g

2

g

1

g

2

g

3

g

4 rank: a finite ordered set of labelsY = {1,2, · · · ,K}

ordinal regression:

given training set{(xn,yn)}Nn=1, find a decision function g that predicts the ranks of unseen examples well

e.g. ranking movies, ranking by document relevance, etc.

matching human preferences:

applications in social science and info. retrieval

(4)

Ordinal Regression Problem

Properties of Ordinal Regression

regression without metric:

possibly metric underlying (age), but not encoded in{1,2,3,4}

classification with ordered categories:

small mistake – classify a teenager as a child;

big mistake – classify an infant as an adult common loss functions:

determine the category: classification error LC(g,x,y) =

g(x) 6=y

or at least have a close prediction: absolute error LA(g,x,y) =

g(x) −y

will talk about LAonly;

similar for LC

(5)

Thresholded Ensemble Model

Thresholded Model for Ordinal Regression

naive algorithm for ordinal regression:

1 do general regression on{(xn,yn)}, and get H(x) – general regression performs badly without metric

2 set g(x) =clip(round(H(x)))

– roundoff operation (uniform quantization) cause large error improved and generalized algorithm:

1 estimate a potential function H(x)

2 quantize H(x)by some orderedθto get g(x)

x x - θ1

d d d

θ2

t tt t θ3

??

1 2 3 4 g(x)

H(x)

thresholded model: g(x) ≡gH,θ(x) =min{k:H(x) < θk}

(6)

Thresholded Ensemble Model

Thresholded Ensemble Model

the potential function H(x)is a weighted ensemble H(x) ≡HT(x) =PT

t=1wtht(x)

intuition: combine preferences to estimate the overall confidence e.g. if many people, ht, say a movie x is “good”,

the confidence of the movie H(x)should be high ht can be binary, multi-valued, or continuous wt <0: allow reversing bad preferences

thresholded ensemble model:

ensemble learning for ordinal regression

(7)

Bounds for Large-Margin Thresholded Ensembles

Margins of Thresholded Ensembles

x x - θ1

d d d

θ2

t tt t θ3

??

 ρ1 -

ρ2 ρ3 -

1 2 3 4 g(x)

H(x)

margin: safe from the boundary

normalized margin for thresholded ensemble

¯

ρ(x,y,k) =

 HT(x) − θk, if y >k θkHT(x), if y ≤k

, T X

t=1

wt +

K−1

X

k=1

θk

!

negative margin ⇐⇒ wrong prediction

K−1

X

k=1

 ¯ρ(x,y,k) ≤0

⇐⇒

g(x) −y

(8)

Bounds for Large-Margin Thresholded Ensembles

Theoretical Reduction

x x -

– – θ1 +d + +d d +t ++ +tt t ??

++

1 2 3 4 g(x)

H(x) (K −1)binary classification problems w.r.t. eachθk:

(X)k, (Y)k = (x,k), +/−

(Schapire et al., 1998) binary classification: with probability at least 1− δ, for all∆ >0 and binary classifiers gc,

E(X,Y)∼D0

gc(X) 6=Y ≤ 1 N

N

X

n=1

 ¯ρ(Xn,Yn) ≤ ∆ +O(log N

N , 1

, r

log1 δ)

(Lin and Li, 2006) ordinal regression: with similar settings, for all thresholded ensembles g,

E(x,y)∼DLA(g,x,y) ≤ 1 N

N

X

n=1 K−1

X

k=1

 ¯ρ(xn,yn,k) ≤ ∆+O(K,log N

N , 1

, r

log1 δ) large-margin thresholded ensembles can generalize

(9)

Algorithms for Large-Margin Thresholded Ensembles

Algorithmic Reduction

(Freund and Schapire, 1996) AdaBoost: binary classification by operationally optimizing

min

N

X

n=1

exp −ρ(xn,yn) ≈max softminnρ(x¯ n,yn)

(Lin and Li, 2006) ORBoost-LR (left-right):

min

N

X

n=1 yn

X

k=yn−1

exp −ρ(xn,yn,k)

ORBoost-All:

min

N

X

n=1 K−1

X

k=1

exp −ρ(xn,yn,k)

algorithmic reduction to AdaBoost

(10)

Algorithms for Large-Margin Thresholded Ensembles

Advantages of ORBoost

ensemble learning: combine simple preferences to approximate complex targets

threshold: adaptively estimated scales to perform ordinal regression

inherit from AdaBoost:

simple implementation guarantee on minimizingP

n,k ¯ρ(xn,yn,k) ≤ ∆ fast practically less vulnerable to overfitting

useful properties inherited with reduction

(11)

Algorithms for Large-Margin Thresholded Ensembles

ORBoost Experiments

py ma ho ab ba co ca ce

0 0.5 1 1.5 2 2.5

test absolute error

RankBoost ORBoost SVM

Results (ORBoost-All) ORBoost-All simpler, and much better than RankBoost (Freund et al., 2003)

ORBoost-All much faster, and comparable to SVM (Chu and Keerthi, 2005)

similar for ORBoost-LR

(12)

Conclusion

Conclusion

thresholded ensemble model: useful for ordinal regression theoretical reduction: new large-margin bounds

algorithmic reduction: new training algorithms – ORBoost ORBoost:

simplicity over existing boosting algorithms

comparable performance to state-of-the-art algorithms fast training and less vulnerable to overfitting

on-going work: similar reduction technique for other theoretical and algorithmic results with more general loss functions (Li and Lin, 2006)

Questions?

參考文獻

相關文件

A 0.13- m MTCMOS test chip confirms that gate-level sleep devices can prevent sneak leakage and provide total standby leakage savings measured from 7.0 to 8.6 for different

regression loss for strong theoretical support leads to a promising SVM-based algorithm with superior experimental

Table I shows the quality comparison of several off-the- shelf low-light enhancement algorithms on our synthesis dataset. Our method outperforms all others method with large

After calculating the summation of the difference between the middle pixel and other black pixels in each mask, the direction of the edge of the middle pixel has the

soft-margin k -means OOB error RBF network probabilistic SVM GBDT PCA random forest matrix factorization Gaussian kernel kernel LogReg large-margin prototype quadratic programming

Lemma 86 0/1 permanent, bipartite perfect matching, and cycle cover are parsimoniously equivalent.. We will show that the counting versions of all three problems are in

(2015), an effective parameter-selection procedure by using warm-start techniques to solve a sequence of optimization problems has been proposed for linear classification.. We

(Lecture 203) In this problem, we are going to apply the kernel trick to the perceptron learning algorithm introduced in Machine Learning Foundations. We can then maintain the