• 沒有找到結果。

Large-Margin Thresholded Ensembles for Ordinal Regression

N/A
N/A
Protected

Academic year: 2022

Share "Large-Margin Thresholded Ensembles for Ordinal Regression"

Copied!
18
0
0

加載中.... (立即查看全文)

全文

(1)

Large-Margin Thresholded Ensembles for Ordinal Regression

Hsuan-Tien Lin and Ling Li

Learning Systems Group, California Institute of Technology, U.S.A.

Conf. on Algorithmic Learning Theory, October 9, 2006

(2)

Ordinal Regression Problem

Ordinal Regression

what is the age-group of the person in the picture?

g

2

g

1

g

2

g

3

g

4 rank: a finite ordered set of labelsY = {1,2, · · · ,K}

ordinal regression:

given training set{(xn,yn)}Nn=1, find a decision function g that predicts the ranks of unseen examples well

e.g. ranking movies, ranking by document relevance, etc.

matching human preferences:

applications in social science and info. retrieval

(3)

Ordinal Regression Problem

Properties of Ordinal Regression

regression without metric:

possibly metric underlying (age), but not encoded in{1,2,3,4}

monotonic invariance

– relabel by{2,3,5,7}should not change results general regression deteriorates without metric classification with ordered categories:

small mistake – classify a teenager as a child;

big mistake – classify an infant as an adult no shuffle invariance

– relabel by{3,1,2,4}lose information

general classification cannot use ordering information ordinal regression resides uniquely

between classification and regression

(4)

Ordinal Regression Problem

Error Functions for Ordinal Regression

two aspects of ordinal regression:

determine the category – discrete nature

or at least have a close prediction – ordering preference categorical prediction: classification error

LC(g,x,y) =

g(x) 6=y close prediction: absolute error LA(g,x,y) =

g(x) −y

neither perfect; both common

(5)

Ordinal Regression Problem

Our Contributions

new model for ordinal regression: thresholded ensemble model – combines thresholding and ensemble learning

new generalization bounds for thresholded ensembles – theoretical guarantee of performance

new algorithms for constructing thresholded ensembles – simple and efficient

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure:target; traditional regression; our ordinal regression promising experimental results

(6)

Thresholded Ensemble Model

Thresholded Model

commonly used in previous work:

thresholded perceptrons (PRank, Crammer and Singer, 2005) thresholded SVMs (SVOR, Chu and Keerthi, 2005)

prediction procedure:

1 compute a potential function H(x)(e.g. raw perceptron output)

2 quantize H(x)by some orderedθto get g(x)

x x -

θ1 d d dθ2 t tt tθ3 ??

1 2 3 4 g(x)

H(x)

thresholded model:

g(x) ≡gH,θ(x) =min{k:H(x) < θk}

(7)

Thresholded Ensemble Model

Thresholded Ensemble Model

x x -

θ1 d d dθ2 t tt tθ3 ??

1 2 3 4 g(x)

H(x)

the potential function H(x)is a weighted ensemble H(x) ≡HT(x) =PT

t=1wtht(x)

intuition: combine preferences to estimate the overall confidence e.g. if many people, ht, say a movie x is “good”,

the confidence of the movie H(x)should be high

good theoretical and algorithmic properties inher- ited from ensemble learning for classification

(8)

New Large-Margin Bounds of Thresholded Ensembles

Margins of Thresholded Ensembles

x x - θ1

d d d

θ2

t tt t θ3

??

 ρ1 -

ρ2

ρ3 -

1 2 3 4 g(x)

H(x)

margin: safe from the boundary

normalized margin for thresholded ensemble

¯

ρ(x,y,k) =

 HT(x) − θk, if y >k θkHT(x), if y ≤k

, T X

t=1

wt +

K−1

X

k=1

θk

!

negative margin ⇐⇒ wrong prediction

K−1

X

k=1

 ¯ρ(x,y,k) ≤0

⇐⇒

g(x) −y

=LA(g,x,y)

(9)

New Large-Margin Bounds of Thresholded Ensembles

New Large-Margin Bounds for the Model

core results: if(xn,yn)i.i.d. fromD, with prob. >1− δ,∀∆ >0,

E(x,y)∼DLA(g,x,y) ≤N1

N

X

n=1 K−1

X

k=1

 ¯ρ(xn,yn,k) ≤ ∆ +O K r

1 N

log2N

2 +log1δ

!

E(x,y)∼DLC(g,x,y) ≤N2

N

X

n=1 yn

X

k=yn−1

 ¯ρ(xn,yn,k) ≤ ∆ +O r

1 N

log2N

2 +log1δ

!

sketch of the proof (to be illustrated with LA):

1 reduce ordinal regression examples to dependent binary examples

2 extract i.i.d. binary examples; apply existing classification bounds

3 bound the deviation caused by the i.i.d. extraction large-margin thresholded ensembles

could generalize

(10)

New Large-Margin Bounds of Thresholded Ensembles

Reduction to Binary Classification

x x -

– – θ1 +d + +d d +t ++ +tt t ??

++

1 2 3 4 g(x)

H(x)

K −1 binary classification problems w.r.t. eachθk

encode(x,y,k)as (X)k, (Y)k = (x,1k),sign(y−k−0.5) :

ρ(x¯ ,y,k) ∝ (Y)k HT(x) − hθ,1ki =bin. classifier marginρC (X)k, (Y)k key observation:

E(x,y)∼DLA(g,x,y) = E(x,y)∼DXK−1

k=1 ¯ρ(x,y,k) ≤0

= (K 1)E(x,y)∼D,k∼K ¯ρ(x,y,k) ≤0

= (K 1)E((X)

k,(Y)k)∼ ˆDC (X)k, (Y)k ≤0 ordinal regression problem=⇒

one big joint binary classification problem

(11)

New Large-Margin Bounds of Thresholded Ensembles

Extraction of Independent Examples

E(x,y)∼DLA(g,x,y) = (K 1)E((X)

k,(Y)k)∼ ˆDC (X)k, (Y)k ≤0 testing distributionDˆ of (X)k, (Y)k

: derived from (x,y,k) ∼ D × K

extended training examplesS =ˆ 

(Xn)k, (Yn)k

 :

not i.i.d. fromDˆ ; cannot be directly used in existing bounds i.i.d. subset ofS: randomly choose kˆ nfor each n

apply ensemble learning bound (Schapire et al., 1998):

if(xn,yn,kn)i.i.d. fromD × K, with prob. >1− δ,∀∆ >0,

E(x,y)∼DLA(g,x,y) ≤KN−1

N

X

n=1

 ¯ρ(xn,yn,kn) ≤ ∆ +O K r

1 N

log2N

2 +log1δ

!

can we obtain a deterministic RHS?

(12)

New Large-Margin Bounds of Thresholded Ensembles

Deviation from the Extraction

E(x,y)∼DLA(g,x,y) ≤KN−1

N

X

n=1

 ¯ρ(xn,yn,kn) ≤ ∆ +O K r

1 N

log2N

2 +log1δ

!

let bn= ¯ρ(xn,yn,kn) ≤ ∆

: binary independent r.v. with mean

µn= K−11

K−1

X

k=1

 ¯ρ(xn,yn,k) ≤ ∆

extended Chernoff bound: with prob. >1− δ,

K−1 N

N

X

n=1

bn N1

N

X

n=1 K−1

X

k=1

 ¯ρ(xn,yn,k) ≤ ∆ +O

q

1 Nlog1δ



connection between bound and algorithm design? boosting

(13)

New Large-Margin Algorithms for Thresholded Ensembles

Boosting for Large-Margin Thresholded Ensembles

existing algorithm (RankBoost, Freund et al., 2003):

construct HT iteratively with some margin concepts, but noθ our work:

RankBoost-AE: extended RankBoost for ordinal regression – obtainθby minimizing training LAusing dynamic programming ORBoost: new boosting formulation for ordinal regression

ORBoost:

simpler and faster than existing approaches;

connects well to large-margin bounds

(14)

New Large-Margin Algorithms for Thresholded Ensembles

ORBoost: Ordinal Regression Boosting

inspired from AdaBoost: operationally

min

N

X

n=1

exp −ρ(xn,yn) ≈max softminnρ(xn,yn)

ORBoost:

min

N

X

n=1

X

k

exp −ρ(xn,yn,k) ≥const.·

N

X

n=1

X

k

ρ(xn,yn,k) ≤ ∆

ORBoost-LR k ∈ {yn1,yn}

connects to bound on LC

ORBoost-All

k ∈ {1,2, · · · ,K1}

connects to bound on LA

algorithmic derivation based on theoretical bounds

(15)

New Large-Margin Algorithms for Thresholded Ensembles

Advantages of ORBoost

ensemble learning:

combine simple preferences to approximate complex targets thresholding:

adaptively estimating scales to perform ordinal regression benefits inherited from AdaBoost

simple implementation

if ht good enough: guarantee on rapidly minimizing X

n,k

 ¯ρ(xn,yn,k) ≤ ∆

– decision function g improves with T

ORBoost not very vulnerable to overfitting in practice

(16)

Experimental Results

ORBoost v.s. RankBoost

py ma bo ab ba co ca ce

0 0.5 1 1.5 2 2.5

dataset

test absolute error

RankBoost ORBoost

Results (ORBoost-All) significantly better than RankBoost (best existing boosting approach)

simpler to implement and less vulnerable to overfitting

ORBoost: promising boosting approach for ordinal regression

(17)

Experimental Results

ORBoost v.s. SVOR

py ma bo ab ba co ca ce

0 0.5 1 1.5 2 2.5

dataset

test absolute error

SVOR ORBoost

Results (ORBoost-All) comparable to SVOR (state-of-the-art algorithm) much faster in training (1 hour v.s.

2 days on 6000 examples)

ORBoost: could be especially useful for large-scale tasks

(18)

Conclusion

Conclusion

thresholded ensemble model: useful for ordinal regression theoretical reduction: new large-margin bounds

algorithmic reduction: new learning algorithms ORBoost:

simplicity and better performance over existing boosting algorithm comparable performance to state-of-the-art algorithms

fast training and not very vulnerable to overfitting

broader reduction view: many more bounds/algorithms and more general error functions (Li and Lin, NIPS 2006)

Thank you. Questions?

參考文獻

相關文件

We also used reduction and reverse reduction to design a novel boosting ap- proach, AdaBoost.OR, to improve the performance of any cost-sensitive base ordinal ranking algorithm..

A 0.13- m MTCMOS test chip confirms that gate-level sleep devices can prevent sneak leakage and provide total standby leakage savings measured from 7.0 to 8.6 for different

regression loss for strong theoretical support leads to a promising SVM-based algorithm with superior experimental

Table I shows the quality comparison of several off-the- shelf low-light enhancement algorithms on our synthesis dataset. Our method outperforms all others method with large

After calculating the summation of the difference between the middle pixel and other black pixels in each mask, the direction of the edge of the middle pixel has the

soft-margin k -means OOB error RBF network probabilistic SVM GBDT PCA random forest matrix factorization Gaussian kernel kernel LogReg large-margin prototype quadratic programming

Lemma 86 0/1 permanent, bipartite perfect matching, and cycle cover are parsimoniously equivalent.. We will show that the counting versions of all three problems are in

(Lecture 203) In this problem, we are going to apply the kernel trick to the perceptron learning algorithm introduced in Machine Learning Foundations. We can then maintain the