Large-Margin Thresholded Ensembles for Ordinal Regression
Hsuan-Tien Lin and Ling Li
Learning Systems Group, California Institute of Technology, U.S.A.
Conf. on Algorithmic Learning Theory, October 9, 2006
Ordinal Regression Problem
Ordinal Regression
what is the age-group of the person in the picture?
g
2
g
1
g
2
g
3
g
4 rank: a finite ordered set of labelsY = {1,2, · · · ,K}
ordinal regression:
given training set{(xn,yn)}Nn=1, find a decision function g that predicts the ranks of unseen examples well
e.g. ranking movies, ranking by document relevance, etc.
matching human preferences:
applications in social science and info. retrieval
Ordinal Regression Problem
Properties of Ordinal Regression
regression without metric:
possibly metric underlying (age), but not encoded in{1,2,3,4}
monotonic invariance
– relabel by{2,3,5,7}should not change results general regression deteriorates without metric classification with ordered categories:
small mistake – classify a teenager as a child;
big mistake – classify an infant as an adult no shuffle invariance
– relabel by{3,1,2,4}lose information
general classification cannot use ordering information ordinal regression resides uniquely
between classification and regression
Ordinal Regression Problem
Error Functions for Ordinal Regression
two aspects of ordinal regression:
determine the category – discrete nature
or at least have a close prediction – ordering preference categorical prediction: classification error
LC(g,x,y) =
g(x) 6=y close prediction: absolute error LA(g,x,y) =
g(x) −y
neither perfect; both common
Ordinal Regression Problem
Our Contributions
new model for ordinal regression: thresholded ensemble model – combines thresholding and ensemble learning
new generalization bounds for thresholded ensembles – theoretical guarantee of performance
new algorithms for constructing thresholded ensembles – simple and efficient
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure:target; traditional regression; our ordinal regression promising experimental results
Thresholded Ensemble Model
Thresholded Model
commonly used in previous work:
thresholded perceptrons (PRank, Crammer and Singer, 2005) thresholded SVMs (SVOR, Chu and Keerthi, 2005)
prediction procedure:
1 compute a potential function H(x)(e.g. raw perceptron output)
2 quantize H(x)by some orderedθto get g(x)
x x -
θ1 d d dθ2 t tt tθ3 ??
1 2 3 4 g(x)
H(x)
thresholded model:
g(x) ≡gH,θ(x) =min{k:H(x) < θk}
Thresholded Ensemble Model
Thresholded Ensemble Model
x x -
θ1 d d dθ2 t tt tθ3 ??
1 2 3 4 g(x)
H(x)
the potential function H(x)is a weighted ensemble H(x) ≡HT(x) =PT
t=1wtht(x)
intuition: combine preferences to estimate the overall confidence e.g. if many people, ht, say a movie x is “good”,
the confidence of the movie H(x)should be high
good theoretical and algorithmic properties inher- ited from ensemble learning for classification
New Large-Margin Bounds of Thresholded Ensembles
Margins of Thresholded Ensembles
x x - θ1
d d d
θ2
t tt t θ3
??
ρ1 -
ρ2
ρ3 -
1 2 3 4 g(x)
H(x)
margin: safe from the boundary
normalized margin for thresholded ensemble
¯
ρ(x,y,k) =
HT(x) − θk, if y >k θk −HT(x), if y ≤k
, T X
t=1
wt +
K−1
X
k=1
θk
!
negative margin ⇐⇒ wrong prediction
K−1
X
k=1
¯ρ(x,y,k) ≤0
⇐⇒
g(x) −y
=LA(g,x,y)
New Large-Margin Bounds of Thresholded Ensembles
New Large-Margin Bounds for the Model
core results: if(xn,yn)i.i.d. fromD, with prob. >1− δ,∀∆ >0,
E(x,y)∼DLA(g,x,y) ≤N1
N
X
n=1 K−1
X
k=1
¯ρ(xn,yn,k) ≤ ∆ +O K r
1 N
log2N
∆2 +log1δ
!
E(x,y)∼DLC(g,x,y) ≤N2
N
X
n=1 yn
X
k=yn−1
¯ρ(xn,yn,k) ≤ ∆ +O r
1 N
log2N
∆2 +log1δ
!
sketch of the proof (to be illustrated with LA):
1 reduce ordinal regression examples to dependent binary examples
2 extract i.i.d. binary examples; apply existing classification bounds
3 bound the deviation caused by the i.i.d. extraction large-margin thresholded ensembles
could generalize
New Large-Margin Bounds of Thresholded Ensembles
Reduction to Binary Classification
x x -
– – θ1 +d + +d d +t ++ +tt t ??
++
1 2 3 4 g(x)
H(x)
K −1 binary classification problems w.r.t. eachθk
encode(x,y,k)as (X)k, (Y)k = (x,1k),sign(y−k−0.5) :
ρ(x¯ ,y,k) ∝ (Y)k HT(x) − hθ,1ki =bin. classifier marginρC (X)k, (Y)k key observation:
E(x,y)∼DLA(g,x,y) = E(x,y)∼DXK−1
k=1 ¯ρ(x,y,k) ≤0
= (K −1)E(x,y)∼D,k∼K ¯ρ(x,y,k) ≤0
= (K −1)E((X)
k,(Y)k)∼ ˆDρC (X)k, (Y)k ≤0 ordinal regression problem=⇒
one big joint binary classification problem
New Large-Margin Bounds of Thresholded Ensembles
Extraction of Independent Examples
E(x,y)∼DLA(g,x,y) = (K −1)E((X)
k,(Y)k)∼ ˆDρC (X)k, (Y)k ≤0 testing distributionDˆ of (X)k, (Y)k
: derived from (x,y,k) ∼ D × K
extended training examplesS =ˆ
(Xn)k, (Yn)k
:
not i.i.d. fromDˆ ; cannot be directly used in existing bounds i.i.d. subset ofS: randomly choose kˆ nfor each n
apply ensemble learning bound (Schapire et al., 1998):
if(xn,yn,kn)i.i.d. fromD × K, with prob. >1− δ,∀∆ >0,
E(x,y)∼DLA(g,x,y) ≤KN−1
N
X
n=1
¯ρ(xn,yn,kn) ≤ ∆ +O K r
1 N
log2N
∆2 +log1δ
!
can we obtain a deterministic RHS?
New Large-Margin Bounds of Thresholded Ensembles
Deviation from the Extraction
E(x,y)∼DLA(g,x,y) ≤KN−1
N
X
n=1
¯ρ(xn,yn,kn) ≤ ∆ +O K r
1 N
log2N
∆2 +log1δ
!
let bn= ¯ρ(xn,yn,kn) ≤ ∆
: binary independent r.v. with mean
µn= K−11
K−1
X
k=1
¯ρ(xn,yn,k) ≤ ∆
extended Chernoff bound: with prob. >1− δ,
K−1 N
N
X
n=1
bn≤ N1
N
X
n=1 K−1
X
k=1
¯ρ(xn,yn,k) ≤ ∆ +O
q
1 Nlog1δ
connection between bound and algorithm design? boosting
New Large-Margin Algorithms for Thresholded Ensembles
Boosting for Large-Margin Thresholded Ensembles
existing algorithm (RankBoost, Freund et al., 2003):
construct HT iteratively with some margin concepts, but noθ our work:
RankBoost-AE: extended RankBoost for ordinal regression – obtainθby minimizing training LAusing dynamic programming ORBoost: new boosting formulation for ordinal regression
ORBoost:
simpler and faster than existing approaches;
connects well to large-margin bounds
New Large-Margin Algorithms for Thresholded Ensembles
ORBoost: Ordinal Regression Boosting
inspired from AdaBoost: operationally
min
N
X
n=1
exp −ρ(xn,yn) ≈max softminnρ(xn,yn)
ORBoost:
min
N
X
n=1
X
k
exp −ρ(xn,yn,k) ≥const.·
N
X
n=1
X
k
ρ(xn,yn,k) ≤ ∆
ORBoost-LR k ∈ {yn−1,yn}
connects to bound on LC
ORBoost-All
k ∈ {1,2, · · · ,K−1}
connects to bound on LA
algorithmic derivation based on theoretical bounds
New Large-Margin Algorithms for Thresholded Ensembles
Advantages of ORBoost
ensemble learning:
combine simple preferences to approximate complex targets thresholding:
adaptively estimating scales to perform ordinal regression benefits inherited from AdaBoost
simple implementation
if ht good enough: guarantee on rapidly minimizing X
n,k
¯ρ(xn,yn,k) ≤ ∆
– decision function g improves with T
ORBoost not very vulnerable to overfitting in practice
Experimental Results
ORBoost v.s. RankBoost
py ma bo ab ba co ca ce
0 0.5 1 1.5 2 2.5
dataset
test absolute error
RankBoost ORBoost
Results (ORBoost-All) significantly better than RankBoost (best existing boosting approach)
simpler to implement and less vulnerable to overfitting
ORBoost: promising boosting approach for ordinal regression
Experimental Results
ORBoost v.s. SVOR
py ma bo ab ba co ca ce
0 0.5 1 1.5 2 2.5
dataset
test absolute error
SVOR ORBoost
Results (ORBoost-All) comparable to SVOR (state-of-the-art algorithm) much faster in training (1 hour v.s.
2 days on 6000 examples)
ORBoost: could be especially useful for large-scale tasks
Conclusion
Conclusion
thresholded ensemble model: useful for ordinal regression theoretical reduction: new large-margin bounds
algorithmic reduction: new learning algorithms ORBoost:
simplicity and better performance over existing boosting algorithm comparable performance to state-of-the-art algorithms
fast training and not very vulnerable to overfitting
broader reduction view: many more bounds/algorithms and more general error functions (Li and Lin, NIPS 2006)
Thank you. Questions?