Large-Margin Thresholded Ensembles for Ordinal Regression
Hsuan-Tien Lin
(accepted by ALT ’06, joint work with Ling Li)
Learning Systems Group, Caltech
Workshop Talk in MLSS 2006, Taipei, Taiwan, 07/25/2006
Ordinal Regression Problem
Reduction Method
Algorithmic
1 identify the type of learning problem(ordinal regression)
2 find premade reduction (thresholded ensemble) and oracle learning algorithm(AdaBoost)
3 build aordinal regression rule using(ORBoost)+ data
RA
Theoretical
1 identify the type of learning problem (ordinal regression)
2 find premade reduction (thresholded ensemble) andknown generalization bounds (large-margin ensembles)
3 derive new bound
(large-margin thresholded ensembles) using the reduction + known bound this work: a concrete instance of reductions
Ordinal Regression Problem
Ordinal Regression
what is the age-group of the person in the picture?
g
2
g
1
g
2
g
3
g
4 rank: a finite ordered set of labelsY = {1,2, · · · ,K}
ordinal regression:
given training set{(xn,yn)}Nn=1, find a decision function g that predicts the ranks of unseen examples well
e.g. ranking movies, ranking by document relevance, etc.
matching human preferences:
applications in social science and info. retrieval
Ordinal Regression Problem
Properties of Ordinal Regression
regression without metric:
possibly metric underlying (age), but not encoded in{1,2,3,4}
classification with ordered categories:
small mistake – classify a teenager as a child;
big mistake – classify an infant as an adult common loss functions:
determine the category: classification error LC(g,x,y) =
g(x) 6=y
or at least have a close prediction: absolute error LA(g,x,y) =
g(x) −y
will talk about LAonly;
similar for LC
Thresholded Ensemble Model
Thresholded Model for Ordinal Regression
naive algorithm for ordinal regression:
1 do general regression on{(xn,yn)}, and get H(x) – general regression performs badly without metric
2 set g(x) =clip(round(H(x)))
– roundoff operation (uniform quantization) cause large error improved and generalized algorithm:
1 estimate a potential function H(x)
2 quantize H(x)by some orderedθto get g(x)
x x - θ1
d d d
θ2
t tt t θ3
??
1 2 3 4 g(x)
H(x)
thresholded model: g(x) ≡gH,θ(x) =min{k:H(x) < θk}
Thresholded Ensemble Model
Thresholded Ensemble Model
the potential function H(x)is a weighted ensemble H(x) ≡HT(x) =PT
t=1wtht(x)
intuition: combine preferences to estimate the overall confidence e.g. if many people, ht, say a movie x is “good”,
the confidence of the movie H(x)should be high ht can be binary, multi-valued, or continuous wt <0: allow reversing bad preferences
thresholded ensemble model:
ensemble learning for ordinal regression
Bounds for Large-Margin Thresholded Ensembles
Margins of Thresholded Ensembles
x x - θ1
d d d
θ2
t tt t θ3
??
ρ1 -
ρ2 ρ3 -
1 2 3 4 g(x)
H(x)
margin: safe from the boundary
normalized margin for thresholded ensemble
¯
ρ(x,y,k) =
HT(x) − θk, if y >k θk −HT(x), if y ≤k
, T X
t=1
wt +
K−1
X
k=1
θk
!
negative margin ⇐⇒ wrong prediction
K−1
X
k=1
¯ρ(x,y,k) ≤0
⇐⇒
g(x) −y
Bounds for Large-Margin Thresholded Ensembles
Theoretical Reduction
x x -
– – θ1 +d + +d d +t ++ +tt t ??
++
1 2 3 4 g(x)
H(x) (K −1)binary classification problems w.r.t. eachθk:
(X)k, (Y)k = (x,k), +/−
(Schapire et al., 1998) binary classification: with probability at least 1− δ, for all∆ >0 and binary classifiers gc,
E(X,Y)∼D0
gc(X) 6=Y ≤ 1 N
N
X
n=1
¯ρ(Xn,Yn) ≤ ∆ +O(log N
√ N , 1
∆, r
log1 δ)
(Lin and Li, 2006) ordinal regression: with similar settings, for all thresholded ensembles g,
E(x,y)∼DLA(g,x,y) ≤ 1 N
N
X
n=1 K−1
X
k=1
¯ρ(xn,yn,k) ≤ ∆+O(K,log N
√N , 1
∆, r
log1 δ) large-margin thresholded ensembles can generalize
Algorithms for Large-Margin Thresholded Ensembles
Algorithmic Reduction
(Freund and Schapire, 1996) AdaBoost: binary classification by operationally optimizing
min
N
X
n=1
exp −ρ(xn,yn) ≈max softminnρ(x¯ n,yn)
(Lin and Li, 2006) ORBoost-LR (left-right):
min
N
X
n=1 yn
X
k=yn−1
exp −ρ(xn,yn,k)
ORBoost-All:
min
N
X
n=1 K−1
X
k=1
exp −ρ(xn,yn,k)
algorithmic reduction to AdaBoost
Algorithms for Large-Margin Thresholded Ensembles
Advantages of ORBoost
ensemble learning: combine simple preferences to approximate complex targets
threshold: adaptively estimated scales to perform ordinal regression
inherit from AdaBoost:
simple implementation guarantee on minimizingP
n,k ¯ρ(xn,yn,k) ≤ ∆ fast practically less vulnerable to overfitting
useful properties inherited with reduction
Algorithms for Large-Margin Thresholded Ensembles
ORBoost Experiments
py ma ho ab ba co ca ce
0 0.5 1 1.5 2 2.5
test absolute error
RankBoost ORBoost SVM
Results (ORBoost-All) ORBoost-All simpler, and much better than RankBoost (Freund et al., 2003)
ORBoost-All much faster, and comparable to SVM (Chu and Keerthi, 2005)
similar for ORBoost-LR
Conclusion
Conclusion
thresholded ensemble model: useful for ordinal regression theoretical reduction: new large-margin bounds
algorithmic reduction: new training algorithms – ORBoost ORBoost:
simplicity over existing boosting algorithms
comparable performance to state-of-the-art algorithms fast training and less vulnerable to overfitting
on-going work: similar reduction technique for other theoretical and algorithmic results with more general loss functions (Li and Lin, 2006)
Questions?