One-sided Support Vector Regression for Multiclass Cost-sensitive Classification
Han-Hsing Tu and Hsuan-Tien Lin National Taiwan University
June 23, 2010
Cost-sensitive Classification
Binary Cost-sensitive Classification
? cold-infected healthy
XXXXactual XXXXXX predicted
cold healthy
cold 0 C−1
healthy C1 0
binary, cost-matrix based
Cost-sensitive Classification
Multiclass Cost-sensitive Classification
? H1N1-infected cold-infected healthy
error measure = society cost XXXXactual XXXXXX
predicted
H1N1 cold healthy
H1N1 0 1000 100000
cold 100 0 3000
healthy 100 30 0
human doctors consider costs of decision
want computer-aided diagnosis to behave similarly multiclass, cost-matrix based
Cost-sensitive Classification
From Cost Matrix to Cost Vector
with actual underlying status
prediction H1N1 cold healthy society cost 0 1000 100000
only a "row" needed per example:cost vectorc an H1N1 patient:c = (0,1000,100000) a healthy patient:c = (100,30,0)
“regular” classification cost for label 2:c = (1,0,1,1)
binary cost-sensitive classification cost for label−1: c = (0,C−1) multiclass,cost-vector based:
a very general setup
Cost-sensitive Classification
Cost-sensitive Classification Setup
Given
N examples, each (inputxn, label yn, cost cn)∈ X × {1, 2, . . . , K } × RK
—will assumecn[yn] =0=min1≤k ≤Kcn[k ]
Goal
a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)
will assumec[y ] =0=min1≤k ≤Kc[k ] = cmin note: y not really needed in evaluation
cost-sensitive classification:
can express any finite-loss supervised learning tasks
Cost-sensitive Classification
Our Contributions
decomposition per-class pair-wise tournament err. correcting
regular OVA OVO FT ECOC
cost-sensitive our work WAP FT SECOC
a theoretic and algorithmic study of multiclass cost-sensitive classification, which ...
introduces a methodology toreduce cost-sensitive classification toregression couples the methodology with a novel
regression loss forstrong theoretical support leads to a promising SVM-based algorithm with superior experimental results
Design and Analysis
Key Idea: Cost Estimator
Goal
a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)
if everyc[k ] known best g∗(x) = argmin
1≤k ≤K
c[k ]
if rk(x)≈ c[k] well
approximately good gr(x) = argmin
1≤k ≤K
rk(x)
how to get cost estimator rk? regression
Design and Analysis
Cost Estimator by Regression
Given
N examples, each (inputxn, label yn, cost cn)∈ X × {1, 2, . . . , K } × RK
input cn[1] input cn[2] . . . input cn[K ]
x1 0, x1 2, x1 1
x2 1, x2 3, x2 5
· · ·
xN 6, xN 1, xN 0
| {z }
r1 | {z }
r2 | {z }
rK want: rk(x)≈ c[k] for all future (x, y, c) and k
good rk =⇒ good gr?
Design and Analysis
Absolute Loss Bound
gr(x) = argmin
1≤k ≤K
rk(x)
Theorem
For any set of regressors (cost estimators){rk}Kk =1
and for any example (x, y , c) with c[y ] =0,
c[gr(x)]≤
K
X
k =1
rk(x)− c[k]
.
good rk =⇒ good gr? YES!
Design and Analysis
A Pictorial Proof
c[gr(x)]≤
K
X
k =1
rk(x)− c[k]
assumec ordered and not degenerate:
y = 1;0=c[1]<c[2]≤ · · · ≤c[K ] assume mis-prediction gr(x) = 2:
r2(x) = min1≤k ≤Krk(x)≤ r1(x)
-
c[1]
∆1 r2(x)6r1(x)6
∆2
c[2] c[3]
r3(x)6
c[K ] rK6(x)
c[2]− c[1]
|{z}
0
≤ ∆1
+
∆2
≤
K
X
k =1
rk(x)− c[k]
Design and Analysis
A Closer Look
let∆1≡ r1(x)− c[1] and∆2≡ c[2] − r2(x)
c[1]
∆1
r2(x)6r 6
1(x)
∆2
c[2] c[3]
r3(x)6
∆1≥ 0 and∆2≥ 0:
c[2]≤∆1+∆2
r2(x)6 6
∆1
r1(x)
∆2
∆1≤ 0 and∆2≥ 0:
c[2]≤∆2
r2(x)6 r1(x)6
∆1
∆1≥ 0 and∆2≤ 0:
c[2]≤∆1
c[2]≤ max(∆1, 0) + max(∆2, 0)≤ ∆1
+ ∆2
Design and Analysis
Tighter Bound with One-sided Loss
Defineone-sided lossξk ≡ max(∆k, 0), with ∆k ≡
rk(x)− c[k]
ifc[k ] = cmin
∆k ≡
c[k ]− rk(x)
ifc[k ]6= cmin
Intuition: ξk =0 encodes ...
whenc[k ] = cmin:wish to have rk(x)≤ c[k]
whenc[k ]6= cmin:wish to have rk(x)≥ c[k]
c[gr(x)]≤
K
X
k =1
ξk
| {z }
one-sided loss bound
≤
K
X
k =1
∆k
| {z }
absolute loss bound
Design and Analysis
The Proposed Reduction Framework
1 transform cost-sensitive examples (xn, yn, cn)to regression examples X(k )n , Yn(k ), Zn(k ) = (xn, cn[k ],
+
/-
)2 use aone-sided regression algorithm to get regressors rk(x)
3 for each new inputx, predict its class using gr(x) = argmin
1≤k ≤K
rk(x)
how to design a good OSR algorithm?
cost- sensitive example (xn, yn, cn)
⇒
@ AA
%
$ '
&
regression examples (Xn,k, Yn,k, Zn,k)
k = 1,· · · , K ⇒⇒⇒ one-sided regression algorithm ⇒⇒⇒
%
$ '
&
regressors rk(x) k∈ 1, · · · , K
AA
@
⇒
cost- sensitive classifier gr(x)
The Proposed Algorithm
Support Vector Machinery for One-sided Regression
Given
Xn,k, Yn,k, Zn,k = (xn, cn[k ],
+
/-
) Training Goalall trainingξn,k =max
Zn,k rk(Xn,k)− Yn,k
| {z }
∆n,k
, 0
small
OSR-SVM for cost-sensitive classification:
min
wk,bk
1
2hwk, wki + C
N
X
n=1
ξn,k
to get rk(X) =hwk, φ(X)i + bk
The Proposed Algorithm
One-sided Support Vector Regression
Standard Support Vector Regression
minw,b
1
2hw, wi + C
N
X
n=1
(ξn+ξ∗n)
ξn=max (
+
(rk(Xn)− Yn− ), 0) ξn∗ =max (-
(rk(Xn)− Yn+), 0) One-sided Support Vector Regression(for each k )min
w,b
1
2hw, wi + C
N
X
n=1
ξn
ξn=max (Zn· (rk(Xn)− Yn), 0) OSR-SVM:
SVR +( = 0)+(keepξ orξ∗ by Z )
The Proposed Algorithm
OSR-SVM versus OVA-SVM: Formulations
OSR-SVM: gr(x) = argmin rk(X)
wmink,bk
1
2hwk, wki + C
N
X
n=1
ξn,k with rk(X) =hwk, φ(X)i + bk
ξn,k =max Zn,k · rk(Xn,k)− Yn,k , 0
OVA-SVM(−1 for correct class): gr(x) = argmin rk(X) with ξn,k =max Zn,k · rk(Xn,k)+1, 0
OVA-SVM:
special case that replaces Yn,k (i.e.cn[k ]) by−Zn,k
Experiments
OSR-SVM versus OVA-SVM: Experiments
ir. wi. gl. ve. vo. se. dn. sa. us. le.
0 50 100 150 200 250 300 350
avg. test cost
OSR
OVA OSR: a cost-sensitive extension of OVA OVA: cost-insensitive SVM
OSR often significantly better than OVA
Experiments
OSR-SVM versus WAP/FT/SECOC-SVM
ir. wi. gl. ve. vo. se. dn. sa. us. le.
0 50 100 150 200 250 300
avg. test cost
OSR WAP FT SECOC
OSR (per-class):
O(K ) train/pred WAP (pair-wise):
O(K2)train/pred FT (tournament):
O(K ) train;
O(log K ) pred
SECOC (err correct):
big O(K ) train/pred
speed: FT >OSR> SECOC > WAP;
performance: OSR≈ WAP > FT > SECOC
Conclusion
Conclusion
reduction to regression:
a simple way of designing cost-sensitive classification algorithms theoretical guarantee:
absolute andone-sided bounds algorithmic use:
anovel and simple algorithm OSR-SVM experimental performance of OSR-SVM:
leading in SVM family
Thank you. Questions?