One-sided Support Vector Regression for Multiclass Cost-sensitive Classiﬁcation

(1)

One-sided Support Vector Regression for Multiclass Cost-sensitive Classification

Han-Hsing Tu and Hsuan-Tien Lin National Taiwan University

June 23, 2010

(2)

Cost-sensitive Classification

Binary Cost-sensitive Classification

? cold-infected healthy

XXXXactual XXXXXX predicted

cold healthy

cold 0 C₋₁

healthy C1 0

binary, cost-matrix based

(3)

Multiclass Cost-sensitive Classification

? H1N1-infected cold-infected healthy

error measure = society cost XXXXactual XXXXXX

predicted

H1N1 cold healthy

H1N1 0 1000 100000

cold 100 0 3000

healthy 100 30 0

human doctors consider costs of decision

want computer-aided diagnosis to behave similarly multiclass, cost-matrix based

(4)

From Cost Matrix to Cost Vector

with actual underlying status

prediction H1N1 cold healthy society cost 0 1000 100000

only a "row" needed per example:cost vectorc an H1N1 patient:c = (0,1000,100000) a healthy patient:c = (100,30,0)

“regular” classification cost for label 2:c = (1,0,1,1)

binary cost-sensitive classification cost for label−1: c = (0,C−1) multiclass,cost-vector based:

a very general setup

(5)

Cost-sensitive Classification Setup

Given

N examples, each (inputxn, label yn, cost cn)∈ X × {1, 2, . . . , K } × R^K

—will assumec_n[y_n] =0=min_{1≤k ≤K}c_n[k ]

Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

will assumec[y ] =0=min_{1≤k ≤K}c[k ] = c_min note: y not really needed in evaluation

cost-sensitive classification:

can express any finite-loss supervised learning tasks

(6)

Our Contributions

decomposition per-class pair-wise tournament err. correcting

regular OVA OVO FT ECOC

cost-sensitive our work WAP FT SECOC

a theoretic and algorithmic study of multiclass cost-sensitive classification, which ...

introduces a methodology toreduce cost-sensitive classification toregression couples the methodology with a novel

regression loss forstrong theoretical support leads to a promising SVM-based algorithm with superior experimental results

(7)

Design and Analysis

Key Idea: Cost Estimator

Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

if everyc[k ] known best g^∗(x) = argmin

1≤k ≤K

c[k ]

if r_k(x)≈ c[k] well

approximately good g_r(x) = argmin

1≤k ≤K

r_k(x)

how to get cost estimator r_k? regression

(8)

Design and Analysis

Cost Estimator by Regression

Given

N examples, each (inputxn, label yn, cost cn)∈ X × {1, 2, . . . , K } × R^K

input c_n[1] input c_n[2] . . . input c_n[K ]

x₁ 0, x₁ 2, x₁ 1

x₂ 1, x₂ 3, x₂ 5

· · ·

x_N 6, x_N 1, x_N 0

| {z }

r₁ | {z }

r₂ | {z }

r_K want: r_k(x)≈ c[k] for all future (x, y, c) and k

good r_k =⇒ good gr?

(9)

Design and Analysis

Absolute Loss Bound

g_r(x) = argmin

1≤k ≤K

r_k(x)

Theorem

For any set of regressors (cost estimators){rk}^Kk =1

and for any example (x, y , c) with c[y ] =0,

c[gr(x)]≤

K

X

k =1

r_k(x)− c[k]

.

good r_k =⇒ good gr? YES!

(10)

Design and Analysis

A Pictorial Proof

c[g_r(x)]≤

K

X

k =1

r_k(x)− c[k]

assumec ordered and not degenerate:

y = 1;0=c[1]<c[2]≤ · · · ≤c[K ] assume mis-prediction gr(x) = 2:

r₂(x) = min_{1≤k ≤K}r_k(x)≤ r1(x)

-

c[1]

∆₁ r2(x)6_r₁_(x)6

∆2

c[2] c[3]

r3(x)6

c[K ] rK6(x)

c[2]− c[1]

|{z}

0

≤ ∆1

+

∆2

≤

K

X

k =1

r_k(x)− c[k]

(11)

Design and Analysis

A Closer Look

let∆₁≡ r1(x)− c[1] and∆₂≡ c[2] − r2(x)

c[1]

∆1

r₂(x)6_r 6

1(x)

∆2

c[2] c[3]

r₃(x)6

∆1≥ 0 and∆2≥ 0:

c[2]≤∆₁+∆₂

r₂(x)6 6

∆1

r₁(x)

∆₂

∆₁≤ 0 and∆₂≥ 0:

c[2]≤∆2

r2(x)6 _r₁_(x)6

∆1

∆1≥ 0 and∆2≤ 0:

c[2]≤∆₁

c[2]≤ max(∆₁, 0) + max(∆₂, 0)≤ ∆₁

+ ∆₂

(12)

Design and Analysis

Tighter Bound with One-sided Loss

Defineone-sided lossξk ≡ max(∆k, 0), with ∆k ≡

r_k(x)− c[k]

ifc[k ] = c_min

∆_k ≡

c[k ]− rk(x)

ifc[k ]6= cmin

Intuition: ξk =0 encodes ...

whenc[k ] = c_min:wish to have r_k(x)≤ c[k]

whenc[k ]6= cmin:wish to have r_k(x)≥ c[k]

c[g_r(x)]≤

K

X

k =1

ξ_k

| {z }

one-sided loss bound

≤

K

X

k =1

∆_k

| {z }

absolute loss bound

(13)

Design and Analysis

The Proposed Reduction Framework

1 transform cost-sensitive examples (x_n, yn, cn)to regression examples X^{(k )}_n , Y_n^{(k )}, Z_n^{(k )} = (xn, cn[k ],

+

/

-

)

2 use aone-sided regression algorithm to get regressors r_k(x)

3 for each new inputx, predict its class using g_r(x) = argmin

1≤k ≤K

r_k(x)

how to design a good OSR algorithm?

cost- sensitive example (xn, yn, cn)

⇒

@ AA

%

$ '

&

regression examples (Xn,k, Yn,k, Zn,k)

k = 1,· · · , K ⇒⇒⇒ one-sided regression algorithm ⇒⇒⇒

%

$ '

&

regressors rk(x) k∈ 1, · · · , K

AA

@

⇒

cost- sensitive classifier gr(x)

(14)

The Proposed Algorithm

Support Vector Machinery for One-sided Regression

Given

X_n,k, Y_n,k, Z_n,k = (x_n, cn[k ],

+

/

-

) Training Goal

all trainingξ_n,k =max





Z_n,k r_k(X_n,k)− Yn,k

| {z }

∆n,k

, 0





small

OSR-SVM for cost-sensitive classification:

min

wk,bk

1

2hwk, wki + C

N

X

n=1

ξn,k

to get r_k(X) =hwk, φ(X)i + bk

(15)

One-sided Support Vector Regression

Standard Support Vector Regression

minw,b

1

2hw, wi + C

N

X

n=1

(ξn+ξ^∗_n)

ξn=max (

+

(r_k(Xn)− Yⁿ− ), 0) ξ_n^∗ =max (

-

(r_k(X_n)− Yn+), 0) One-sided Support Vector Regression(for each k )

min

w,b

1

2hw, wi + C

N

X

n=1

ξn

ξn=max (Zn· (rk(X_n)− Yⁿ), 0) OSR-SVM:

SVR +( = 0)+(keepξ orξ^∗ by Z )

(16)

OSR-SVM versus OVA-SVM: Formulations

OSR-SVM: g_r(x) = argmin r_k(X)

wmink,bk

1

2hwk, w_ki + C

N

X

n=1

ξ_n,k with r_k(X) =hwk, φ(X)i + bk

ξ_n,k =max Z_n,k · rk(X_n,k)− Yn,k , 0

OVA-SVM(−1 for correct class): g_r(x) = argmin r_k(X) with ξn,k =max Z_n,k · rk(X_n,k)+1, 0

OVA-SVM:

special case that replaces Y_n,k (i.e.cn[k ]) by−Zn,k

(17)

Experiments

OSR-SVM versus OVA-SVM: Experiments

ir. wi. gl. ve. vo. se. dn. sa. us. le.

0 50 100 150 200 250 300 350

avg. test cost

OSR

OVA OSR: a cost-sensitive extension of OVA OVA: cost-insensitive SVM

OSR often significantly better than OVA

(18)

Experiments

OSR-SVM versus WAP/FT/SECOC-SVM

ir. wi. gl. ve. vo. se. dn. sa. us. le.

0 50 100 150 200 250 300

avg. test cost

OSR WAP FT SECOC

OSR (per-class):

O(K ) train/pred WAP (pair-wise):

O(K²)train/pred FT (tournament):

O(K ) train;

O(log K ) pred

SECOC (err correct):

big O(K ) train/pred

speed: FT >OSR> SECOC > WAP;

performance: OSR≈ WAP > FT > SECOC

(19)

Conclusion

reduction to regression:

a simple way of designing cost-sensitive classification algorithms theoretical guarantee:

absolute andone-sided bounds algorithmic use:

anovel and simple algorithm OSR-SVM experimental performance of OSR-SVM:

leading in SVM family

Thank you. Questions?