Cost-sensitive Multiclass Classiﬁcation via Regression

(1)

Cost-sensitive Multiclass Classification via Regression

Hsuan-Tien Lin

Dept. of CSIE, NTU

Talk at NTU CE, 10/15/2013

Joint work with Han-Hsing Tu; parts appeared in ICML ’10

(2)

Which Digit Did You Write?

?

one (1) two (2) three (3) four (4)

• aclassification problem

—grouping “pictures” into different “categories”

How can machines learn to classify?

(3)

Supervised Machine Learning

Parent

?

(picture, category) pairs

?

Kid’s good

decision function brain

'

&

$

% -

6 possibilities

Truth f (x) + noise e(x)

?

examples (picturex_n, category y_n)

?

learning good decision

function g(x)≈ f (x) algorithm

'

&

$

% -

6

learning model{g^α(x)} challenge:

see only{(xn, yn)} without knowing f (x) or e(x)

=?⇒ generalize to unseen (x, y) w.r.t. f (x)

(4)

Mis-prediction Costs

(g(x)≈ f (x)?)

?

• ZIP code recognition:

1:wrong; 2:right; 3: wrong; 4: wrong

• check value recognition:

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4:two-dollar mistake different applications:

evaluate mis-predictions differently

(5)

ZIP Code Recognition

?

1:wrong; 2:right; 3: wrong; 4: wrong

• regular classification problem: only right or wrong

• wrong cost: 1;right cost: 0

• prediction error of g on some (x, y ):

classification cost =Jy 6= g (x)K regular classification:

well-studied, many good algorithms

(6)

Check Value Recognition

?

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4: two-dollar mistake

• cost-sensitive classification problem:

different costs for different mis-predictions

• e.g. prediction error of g on some (x, y ):

absolute cost =|y − g(x)|

cost-sensitive classification:

new, need more research

(7)

What is the Status of the Patient?

?

H1N1-infected cold-infected healthy

• anotherclassification problem

—grouping “patients” into different “status”

Are all mis-prediction costs equal?

(8)

Patient Status Prediction

error measure = society cost XXXXactual XXXXXX

predicted

H1N1 cold healthy

H1N1 0 1000 100000

cold 100 0 3000

healthy 100 30 0

• H1N1 mis-predicted as healthy:very high cost

• cold mis-predicted as healthy: high cost

• cold correctly predicted as cold: no cost

human doctors consider costs of decision;

can computer-aided diagnosis do the same?

(9)

What is the Type of the Movie?

? romance fiction terror

customer 1 who hates terror but likes romance error measure = non-satisfaction

XXXXactual XXXXXX predicted

romance fiction terror

romance 0 5 100

customer 2 who likes terror and romance

romance fiction terror

romance 0 5 3

different customers:

evaluate mis-predictions differently

(10)

Cost-sensitive Classification Tasks

movie classification with non-satisfaction

romance fiction terror customer 1, romance 0 5 100 customer 2, romance 0 5 3

patient diagnosis with society cost

H1N1 cold healthy

H1N1 0 1000 100000

cold 100 0 3000

healthy 100 30 0

check digit recognition with absolute cost C(y, g(x)) = |g(x) − y|

(11)

Cost Vector

cost vectorc: a row of cost components

• customer 1 on a romance movie: c = (0,5,100)

• an H1N1 patient: c = (0,1000,100000)

• absolute cost for digit 2: c = (1,0,1,2)

• “regular” classification cost for label 2: c⁽²⁾_c = (1,0,1,1) regular classification:

special case of cost-sensitive classification

(12)

Cost-sensitive Classification Setup

Given

N examples, each

(inputx_n, label yn, cost cn)∈ X × {1, 2, . . . , K } × R^K

• K = 2: binary; K > 2: multiclass

• will assumec_n[y_n] =0=min_{1≤k ≤K}c_n[k ] Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

• will assumec[y ] =0=c_min =min_{1≤k ≤K}c[k ]

• note: y not really needed in evaluation cost-sensitive classification:

can express any finite-loss supervised learning tasks

(13)

Our Contribution

binary multiclass

regular well-studied well-studied

cost-sensitive known(Zadrozny, 2003) ongoing(our work, among others)

a theoretic and algorithmic study of cost-sensitive classification, which ...

• introduces a methodology to reduce cost-sensitive classification toregression

• providesstrong theoretical support for the methodology

• leads to a promising algorithm withsuperior experimental results

will describe the methodology and an algorithm

(14)

Key Idea: Cost Estimator

Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

if every c[k ] known optimal g^∗(x) = argmin

1≤k ≤K

c[k ]

if rk(x)≈ c[k] well

approximately good gr(x) = argmin

1≤k ≤K

r_k(x)

how to get cost estimator r_k? regression

(15)

Cost Estimator by Per-class Regression

Given

N examples, each (inputx_n, label y_n, cost c_n)∈ X × {1, 2, . . . , K } × R^K

input cn[1] input cn[2] . . . input cn[K ]

x₁ 0, x₁ 2, x₁ 1

x₂ 1, x₂ 3, x₂ 5

· · ·

x_N 6, x_N 1, x_N 0

| {z }

r₁ | {z }

r₂ | {z }

r_K

want: r_k(x)≈ c[k] for all future (x, y, c) and k

(16)

The Reduction Framework

1 transform cost-sensitive examples (x_n, y_n, c_n)to regression examples X_n,k, Yn,k = x_n, cn[k ]

2 use your favorite algorithm on the regression examples and get regressors r_k(x)

3 for each new inputx, predict its class using g_r(x) = argmin

1≤k ≤K

r_k(x)

the reduction-to-regression framework:

systematic & easy to implement

cost- sensitive example (xn, yn, cn)

⇒

@ AA

%

$ '

&

regression examples (Xn,k, Yn,k) k = 1,· · · , K ⇒⇒⇒

regression algorithm

⇒⇒

⇒

%

$ '

&

regressors rk(x) k∈ 1, · · · , K

AA

@

⇒

cost- sensitive classifier gr(x)

(17)

Theoretical Guarantees (1/2)

g_r(x) = argmin

1≤k ≤K

r_k(x)

Theorem (Absolute Loss Bound)

For any set of regressors (cost estimators){rk}^Kk =1and for any example (x, y , c) with c[y ] =0,

c[g_r(x)]≤

K

X

k =1

r_k(x)− c[k]

.

low-cost classifier⇐= accurate regressor

(18)

Theoretical Guarantees (2/2)

g_r(x) = argmin

1≤k ≤K

r_k(x)

Theorem (Squared Loss Bound)

For any set of regressors (cost estimators){rk}^Kk =1and for any example (x, y , c) with c[y ] =0,

c[gr(x)]≤ v u u t2

K

X

k =1

(r_k(x)− c[k])².

applies to commonleast-square regression

(19)

A Pictorial Proof

c[g_r(x)]≤

K

X

k =1

r_k(x)− c[k]

• assumec ordered and not degenerate:

y = 1;0=c[1]<c[2]≤ · · · ≤c[K ]

• assume mis-prediction g_r(x) = 2:

r₂(x) = min_{1≤k ≤K}r_k(x)≤ r1(x)

c[1] -

∆1

r₂(x)6 r₁(x)⁶

∆₂

c[2] c[3]

r₃(x)6

c[K ] r_K(x)6

c[2]− c[1]

|{z}

0

≤ ∆1

+

∆2

≤

K

X

k =1

r_k(x)− c[k]

(20)

An Even Closer Look

let∆₁≡ r1(x)− c[1] and∆₂≡ c[2] − r2(x)

1 ∆₁≥ 0 and∆₂≥ 0: c[2] ≤∆₁+∆₂

2 ∆₁≤ 0 and∆₂≥ 0: c[2] ≤∆₂

3 ∆1≥ 0 and∆2≤ 0: c[2] ≤∆1

c[2]≤ max(∆₁, 0) + max(∆₂, 0)≤ ∆₁

+ ∆₂

c[1] -

∆1

r2(x)6 r1(x)⁶

∆2

c[2] c[3]

r₃(x)6

c[K ] r_K(x)6

6 - r₂(x) ⁶

∆1

r₁(x)

∆₂

6 -

r₂(x) r₁(x)⁶

∆1

(21)

Tighter Bound with One-sided Loss

Defineone-sided loss ξ_k ≡ max(∆k,0) with ∆_k ≡

r_k(x)− c[k]

ifc[k ] = c_min

∆_k ≡

c[k ]− rk(x)

ifc[k ]6= cmin

Intuition

• c[k ] = c_min:wish to have r_k(x)≤ c[k]

• c[k ]6= cmin:wish to have r_k(x)≥ c[k]

—both wishes same as ∆_k ≤ 0 and hence ξk =0 One-sided Loss Bound:

c[gr(x)]≤

K

X

k =1

ξ_k ≤

K

X

k =1

∆_k

(22)

The Improved Reduction Framework

1 transform cost-sensitive examples (xn, yn, cn)to regression examples X^{(k )}_n , Y_n^{(k )},Z_n^{(k )} =

x_n, c_n[k ],2r

c_n[k ] =c_n[y_n]z

− 1

2 use aone-sided regression algorithmto get regressors r_k(x)

3 for each new inputx, predict its class using g_r(x) = argmin

1≤k ≤K

r_k(x)

the reduction-to-OSR framework:

need a good OSR algorithm

cost- sensitive example (xn, yn, cn)

⇒

@ AA

%

$ '

&

regression examples (Xn,k, Yn,k, Zn,k)

k = 1,· · · , K ⇒⇒⇒ one-sided regression algorithm ⇒⇒⇒

%

$ '

&

regressors rk(x) k∈ 1, · · · , K

AA

@

⇒

cost- sensitive classifier gr(x)

(23)

Regularized One-sided Hyper-linear Regression

Given

X_n,k, Y_n,k, Z_n,k =

xn, cn[k ], 2r

cn[k ] =cn[yn]z

− 1

Training Goal

all trainingξ_n,k =max





Z_n,k r_k(X_n,k)− Yn,k

| {z }

∆n,k

, 0





small

—will drop k

minw,b

λ

2hw, wi +

N

X

n=1

ξ_n to get r_k(X) =hw, φ(X)i + b

(24)

One-sided Support Vector Regression

Regularized One-sided Hyper-linear Regression minw,b

λ

2hw, wi +

N

X

n=1

ξn

ξn=max (Z_n· (rk(X_n)− Yn), 0) Standard Support Vector Regression

minw,b

1

2Chw, wi +

N

X

n=1

(ξn+ξ^∗_n)

ξn=max (+1· (rk(Xn)− Yⁿ− ), 0) ξ_n^∗=max (−1· (rk(Xn)− Yⁿ+), 0) OSR-SVM = SVR +(0→ )+(keepξ_norξ_n^∗ by Z_n)

(25)

OSR versus Other Reductions

OSR: K regressors

How unlikely (costly) does the example belong to class k ? Filter Tree (FT): K − 1 binary classifiers

Is the lowest cost within labels{1, 4} or {2, 3}?

Is the lowest cost within label{1} or {4}?

Is the lowest cost within label{2} or {3}?

Weighted All Pairs (WAP): ^{K (K −1)}₂ binary classifiers isc[1] or c[4] lower?

Sensitive Error Correcting Output Code (SECOC): (T · K ) bin.

cla.

isc[1] + c[3] + c[4] greater than someθ?

(26)

Experiment: OSR-SVM versus OVA-SVM

ir. wi. gl. ve. vo. se. dn. sa. us. le.

0 50 100 150 200 250 300 350

avg. test cost

OSR

OVA • OSR: a cost-sensitive extension of OVA

• OVA: regular SVM

OSR often significantly better than OVA

(27)

Experiment: OSR versus FT

0 50 100 150 200 250 300

avg. test cost

OSR

FT • OSR (per-class):

O(K ) training, O(K ) prediction

• FT (tournament):

O(K ) training, O(log₂K ) prediction

FT faster, but OSR better performed

(28)

Experiment: OSR versus WAP

0 50 100 150 200 250 300

avg. test cost

OSR

WAP • OSR (per-class):

• WAP (pairwise):

O(K²)training, O(K²)prediction

OSR faster and comparable performance

(29)

Experiment: OSR versus SECOC

0 50 100 150 200 250 300

avg. test cost

OSR SECOC

• OSR (per-class):

• SECOC

(error-correcting): big O(K ) training, big O(K ) prediction

OSR faster and much better performance

(30)

Conclusion

• reduction to regression:

a simple way of designing cost-sensitive classification algorithms

• theoretical guarantee:

absolute, squared andone-sided bounds

• algorithmic use:

anovel and simple algorithm OSR-SVM

• experimental performance of OSR-SVM:

leading in SVM family

more algorithm andapplicationopportunities

(31)

Acknowledgments

• Profs. Chih-Jen Lin, Yuh-Jyh Lee, Shou-de Lin for suggestions

• Prof. Chuin-Shan Chen for talk invitation

• Computational Learning Lab @ NTU for discussions

Cost-sensitive Multiclass Classiﬁcation via Regression

Cost-sensitive Multiclass Classification via Regression

Which Digit Did You Write?

Supervised Machine Learning

Mis-prediction Costs

ZIP Code Recognition

Check Value Recognition

What is the Status of the Patient?

Patient Status Prediction

What is the Type of the Movie?

Cost-sensitive Classification Tasks

Cost Vector

Cost-sensitive Classification Setup

Our Contribution

Key Idea: Cost Estimator

Cost Estimator by Per-class Regression

The Reduction Framework

Theoretical Guarantees (1/2)

Theoretical Guarantees (2/2)

A Pictorial Proof

An Even Closer Look

Tighter Bound with One-sided Loss

The Improved Reduction Framework

Regularized One-sided Hyper-linear Regression

One-sided Support Vector Regression

OSR versus Other Reductions

Experiment: OSR-SVM versus OVA-SVM

Experiment: OSR versus FT

Experiment: OSR versus WAP

Experiment: OSR versus SECOC

Conclusion

Acknowledgments

Thank you. Questions?