• 沒有找到結果。

Cost-sensitive Multiclass Classification via Regression

N/A
N/A
Protected

Academic year: 2022

Share "Cost-sensitive Multiclass Classification via Regression"

Copied!
31
0
0

加載中.... (立即查看全文)

全文

(1)

Cost-sensitive Multiclass Classification via Regression

Hsuan-Tien Lin

Dept. of CSIE, NTU

Talk at NTU CE, 10/15/2013

Joint work with Han-Hsing Tu; parts appeared in ICML ’10

(2)

Which Digit Did You Write?

?

one (1) two (2) three (3) four (4)

aclassification problem

—grouping “pictures” into different “categories”

How can machines learn to classify?

(3)

Supervised Machine Learning

Parent

?

(picture, category) pairs

?

Kid’s good

decision function brain

'

&

$

% -

6 possibilities

Truth f (x) + noise e(x)

?

examples (picturexn, category yn)

?

learning good decision

function g(x)≈ f (x) algorithm

'

&

$

% -

6

learning model{gα(x)} challenge:

see only{(xn, yn)} without knowing f (x) or e(x)

=?⇒ generalize to unseen (x, y) w.r.t. f (x)

(4)

Mis-prediction Costs

(g(x)≈ f (x)?)

?

ZIP code recognition:

1:wrong; 2:right; 3: wrong; 4: wrong

check value recognition:

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4:two-dollar mistake different applications:

evaluate mis-predictions differently

(5)

ZIP Code Recognition

?

1:wrong; 2:right; 3: wrong; 4: wrong

regular classification problem: only right or wrong

wrong cost: 1;right cost: 0

prediction error of g on some (x, y ):

classification cost =Jy 6= g (x)K regular classification:

well-studied, many good algorithms

(6)

Check Value Recognition

?

1:one-dollar mistake; 2:no mistake;

3:one-dollar mistake; 4: two-dollar mistake

cost-sensitive classification problem:

different costs for different mis-predictions

e.g. prediction error of g on some (x, y ):

absolute cost =|y − g(x)|

cost-sensitive classification:

new, need more research

(7)

What is the Status of the Patient?

?

H1N1-infected cold-infected healthy

anotherclassification problem

—grouping “patients” into different “status”

Are all mis-prediction costs equal?

(8)

Patient Status Prediction

error measure = society cost XXXXactual XXXXXX

predicted

H1N1 cold healthy

H1N1 0 1000 100000

cold 100 0 3000

healthy 100 30 0

H1N1 mis-predicted as healthy:very high cost

cold mis-predicted as healthy: high cost

cold correctly predicted as cold: no cost

human doctors consider costs of decision;

can computer-aided diagnosis do the same?

(9)

What is the Type of the Movie?

? romance fiction terror

customer 1 who hates terror but likes romance error measure = non-satisfaction

XXXXactual XXXXXX predicted

romance fiction terror

romance 0 5 100

customer 2 who likes terror and romance

XXXXactual XXXXXX predicted

romance fiction terror

romance 0 5 3

different customers:

evaluate mis-predictions differently

(10)

Cost-sensitive Classification Tasks

movie classification with non-satisfaction

XXXXactual XXXXXX predicted

romance fiction terror customer 1, romance 0 5 100 customer 2, romance 0 5 3

patient diagnosis with society cost

XXXXactual XXXXXX predicted

H1N1 cold healthy

H1N1 0 1000 100000

cold 100 0 3000

healthy 100 30 0

check digit recognition with absolute cost C(y, g(x)) = |g(x) − y|

(11)

Cost Vector

cost vectorc: a row of cost components

customer 1 on a romance movie: c = (0,5,100)

an H1N1 patient: c = (0,1000,100000)

absolute cost for digit 2: c = (1,0,1,2)

“regular” classification cost for label 2: c(2)c = (1,0,1,1) regular classification:

special case of cost-sensitive classification

(12)

Cost-sensitive Classification Setup

Given

N examples, each

(inputxn, label yn, cost cn)∈ X × {1, 2, . . . , K } × RK

K = 2: binary; K > 2: multiclass

will assumecn[yn] =0=min1≤k ≤Kcn[k ] Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

will assumec[y ] =0=cmin =min1≤k ≤Kc[k ]

note: y not really needed in evaluation cost-sensitive classification:

can express any finite-loss supervised learning tasks

(13)

Our Contribution

binary multiclass

regular well-studied well-studied

cost-sensitive known(Zadrozny, 2003) ongoing(our work, among others)

a theoretic and algorithmic study of cost-sensitive classification, which ...

introduces a methodology to reduce cost-sensitive classification toregression

providesstrong theoretical support for the methodology

leads to a promising algorithm withsuperior experimental results

will describe the methodology and an algorithm

(14)

Key Idea: Cost Estimator

Goal

a classifier g(x) that pays a small cost c[g(x)] on future unseen example (x, y , c)

if every c[k ] known optimal g(x) = argmin

1≤k ≤K

c[k ]

if rk(x)≈ c[k] well

approximately good gr(x) = argmin

1≤k ≤K

rk(x)

how to get cost estimator rk? regression

(15)

Cost Estimator by Per-class Regression

Given

N examples, each (inputxn, label yn, cost cn)∈ X × {1, 2, . . . , K } × RK

input cn[1] input cn[2] . . . input cn[K ]

x1 0, x1 2, x1 1

x2 1, x2 3, x2 5

· · ·

xN 6, xN 1, xN 0

| {z }

r1 | {z }

r2 | {z }

rK

want: rk(x)≈ c[k] for all future (x, y, c) and k

(16)

The Reduction Framework

1 transform cost-sensitive examples (xn, yn, cn)to regression examples Xn,k, Yn,k = xn, cn[k ]

2 use your favorite algorithm on the regression examples and get regressors rk(x)

3 for each new inputx, predict its class using gr(x) = argmin

1≤k ≤K

rk(x)

the reduction-to-regression framework:

systematic & easy to implement







 cost- sensitive example (xn, yn, cn)





@ AA

%

$ '

&

regression examples (Xn,k, Yn,k) k = 1,· · · , K ⇒⇒⇒

regression algorithm

⇒⇒

%

$ '

&

regressors rk(x) k∈ 1, · · · , K

AA

@









 cost- sensitive classifier gr(x)

(17)

Theoretical Guarantees (1/2)

gr(x) = argmin

1≤k ≤K

rk(x)

Theorem (Absolute Loss Bound)

For any set of regressors (cost estimators){rk}Kk =1and for any example (x, y , c) with c[y ] =0,

c[gr(x)]≤

K

X

k =1

rk(x)− c[k]

.

low-cost classifier⇐= accurate regressor

(18)

Theoretical Guarantees (2/2)

gr(x) = argmin

1≤k ≤K

rk(x)

Theorem (Squared Loss Bound)

For any set of regressors (cost estimators){rk}Kk =1and for any example (x, y , c) with c[y ] =0,

c[gr(x)]≤ v u u t2

K

X

k =1

(rk(x)− c[k])2.

applies to commonleast-square regression

(19)

A Pictorial Proof

c[gr(x)]≤

K

X

k =1

rk(x)− c[k]

assumec ordered and not degenerate:

y = 1;0=c[1]<c[2]≤ · · · ≤c[K ]

assume mis-prediction gr(x) = 2:

r2(x) = min1≤k ≤Krk(x)≤ r1(x)

c[1] -

1

r2(x)6 r1(x)6

2

c[2] c[3]

r3(x)6

c[K ] rK(x)6

c[2]− c[1]

|{z}

0

≤ ∆1

+

2

K

X

k =1

rk(x)− c[k]

(20)

An Even Closer Look

let∆1≡ r1(x)− c[1] and2≡ c[2] − r2(x)

11≥ 0 and∆2≥ 0: c[2] ≤1+∆2

21≤ 0 and∆2≥ 0: c[2] ≤2

31≥ 0 and∆2≤ 0: c[2] ≤1

c[2]≤ max(∆1, 0) + max(∆2, 0)≤ ∆1

+ ∆2

c[1] -

1

r2(x)6 r1(x)6

2

c[2] c[3]

r3(x)6

c[K ] rK(x)6

6 - r2(x) 6

1

r1(x)

2

6 -

r2(x) r1(x)6

1

(21)

Tighter Bound with One-sided Loss

Defineone-sided loss ξk ≡ max(∆k,0) with ∆k ≡

rk(x)− c[k]

ifc[k ] = cmin

k ≡

c[k ]− rk(x)

ifc[k ]6= cmin

Intuition

c[k ] = cmin:wish to have rk(x)≤ c[k]

c[k ]6= cmin:wish to have rk(x)≥ c[k]

—both wishes same as ∆k ≤ 0 and hence ξk =0 One-sided Loss Bound:

c[gr(x)]≤

K

X

k =1

ξk

K

X

k =1

k

(22)

The Improved Reduction Framework

1 transform cost-sensitive examples (xn, yn, cn)to regression examples X(k )n , Yn(k ),Zn(k ) =

xn, cn[k ],2r

cn[k ] =cn[yn]z

− 1

2 use aone-sided regression algorithmto get regressors rk(x)

3 for each new inputx, predict its class using gr(x) = argmin

1≤k ≤K

rk(x)

the reduction-to-OSR framework:

need a good OSR algorithm







 cost- sensitive example (xn, yn, cn)





@ AA

%

$ '

&

regression examples (Xn,k, Yn,k, Zn,k)

k = 1,· · · , K ⇒⇒⇒ one-sided regression algorithm ⇒⇒⇒

%

$ '

&

regressors rk(x) k∈ 1, · · · , K

AA

@









 cost- sensitive classifier gr(x)

(23)

Regularized One-sided Hyper-linear Regression

Given

Xn,k, Yn,k, Zn,k =

xn, cn[k ], 2r

cn[k ] =cn[yn]z

− 1

Training Goal

all trainingξn,k =max

Zn,k rk(Xn,k)− Yn,k



| {z }

n,k

, 0

small

—will drop k

minw,b

λ

2hw, wi +

N

X

n=1

ξn to get rk(X) =hw, φ(X)i + b

(24)

One-sided Support Vector Regression

Regularized One-sided Hyper-linear Regression minw,b

λ

2hw, wi +

N

X

n=1

ξn

ξn=max (Zn· (rk(Xn)− Yn), 0) Standard Support Vector Regression

minw,b

1

2Chw, wi +

N

X

n=1

nn)

ξn=max (+1· (rk(Xn)− Yn− ), 0) ξn=max (−1· (rk(Xn)− Yn+), 0) OSR-SVM = SVR +(0→ )+(keepξnorξn by Zn)

(25)

OSR versus Other Reductions

OSR: K regressors

How unlikely (costly) does the example belong to class k ? Filter Tree (FT): K − 1 binary classifiers

Is the lowest cost within labels{1, 4} or {2, 3}?

Is the lowest cost within label{1} or {4}?

Is the lowest cost within label{2} or {3}?

Weighted All Pairs (WAP): K (K −1)2 binary classifiers isc[1] or c[4] lower?

Sensitive Error Correcting Output Code (SECOC): (T · K ) bin.

cla.

isc[1] + c[3] + c[4] greater than someθ?

(26)

Experiment: OSR-SVM versus OVA-SVM

ir. wi. gl. ve. vo. se. dn. sa. us. le.

0 50 100 150 200 250 300 350

avg. test cost

OSR

OVA OSR: a cost-sensitive extension of OVA

OVA: regular SVM

OSR often significantly better than OVA

(27)

Experiment: OSR versus FT

ir. wi. gl. ve. vo. se. dn. sa. us. le.

0 50 100 150 200 250 300

avg. test cost

OSR

FT OSR (per-class):

O(K ) training, O(K ) prediction

FT (tournament):

O(K ) training, O(log2K ) prediction

FT faster, but OSR better performed

(28)

Experiment: OSR versus WAP

ir. wi. gl. ve. vo. se. dn. sa. us. le.

0 50 100 150 200 250 300

avg. test cost

OSR

WAP OSR (per-class):

O(K ) training, O(K ) prediction

WAP (pairwise):

O(K2)training, O(K2)prediction

OSR faster and comparable performance

(29)

Experiment: OSR versus SECOC

ir. wi. gl. ve. vo. se. dn. sa. us. le.

0 50 100 150 200 250 300

avg. test cost

OSR SECOC

OSR (per-class):

O(K ) training, O(K ) prediction

SECOC

(error-correcting): big O(K ) training, big O(K ) prediction

OSR faster and much better performance

(30)

Conclusion

reduction to regression:

a simple way of designing cost-sensitive classification algorithms

theoretical guarantee:

absolute, squared andone-sided bounds

algorithmic use:

anovel and simple algorithm OSR-SVM

experimental performance of OSR-SVM:

leading in SVM family

more algorithm andapplicationopportunities

(31)

Acknowledgments

Profs. Chih-Jen Lin, Yuh-Jyh Lee, Shou-de Lin for suggestions

Prof. Chuin-Shan Chen for talk invitation

Computational Learning Lab @ NTU for discussions

Thank you. Questions?

參考文獻

相關文件

The embedding allows the proposed algorithm, active learning with cost embedding (ALCE), to define a cost-sensitive uncertainty measure from the distance in the hidden space..

We proposed the condensed filter tree (CFT) algorithm by coupling several tools and ideas: the label powerset approach for reducing to cost-sensitive classifi- cation, the

Exten- sive experimental results justify the validity of the novel loss function for making existing deep learn- ing models cost-sensitive, and demonstrate that our proposed model

Experiments on the benchmark and the real-world data sets show that our proposed methodology in- deed achieves lower test error rates and similar (sometimes lower) test costs

First of all, RED-SVM with the asymmetric cost and RED-SVM with the absolute cost generally perform better than SVOR-IMC or SVOR- EXC, which demonstrates that the

In addi- tion, soft cost-sensitive classification algorithms reach significantly lower test error rate than their hard siblings, while achieving similar (sometimes better) test

D. Existing cost-insensitive active learning strategies 1) Binary active learning: Active learning for binary classification (binary active learning) has been studied in many works

For those methods utilizing label powerset to reduce the multi- label classification problem, in [7], the author proposes cost- sensitive RAkEL (CS-RAkEL) based on RAkEL optimizing on