boosting C4.5

(1)

A Boosting TutorialA Boosting Tutorial A Boosting Tutorial A Boosting Tutorial A Boosting Tutorial

Rob Schapire

Princeton University

www.cs.princeton.edu/∼schapire

(2)

Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”

[Gorin et al.]

• goal:goal:goal:goal: automatically categorize type of call requested by phonegoal:

customer (^Collect^, CallingCard^, PersonToPerson^{, etc.})

• yes I’d like to place a collect call long distance please (Collect)

• operator I need to make a call but I need to bill it to my office

(ThirdNumber)

• yes I’d like to place a call on my master card please (CallingCard)

• I just called a number in sioux city and I musta rang the wrong number

because I got the wrong party and I would like to have that taken off of my bill (BillingCredit)

• observationobservationobservationobservation:observation

• easyeasyeasyeasy to find “rules of thumb” that are “often” correcteasy

• e.g.: “IF ‘card’ occurs in utterance THEN predict ‘CallingCard’ ”

• hardhardhardhard to findhard singlesinglesinglesingle highly accurate prediction rulesingle

(3)

The Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting Approach

• devise computer program for deriving rough rules of thumb

• apply procedure to subset of examples

• obtain rule of thumb

• apply to 2nd subset of examples

• obtain 2nd rule of thumb

• repeat T times

(4)

DetailsDetailsDetailsDetailsDetails

• how to choose exampleschoose exampleschoose exampleschoose examples on each round?choose examples

• concentrate on “hardest” examples

(those most often misclassified by previous rules of thumb)

• how to combinecombinecombinecombine rules of thumb into single prediction rule?combine

• take (weighted) majority vote of rules of thumb

(5)

BoostingBoostingBoostingBoostingBoosting

• boostingboostingboostingboosting = general method of converting rough rules ofboosting thumb into highly accurate prediction rule

• technicallytechnicallytechnicallytechnically:technically

• assumeassumeassumeassume givenassume “weak” learning algorithm“weak” learning algorithm“weak” learning algorithm“weak” learning algorithm that can“weak” learning algorithm

consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55%

(in two-class setting)

• given sufficient data, a boosting algorithmboosting algorithmboosting algorithmboosting algorithm canboosting algorithm provablyprovablyprovablyprovablyprovably

construct single classifier with very high accuracy, say, 99%

(6)

Outline of TutorialOutline of TutorialOutline of TutorialOutline of TutorialOutline of Tutorial

• brief background

• basic algorithm and core theory

• other ways of understanding boosting

• experiments, applications and extensions

(7)

Brief BackgroundBrief BackgroundBrief BackgroundBrief BackgroundBrief Background

(8)

The Boosting ProblemThe Boosting ProblemThe Boosting ProblemThe Boosting ProblemThe Boosting Problem

• “strong” PAC algorithm

• for any distribution

• ∀ > 0, δ > 0

• given polynomially many random examples

• finds classifier with error ≤ with probability ≥ 1 − δ

• “weak” PAC algorithm

• same, but only for ≥ ¹₂ − γ

• [Kearns & Valiant ’88]:

• does weak learnability imply strong learnability?

(9)

Early Boosting AlgorithmsEarly Boosting AlgorithmsEarly Boosting AlgorithmsEarly Boosting AlgorithmsEarly Boosting Algorithms

• [Schapire ’89]:

• first provable boosting algorithm

• call weak learner three times on three modified distributions

• get slight boost in accuracy

• apply recursively

• [Freund ’90]:

• “optimal” algorithm that “boosts by majority”

• [Drucker, Schapire & Simard ’92]:

• first experiments using boosting

• limited by practical drawbacks

(10)

AdaBoostAdaBoostAdaBoostAdaBoostAdaBoost

• [Freund & Schapire ’95]:

• introduced “AdaBoostAdaBoostAdaBoostAdaBoost” algorithmAdaBoost

• strong practical advantages over previous boosting algorithms

• experiments and applications using AdaBoost:experiments and applications using AdaBoost:experiments and applications using AdaBoost:experiments and applications using AdaBoost:experiments and applications using AdaBoost:

[Drucker & Cortes ’96]

[Jackson & Craven ’96]

[Freund & Schapire ’96]

[Quinlan ’96]

[Breiman ’96]

[Maclin & Opitz ’97]

[Bauer & Kohavi ’97]

[Schwenk & Bengio ’98]

[Schapire, Singer & Singhal ’98]

[Abney, Schapire & Singer ’99]

[Haruno, Shirai & Ooyama ’99]

[Cohen & Singer’ 99]

[Dietterich ’00]

[Schapire & Singer ’00]

[Collins ’00]

[Escudero, M`arquez & Rigau ’00]

[Iyer, Lewis, Schapire et al. ’00]

[Onoda, R¨atsch & M ¨uller ’00]

[Tieu & Viola ’00]

[Walker, Rambow & Rogati ’01]

[Rochery, Schapire, Rahim & Gupta ’01]

[Merler, Furlanello, Larcher & Sboner ’01]

[Di Fabbrizio, Dutton, Gupta et al. ’02]

[Qu, Adam, Yasui et al. ’02]

[Tur, Schapire & Hakkani-T ¨ur ’03]

[Viola & Jones ’04]

[Middendorf, Kundaje, Wiggins et al. ’04]

...

• continuing development of theory and algorithms:continuing development of theory and algorithms:continuing development of theory and algorithms:continuing development of theory and algorithms:continuing development of theory and algorithms:

[Breiman ’98, ’99]

[Schapire, Freund, Bartlett & Lee ’98]

[Grove & Schuurmans ’98]

[Mason, Bartlett & Baxter ’98]

[Schapire & Singer ’99]

[Cohen & Singer ’99]

[Freund & Mason ’99]

[Domingo & Watanabe ’99]

[Mason, Baxter, Bartlett & Frean ’99, ’00]

[Duffy & Helmbold ’99, ’02]

[Freund & Mason ’99]

[Ridgeway, Madigan & Richardson ’99]

[Kivinen & Warmuth ’99]

[Friedman, Hastie & Tibshirani ’00]

[R¨atsch, Onoda & M ¨uller ’00]

[R¨atsch, Warmuth, Mika et al. ’00]

[Allwein, Schapire & Singer ’00]

[Friedman ’01]

[Koltchinskii, Panchenko & Lozano ’01]

[Collins, Schapire & Singer ’02]

[Demiriz, Bennett & Shawe-Taylor ’02]

[Lebanon & Lafferty ’02]

[Wyner ’02]

[Rudin, Daubechies & Schapire ’03]

[Jiang ’04]

[Lugosi & Vayatis ’04]

[Zhang ’04]

...

(11)

Basic Algorithm and Core TheoryBasic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory

(12)

A Formal Description of BoostingA Formal Description of BoostingA Formal Description of BoostingA Formal Description of BoostingA Formal Description of Boosting

• given training settraining settraining settraining settraining set (x₁, y₁), . . . , (x_m, y_m)

• y_i ∈ {−1, +1} correct label of instance x_i ∈ X

• for t = 1, . . . , T:

• construct distribution D_t on {1, . . . , m}

• find weak classifierweak classifierweak classifierweak classifierweak classifier (“rule of thumb”) h_t : X → {−1, +1}

with small errorerrorerrorerrorerror _t on D_t:

_t = Pr_D_t[h_t(x_i) 6= y_i]

• output final classifierfinal classifierfinal classifierfinal classifierfinal classifier H_final

(13)

AdaBoostAdaBoostAdaBoostAdaBoostAdaBoost

[with Freund]

• constructingconstructingconstructingconstructingconstructing DDDDD_t_t_t_t_t:

• D₁(i) = 1/m

• given D_t and h_t:

D_t+1(i) = D_t(i) Z_t ×











e^−α^t if y_i = h_t(x_i) e^α^t if y_i 6= h_t(x_i)

= D_t(i)

Z_t exp(−α_t y_i h_t(x_i)) where Z_t = normalization constant

α_t = ¹₂ ln







1 − _t

_t





 > 0

• final classifierfinal classifierfinal classifierfinal classifier:final classifier

• H_final(x) = sign





 X

t α_th_t(x)







(14)

Toy ExampleToy ExampleToy ExampleToy ExampleToy Example

D1

weak classifiers = vertical or horizontal half-planes

(15)

Round 1Round 1Round 1Round 1Round 1

h1

α ε1

1

=0.30

=0.42

D2

(16)

Round 2Round 2Round 2Round 2Round 2

α ε2

2

=0.21

=0.65

h2 D3