A Boosting TutorialA Boosting Tutorial A Boosting Tutorial A Boosting Tutorial A Boosting Tutorial
Rob Schapire
Princeton University
www.cs.princeton.edu/∼schapire
Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”
[Gorin et al.]
• goal:goal:goal:goal: automatically categorize type of call requested by phonegoal:
customer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance please (Collect)
• operator I need to make a call but I need to bill it to my office
(ThirdNumber)
• yes I’d like to place a call on my master card please (CallingCard)
• I just called a number in sioux city and I musta rang the wrong number
because I got the wrong party and I would like to have that taken off of my bill (BillingCredit)
• observationobservationobservationobservation:observation
• easyeasyeasyeasy to find “rules of thumb” that are “often” correcteasy
• e.g.: “IF ‘card’ occurs in utterance THEN predict ‘CallingCard’ ”
• hardhardhardhard to findhard singlesinglesinglesingle highly accurate prediction rulesingle
The Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting Approach
• devise computer program for deriving rough rules of thumb
• apply procedure to subset of examples
• obtain rule of thumb
• apply to 2nd subset of examples
• obtain 2nd rule of thumb
• repeat T times
DetailsDetailsDetailsDetailsDetails
• how to choose exampleschoose exampleschoose exampleschoose examples on each round?choose examples
• concentrate on “hardest” examples
(those most often misclassified by previous rules of thumb)
• how to combinecombinecombinecombine rules of thumb into single prediction rule?combine
• take (weighted) majority vote of rules of thumb
BoostingBoostingBoostingBoostingBoosting
• boostingboostingboostingboosting = general method of converting rough rules ofboosting thumb into highly accurate prediction rule
• technicallytechnicallytechnicallytechnically:technically
• assumeassumeassumeassume givenassume “weak” learning algorithm“weak” learning algorithm“weak” learning algorithm“weak” learning algorithm that can“weak” learning algorithm
consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55%
(in two-class setting)
• given sufficient data, a boosting algorithmboosting algorithmboosting algorithmboosting algorithm canboosting algorithm provablyprovablyprovablyprovablyprovably
construct single classifier with very high accuracy, say, 99%
Outline of TutorialOutline of TutorialOutline of TutorialOutline of TutorialOutline of Tutorial
• brief background
• basic algorithm and core theory
• other ways of understanding boosting
• experiments, applications and extensions
Brief BackgroundBrief BackgroundBrief BackgroundBrief BackgroundBrief Background
The Boosting ProblemThe Boosting ProblemThe Boosting ProblemThe Boosting ProblemThe Boosting Problem
• “strong” PAC algorithm
• for any distribution
• ∀ > 0, δ > 0
• given polynomially many random examples
• finds classifier with error ≤ with probability ≥ 1 − δ
• “weak” PAC algorithm
• same, but only for ≥ 12 − γ
• [Kearns & Valiant ’88]:
• does weak learnability imply strong learnability?
Early Boosting AlgorithmsEarly Boosting AlgorithmsEarly Boosting AlgorithmsEarly Boosting AlgorithmsEarly Boosting Algorithms
• [Schapire ’89]:
• first provable boosting algorithm
• call weak learner three times on three modified distributions
• get slight boost in accuracy
• apply recursively
• [Freund ’90]:
• “optimal” algorithm that “boosts by majority”
• [Drucker, Schapire & Simard ’92]:
• first experiments using boosting
• limited by practical drawbacks
AdaBoostAdaBoostAdaBoostAdaBoostAdaBoost
• [Freund & Schapire ’95]:
• introduced “AdaBoostAdaBoostAdaBoostAdaBoost” algorithmAdaBoost
• strong practical advantages over previous boosting algorithms
• experiments and applications using AdaBoost:experiments and applications using AdaBoost:experiments and applications using AdaBoost:experiments and applications using AdaBoost:experiments and applications using AdaBoost:
[Drucker & Cortes ’96]
[Jackson & Craven ’96]
[Freund & Schapire ’96]
[Quinlan ’96]
[Breiman ’96]
[Maclin & Opitz ’97]
[Bauer & Kohavi ’97]
[Schwenk & Bengio ’98]
[Schapire, Singer & Singhal ’98]
[Abney, Schapire & Singer ’99]
[Haruno, Shirai & Ooyama ’99]
[Cohen & Singer’ 99]
[Dietterich ’00]
[Schapire & Singer ’00]
[Collins ’00]
[Escudero, M`arquez & Rigau ’00]
[Iyer, Lewis, Schapire et al. ’00]
[Onoda, R¨atsch & M ¨uller ’00]
[Tieu & Viola ’00]
[Walker, Rambow & Rogati ’01]
[Rochery, Schapire, Rahim & Gupta ’01]
[Merler, Furlanello, Larcher & Sboner ’01]
[Di Fabbrizio, Dutton, Gupta et al. ’02]
[Qu, Adam, Yasui et al. ’02]
[Tur, Schapire & Hakkani-T ¨ur ’03]
[Viola & Jones ’04]
[Middendorf, Kundaje, Wiggins et al. ’04]
...
• continuing development of theory and algorithms:continuing development of theory and algorithms:continuing development of theory and algorithms:continuing development of theory and algorithms:continuing development of theory and algorithms:
[Breiman ’98, ’99]
[Schapire, Freund, Bartlett & Lee ’98]
[Grove & Schuurmans ’98]
[Mason, Bartlett & Baxter ’98]
[Schapire & Singer ’99]
[Cohen & Singer ’99]
[Freund & Mason ’99]
[Domingo & Watanabe ’99]
[Mason, Baxter, Bartlett & Frean ’99, ’00]
[Duffy & Helmbold ’99, ’02]
[Freund & Mason ’99]
[Ridgeway, Madigan & Richardson ’99]
[Kivinen & Warmuth ’99]
[Friedman, Hastie & Tibshirani ’00]
[R¨atsch, Onoda & M ¨uller ’00]
[R¨atsch, Warmuth, Mika et al. ’00]
[Allwein, Schapire & Singer ’00]
[Friedman ’01]
[Koltchinskii, Panchenko & Lozano ’01]
[Collins, Schapire & Singer ’02]
[Demiriz, Bennett & Shawe-Taylor ’02]
[Lebanon & Lafferty ’02]
[Wyner ’02]
[Rudin, Daubechies & Schapire ’03]
[Jiang ’04]
[Lugosi & Vayatis ’04]
[Zhang ’04]
...
Basic Algorithm and Core TheoryBasic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory Basic Algorithm and Core Theory
A Formal Description of BoostingA Formal Description of BoostingA Formal Description of BoostingA Formal Description of BoostingA Formal Description of Boosting
• given training settraining settraining settraining settraining set (x1, y1), . . . , (xm, ym)
• yi ∈ {−1, +1} correct label of instance xi ∈ X
• for t = 1, . . . , T:
• construct distribution Dt on {1, . . . , m}
• find weak classifierweak classifierweak classifierweak classifierweak classifier (“rule of thumb”) ht : X → {−1, +1}
with small errorerrorerrorerrorerror t on Dt:
t = PrDt[ht(xi) 6= yi]
• output final classifierfinal classifierfinal classifierfinal classifierfinal classifier Hfinal
AdaBoostAdaBoostAdaBoostAdaBoostAdaBoost
[with Freund]
• constructingconstructingconstructingconstructingconstructing DDDDDttttt:
• D1(i) = 1/m
• given Dt and ht:
Dt+1(i) = Dt(i) Zt ×
e−αt if yi = ht(xi) eαt if yi 6= ht(xi)
= Dt(i)
Zt exp(−αt yi ht(xi)) where Zt = normalization constant
αt = 12 ln
1 − t
t
> 0
• final classifierfinal classifierfinal classifierfinal classifier:final classifier
• Hfinal(x) = sign
X
t αtht(x)
Toy ExampleToy ExampleToy ExampleToy ExampleToy Example
D1
weak classifiers = vertical or horizontal half-planes
Round 1Round 1Round 1Round 1Round 1
h1
α ε1
1
=0.30
=0.42
D2
Round 2Round 2Round 2Round 2Round 2
α ε2
2
=0.21
=0.65
h2 D3