Foundations of Machine Learning (FOML) Lecture 4
Transcription
Foundations of Machine Learning (FOML) Lecture 4
Foundations of Machine Learning (FOML) Lecture 4 Kristiaan Pelckmans September 25, 2015 Overview Today: I AdaBoost. I Analysis. I Extensions. I Discussion. Overview (Ct’d) Course: 1. Introduction. 2. Support Vector Machines (SVMs). 3. Probably Approximatively Correct (PAC) analysis. 4. Boosting. Fr 2015-09-25. 5. Online Learning. Ti 2015-09-29. 6. Multi-class classification (Kalyam), Fr 2015-10-02. 7. Ranking (Jakob,Fredrik, Andreas), Ti 2015-10-06. 8. Regression (Tatiana, Ruben), To 2015-10-08. 9. Stability-based analysis (Tilo, Juozas, Yevgen), Ti 2015-10-13. 10. Dimensionality reduction (Fredrik, Thomas, Ali Basirat), To 2015-10-15. 11. Reinforcement learning (Sholeh), On 2015-10-21. 12. Presentations of the results of the mini-projects. Overview (Ct’d) Miniprojects: AdaBoost 1. Detecting faces. 2. Integral image representation. 3. AdaBoost. 4. Hierarchical classification. 5. Empirical and actual risk. 6. Complexity of weak learners. 7. How about tuning T ? 8. Numerical results. AdaBoost. Average learning and boosting I How to boost weak learners into a global strong learner? I Weak learner of a concept class C: for any D, δ > 0, one gets a hS where 1 PrS∼D m R(hS ) ≤ − γ ≥ 1 − δ, 2 if m samples provided, with m ≥ poly(1/δ, n, size(c)). I E.g. decision stumps or decision trees. AdaBoost. citations I M. Kearns, L.G. Valiant, Cryptographic limitations on learning boolean formulae and automata. Technical report, 1988. I I I I I M. Kearns, L.G. Valiant, Efficient distribution-free learning of stochastic concepts, 1990. I R. E. Schapire, Y. Freund, P. Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998. I I D. Mease, A. Wyner, Evidence Contrary to the Statistical View of Boosting. JMLR, 2008. R.E. Schapire, The strength of weak learnability. Machine Learning, 1990. Y. Freund, Boosting a weak learning algorithm by majority. COLT, 1990. L. Breiman. Bagging predictors. Machine Learning, 24(2): 123-140, 1996. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997. R. E. Schapire and Y. Freund. Boosting, Foundations and Algorithms. The MIT Press, 2012. AdaBoost (Ct’d) AdaBoost (Ct’d) Initiate D1 (i) = m1 . for t = 1, . . . , T : 1. Choose a new base-classifier ht ∈ H based on the weighted dataset, and t = m 1 X Dt (i)I (yi 6= ht (xi )) . m i=1 2. Let αt = 1 2 log t 1−t . 2 3. Let Zt = 2 (t (1 − t )) . 4. Let Dt+1 (i) = Dt (i) exp (−αt yi ht (xi )) , ∀i. Zt Then g = sign T X t=1 αt ht . AdaBoost (Ct’d) Theorem. Bound on empirical error of h. Assume t ≤ 12 , then R̂(g ) ≤ exp −2 T X 1 t=1 2 2 ! − t proof: First DT +1 (i) = DT (i) exp (−αT yi hT (xi )) exp(−yi g (xi )) . = Q ZT m T Z t t=1 then m m 1 X 1 X I (yi g (xi ) ≤ 0) ≤ exp(−yi g (xi )) R̂(g ) = m m i=1 i=1 ! m T T Y Y 1 X = m Zt DT +1 (i) = Zt . m i=1 t=1 t=1 AdaBoost (Ct’d) proof (Ct’d): and Zt = m X Dt (i) exp(−αt yi ht (xi )) i=1 X = X Dt (i)e −αt + i:yi ht (xi )=+1 = (1 − t )e Dt (i)e αt i:yi ht (xi )=−1 −αt p + t e αt = 2 t (1 − t ). (optimal!), and hence T Y T Y p Zt = 2 t (1 − t ) t=1 = t=1 T Yq 1 − 4(1/2 − t )2 ≤ exp −2 t=1 T X 1 t=1 2 2 ! − t . AdaBoost (Ct’d) Coordinate Descent. Consider the function ! m t X X Ft (α) = exp −yi αs hs (xi ) . i=1 s=1 Then, at each iteration t, one solves t−1 Y dFt (α) arg min = arg min(2ht − 1) m Zs αt dht ht ht s=1 Moreover dFt (α) 1 1 − t = 0 ⇔ αt = log . dαt 2 t ! . AdaBoost (Ct’d) Relation to logistic regression I Zero-one loss 1 Pm I (y i h(xi ) < 0). i=1 m I Hinge loss 1 Pm i=1 max(1 − yi h(xi ), 0). m I Logistic loss 1 Pm log (1 + exp(−2yi h(xi ))). i=1 m I Boosting loss: 1 Pm i=1 exp(−yi h(xi )). m AdaBoost (Ct’d) Overfitting? I Assume that VCdim(H) = d, and ( !) T X HT = sign αt ht + b t=1 I Then VCdim(HT ) ≤ 2(d+1)(T +1) log((T +1)e). I Hence one needs early stopping. AdaBoost (Ct’d) Rademacher analysis Let H = {h : X → R} be any hypothesis set, and ( ) X X conv(H) = µh h, µh ≤ 1, µh ≥ 0 , h h then for any sample S one has RS (H) = RS (conv(H)) AdaBoost (Ct’d) Proof: " # p m X X 1 RS (conv(H)) = Eσ sup σi µh hk (xi ) m h1 ,...,hp ,µ1 ,...,µp i=1 k=1 # " p m X X 1 σi hk (xi ) = Eσ sup sup µk m h1 ,...,hp µ1 ,...,µp k=1 i=1 # " m X 1 = Eσ sup max σi hk (xi ) m h1 ,...,hp k i=1 # " m X 1 σi h(xi ) = Eσ sup m h i=1 AdaBoost (Ct’d) Margin-based analysis I L1 margin: P T t=1 αt ht (x) . ρ(x) = PT t=1 |αt | I On sample ρS = min i I yi PT Pt=1 T αt ht (xi ) t=1 |αt | . Maximising this margin? As an LP .... AdaBoost (Ct’d) Game-theoretic interpretation I Loss (-Payoff) matrix M and mixed strategies p and q: min max p T Mq = max min p T Mq. p I q q p equivalently min max p T Mej = max min eiT Mq. p I q j i Apply to boosting Mi,t = yi ht (xi ) then 2γ∗ = min max D t=1,...,T m X D(i)yi ht (xi ) i=1 = max min yi α i=1,...,m T X αt ht (xi ) = ρ∗ kαk1 t=1 Conclusions Boosting: I AdaBoost: weak → strong learner. I Analysis. I Noise - LogitBoost. I Detecting outliers.