COMP 598 – Applied Machine Learning Today’s quiz (5 min.) !

Transcription

COMP 598 – Applied Machine Learning Today’s quiz (5 min.) !
COMP 598 – Applied Machine Learning
Lecture 11: Ensemble Learning (cont’d)
!
Instructor: Joelle Pineau ([email protected])
TAs: Pierre-Luc Bacon ([email protected])
Angus Leigh ([email protected])
Class web page: www.cs.mcgill.ca/~jpineau/comp598
Today’s quiz (5 min.)
•  Consider the training data below (marked “+”, “-” for the output), and
the test data (marked “A”, “B”, “C”).
–  What class would the 1-NN algorithm predict for each test point?
–  What class would the 3-NN algorithm predict for each test point?
–  Which training points could be removed without causing change in 3-NN decision
rule?
•  Properties of NN algorithm?
X2
+
+ A
-
+
+
+
+
+
C
-
-
+
COMP-598: Applied Machine Learning
2
-
B
-
X1
Joelle Pineau
1
Today’s quiz (5 min.)
•  Consider the training data below (marked “+”, “-” for the output), and
the test data (marked “A”, “B”, “C”).
–  What class would the 1-NN algorithm predict for each test point? A=+, B=-, C=–  What class would the 3-NN algorithm predict for each test point? A=+, B=-, C=+
–  Which training points could be removed without causing change in 3-NN decision
rule? Removed in figure on this slide.
•  Properties of NN algorithm?
X2
+
COMP-598: Applied Machine Learning
+ A
-
+
+
+
C
-
-
3
B
X1
Joelle Pineau
Random forests (Breiman)
•  Basic algorithm:
–  Use K bootstrap replicates to train K different trees.
–  At each node, pick m variables at random (use m<M, the total
number of features).
–  Determine the best test (using normalized information gain).
–  Recurse until the tree reaches maximum depth (no pruning).
•  Comments:
–  Each tree has high variance, but the ensemble uses averaging,
which reduces variance.
–  Random forests are very competitive in both classification and
regression, but still subject to overfitting.
COMP-598: Applied Machine Learning
4
Joelle Pineau
2
Extremely randomized trees (Geurts et al., 2005)
•  Basic algorithm:
–  Construct K decision trees.
–  Pick m attributes at random (without replacement) and pick a random
test involving each attribute.
–  Evaluate all tests (using a normalized information gain metric) and pick
the best one for the node.
–  Continue until a desired depth or a desired number of instances (nmin)
at the leaf is reached.
•  Comments:
–  Very reliable method for both classification and regression.
–  The smaller m is, the more randomized the trees are; small m is best,
especially with large levels of noise. Small nmin means less bias and
more variance, but variance is controlled by averaging over trees.
COMP-598: Applied Machine Learning
5
Joelle Pineau
Randomization in general
•  Instead of searching very hard for the best hypothesis, generate
lots of random ones, then average their results.
•  Examples:
–  Random feature selection Random projections.
•  Advantages?
–  Very fast, easy, can handle lots of data.
–  Can circumvent difficulties in optimization.
–  Averaging reduces the variance introduced by randomization.
•  Disadvantages?
–  New prediction may be more expensive to evaluate (go over all trees).
–  Still typically subject to overfitting.
–  Low interpretability compared to standard decision trees.
COMP-598: Applied Machine Learning
6
Joelle Pineau
3
Additive models
•  In an ensemble, the output on any instance is computed by
averaging the outputs of several hypotheses.
•  Idea: Don’t construct the hypotheses independently. Instead,
new hypotheses should focus on instances that are problematic
for existing hypotheses.
–  If an example is difficult, more components should focus on it.
COMP-598: Applied Machine Learning
7
Joelle Pineau
Boosting
•  Recall algorithm:
–  Use the training set to train a simple predictor.
–  Re-weight the training examples, putting more weight on examples
that were not properly classified in the previous predictor.
–  Repeat n times.
classifier
–  Combine the simple Boosting
hypotheses into
a single, accurate predictor.
D1
Original
Data
D2
Dn
COMP-598: Applied Machine Learning
Weak Learner
H1
Weak Learner
H2
Weak Learner
Hn
8
Final
hypothesis
F(H1,H2,...Hn)
Joelle Pineau
4
COMP-652, Lecture 11 - October 16, 2012
16
Notation
•  Assume that examples are drawn independently from some
probability distribution P on the set of possible data D.
•  Let JP(h) be the expected error of hypothesis h when data is
drawn from P:
JP(h) = ∑<x,y>J(h(x),y)P(<x,y>)
where J(h(x),y) could be the squared error, or 0/1 loss.
COMP-598: Applied Machine Learning
9
Joelle Pineau
Weak learners
•  Assume we have some “weak” binary classifiers
E.g. A decision stump is a single node decision tree: xi>t
•  “Weak” means JP(h)<1/2-ɣ (assuming 2 classes), where ɣ>0
I.e. true error of the classifier is only slightly better than random.
•  Questions:
–  How do we re-weight the examples?
–  How do we combine many simple predictors into a single classifier?
COMP-598: Applied Machine Learning
10
Joelle Pineau
5
!"##"$!%
Example
!"##"$!%
Toy example
()*%+,-./01%
()*%+,-./01%
COMP-598: Applied Machine Learning
11
Joelle Pineau
COMP-652, Lecture 11 - October 16, 2012
20
2)345%&%
Example: First step
Toy example: First step
2)345%&%
COMP-598: Applied Machine Learning
COMP-652, Lecture 11 - October 16, 2012
12
Joelle Pineau
&'%
21
&'%
6
!"##"$!%
Example: Second step
Toy example: Second step
()*+,%#%
!"##"$!%
()*+,%#%
COMP-598: Applied Machine Learning
COMP-652, Lecture 11 - October 16, 2012
13
Joelle Pineau
()*+,%-%
22
Example: Third step
Toy example: Third step
()*+,%-%
&'%
COMP-598: Applied Machine Learning
COMP-652, Lecture 11 - October 16, 2012
14
Joelle Pineau
&'%
23
7
!"##"$!%
Example: Final
Toy example:
Finalhypothesis
hypothesis
()*+,%-./01234)4%
COMP-598: Applied Machine Learning
15
Joelle Pineau
567%83*92:+;<4%
AdaBoost
(Freund & Schapire, 1995)
COMP-652, Lecture 11 - October 16, 2012
24
60:/+;)40*%=)12%
•! 6>?'%@AB)*,+*C4%D39)4)0*%E;33%F,G0;)12:H%
•! D39)4)0*%I1B:/4%@0*,.%4)*G,3%+J;)KB13H%
weight of weak learner t
COMP-598: Applied Machine Learning
16
Joelle Pineau
&'%
8
'()*+,-.+,/%0123425,,%67.82*9,%
•! :*;14*%).%8-+57<%=7.82*9%8<%47*53+>%,*?*752%8-+57<%
Properties of AdaBoost
@1*,3.+,%A.7%*54B%*(59=2*/%
C!
C!
C!
•  Compared to other boosting algorithms, main insight is to
D;.*,%.7%;.*,%+.)%*(59=2*%x%8*2.+>%).%425,,%&EF%
automatically adapt the error rate at each iteration.
D;.*,%.7%;.*,%+.)%*(59=2*%x%8*2.+>%).%425,,%#EF%
D;.*,%.7%;.*,%+.)%*(59=2*%x%8*2.+>%).%425,,%GEF%
...
•  Training error on the final hypothesis is at most:
recall: ɣt is how much better than random is ht
•  AdaBoost gradually reduces the training error exponentially fast.
COMP-598: Applied Machine Learning
17
Joelle Pineau
H*()%I5)*>.7-J53.+%
•! K*4-,-.+%,)19=,/%=7*,*+4*%.A%L.7;%.7%,B.7)%
=B75,*M%'(59=2*/%
set:
Textin categorization
“If Real
the worddata
Clinton
appears
the document predict
Real data set: Text Categorization
document is about politics”
;5)585,*/%N6%
COMP-598: Applied Machine Learning
COMP-652, Lecture 11 - October 16, 2012
;5)585,*/%:*1)*7,%
18
Joelle Pineau
25
&!%
9
Boosting: Experimental Results
[Freund & Schapire, 1996]
error
Comparison of C4.5, Boosting C4.5, Boosting decision
Boosting
empirical
evaluation
Boosting
empirical
evaluation
stumps (depth
1 trees),
27 benchmark
datasets
error
error
39
©Carlos Guestrin 2005-2007
COMP-598: Applied Machine Learning
19
Joelle Pineau
COMP-652, Lecture 11 - October 16, 2012
26
Bagging vs. Boosting
Bagging vs Boosting
15
20
25
30
0
5
10
15
20
25
30
FindAttrTest boosting FindDecRule
0
5
10
15
20
25
boosting C4.5
30
0
5
10
15
20
25
30
bagging C4.5
40
©Carlos Guestrin 2005-2007
Comparison of C4.5 versus various other boosting and
methods.
COMP-652, Lecture 11 - October 16, 2012
COMP-598: Applied Machine Learning
20
27
Joelle Pineau
benchmark. These experiments indicate that boostg pseudo-loss clearly outperforms boosting using
ing pseudo-loss did dramatically better than error
20
10
Bagging vs Boosting
•  Bagging is typically faster, but may get a smaller error reduction
(not by much).
•  Bagging works well with “reasonable” classifiers.
•  Boosting works with very simple classifiers.
E.g., Boostexter - text classification using decision stumps based on
single words.
•  Boosting may have a problem if a lot of the data is mislabeled,
because it will focus on those examples a lot, leading to
overfitting.
COMP-598: Applied Machine Learning
21
Joelle Pineau
Why does boosting work?
•  Weak learners have high bias. By combining them, we get more
expressive classifiers. Hence, boosting is a bias-reduction
technique.
•  Adaboost looks for a good approximation to the log-odds ratio,
within the space of functions that can be captured by a linear
combination of the base classifiers.
•  What happens as we run boosting longer? Intuitively, we get
more and more complex hypotheses. How would you expect bias
and variance to evolve over time?
COMP-598: Applied Machine Learning
22
Joelle Pineau
11
A naïve (but reasonable) analysis of error
A naive (but reasonable) analysis of generalization error
•  Expect the training error to continue to drop (until it reaches 0).
• Expect
the training
error
to continue
to as
drop
it reaches
0)and hf
•  Expect
the test
error
to increase
we(until
get more
voters,
• Expect becomes
the test error
to
increase
as
we
get
more
voters,
and
h
f becomes
too complex.
too complex.
1
0.8
0.6
0.4
0.2
20
40
60
COMP-598: Applied Machine Learning
80
100
23
Joelle Pineau
COMP-652, Lecture 11 - October 16, 2012
30
Actual typical run of AdaBoost
•  Test error does not increase even after 1000 runs! (more than 2
million decision nodes!)
•  Test error continues to drop even after training error reaches 0!
Actual typical run of AdaBoost
•  These are consistent results through many sets of experiments!
•  Conjecture: Boosting does not overfit!
20
15
10
5
0
COMP-598: Applied Machine Learning
10
100
24
1000
Joelle Pineau
• Test error does not increase even after 1000 runs! (more than 2 million
decision nodes!)
• Test error continues to drop even after training error reaches 0!
• These are consistent results through many sets of experiments!
12
What you should know
• 
Ensemble methods combine several hypotheses into one prediction.
• 
They work better than the best individual hypothesis from the same
class because they reduce bias or variance (or both).
• 
Extremely randomized trees are a bias-reduction technique.
• 
Bagging is mainly a variance-reduction technique, useful for complex
hypotheses.
• 
Main idea is to sample the data repeatedly, train several classifiers and
average their predictions.
• 
Boosting focuses on harder examples, and gives a weighted vote to the
hypotheses.
• 
Boosting works by reducing bias and increasing classification margin.
COMP-598: Applied Machine Learning
25
Joelle Pineau
13