Naïve Bayes Classifier Ke Chen COMP24111 Machine Learning
Transcription
Naïve Bayes Classifier Ke Chen COMP24111 Machine Learning
Naïve Bayes Classifier Ke Chen COMP24111 Machine Learning Outline • Background • Probability Basics • Probabilistic Classification • Naïve Bayes – Principle and Algorithms – Example: Play Tennis • Relevant Issues • Summary 2 COMP24111 Machine Learning Background • There are three methods to establish a classifier a) Model a classification rule directly Examples: k-NN, decision trees, perceptron, SVM b) Model the probability of class memberships given input data Example: perceptron with the cross-entropy cost c) Make a probabilistic model of data within each class Examples: naive Bayes, model based classifiers a) and b) are examples of discriminative classification • c) is an example of generative classification • b) and c) are both examples of probabilistic classification • 3 COMP24111 Machine Learning Probability Basics • Prior, conditional and joint probability for random variables – Prior probability: P(X) – Conditional probability: P(X1 |X2 ), P(X2 |X1 ) – Joint probability: X (X1 , X2 ), P(X) P(X1 ,X2 ) – Relationship: P(X1 ,X2 ) P(X2 |X1 )P(X1 ) P(X1 |X2 )P(X2 ) – Independence: P(X2 |X1 ) P(X2 ), P(X1 |X2 ) P(X1 ), P(X1 ,X2 ) P(X1 )P(X2 ) • Bayesian Rule Likelihood Prior P( X |C )P(C ) P(C |X ) Posterior P( X ) Evidence 4 COMP24111 Machine Learning Probability Basics • Quiz: We have two six-sided dice. When they are tolled, it could end up with the following occurance: (A) dice 1 lands on side “3”, (B) dice 2 lands on side “1”, and (C) Two dice sum to eight. Answer the following questions: 1) P( A) ? 2) P(B) ? 3) P(C) ? 4) P( A| B) ? 5) P(C | A) ? 6) P( A , B) ? 7) P( A , C ) ? 8) Is P( A , C ) equal to P(A) P(C) ? 5 COMP24111 Machine Learning Probabilistic Classification • Establishing a probabilistic model for classification – Discriminative model P(C|X) C c1 , ,cL , X (X1 , , Xn ) P(c1 | x ) P(c 2 | x) P(c L | x) Discriminative Probabilistic Classifier x1 x2 xn x ( x1 , x2 , , xn ) 6 COMP24111 Machine Learning Probabilistic Classification • Establishing a probabilistic model for classification (cont.) – Generative model P(X|C) C c1 , ,cL , X (X1 , , Xn ) P( x | c 1 ) P( x |c 2 ) P( x |c L ) Generative Probabilistic Model Generative Probabilistic Model Generative Probabilistic Model for Class 1 for Class 2 x1 x2 xn x1 x2 for Class L xn x1 x2 xn x ( x1 , x2 , , xn ) 7 COMP24111 Machine Learning Probabilistic Classification • MAP classification rule – MAP: Maximum A Posterior – Assign x to c* if P(C c * |X x) P(C c |X x) c c * , c c1 , , c L • Generative classification with the MAP rule – Apply Bayesian rule to convert them into posterior probabilities P( X x |C ci )P(C ci ) P(C ci |X x) P( X x) P( X x |C ci )P(C ci ) for i 1,2 , , L – Then apply the MAP rule 8 COMP24111 Machine Learning Naïve Bayes • Bayes classification P(C|X) P(X|C)P(C) P(X1 , , Xn |C)P(C) Difficulty: learning the joint probability P(X1 , , Xn |C) • Naïve Bayes classification – Assumption that all input features are conditionally independent! P(X1 , X2 , , Xn |C) P(X1 | X2 , , Xn , C)P(X2 , , Xn |C) P(X1 |C)P(X2 , , Xn |C) P(X1 |C)P(X2 |C) P(Xn |C) – MAP classification rule: for x ( x1 , x2 , , xn ) [ P( x1 |c * ) P( xn |c * )]P(c * ) [ P( x1 |c) P( xn |c)]P(c), c c * , c c1 , , c L 9 COMP24111 Machine Learning Naïve Bayes • Algorithm: Discrete-Valued Features – Learning Phase: Given a training set S, For each target value of ci (c i c1 , , c L ) Pˆ (C ci ) estimate P(C ci ) with examples in S; For every feature value x jk of each feature X j ( j 1, , n; k 1, , N j ) Pˆ ( X j x jk |C ci ) estimate P( X j x jk |C ci ) with examples in S; Output: conditional probability tables; for X j , N j L elements – Test Phase: Given an unknown instance X (a1 , , an ), Look up tables to assign the label c* to X’ if [Pˆ (a1 |c* ) Pˆ (an |c* )]Pˆ (c* ) [Pˆ (a1 |c) Pˆ (an |c)]Pˆ (c), c c* , c c1 , , cL 10 COMP24111 Machine Learning Example • Example: Play Tennis 11 COMP24111 Machine Learning Example • Learning Phase Outlook Play=Yes Play=No Temperature Play=Yes Play=No Sunny 2/9 3/5 Hot 2/9 2/5 Overcast 4/9 3/9 0/5 2/5 Mild 4/9 3/9 2/5 1/5 Rain Humidity High Normal Cool Play=Yes Play=No 3/9 6/9 4/5 1/5 P(Play=Yes) = 9/14 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 P(Play=No) = 5/14 12 COMP24111 Machine Learning Example • Test Phase – Given a new instance, predict its label x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) – Look up tables achieved in the learning phrase P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 – Decision making with the MAP rule P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. 13 COMP24111 Machine Learning Example • Test Phase – Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) – Look up tables P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 – MAP rule P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. 14 COMP24111 Machine Learning Naïve Bayes • Algorithm: Continuous-valued Features – Numberless values for a feature – Conditional probability often modeled with the normal distribution ( X j ji )2 1 Pˆ ( X j |C ci ) exp 2 2 ji 2 ji ji : mean (avearage) of feature values X j of examples for which C ci ji : standard deviation of feature values X j of examples for which C ci – Learning Phase: for X (X1 , , Xn ), C c1 , , cL Output: n L normal distributions and P(C ci ) i 1, , L – Test Phase: Given an unknown instance X (a1 , , an ) • • Instead of looking-up tables, calculate conditional probabilities with all the normal distributions achieved in the learning phrase Apply the MAP rule to make a decision 15 COMP24111 Machine Learning Naïve Bayes • Example: Continuous-valued Features – Temperature is naturally of continuous value. Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8 No: 27.3, 30.1, 17.4, 29.5, 15.1 – Estimate mean and variance for each class N 1 N 1 xn , 2 ( xn )2 N n1 N n1 Yes 21.64 , Yes 2.35 No 23.88 , No 7.09 – Learning Phase: output two Gaussian models for P(temp|C) 2 2 1 ( x 21 . 64 ) 1 ( x 21 . 64 ) ˆ P ( x | Yes ) exp exp 2 2 2.35 2.35 2 11.09 2.35 2 2 2 1 ( x 23 . 88 ) 1 ( x 23 . 88 ) ˆ P ( x | No) exp exp 2 2 7.09 7.09 2 50.25 7.09 2 16 COMP24111 Machine Learning Relevant Issues • Violation of Independence Assumption – For many real world tasks, P(X1 , , Xn |C) P(X1 |C) P(Xn |C) – Nevertheless, naïve Bayes works surprisingly well anyway! • Zero conditional probability Problem – If no example contains the attribute value X j a jk , Pˆ ( X j a jk |C ci ) 0 – In this circumstance, Pˆ ( x1 |ci ) Pˆ ( a jk |ci ) Pˆ ( xn |ci ) 0 during test – For a remedy, conditional probabilities estimated with n mp Pˆ ( X j a jk |C ci ) c nm nc : numberof trainingexamplesfor w hichX j a jk and C ci n : numberof trainingexamplesfor w hichC ci p : prior estimate(usually,p 1 / t for t possiblevalues of X j ) m : w eight to prior(numberof " virtual"examples, m 1) 17 COMP24111 Machine Learning Summary • Naïve Bayes: the conditional independence assumption – Training is very easy and fast; just requiring considering each attribute in each class separately – Test is straightforward; just looking up tables or calculating conditional probabilities with estimated distributions • A popular generative model – Performance competitive to most of state-of-the-art classifiers even in presence of violating independence assumption – Many successful applications, e.g., spam mail filtering – A good candidate of a base learner in ensemble learning – Apart from classification, naïve Bayes can do more… 18 COMP24111 Machine Learning