MACHINE LEARNING
Transcription
MACHINE LEARNING
MACHINE LEARNING What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) Any change in a system that allows it to perform better (Simon 1983) 2 What do we learn: Descriptions Rules how to recognize/classify objects, states, events Rules how to transform an initial situation to achieve a goal (final state) 3 How do we learn: Rote learning - storage of computed information. Taking advice from others. (Advice may need to be operationalized.) Learning from problem solving experiences remembering experiences and generalizing from them. (May add efficiency but not new knowledge.) Learning from examples. (May or may not involve a teacher.) Learning by experimentation and discovery. (Decreasing burden on teacher, increasing burden on learner.) 4 Approaches to Machine Learning • Symbol-based • Connectionist Learning • Evolutionary learning 5 Inductive Symbol-Based Machine Learning Concept Learning Version space search Decision trees: ID3 algorithm Explanation-based learning Supervised learning Reinforcement learning 6 Version space search for concept learning Concepts – describe classes of objects Concepts consist of feature sets Operation on concept descriptions Generalization: Replace a feature with a variable Specialization: Instantiate a variable with a feature 7 Positive and Negative examples of a concept The concept description has to match all positive examples The concept description has to be false for the negative examples 8 Plausible descriptions The version space represents all the alternative plausible descriptions of the concept A plausible description is one that is applicable to all known positive examples and no known negative example. 9 Basic Idea Given: A representation language A set of positive and negative examples expressed in that language Compute: A concept description that is consistent with all the positive examples and none of the negative examples 10 Hypotheses The version space contains two sets of hypotheses: G – the most general hypotheses that match the training data S – the most specific hypotheses that match the training data Each hypothesis is represented as a vector of values of the known attributes 11 Example of Version space Consider the task to obtain a description of the concept: Japanese Economy car. The attributes under consideration are: Origin, Manufacturer, Color, Decade, Type training data: Positive ex: (Japan, Honda, Blue, 1980, Economy) Positive ex: (Japan, Honda, White, 1980, Economy) Negative ex: (Japan, Toyota, Green, 1970, Sports) 12 Example continued The most general hypothesis that matches the positive data and does not match the negative data, is: (?, Honda, ?, ?, Economy) the symbol ‘?’ means that the attribute may take any value The most specific hypothesis that matches the positive examples is: (Japan, Honda, ?,1980, Economy) 13 Algorithm: Candidate elimination Initialize G to contain one element: the most general description (all features are variables). Initialize S to empty. Accept a new training example. 14 Process positive examples Remove from G any descriptions that do not cover the example. Generalize S as little as possible so that the new training example is covered. Remove from S all elements that cover negative examples. 15 Process negative examples Remove from S any descriptions that cover the negative example. Specialize G as little as possible so that the negative example is not covered. Remove from G all elements that do not cover the positive examples. 16 Algorithm continued Continue processing new training examples, until one of the following occurs: Either S or G become empty, there are no consistent hypotheses over the training space. Stop. S and G are both singleton sets. if they are identical, output their value and stop. if they are different, the training cases were inconsistent. Output this result and stop. No more training examples. G has several hypotheses. The version space is a disjunction of hypotheses. If for a new example the hypotheses agree, then we can classify the example. If they disagree we can take the majority vote 17 Learning the concept of "Japanese economy car" Features: POSITIVE EXAMPLE: Origin, Manufacturer, Color, Decade, Type (Japan, Honda, Blue, 1980, Economy) Initialize G to singleton set that includes everything Initialize S to singleton set that includes first positive example G = {(?, ?, ?, ?, ?)} S = {(Japan, Honda, Blue, 1980, Economy)} 18 Example continued NEGATIVE EXAMPLE: (Japan, Toyota, Green, 1970, Sports) Specialize G to exclude negative example G = {(?, Honda, ?, ?, ?), (?, ?, Blue, ?, ?) (?, ?, ?, 1980, ?) (?, ?, ?, ?, Economy)} S = {(Japan, Honda, Blue, 1980, Economy)} 19 Example continued POSITIVE EXAMPLE: (Japan, Toyota, Blue, 1990, Economy) Remove from G descriptions inconsistent with positive example Generalize S to include positive example G = { (?, ?, Blue, ?, ?) (?, ?, ?, ?, Economy)} S = {(Japan, ?, Blue, ?, Economy)} 20 Example continued NEGATIVE EXAMPLE: (USA, Chrysler, Red, 1980, Economy) Specialize G to exclude negative example (but staying within version space, i.e., staying consistent with S) G = {(?, ?, Blue, ?, ?) (Japan, ?, ?, ?, Economy)} S = {(Japan, ?, Blue, ?, Economy)} 21 Example continued POSITIVE EXAMPLE: (Japan, Honda, White, 1980, Economy) Remove from G descriptions inconsistent with positive example Generalize S to include the positive example G = {(Japan, ?, ?, ?, Economy)} S = {(Japan, ?, ?, ?, Economy)} S = G, both singleton => done! 22 Decision trees A decision tree is a structure that represents a procedure for classifying objects based on their attributes. Each object is represented as a set of attribute/value pairs and a classification. 23 Example A set of medical symptoms might be represented as follows: Cough Fever Weight Mary no yes normal Fred no yes normal Julie yes yes skinny Elvis yes no obese Pain throat abdomen none chest Classification flu appendicitis flu heart disease The system is given a set of training instances along with their correct classifications and develops a decision tree based on these examples. 24 Attributes If a crucial attribute is not represented, then no decision tree will be able to learn the concept. If two training instances have the same representation but belong to different classes, then the attribute set is said to be inadequate. It is impossible for the decision tree to distinguish the instances. 25 ID3 Algorithm (Quinlan, 1986) ID3(R, C, S) // R – list of attributes, // C – categorical attribute, S - examples If all examples from S belong to the same class Cj , return a leaf labeled Cj If R is empty return a node with the most frequent value of C Else select the “best” decision attribute A in R with values v1, v2, …, vn for next node divide the training set S into S1, …, Sn according to values v1,…,vn Call ID3 (R – {A}, C, S1), ID3(R – {A}, C, S2), … ID3(R – {A}, C, Sn), i.e. recursively build subtrees T1, …, Tn for S1, …, Sn Return a node labelled A with children the subtrees T1, T2, … Tn 26 S - a sample of training examples Entropy Entropy (S ) = expected number of bits needed to encode the classification of an arbitrary member of S Information theory: optimal length code assigns -log2 p bits to message having probability p Generally for c different classes Entropy(S) c(- pi * log2 pi) 27 Entropy of the Training Set T : a set of records partitioned into C1, C2, …, Ck on the bases of the categorical attribute C. Probability of each class Pi = Ci / T Info(T) = -p1*Log(P1) - … - Pk*log(Pk) Info (T) is the information needed to classify an element. 28 How much helpful is an attribute? X : a non-categorical attribute, T = {T1,…,Tn} is the split of T according to X The entropy of each Tk is: Info(Tk) = - (Tk1 / Tk)* log(Tk1 / Tk) - … - (T kc / Tk)*log(Tkc / Tk ) where c is the number of partitions in Tk produced by the categorical attribute C For any k, Info(Tk) reflects how the categorical attribute C splits the set Tk 29 Information Gain Info(X,T) = T1/T * Info(T1) + T2/T * Info(T2) + …. + Tn /T * Info(Tn) Gain(X,T) = Info(T) – Info(X,T) = Entropy(T) - i (Ti/T)*Entropy(Ti) 30 Information Gain Gain(X,T) - the expected reduction in entropy caused by partitioning the examples of T according to the attribute X. Gain(X,T) - a measure of the effectiveness of an attribute in classifying the training data The best attribute has maximal Gain(X,T) 31 Example (1) Name Hair Height Weight Lotion Result Sarah blonde average light no sunburned (positive) Dana blonde tall average yes none (negative) Alex brown short average yes none Annie blonde short average no sunburned Emily red average heavy no sunburned Pete brown tall heavy no none John brown average heavy no none Katie blonde short light yes none 32 Example (2) Attribute “hair” Blonde: T1 = {Sara, Dana, Annie, Katie} Brown: T2 = {Alex, Pete, John} Red: T3 = { Emily} T1 is split by C into 2 sets: T11 = {Sarah, Annie}, T12 = {Dana, Katie} Info(T1) = - 2/4 * log(2/4) – 2/4* log(2/4) = -log(1/2) = 1 In a similar way we compute Info(T2) = 0, Info(T3) = 0 Info(‘hair’,T) = T1/T * Info(T1) + T2/T * Info(T2) + T3 /T *Info(T3) = 4/8 * Info(T1) + 3/8* Info(T2) + 1/8 * Info(T3) = = 4/8 * 1 = 0.50 This happens to be the best attribute 33 Example (3) Hair color blonde red brown Lotion yes none no sunbur n sunbur n none 34 Split Ratio GainRatio(D,T) = Gain(D,T) / SplitInfo(D,T) where SplitInfo(D,T) is the information due to the split of T when D is considered categorical attribute 35 Split Ratio Tree lotion no yes Color blonde red brown none none sunbur n none 36 More Training Examples 37