Tài liệu tiếng anh chuyên ngành môn máy học

INTRODUCTION CS346-Spring 98 CS446-Fall 06 1 • The goal is to have the resulting decision tree as small as possible (Occam’s Razor) • The main decision in the algorithm is the selection of the next attribute to condition on. • We want attributes that split the examples to sets that are relatively pure in one label; this way we are closer to a leaf node. • The most popular heuristics is based on information gain, originated with the ID3 system of Quinlan. Picking the Root Attribute INTRODUCTION CS346-Spring 98 CS446-Fall 06 2 + - - + + + - - + - + - + + - - + + + - - + - + - - + - - + - + - - + - + - + + - - + + - - - + - + - + + - - + + + - - + - + - + + - - + - + - - + + + - + - + + - + - + + + - - + - + - - + - + - - + - + - + - - - + - - - - + - - + - - - + + + + + + + + - - - - - - - - - - - - + + + + + + + + + + + + + + - - + - + - + - + + + - - - - - - - - - - - - + + + + + + - - - - - Highly Disorganized High Entropy Much Information Required Highly Organized Low Entropy Little Information Required INTRODUCTION CS346-Spring 98 CS446-Fall 06 3 • Expected Information required to know an element’s label • Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is: where is the proportion of positive examples in S and is the proportion of negatives. • If all the examples belong to the same category: Entropy = 0 • If the examples are equally mixed (0.5,0.5) Entropy = 1 )log(pp)log(ppEntropy(S) −−++ −−= + p − p If the probability for + is 0.5, a single bit is required for each example; If it is 0.8 we need less then 1 bit. ∑ = −= k i ii 1 )log(pp}), pp,Entropy({p k21 In general, when p i is the fraction of examples labeled i: Entropy WHY? INTRODUCTION CS346-Spring 98 CS446-Fall 06 4 where is the proportion of positive examples in S and is the proportion of negatives. • If all the examples belong to the same category: Entropy = 0 • If the examples are equally mixed (0.5,0.5) Entropy = 1 + p − p 1 + 1 + 1 + Entropy • Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is: )log(pp)log(ppEntropy(S) −−++ −−= INTRODUCTION CS346-Spring 98 CS446-Fall 06 5 where is the proportion of positive examples in S and is the proportion of negatives. • If all the examples belong to the same category: Entropy = 0 • If the examples are equally mixed (0.5,0.5) Entropy = 1 + p − p 1 1 1 Entropy )log(pp)log(ppEntropy(S) −−++ −−= • Entropy of a set of examples, S, relative to a binary classification is: INTRODUCTION CS346-Spring 98 CS446-Fall 06 6 Information Gain For Information Gain, Subtract Information required after split from before + - - + + + - - + - + - + + - - + + + - - + - + - - + - - + - + - - + - + - + + - - + + - - - + - + - + + - - + + + - - + - + - + + - - + - + - - + + + - + - + + - + - + + + - - + - + - - + - + - - + - + - + - - - + - - - - + - - + - - - + + + + + + + + Some Expected Information required before the split Some Expected Information required after the split INTRODUCTION CS346-Spring 98 CS446-Fall 06 7 • The information gain of an attribute a is the expected reduction in entropy caused by partitioning on this attribute. where is the subset of S for which attribute a has value v and the entropy of partitioning the data is calculated by weighing the entropy of each partition by its size relative to the original set Partitions that lower entropy correspond to high information gain )Entropy(S |S| |S| Entropy(S)a)Gain(S, v v values(a)v ∑ ∈ −= v S Entropy of several sets Go back to check which of the A, B splits is better INTRODUCTION CS346-Spring 98 CS446-Fall 06 8 • Consider data with two Boolean attributes (A,B). < (A=0,B=0), - >: 50 examples < (A=0,B=1), - >: 50 examples < (A=1,B=0), - >: 3 examples < (A=1,B=1), + >: 100 examples A 100 + 3 - 100 - 01 • Splitting on B: • What should be the first attribute we select? • Splitting on A: Picking the Root Attribute B 100 + 50 - 53 - 01 Information gain of A is higher INTRODUCTION CS346-Spring 98 CS446-Fall 06 9 Generalization  Want assurance of accuracy on fu ture examples  Computations from the training set are only estimates  Sequence of decisions build the tree  Local  Dependent  Approximate  How good are the trees? Interesting question – following our algorithm gives no guarantee INTRODUCTION CS346-Spring 98 CS446-Fall 06 10 An Illustrative Example Day Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No