The Decision Tree Project

Transcription

The Decision Tree Project
The Decision Tree Project
Arnel Curkic, Cato A. Goffeng & Amund Lågbu
March 6, 2014
Contents
1 Introduction
1.1 About . . . . . . . . . . . .
1.2 Raw data generation . . . .
1.3 The problem’s applications
1.4 Attributes and conversion .
1.4.1 See5 . . . . . . . . .
1.4.2 Weka . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 What others have done
3
3
3
4
4
4
5
7
3 Decision tree tools
3.1 See5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
8
10
4 Trees and rules - what is best, and the difference between them 13
5 Interpretation of output & a review of rules generated
14
5.1 See5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Characterization of the generalizing ability of decision tree models using cross-validation
19
7 Classifications with different costs
8 Relevant decision tree options
8.1 Winnowing . . . . . . . . . . . . . . . . . . .
8.1.1 Test with winnowing . . . . . . . . . .
8.1.2 Test without winnowing . . . . . . . .
8.2 Boosting . . . . . . . . . . . . . . . . . . . . .
8.2.1 A run with 10 boosting trials . . . . .
8.2.2 A run without boosting . . . . . . . .
8.2.3 Comparison of runs . . . . . . . . . .
8.3 Pruning techniques . . . . . . . . . . . . . . .
8.4 Algorithms which handles missing attributes
9 Evaluation
21
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
22
23
23
23
25
26
26
28
30
1
10 Appendix
33
10.1 Java code created to shuffle .arff file . . . . . . . . . . . . . . . . 33
10.1.1 main.java . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
10.1.2 FileOrganizer.java . . . . . . . . . . . . . . . . . . . . . . 34
10.1.3 ArrayShuffle.java . . . . . . . . . . . . . . . . . . . . . . . 36
10.2 A Java program for testing a small decision tree . . . . . . . . . 38
10.2.1 Main 05b.java . . . . . . . . . . . . . . . . . . . . . . . . . 38
10.2.2 FileReader.java . . . . . . . . . . . . . . . . . . . . . . . 39
10.2.3 DecisionTree.java . . . . . . . . . . . . . . . . . . . . . . . 42
10.2.4 Rule.java . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
10.2.5 FlowerClassifier.java . . . . . . . . . . . . . . . . . . . . . 44
10.3 Code to create a decision tree based on a tree handling missing
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2
Chapter 1
Introduction
1.1
About
We have chosen a dataset with information about flowers in the iris plant family.
The four attributes in the set describe the flowers’ sepal length and width, the
petal length and width, and lastly a class specifies what kind of species the
current flower belongs to. The dataset contains 150 instances of flower data
[11].
The objective of this task is to create a decision tree which may separate the
flowers from each other, and determine what species a flower belongs to.
According to Mitchell overfitting is one possible issue that may occur when
a decision tree is generated. Overfitting is an event which is present when for
example a decision tree is connected too much to the training data, and works
poorly on later test data. Mitchell mentions two possible causes: 1) A tree
where each branch is just deeply enough to classify the training samples. 2)
Noise in the data, or too small training examples to produce a representative
sample of the target function [9].
Overfitting may occur in our project solution, and may be caused of both
examples shown above, and especially by the size of the set. Our data set
contains only 150 instances, and may be a small training example. Regardless
of the possible problems we separated the dataset in two parts - one with test
data (50 instances), and another with training data (100 instances).
1.2
Raw data generation
The raw data was download as .data-files. The information in the files was
sorted by flower species. Separating them directly into training and testing
data could lead to complications, since the last 50, or first 50 instances was only
one species of iris flowers. Instead we reordered the file to make sure that 33-34
instances of each flower species was present in the training, and 16-17 instances
was present in the testing.
3
1.3
The problem’s applications
We chose a dataset that is a basic classification problem. The task performed
is to use a set of values representing the width and length of a flowers sepal
and petal to classify it into one of the three species of iris (setosa, versicolour
or virginica).
1.4
Attributes and conversion
The attributes used in the dataset are measurements introduced by Sir Ronald
Fisher and consists of 50 instances of each of the flower species iris setosa, iris
virginica and iris versicolor. Each instance contains its species and the width
and height of the flowers petal and sepals. The measurements are all numerical
values and therefore given the continuous type when used in conjunction with
See5, while the last species attribute is the classification.
1.4.1
See5
We have installed the Windows-version of C5.0 - See5. This is not the full version, but a tutorial. This version supports all the functionality of the full version
of See5, except the number of possible instances. In the tutorial 400 instances
are supported [10], but since our data set is containing only 150 instances, we
have chosen to use this tool.
Two file types are required as a minimum to perform a decision tree algorithm
with See5; a .data file containing the data and a .names file containing the
attributes and classes. A file with test data (.test) is optional, and may be used
to test the created decision tree [12].
We have created three files to perform our decision tree algorithm; “iris.names”,
“iris.data” and “iris.test” (files with output and the decision tree is generated
by See5 automatically).
Both “iris.test” and “iris.data” are written in the same way. The two lines in
the example below are copied from “iris.test” and contain five comma-separated
values each. The first four values are the attributes of the current instance,
while the last is the instance’ class.
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
...
The data file was downloaded from the Internet, while the test file was
generated by moving 50 of the instances in the data file to the test file.
After finishing the data files we created a .names file based on information
provided in the tutorial “C5.0: An Informal Tutorial” [12]. The .names file
contains the following information:
class.
sepalLength: continuous.
sepalWidth: continuous.
petalLength: continuous.
4
petalWidth: continuous.
class: Iris-setosa,Iris-versicolor,Iris-virginica.
The first line “class.” is the attribute that will be predicted (in our case the
flowers’ class).
Then four attributes are described; “sepalLength”, “sepalWidth”, “petalLength” and “petalWidth”. All are defined as continuous, which is a numeric
attribute.
1.4.2
Weka
For Weka 3.6, there are no need for converting file formats before using the
selected dataset. Weka comes with a folder ‘data’, where the iris data set is
stored in an .arff-file. The data is written in the following way:
@RELATION iris
@ATTRIBUTE sepallength REAL
....
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
....
Unlike the files used for generating a decision tree in See5, both descriptive information and data is placed together in the .arff file. The lines below “@DATA”
are the data in the file. @RELATION is used when the current relation is
named. The @ATTRIBUTE tag is used describing the attributes. REAL is
used for numeric attributes (NUMERIC could also have been used). The line
“@ATTRIBUTE class. . . ” contains the different flower classes [5].
This file is however containing all the data about the iris flowers, and is not
separated into testing or training data. Weka supports functionality to separate
a data set (as shown in figure 1.1), but we have done this manually by selecting
the radio button “Supplied test set” - and uploading our own set.
5
Figure 1.1: Different functionality for training/testing.
The set has been generated by moving 50 of the flowers in the iris set into
a new .arff-file. 17 of both iris setosa and iris versicolor are moved into the test
file. 16 instances of iris virginica are also moved, as shown in figure 1.2, where
instances from the training file are passed into the testing file.
Figure 1.2: Iris data.
We didn’t focus on picking the instances in each flower class randomly, since
the instances didn’t appear to be sorted in more levels than flower classes.
6
Chapter 2
What others have done
The iris dataset is a popular dataset as shown on the UCI Machine Learning
Repository website, where it is ranked as number one on the “most popular
data sets” list [11].
The iris data set is used as one of many in a study where Mingers compare
five different methods for pruning decision trees [8]. The five pruning methods
were Error-Complexity Pruning (Err-Comp), Critical Value Pruning (Critical),
Minimum.Error Pruning (Min-err), Reduced-Error Pruning (Reduce) and Pessimistic Error Pruning (Pessim). In this experiment compare these five pruning
methods in terms of the size of the pruned tree and its accuracy. The paper concludes that there are significant differences between the five pruning methods.
The Min-err method was the least accurate method, it also failed to provide a set
of trees for the experts to examine. The Pessim pruning method was the most
crude, but it also was the quickest and separate test data is not required. The
three other pruning methods are very similar overall and produced consistently
low error rates over all the data sets including the iris data set [8].
Another paper from Buntine & Niblett [2] also uses the iris data set as
one of many data sets to test if using random splitting could decrease error.
Both Buntine & Niblett [2] and Mingers [7] has concluded with pretty much the
same results. Buntine concludes with that random splitting rules perform worse
than other measures [2]. Mingers [7] also came to the conclusion that selecting
attributes entirely randomly produces trees are as accurate as those produced
by using measures. Selecting random attributes also resulted in larger decision
trees compared to using measures, on unpruned trees. After pruning there were
little tree size differences [7].
Murthy has created a randomized algorithm for building oblique decision
trees. The iris data set is used with other data sets to experiment and test
this algorithm. The result showed that this algorithm produced small, accurate
trees, and the computational requirements are quite modest [4].
These are some of many examples where the iris data set has been used.
Throughout this project we will use the iris data set to help us solve our problem
situations. In the research done before, both Mingers [7] and Buntine [2] uses
different data sets. Different in size, number of attributes and other aspects.
Where as we will only use the iris data set which is a rather small data set [11],
and as mentioned before we might experience overfitting.
7
Chapter 3
Decision tree tools
To perform the task we will use both See5 and Weka for decision tree making.
In this task we have then performed some basic tests of the tools’ functionality,
to check whether we are able to use them.
3.1
See5
We have chosen See5 as a decision tree tool. The start-up window contains six
buttons, and only one is clickable - “Locate Data” (as shown in figure 3.1).
Figure 3.1: See5’s start-up window.
We may upload our data, test and description file by clicking on the selectable
button. We are then presented with the following screen, as shown in figure 3.2:
8
Figure 3.2: GUI presented when uploading files to See5.
After that we are able to create a decision tree by clicking the number two
button from the left - “Construct Classifier”. Then we are presented with the
window shown in figure 3.3. We notice could use cross validation to train and
test the system with varying test data. But since this is only a basic test, we
use the default values and click “Ok”.
Figure 3.3: Window for classifier construction in See5.
See5 then generates a decision tree for us, and also tests the generated tree.
A window presents the output (as illustrated in figure 3.4):
9
Figure 3.4: Output in See5.
We notice that the decision tree is presented at the top with different branches.
By separating the plants, whether they have petal length ¡= 1.9, or petal length
¿ 1.9, the created decision tree is able to classify all the 33 iris setosas on top of
the tree. The algorithm then continuous creating branches, and ends up with
a tree that classifies 3 instances wrong and ends up with three percent errors.
For the test data the tree generates two percent errors.
All this information is also stored in two files generated automatically by
See5 - iris.tree and iris.out.
3.2
Weka
We have chosen Weka as a decision tree tool. Weka supports visualization of
data sets, as shown in figure 3.5, where the iris data set used in this task is
visualized.
10
Figure 3.5: Visualization of training data in Weka.
Immediately we discover one of the advantages with using Weka - the good
visualization of data. We notice that the small training set may cause an error,
since all iris setosas in the training set (colored blue) have smaller petal width
than the other flowers, as shown in figure 3.5. A visualization of the full data
set, shown in figure 3.6, shows that one of the setosa flowers has a larger petal
width than the others. This was not visible in the training set, since that set
didn’t contain the flower with petal width greater than the other setosas. The
visualizations in Weka makes this easier to notice.
Figure 3.6: Visualization of petal width of all 150 iris flowers.
Weka has also implemented a number of algorithms for decision tree making,
amongst others. These algorithms may be selected as shown in figure 3.7.
11
Figure 3.7: Algorithms in Weka.
At last we tested Weka and analyzed the output of the RandomForest algorithm. This algorithm uses a number of trees, called a forest. These trees
are generated based on random vectors. Finally when a large number of trees
are generated, they vote for the most popular class amongst them [1]. The
algorithm was used to generate a decision tree test, and the output is shown in
figure 3.8. As the output shows, the confusion matrix showed that one of the
iris versicolor was classified as iris virginica during the testing. The others were
classified correctly. Since this only is a test of basic functionality in Weka, no
further analyzes were done.
Figure 3.8: Output from a test of the random forest algorithm available in Weka.
12
Chapter 4
Trees and rules - what is
best, and the difference
between them
Decision trees and rules are similar in many ways. One way to learn sets of
rules is to first learn a decision tree. Then by translating the tree into similar
sets of rules, one rule for each leaf node in the decision tree. Sets of rules could
be similarly translated into a decision tree. The sequential covering algorithm
learns a sets of rules by first learning a single accurate rule, further removing
the positive examples covered by this rule and then continuing the process over
the remaining training examples/data. This is an efficient, greedy algorithm
for learning rule sets, similar to the top-down decision tree learning algorithms,
such as the ID3 algorithm [9]. The only difference is that the ID3 algorithm can
be viewed as simultaneous, rather than sequential covering as the rule algorithm
[9].
A decision tree model can be converted into a collection of if-then statements
(set of rules). The decision tree presentation is useful when you want to see how
attributes in the data can split the data into subsets, relevant to the problem.
The rule set presentation is useful if we want to see how particular groups of
items related to a specific conclusion.
We can look at decision rules as the verbal equivalent to the graphical decision tree.
13
Chapter 5
Interpretation of output &
a review of rules generated
5.1
See5
We used See5 to generate a decision tree for classifying flower species. We used
no boosting trials, no cross validation, but a testing set of 50 instances to check
our data set containing 100 instances of flower data.
The screen dump in figure 5.1 shows the output from See5.
Figure 5.1: Output in See5.
The output may be interpreted into the decision tree shown in figure 5.2.
14
Figure 5.2: Illustration of decision tree output from C5.
The confusion matrix showed that this relatively small tree generated only
2% errors when running the test data, but a much higher error rate on the
training data; 5%.
This tree has few rules, and every rule seems to be of importance. The first
level of branches separates all the iris setosas from the rest of the iris flowers.
Then the two next branches in the next level separates the iris versicolors from
the iris virginicas, with some errors.
Since the generated tree was pretty small, and clearly separates the flowers,
we have not removed any of the trees branches. All the rules seems sensible.
5.2
Weka
For the Weka test we decided to run the training on 100 instances of data, and
tried cross validation. The reason why we performed training with only 100 data
instances, is to later check if any of the rules may be unnecessary by running
the 50 instances in the test set. This approach was chosen mainly for the sake
of learning.
Since this data set was sorted, we also shuffled the data before performing
the test. A small Java program was created for this purpose. The program code
is shown in section 10.1.
In Weka we chose an algorithm for our decision tree making. The J48algorithm was chosen. This algorithm has been proven to work well with the
iris data set before. The researchers Tiwari, Srivastava and Pandey have made
a comparison of decision tree algorithms on the iris data set. They found that
the J48 algorithm is a good choice for small to medium data sets, as iris [17].
The j48-algorithm is Weka’s implementation of an algorithm that is better
known as C4.5. This algorithm is able to handle cases with missing data. It
is also a divide-and-conquer algorithm which may split a data set into smaller
disjoint sets [6].
Several tests were performed with the training set, and the number of crossvalidation folds were adjusted to see if that had an impact on the result. At last
a test with five cross-validations were selected. That particular test generated a
confusion matrix, with seven incorrect classifications, which is shown in figure
5.3. The matrix was linked to the decision tree shown in figure 5.4. This decision
tree is further illustrated in figure 5.5.
15
Figure 5.3: Illustration of an output confusion matrix from Weka.
Figure 5.4: Illustration of an output decision tree from Weka.
Figure 5.5: Illustration of the decision tree shown in figure 5.4.
To check if any of the rules were unnecessary, a Java program was created
for this purpose. The program uses the generated tree to classify flowers. It also
modifies the tree and removes sets of branches, and then prints the confusion
matrix of every modified decision tree. For simplicity the Java program uses a
data file which was originally modified for Cubist. This file has no flower names
as strings, but numbers instead. The program is shown in section 10.2.
After running the program twice - once with the 150 instances in the complete data file and once with the 50 instances in the test file, the program wrote
the following output, shown in figure 5.6 and 5.7.
16
Figure 5.6: Decision tree test of the complete data set performed with a program
written in Java.
Figure 5.7: Decision tree test of the test data.
According to the confusion matrixes for the trees modified with Java, it is
possible to shrink the size of the tree, and still get good results. If we drop
the lowest two branches of the tree, we would have a tree which generates only
seven errors for the whole data set, and two errors for the test set (which is just
17
as good as the complete tree on the same set). It is possible that the full tree
has been overfitted when trained with the 100 instances in the training set, and
that a smaller tree would work just as well, or better, with other data.
Mitchell mentions that it is hard to decide when to stop growing a decision
tree [9], and further testing may therefore be necessary to be able to make a
conclusion. But according to William Occam’s principle, “Occam’s Razor”, a
simple, small tree should always be prefered above a larger one [13], and this
principle therefore strengthens the arguments for choosing the modificated tree
instead of the original one. (The confusion matrixes for the modificated tree
mentioned in this section are shown in figure 5.6 and 5.7 as “Confusion matrix
without rule 3.”
The illustration of the tree mentioned in the section above is shown in figure
5.8 - without the lowest, two branches.
Figure 5.8: Decision tree with modifications done after a review of a tree’s
generated rules.
18
Chapter 6
Characterization of the
generalizing ability of
decision tree models using
cross-validation
In a limited dataset the risk of overfitting is high. To limit overfitting the predictive model to the validation data, we experimented with different techniques
offered by See5. Cross validation enables you to train and test on the same
dataset, which in our case was 150 instances. There are multiple ways to perform this type of validation, but the See5 software utilizes a K-fold algorithm to
divide the data evenly over a selected amount of folds (subsets), we chose five
for this exercise. On the first iteration four subsets are used to train the model
while the last subset is used for testing. The subset used for testing is then
swapped with one of the training subsets on the next iteration and this process
is continued until all subsets have been used both for training and testing.
Fold
Decision Tree
---- ---------------Size
Errors
1
2
3
4
5
4
4
3
4
4
Mean
SE
3.8
0.2
(a)
---49
(b)
---1
3.3%
3.3%
13.3%
3.3%
0.0%
4.7%
2.3%
(c)
----
<-classified as
(a): class Iris-setosa
19
48
4
2
46
(b): class Iris-versicolor
(c): class Iris-virginica
This enabled us to test the performance of different predictive models on the
same training set. The combined growth in error percentage might seem like a
negative, but the result is generalized and therefore considered less biased.
Another option is to use leave one out cross validation (LOOCV). This is
accomplished in See5 by selecting the cross validation option with a fold count
equal to the instances of data.
Mean
SE
(a)
---49
4.0
0.0
(b)
---1
47
3
4.7%
1.7%
(c)
---3
47
<-classified as
(a): class Iris-setosa
(b): class Iris-versicolor
(c): class Iris-virginica
This algorithm uses one instance of the dataset for testing and the remaining
data for training, then repeat the process for all the instances. This produces
the same error rate, but a lower variability of the means.
Lastly we tried the K-fold cross validation on decreasing datasets to emphasise the problem of small datasets. The table in figure 6.1 shows a clear increase
in error rate as we reduce the dataset compared to our result of 4.7% on 150
Instances.
Figure 6.1: Table showing increase in error rate.
20
Chapter 7
Classifications with
different costs
Differing costs will not be a problem for our chosen data set. If we look at the
heart diseases data set, where a wrong classification could have major consequences. If a patient who is sick is diagnosed to be healthy, this could lead to
big problems. In our case where we use the iris data set to classify flowers in
the iris family, this is not a major problem. If one flower is wrongly classified it
will not cause a problem and therefore we will not have to use a .cost file.
21
Chapter 8
Relevant decision tree
options
8.1
Winnowing
Winnowing is used to reduce the number of used attributes in a decision tree,
by pre-picking predictors. Usually this process leads to a different classifier
than what would have been the case without. Winnowing is, according to Rulequest Research, best fitted for large applications, where many of the attributes
probably have a small impact on the classification task [15].
We have made a small example in See5 which demonstrates the use of winnowing with our small data set. The complete data set, consisting of 150 instances, has been constructed with See5 with and without the use of winnowing.
The results are presented in 8.1.1 and 8.1.2.
8.1.1
Test with winnowing
The test in figure 8.1 shows that three attributes has been winnowed, and that
only the petal width is used as a predictor.
Figure 8.1: Test with winnowing for decision tree construction.
22
8.1.2
Test without winnowing
The test without winnowing (ref. figure 8.2) removes no predictors, and two
attributes are used when constructing the decision tree - petal length and then
petal width.
Figure 8.2: Test without the use of winnowing.
8.2
Boosting
According Appel et.al boosting is one of the most used decision tree learning
techniques. The technique generates several so called “week leaners” (trained
decision trees with poor performance) and combines them to generate a single,
strong tree [3].
See5 has implemented functionality called adaptive boosting. The user may
set a number of trials to be generated in See5 - and every trial will generate
a decision tree. According to Rulequest Research a 10-classifier boosting will
reduce the error rate with 25% [15].
We have performed boosting See5 by inserting the number of desired trials
in See5. We have compared a 10-trial boosting with a run without boosting at
all. Pruning was not in use during the run, to decrease disturbance from other
functionality in See5.
8.2.1
A run with 10 boosting trials
The decision tree creation with boosting reduces the error percent to 0 on the
training set. The tree contains all four attributes for classifying the flowers. On
the test data we receive one error, one of the iris versicolors is classified as an
iris virginica.
Evaluation on training data (100 cases):
Trial
Decision Tree
----- ---------------Size
Errors
0
1
2
4
4
3
3( 3.0%)
7( 7.0%)
11(11.0%)
23
3
4
5
6
7
8
9
boost
(a)
---33
3
5
3
5
3
5
4
(b)
----
12(12.0%)
3( 3.0%)
5( 5.0%)
2( 2.0%)
36(36.0%)
38(38.0%)
4( 4.0%)
0( 0.0%)
(c)
----
33
34
<<
<-classified as
(a): class Iris-setosa
(b): class Iris-versicolor
(c): class Iris-virginica
Attribute usage:
100%
100%
67%
40%
petalLength
petalWidth
sepalWidth
sepalLength
Evaluation on test data (50 cases):
Trial
Decision Tree
----- ---------------Size
Errors
0
1
2
3
4
5
6
7
8
9
boost
(a)
---17
4
4
3
3
5
3
5
3
5
4
1( 2.0%)
3( 6.0%)
5(10.0%)
4( 8.0%)
2( 4.0%)
3( 6.0%)
2( 4.0%)
23(46.0%)
21(42.0%)
1( 2.0%)
1( 2.0%)
(b)
----
(c)
----
16
1
16
<<
<-classified as
(a): class Iris-setosa
(b): class Iris-versicolor
(c): class Iris-virginica
24
Time: 0.0 secs
8.2.2
A run without boosting
Without boosting See5 generates a tree with three errors on the training set.
The tree contains only two attributes - and generates one error on the test set.
Evaluation on training data (100 cases):
Decision Tree
---------------Size
Errors
4
(a)
---33
3( 3.0%)
(b)
----
(c)
----
31
1
2
33
<<
<-classified as
(a): class Iris-setosa
(b): class Iris-versicolor
(c): class Iris-virginica
Attribute usage:
100%
67%
petalLength
petalWidth
Evaluation on test data (50 cases):
Decision Tree
---------------Size
Errors
4
(a)
---17
1( 2.0%)
(b)
----
(c)
----
16
1
16
<<
<-classified as
(a): class Iris-setosa
(b): class Iris-versicolor
(c): class Iris-virginica
Time: 0.0 secs
25
8.2.3
Comparison of runs
The run with 10 boosting trials created a tree with less errors on the training set
than the run with no boosting. But on the test set both decision trees performed
with the same amount of errors. The decision tree generated with boosting did
also have a larger number of used attributes than the tree generated with no
boosting.
Since our data set is small, and since both trees ran with the same amount
of errors on the test set, it is difficult to conclude if the boosted tree is better
than the tree generated without boosting. The low percentage of errors on
the training set speaks to the boosted tree’s benefit, but the higher number of
attributes/nodes speaks to the not boosted tree’s benefit, according to Occam’s
principle (mentioned in section 5.2).
8.3
Pruning techniques
Pruning is a technique in machine learning for reducing the size of the decision
tree by going through each subtree and deciding if it should be replaced with
a leaf or sub-branch. By turning of Global Pruning in See5 generally results in
larger decision trees and rulesets [15].
We tested the pruning technique in See5 by turning of the “Global Pruning”
option ref. figure 3.3. We tested this on the complete iris data set with 150
instances. We then look on the size of the decision tree and the number of rules
and how they differ with and without pruning.
Figure 8.3: Test without pruning showing the size of the tree.
26
Figure 8.4: Test without pruning showing number of rules.
Figure 8.5: Test with pruning showing the size of the tree.
Figure 8.6: Test with pruning showing number of rules.
The results we gained by turning the pruning option on and off are represented on the pictures above. We see that the number of rules are the same
both with and without pruning. The size of the decision tree differs - from 5
leafs without pruning to 4 leafs with pruning.
27
8.4
Algorithms which handles missing attributes
The Iris data set has been modificated for this task to illustrate a solution to
this general problem. 50 of the data attributes has been replaced randomly with
a question mark, which is Weka’s symbol for missing attributes. This operation
was done by shuffeling the data in an arff.-file by using the Java program in
section 10.1. Then question marks were added to one of the attributes in the
first 50 instances in the file. Then the file was shuffled again.
Provost and Saar-Tsechansky mentions three possible treatments of missing
values in a data set [14]
1. The instances with missing values may be discarded. This works best if
the values are missing at random [14].
2. The values may be aquired. This method are not used by any algorithm,
but involves for example buyong the missing values in the set [14].
3. Imputation. This method involves estimation of the missing value(s).
There are several techniques used by different algorithms to estimate values [14].
We have used Weka’s algorithm J48 algorithm during the test. We have
also used Weka’s filter “replaceMissingValues” to handle values missing.The
filter replaces all the missing values for in this case numeric attributes with the
means from the traing data [16].
The figures 8.7 and 8.8 illustrates the output from the tree construction. In
the first case, the replaceMissingValues filter was not in use. Then in the second
run, the filter was applied. Both constructions uses cross validation for testing
on the full data set.
To test the tree generated in Weka, it was not enough to just perform cross
validation during creation, since this functionality only tests with the data generated by Weka itself. Instead we created a new decision tre in our Java program
which is able to create confusion matrixes. This program is shown in secion 10.2.
Some code were added to perform the test with the decision tree generated in
Weka on the 150 instances of data with no missing values. The code is added
to secion 10.3.
Figure 8.7: Output from test with no replacement of missing values.
28
Figure 8.8: Output from test with replacement of missing values by applying
the filter ’replaceMissingValues’ in Weka.
The tree created while constructing with the use of the replaceMissingValues
filter is shown in figure 8.9.
Figure 8.9: Decision tree created while using the replaceMissingValues filter.
The generated decision tree managed suprisingly well when tested with the
Java program. Figure 8.10 shows the trees confusion matrix, with only a few
errors when tested on the full data set without missing data.
Figure 8.10: Confusion matrix to the decision tree where the values were generated by Weka.
29
Chapter 9
Evaluation
The main difficulty we had while working on this assignment was the small
amount of records in our dataset. This contributed to less precise predictions
and complications as shown in several of the operations. We believe that increasing the training data would help generate more precise prediction models
and produce a better result.
30
Bibliography
[1] Leo Breiman. Random forests.
randomforest2001.pdf, 2001.
http://oz.berkeley.edu/˜breiman/
[2] Wray Buntine and Tim Niblett. Machine learning: A further comparison
of splitting rules for decision-tree induction. http://link.springer.com/
article/10.1023/A:1022686419106, 1992.
[3] Fuchs et al. Quickly boosting decision trees - pruning underachieving features early. http://jmlr.org/proceedings/papers/v28/appel13.pdf,
2013.
[4] Murthy et al. Oc1: A randomized algorithm for building oblique decision trees. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.
1.1.17.6068&rep=rep1&type=pdf, 1993.
[5] Richard Kirkby et.al. Attribute-relation file format (arff). http://www.
cs.waikato.ac.nz/˜ml/weka/arff.html, 2008.
[6] A. Kusiak.
Decision tree algorithm.
http://read.pudn.com/
downloads110/ebook/457186/C4.5%20%E5%86%B3%E7%AD%96%E6%A0%
91/DecisionT1.pdf.
[7] John Mingers. Machine learning: An empirical comparison of selection measures for decision-tree induction. http://link.springer.com/article/
10.1007/BF00116837, 1988.
[8] John Mingers. Machine learning: An empirical comparison of pruning
methods for decision tree induction. link.springer.com/article/10.
1023/A:1022604100933, 1989.
[9] Tom M. Mitchell. Machine Learning. MIT Press and McGraw Hill Companies, inc, 1997.
[10] No named author. Data mining lab.
datamining/dmlab/resources.html.
[11] No named author.
datasets/Iris.
Iris data set.
http://cecs.louisville.edu/
http://archive.ics.uci.edu/ml/
[12] No named author. C5.0: An informal tutorial. http://www-ia.hiof.no/
˜rolando/ML/c50tutorial.html#.names, 2003.
31
[13] M. Nashville. Data mining with decision trees. http://decisiontrees.
net/decision-trees-tutorial/tutorial-3-occams-razor/.
[14] P. Provost and M. Saar-Tsechansky. Handling missing values when applying classification models. http://www2.mccombs.utexas.edu/faculty/
maytal.saar-tsechansky/ResearchPapers/saar-tsechansky07a.pdf,
2007.
[15] Rulequest Research. See5: An informal tutorial. http://www.rulequest.
com/see5-win.html\#OTHER, 2013.
[16] Sourceforge.
Class replacemissingvalues.
http://weka.
sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/
ReplaceMissingValues.html.
[17] V. Tiwari M., Srivastava and V. Pandey. Comparative investigation of
decision tree algorithms on iris data. http://warse.org/pdfs/2013/
ijacst03232013.pdf, 2013.
32
Chapter 10
Appendix
10.1
Java code created to shuffle .arff file
10.1.1
main.java
package Task01;
public class main {
public static void main(String[] args)
{
String filePath =
"C:\\Users\\amund_000\\Desktop\\Maskinlæring\\Project1\\iris.arff";
FileOrganizer fo = new FileOrganizer(filePath);
if(fo.initFile())
{
ArrayShuffle as = new ArrayShuffle();
if(as.shuffle(fo.getListsWithFlowerInfo()))
{
fo.store(as.getFlowerLists());
}
else
System.out.println("Couldn’t shuffle file.");
}
else
System.out.println("Couldn’t get/init file.");
}
}
33
10.1.2
FileOrganizer.java
package Task01;
import
import
import
import
import
import
import
java.io.BufferedReader;
java.io.FileNotFoundException;
java.io.FileReader;
java.io.IOException;
java.io.PrintWriter;
java.io.UnsupportedEncodingException;
java.util.ArrayList;
public class FileOrganizer
{
private String path;
@SuppressWarnings("rawtypes")
private ArrayList[] listsOfFileInfo;
public FileOrganizer(String filePath)
{
this.listsOfFileInfo = new ArrayList[2];
this.listsOfFileInfo[0] = new ArrayList<String>();
this.listsOfFileInfo[1] = new ArrayList<String>();
this.path = filePath;
}
@SuppressWarnings({ "resource", "unchecked" })
public boolean initFile()
{
FileReader fr;
BufferedReader br;
try
{
fr = new FileReader(this.path);
br = new BufferedReader(fr);
boolean storingHeader = true;
String line;
try
{
line = br.readLine();
while(line != null)
{
34
if(storingHeader)
{
this.listsOfFileInfo[0].add(line);
}
else
{
this.listsOfFileInfo[1].add(line);
//System.out.println(line);
}
if(line.startsWith("@DATA"))
storingHeader = false;
line = br.readLine();
}
}
catch (IOException e)
{
return false;
}
}
catch (FileNotFoundException e)
{
return false;
}
return true;
}
@SuppressWarnings("unchecked")
public ArrayList<String>[] getListsWithFlowerInfo()
{
return (ArrayList<String>[]) this.listsOfFileInfo;
}
public void store(ArrayList<String>[] flowerLists)
{
try
{
PrintWriter writer =
new PrintWriter("C:\\Users\\amund_000\\Desktop\\Maskinlæring\\
Project1\\shuffled.arff","UTF-8");
for(int i = 0 ; i < flowerLists.length ; i++)
{
for(int j = 0 ; j < flowerLists[i].size() ; j++)
{
writer.write(flowerLists[i].get(j));
35
writer.println();
}
}
System.out.println("Success!");
writer.close();
}
catch (FileNotFoundException e)
{
e.printStackTrace();
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
}
}
}
10.1.3
ArrayShuffle.java
package Task01;
import java.util.ArrayList;
public class ArrayShuffle
{
private ArrayList<String> shuffledList;
private ArrayList<Integer> dataList;
private ArrayList<String>[] finalList;
public ArrayShuffle()
{
this.shuffledList = new ArrayList<String>();
}
public ArrayList<String>[] getFlowerLists()
{
return this.finalList;
}
public boolean shuffle(ArrayList<String>[] listsWithFlowerInfo)
{
int lengthOfFlowerData = listsWithFlowerInfo[1].size();
if(lengthOfFlowerData > 0)
{
36
dataList = new ArrayList<Integer>();
for(int i = 0 ; i < lengthOfFlowerData ; i++)
{
dataList.add(i);
}
while(dataList.size() > 0)
{
int randomIndex = (int) Math.floor(Math.random()*dataList.size());
this.shuffledList.add
(listsWithFlowerInfo[1].get(dataList.remove(randomIndex)));
}
listsWithFlowerInfo[1] = shuffledList;
this.finalList = listsWithFlowerInfo;
}
else
return false;
return true;
}
}
37
10.2
A Java program for testing a small decision
tree
10.2.1
Main 05b.java
package Task05b;
import java.util.ArrayList;
public class Main05b
{
private static final String
//DATA_PATH =
"C:\\Users\\amund_000\\Desktop\\Maskinlæring\\testFiles\\test4\\
irisTestData.data",
DATA_PATH =
"C:\\Users\\amund_000\\Desktop\\Maskinlæring\\testFiles\\test4\\
iris.data",
TREE_PATH =
"C:\\Users\\amund_000\\Desktop\\Maskinlæring\\testFiles\\test4\\
iris.treeFile";
private static ArrayList<ArrayList<Float>> data;
public static void main(String[] args)
{
_FileReader fileReader = new _FileReader(DATA_PATH);
if(fileReader.fileHasBeenRead())
data = fileReader.getData();
//testData();
DecisionTree dt;
//We could have used tree data read from a file, ut since this task is
//somewhat small, we have choosen not to do that.
if(fileReader.ReadTreeData(TREE_PATH))
{
dt = new DecisionTree();
dt.createOrdinaryDecisionTree();
FlowerClassifier fc = new FlowerClassifier(dt);
fc.classifyData(data, 0);
dt.creatDecisionTreeWithoutRule1();
38
fc.setDecisionTree(dt);
fc.classifyData(data, 1);
dt.creatDecisionTreeWithoutRule2();
fc.setDecisionTree(dt);
fc.classifyData(data, 2);
dt.creatDecisionTreeWithoutRule3();
fc.setDecisionTree(dt);
fc.classifyData(data, 3);
}
}
@SuppressWarnings("unused")
private static void testData()
{
for(int i = 0 ; i < data.size() ; i++)
{
for(int j = 0 ; j < data.get(i).size() ; j++)
{
System.out.print(data.get(i).get(j) + " ");
}
System.out.println();
}
}
}
10.2.2
FileReader.java
package Task05b;
import
import
import
import
import
java.io.BufferedReader;
java.io.FileNotFoundException;
java.io.FileReader;
java.io.IOException;
java.util.ArrayList;
public class _FileReader
{
private ArrayList<ArrayList<Float>> flowerData;
private boolean dataFileRead;
private String treeData;
public _FileReader(String dataPath)
{
flowerData = new ArrayList<ArrayList<Float>>();
dataFileRead = false;
39
readFile(dataPath);
}
public boolean fileHasBeenRead()
{
return this.dataFileRead;
}
private void readFile(String dataPath)
{
try
{
FileReader fr = new FileReader(dataPath);
BufferedReader br = new BufferedReader(fr);
String line = br.readLine();
/*
* Since we know the content of the file, there is no need to check each line for for exam
* in this small example code.
*/
while(line != null)
{
getDataFromLine(line);
line = br.readLine();
}
br.close();
dataFileRead = true;
}
catch (FileNotFoundException e)
{
dataFileRead = false;
}
catch(IOException e)
{
dataFileRead = false;
}
}
@SuppressWarnings("unchecked")
private void getDataFromLine(String line)
{
ArrayList<Float> arrList = new ArrayList<Float>();
String[] dataArr = line.split(",");
40
for(int i = 0 ; i < dataArr.length ; i++)
arrList.add(convertStringToFloat(dataArr[i]));
this.flowerData.add((ArrayList<Float>) arrList.clone());
}
private Float convertStringToFloat(String string)
{
try
{
float f = Float.parseFloat(string);
return f;
}
catch(NumberFormatException e)
{
System.out.println("Could not format the data correctly.\n Exiting program.");
System.exit(0);
}
return 0f;
}
public ArrayList<ArrayList<Float>> getData()
{
return this.flowerData;
}
public String getTreeData()
{
return this.treeData;
}
//The tree-file has been modified to make creation of a tree easier
//to accomplish.
public boolean ReadTreeData(String path)
{
treeData = "";
try
{
FileReader fr = new FileReader(path);
BufferedReader br = new BufferedReader(fr);
String line = br.readLine();
while(line != null)
{
treeData += line;
line = br.readLine();
41
}
br.close();
return true;
}
catch (FileNotFoundException e)
{
return false;
}
catch(IOException e)
{
return false;
}
}
}
10.2.3
DecisionTree.java
package Task05b;
public class DecisionTree
{
private Rule rootRule;
//The decision tree could have been created based on the
//data read from file,
//instead the tree is hard coded for this small example.
public void createOrdinaryDecisionTree()
{
Rule r3 = new Rule();
r3.createRule(2, 4.9f, false, false, null, null, 2, 3);
Rule r2 = new Rule();
r2.createRule(3, 1.7f, true, false, r3, null, 0, 3);
Rule r1 = new Rule();
r1.createRule(3, .5f, false, true, null, r2, 1, 0);
this.rootRule = r1;
}
public void creatDecisionTreeWithoutRule1()
{
Rule r3 = new Rule();
r3.createRule(2, 4.9f, false, false, null, null, 2, 3);
Rule r2 = new Rule();
r2.createRule(3, 1.7f, true, false, r3, null, 0, 3);
42
this.rootRule = r2;
}
public void creatDecisionTreeWithoutRule2()
{
Rule r3 = new Rule();
r3.createRule(2, 4.9f, false, false, null, null, 2, 3);
Rule r1 = new Rule();
r1.createRule(3, .5f, false, true, null, r3, 1, 0);
this.rootRule = r1;
}
public void creatDecisionTreeWithoutRule3()
{
Rule r2 = new Rule();
r2.createRule(3, 1.7f, false, false, null, null, 2, 3);
Rule r1 = new Rule();
r1.createRule(3, .5f, false, true, null, r2, 1, 0);
this.rootRule = r1;
}
public Rule getRootRule()
{
return this.rootRule;
}
}
10.2.4
Rule.java
package Task05b;
import java.util.ArrayList;
public class Rule
{
private boolean hasLeftRule, hasRightRule;
private Rule leftRule, rightRule;
private int leftClass, rightClass;
private int attributeNumber;
private float threshold;
public void createRule(int attribute, float thresh, boolean hasLeft,
boolean hasRight, Rule left, Rule right, int classL, int classR)
{
this.attributeNumber = attribute;
43
this.threshold = thresh;
this.hasLeftRule = hasLeft;
this.hasRightRule = hasRight;
this.leftClass = classL;
this.rightClass = classR;
if(this.hasLeftRule)
this.leftRule = left;
if(this.hasRightRule)
this.rightRule = right;
}
public int checkRule(ArrayList<Float> arrList)
{
if(arrList.get(attributeNumber) <= threshold)
{
if(hasLeftRule)
return this.leftRule.checkRule(arrList);
else
return leftClass;
}
else
{
if(hasRightRule)
return this.rightRule.checkRule(arrList);
else
return rightClass;
}
}
}
10.2.5
FlowerClassifier.java
package Task05b;
import java.util.ArrayList;
public class FlowerClassifier
{
private DecisionTree decisionTree;
private int[][] confusionMatrix;
public FlowerClassifier(DecisionTree dt)
{
44
this.decisionTree = dt;
}
public void setDecisionTree(DecisionTree dt)
{
this.decisionTree = dt;
}
public void classifyData(ArrayList<ArrayList<Float>> data,
int ruleNotUsed)
{
if(ruleNotUsed == 0)
System.out.print("Confusion matrix to decision tree with all rules\n
------------------------------------------------\n");
if(ruleNotUsed == 1)
System.out.print("Confusion matrix to decision tree without rule 1\n
------------------------------------------------\n");
if(ruleNotUsed == 2)
System.out.print("Confusion matrix to decision tree without rule 2\n
------------------------------------------------\n");
if(ruleNotUsed == 3)
System.out.print("Confusion matrix to decision tree without rule 3\n
------------------------------------------------\n");
confusionMatrix = new int[3][3];
for(int i = 0 ; i < confusionMatrix.length ; i++)
for(int j = 0 ; j < confusionMatrix[i].length ; j++)
confusionMatrix[i][j] = 0;
for(int i = 0 ; i < data.size(); i++)
classify(data.get(i));
for(int i = 1 ; i <= 3 ; i++)
System.out.print(i + "\t");
System.out.println();
int classNumber = 1;
for(int i = 0 ; i < confusionMatrix.length ; i++)
{
for(int j = 0 ; j < confusionMatrix[i].length ; j++)
System.out.print(confusionMatrix[i][j] + "\t");
System.out.println("\t | class " + classNumber);
45
classNumber++;
}
System.out.println();
}
private void classify(ArrayList<Float> arrayList)
{
confusionMatrix[(int)
((float) arrayList.get(arrayList.size() - 1)) -1]
[this.decisionTree.getRootRule().checkRule(arrayList) - 1]++;
}
}
46
10.3
Code to create a decision tree based on a
tree handling missing data
public void createDecisionTreeWithoutMissingValues()
{
Rule r1 = new Rule();
r1.createRule(2, 4.9f, false, false, null, null, 2, 3);
Rule r2 = new Rule();
r2.createRule(3, 1.5f, true, false, r1, null, 0, 2);
Rule r3 = new Rule();
r3.createRule(2, 4.7f, false, true, null, r2, 2, 0);
Rule r4 = new Rule();
r4.createRule(2, 1.9f, false, true, null, r3, 1, 0);
Rule r5 = new Rule();
r5.createRule(3, 1.7f, true, false, r4, null, 0, 3);
Rule r6 = new Rule();
r6.createRule(3, .4f, false, true, null, r5, 1, 0);
this.rootRule = r6;
}
47