Fisher's Iris dataset
Task: classification
Number of instances: 150
Number of attributes: 4 (numerical)
Type of attribute to be predicted: discrete with 3 classes
Download the data: DataIris
This is one of the most famous dataset used to illustrate the classification problems. From 4 characteristics measured on the flowers (the length of the sepal, the width of the sepal, the length of the petal and the width of the petal), the objective is to classify a sample of 150 irises in the 3 following species: versicolor, virginica and setosa. Measurements are in centimeters. Note that the sample is perfectly balanced
(50 irises of each family).
Sources: R.A. Fisher. "The use of multiple measurements in taxonomic problems. Annals of Eugenics", 7(2), 179–188 (1936)
Model with 1 variable
The simplest model uses only one explanatory variable, the petal width :
* If
(Petal width is lower than
0,8) then (Iris
is rather Virginica)
* If
(Petal width is higher than
1,6) then (Iris
is rather Setosa)
* Otherwise
(Iris is rather Versicolor)
This model enables to correctly classify 144 of the 150 data of the sample (96%). We can graphically represent it (orange curve) with the experimental data (green points):
Model with 2 variables
This model implies a second variable: the Petal length . It is similar to the first model, but comprises an additional rule:
* If
(Petal width is lower than
0,8) then (Iris
is rather Virginica)
* If
(Petal width is higher than
1,6) then (Iris
is rather Setosa)
* If (Petal length is higher than 5) alors (Iris is
rather Setosa)
* Otherwise
(Iris is rather Versicolor)
It enables to correctly classify 147 data out of 150 (98%):
Model with a full classification (3 variables)
The following model enables to classify correctly the totality of the 150 instances of the dataset:
* If
(Petal width is lower than
0,8) then (Iris
is rather Virginica)
* If (Sepal width is not close to 2,6) and (Petal length is higher than 5) then (Iris
is rather Setosa)
* If (Sepal width is lower than 2,8) and (Petal width is higher than 1,6) then (Iris
is rather Setosa)
* Otherwise
(Iris is rather Versicolor)
This model therefore concerns 3 variables : the petal width, the petal length and the sepal width. It clearly points that it easy to separate the Virginica iris from the other species (the petal width is lower than
0,8). On te other hand, it is more complicate to separate the Versicor and Setosa species (it is done with the second and the third rules).
The following graph is a "4D" representation of this model:
