Fisher's Iris dataset

 

Task: classification

Number of instances: 150

Number of attributes: 4 (numerical)

Type of attribute to be predicted: discrete with 3 classes

Download the data: DataIris

 

This is one of the most famous dataset used to illustrate the classification problems. From 4 characteristics measured on the flowers (the length of the sepal, the width of the sepal, the length of the petal and the width of the petal), the objective is to classify a sample of 150 irises in the 3 following species: versicolor, virginica and setosa. Measurements are in centimeters. Note that the sample is perfectly balanced (50 irises of each family).

Sources: R.A. Fisher. "The use of multiple measurements in taxonomic problems. Annals of Eugenics", 7(2), 179–188 (1936)

 

Model with 1 variable

The simplest model uses only one explanatory variable, the petal width :

* If (Petal width is lower than 0,8) then (Iris is rather Virginica)

* If (Petal width is higher than 1,6) then (Iris is rather Setosa)

* Otherwise (Iris is rather Versicolor)

 

This model enables to correctly classify 144 of the 150 data of the sample (96%). We can graphically represent it (orange curve) with the experimental data (green points):

 

 

Model with 2 variables

This model implies a second variable: the Petal length . It is similar to the first model, but comprises an additional rule:

* If (Petal width is lower than 0,8) then (Iris is rather Virginica)

* If (Petal width is higher than 1,6) then (Iris is rather Setosa)

* If (Petal length is higher than 5) alors (Iris is rather Setosa)

* Otherwise (Iris is rather Versicolor)

 

It enables to correctly classify 147 data out of 150 (98%):

 

 

Model with a full classification (3 variables)

The following model enables to classify correctly the totality of the 150 instances of the dataset:

 

* If (Petal width is lower than 0,8) then (Iris is rather Virginica)

* If (Sepal width is not close to 2,6) and (Petal length is higher than 5) then (Iris is rather Setosa)

* If (Sepal width is lower than 2,8) and (Petal width is higher than 1,6) then (Iris is rather Setosa)

* Otherwise (Iris is rather Versicolor)

 

This model therefore concerns 3 variables : the petal width, the petal length and the sepal width. It clearly points that it easy to separate the Virginica iris from the other species (the petal width is lower than 0,8). On te other hand, it is more complicate to separate the Versicor and Setosa species (it is done with the second and the third rules).

 

The following graph is a "4D" representation of this model:

 

 

 
 

BLIASoft Knowledge Discovery - Data mining & predictive analytics software - Fuzzy logic & artificial intelligence

              2007-2017 BLIASOLUTIONS - All rights reserved | Terms of use  | Site map