Toward understanding DNN (deep neural network) well: iris dataset

ksk0629

Keisuke Sato

Posted on February 13, 2022

Toward understanding DNN (deep neural network) well: iris dataset

Introduction

This is my second "toward understanding DNN (deep neural network) well" series. I will explore the effect of the numbers of layers and units again with the iris dataset.

github repository: comparison_of_dnn

Note that, this is not a "guide", this is a memo from a beginner to beginners. If you have any comments, suggestions, questions, etc. whilst reading this article, please let me know in the comments below.

Iris dataset

Obviously, it is a so famous dataset. Most people would not need an explanation about this dataset. But I will see a little bit because I am a beginner.

We can use this dataset by sklearn.datasets.load_iris() function. This is for multi-classification. It contains 150 data and each data has the following four features.

  • sepal length (cm)
  • sepal width (cm)
  • petal length (cm)
  • petal width (cm)

The number of classes is three and the dataset has the same numbers of data belonging to each class. This dataset is the three-classification dataset. As most of us know, there has no missing data, but this is like a tutorial article, so I check whether there are missing values.

Input:

import sklearn
from sklearn import datasets

iris_dataset = sklearn.datasets.load_iris(as_frame=True)["frame"]
iris_df.info()
Enter fullscreen mode Exit fullscreen mode

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
Enter fullscreen mode Exit fullscreen mode

Cool. There are no missing values. Next, I check the basic statistics.

Input

iris_df.describe().drop(["count"])
Enter fullscreen mode Exit fullscreen mode

Output

      sepal length (cm) sepal width (cm) petal length (cm) /
mean           5.843333         3.057333          3.758000 /
std            0.828066         0.435866          1.765298 /
min            4.300000         2.000000          1.000000 /
25%            5.100000         2.800000          1.600000 /
50%            5.800000         3.000000          4.350000 /
75%            6.400000         3.300000          5.100000 /
max            7.900000         4.400000          6.900000 /

      petal width (cm)    target
              1.199333  1.000000
              0.762238  0.819232
              0.100000  0.000000
              0.300000  0.000000
              1.300000  1.000000
              1.800000  2.000000
              2.500000  2.000000
Enter fullscreen mode Exit fullscreen mode

Of course, I am interested in data analysis, but I have no ability for analysing it so far. I will analyse the data someday.

Comparison

For the sake of simplicity, I suppose the following conditions.

  • A model is fixed all conditions except for the number of layers and the numbers of units of each layer.
  • Any data preprocessing is not performed.
  • Seed is fixed.

Most of those conditions can be changed or removed. All you have to do is change config_iris.yaml. The yaml file has the following lines.

mlflow:
  experiment_name: iris
  run_name: default
dataset:
  eval_size: 0.25
  test_size: 0.25
  train_size: 0.75
  shuffle: True
dnn:
  n_layers: 3
  n_units_list:
    - 8
    - 4
    - 3
  activation_function_list:
    - relu
    - relu
    - softmax
  seed: 57
dnn_train:
  epochs: 30
  batch_size: 4
  patience: 5
Enter fullscreen mode Exit fullscreen mode

The following changes work to build a model that has five layers (four dense layers plus one output layer), which have relu function as their activation functions, and 8 units.

dnn:
  n_layers: 5
  n_units_list:
    - 8
    - 8
    - 8
    - 8
    - 3
  activation_function_list:
    - relu
    - relu
    - relu
    - relu
    - softmax
Enter fullscreen mode Exit fullscreen mode

Note that, some of the model's information is hard coding. You have to write codes to change them. For example, model's loss function is cross entropy, which is calculated by keras.losses.SparseCategoricalCrossentropy() function and it is specified in iris_dnn.py:
https://github.com/ksk0629/comparison_of_dnn/blob/8498a7d15ed6a4447f13f9f277e214f4821f46a1/src/iris_dnn.py#L28-L30

result

First, I summarise all results. The losses and accuracy are as follows.

#layers #parameters training loss evaluation loss test loss test accuracy
2 35 0.166 0.136 0.157 0.947
2 67 0.086 0.022 0.039 0.974
2 131 0.086 0.033 0.043 1.0
2 259 0.09 0.024 0.047 0.974
3 263 0.104 0.018 0.069 0.974
4 260 0.123 0.05 0.115 0.947
5 261 0.089 0.089 0.075 0.974
6 255 0.138 0.043 0.119 0.947
7 263 0.091 0.023 0.047 0.974
8 261 1.099 1.099 1.099 0.316
9 259 1.099 1.099 1.099 0.316

The amount of test data is 38 and the dataset has 12 data belonging to class 0, 13 data belonging to class 1, and 13 data belonging to class 2.

I performed 11 experiences to explore the following two things.

  • effect of the number of parameters
  • effect of the number of layers

The experiments from the first to the fourth are for the first one and the experiments from the fourth to eleventh are for the second one.

The result says the following facts.

  • The model that has two layers and 67 parameters is the best one in the sense of the test loss value.
  • The model that has two layers and 131 parameters is the best one in the sense of the test accuracy.
  • The models that have 8 layers and 9 layers are the worst ones.

It is a bit surprising to me because I expected the best model would be one whose layers and parameters are more than the above best ones. It is possibly due to the distribution of the test data because it might be too small to evaluate the performance. But at least under the above conditions, the two models that have two layers are the best ones. It possibly means that other ones became overfitting.

As mentioned later, the vanishing gradient occurred in the eight and nine layers model experiments. That is, the eight layers are too much to learn well at least with the iris data under the above conditions.

Except for the models that were occurred the vanishing gradient problem and the best one in the sense of test accuracy, all of the models classified correctly 36 or 37 test data. And interestingly, one of the data classified wrongly is the same one. It possibly implies the distribution of the test data is not great, which means there is a difference between the training data and the test data.

Furthermore, most of the models correctly classified most of the data, which means DNN is so effective to the iris data even though the model structure is so simple.

two layers with 35 parameters

The structure is as follows.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 4)                 20        

 dense_1 (Dense)             (None, 3)                 15        

=================================================================
Total params: 35
Trainable params: 35
Non-trainable params: 0
_________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 0.166
  • evaluation loss: 0.136
  • test loss: 0.157
  • test accuracy: 0.947

The number of the correct outputted results is 36 since the amount of test data is 38. It looks great and it actually works great. At least for iris data, DNN is a very powerful tool even though the model has a very simple structure.

two layers with 67 parameters

The structure is as follows.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 8)                 40        

 dense_1 (Dense)             (None, 3)                 27        

=================================================================
Total params: 67
Trainable params: 67
Non-trainable params: 0
_________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 0.086
  • evaluation loss: 0.022
  • test loss: 0.039
  • test accuracy: 0.974

This model correctly classified 37 test data.

two layers with 131 parameters

The structure is as follows.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 16)                80        

 dense_1 (Dense)             (None, 3)                 51        

=================================================================
Total params: 131
Trainable params: 131
Non-trainable params: 0
_________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 0.086
  • evaluation loss: 0.033
  • test loss: 0.043
  • test accuracy: 1.0

This model correctly classified all test data.

two layers with 259 parameters

The structure is as follows.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 32)                160       

 dense_1 (Dense)             (None, 3)                 99        

=================================================================
Total params: 259
Trainable params: 259
Non-trainable params: 0
_________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 0.09
  • evaluation loss: 0.024
  • test loss: 0.047
  • test accuracy: 0.974

This model correctly classified 37 test data.

three layers with 263 parameters

The structure is as follows.

 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 16)                80        

 dense_1 (Dense)             (None, 9)                 153       

 dense_2 (Dense)             (None, 3)                 30        

=================================================================
Total params: 263
Trainable params: 263
Non-trainable params: 0
_________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 0.104
  • evaluation loss: 0.018
  • test loss: 0.069
  • test accuracy: 0.974

This model correctly classified 37 data too.

four layers with 260 parameters

The structure is as follows.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 14)                70        

 dense_1 (Dense)             (None, 9)                 135       

 dense_2 (Dense)             (None, 4)                 40        

 dense_3 (Dense)             (None, 3)                 15        

=================================================================
Total params: 260
Trainable params: 260
Non-trainable params: 0
_________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 0.123
  • evaluation loss: 0.05
  • test loss: 0.115
  • test accuracy: 0.947

This model correctly classified 36 data.

five layers with 261 parameters

The structure is as follows.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 12)                60        

 dense_1 (Dense)             (None, 8)                 104       

 dense_2 (Dense)             (None, 6)                 54        

 dense_3 (Dense)             (None, 4)                 28        

 dense_4 (Dense)             (None, 3)                 15        

=================================================================
Total params: 261
Trainable params: 261
Non-trainable params: 0
_________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 0.089
  • evaluation loss: 0.089
  • test loss: 0.075
  • test accuracy: 0.974

This model correctly classified 37 data.

six layers with 255 parameters

The structure is as follows.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 10)                50        

 dense_1 (Dense)             (None, 8)                 88        

 dense_2 (Dense)             (None, 6)                 54        

 dense_3 (Dense)             (None, 4)                 28        

 dense_4 (Dense)             (None, 4)                 20        

 dense_5 (Dense)             (None, 3)                 15        

=================================================================
Total params: 255
Trainable params: 255
Non-trainable params: 0
_________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 0.138
  • evaluation loss: 0.043
  • test loss: 0.119
  • test accuracy: 0.947

This model correctly classified 37 data.

seven layers with 263 parameters

The structure is as follows.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 10)                50        

 dense_1 (Dense)             (None, 6)                 66        

 dense_2 (Dense)             (None, 6)                 42        

 dense_3 (Dense)             (None, 6)                 42        

 dense_4 (Dense)             (None, 4)                 28        

 dense_5 (Dense)             (None, 4)                 20        

 dense_6 (Dense)             (None, 3)                 15        

=================================================================
Total params: 263
Trainable params: 263
Non-trainable params: 0
_________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 0.091
  • evaluation loss: 0.023
  • test loss: 0.047
  • test accuracy: 0.974

This model correctly classified 37 data.

eight layers with 261 parameters

The structure is as follows.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 8)                 40        

 dense_1 (Dense)             (None, 6)                 54        

 dense_2 (Dense)             (None, 6)                 42        

 dense_3 (Dense)             (None, 6)                 42        

 dense_4 (Dense)             (None, 4)                 28        

 dense_5 (Dense)             (None, 4)                 20        

 dense_6 (Dense)             (None, 4)                 20        

 dense_7 (Dense)             (None, 3)                 15        

=================================================================
Total params: 261
Trainable params: 261
Non-trainable params: 0
________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 1.099
  • evaluation loss: 1.099
  • test loss: 1.099
  • test accuracy: 0.316

The vanishing gradient problem occurred whilst learning. In fact, the training loss converged soon:
Image description

It implies 8 layers are too much to learn at least with the iris data.

nine layers with 259 parameters

The structure is as follows.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 8)                 40        

 dense_1 (Dense)             (None, 6)                 54        

 dense_2 (Dense)             (None, 6)                 42        

 dense_3 (Dense)             (None, 4)                 28        

 dense_4 (Dense)             (None, 4)                 20        

 dense_5 (Dense)             (None, 4)                 20        

 dense_6 (Dense)             (None, 4)                 20        

 dense_7 (Dense)             (None, 4)                 20        

 dense_8 (Dense)             (None, 3)                 15        

=================================================================
Total params: 259
Trainable params: 259
Non-trainable params: 0
_________________________________________________________________
Enter fullscreen mode Exit fullscreen mode

The final indices are as follows.

  • training loss: 1.099
  • evaluation loss: 1.099
  • test loss: 1.099
  • test accuracy: 0.316

The vanishing gradient occurred too. I have already observed this problem in the experiment of eight layers model. This experiment is for just checking whether it was certainly due to the number of layers and the vanishing gradient problem occurred again.

Conclusion

I explored the effect of the numbers of layers and the number of parameters with the iris dataset. As the result, I found the two layers models are the best ones in the sense of the test loss value and the test accuracy though it might be due to the small test size. The eight and nine layers models learnt anything. The vanishing gradient occurred. It implies it is too much to learn if the amount of layers is more than eight.

As mentioned in result section, the data that most of the models were classified wrongly is the same and the data is as follows.

sepal length (cm)    6.3
sepal width (cm)     2.5
petal length (cm)    4.9
petal width (cm)     1.5
target               1.0
Name: 72, dtype: float64
Enter fullscreen mode Exit fullscreen mode

I guess it is certainly important to check whether or not the data is an outlier.

All of the experiences were performed under 57 seed. It sounds interesting to change the seed and perform the same experiences. Note that, the seed also affects a way of splitting the iris data into ones for training, evaluation, and test data. To use the same test data, it is needed to change load_splitted_dataset_with_eval() function in custom_dataset.py:
https://github.com/ksk0629/comparison_of_dnn/blob/8498a7d15ed6a4447f13f9f277e214f4821f46a1/src/custom_dataset.py#L75-L110

πŸ’– πŸ’ͺ πŸ™… 🚩
ksk0629
Keisuke Sato

Posted on February 13, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related