Comparison of Machine Learning Algorithms...
Ertugrul
Posted on March 10, 2024
Önemli*Makalenin Türkçe versiyonu için Linke tıkalyın*
Türkçe:https://dev.to/ertugrulmutlu/makine-ogrenme-algoritmalarinin-karsilastirilmasi-4o0d
In this article we will compare SVM - DecisionTree - KNN algorithms.
The Features we will compare:
- Accuracy: The ratio of total correct predictions to total data. That is, the ratio of correct predictions to the total number of predictions.
- Macro avg precision Score: The average of the precision for each class. Precision is the ratio of correct positive predictions to total positive predictions. This shows how accurately a class is identified.
- Macro avg Recall Score: The average of the precision for each class. Precision is the ratio of true positive predictions to the total number of true positives. This indicates how successfully a class was detected.
- Macro avg F1 Score: The average of the F1 score for each class. The F1 score is the harmonic mean of precision and sensitivity. This combines the model's classification ability into a single metric.
- Weighted avg precision Score: The average of the weighted precision based on the sampling rate of each class. This provides a measure of precision weighted by the importance of each class.
- Weighted avg Recall Score: The average of the weighted precision based on the sampling rate of each class. This provides a measure of precision weighted by the importance of each class.
- Weighted avg F1 Score: The average F1 score weighted by the sampling rate of each class. This provides a measure of the F1 score weighted by the importance of each class.
First the definitions of algorithms.
Instead of giving definitions, I found it more appropriate to give you a source that explains them more properly.
- KNN (K-Nearest-Neighborn):
Source:
Video: https://www.youtube.com/watch?v=v5CcxPiYSlA
Article: https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761
- DT(Decision tree):
Source:
Video: https://www.youtube.com/watch?v=ZVR2Way4nwQ
Article: https://medium.com/@MrBam44/decision-trees-91f61a42c724
- SVM (Support Vector Machine):
Source:
Video: https://www.youtube.com/watch?v=1NxnPkZM9bc
Article: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47
Now we can get started..
First let's take a look at the Database I will use
Database features:
Here, we will analyze our CSV using the Pandas library.
import pandas as pd
csv = pd.read_csv("glass.csv")
print(csv.head)
To explain the code here in order:
- We import the Pandas library.
- We read the CSV file with the Pandas library.
- Finally, we write the "head" command to get an overview of the CSV file.
The output of this code:
As you can see, it gave us a general information about the content of the CSV file. It also gave us information about the number of rows and columns.
In this CSV file:
-214 Row
-10 Column
It is.
Now let's get the names of the columns:
import pandas as pd
csv = pd.read_csv("glass.csv")
print(csv.columns)
To explain the code here in order:
- We import the Pandas library.
- We read the CSV file with the Pandas library.
- Finally, we write the "columns" command to get an overview of the CSV file.
As you can see, we got the names of the COlumns of the CSV file and then we learned the Type of this data.
In this CSV file:
-RI (Refractive index)
-Na (Sodium)
-Mg (Magnesium)
-Al (Aluminum)
-Si (Silicone)
-K (Potassium)
-Ca (Calcium)
-Ba (Barium)
-Fe (Iron)
-Type (Glass type)
is located.
In the light of this data, different types of glass were identified based on the refractive index of the glass and the chemical substances it contains.
Note: For more detailed information, please visit the Source site.
Source
The site where I downloaded the CSV file:
https://www.kaggle.com/datasets/uciml/glass
Now let's move on to our plan:
What We Know
- Data in CSV files needs to be shaped for use in Algorithms
-Algortimas need to be written using a Library.
-Results need to be extracted graphically
Let's do the data preparation part.
Preparation of Data
First, let's count the libraries I will use:
- Sklearn
- Pandas
- Numpy
data = pd.read_csv(self.url, sep=",")
X = np.array(data.drop([columns[len(columns)-1]], axis=1))
y = np.array(data[columns[len(columns)-1]])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X,y, test_size= 0.2)
To explain the code here in order:
- We read our CSV file by separating it with ',' (We use the PANDAS library for this operation)
- The 'X' data contains the properties of the data we want to predict (Type). With this code, we remove the 'Type' Column from the data and make all the data an array using the 'Numpy' library.
- 'y' data is the data we want to predict (i.e. 'Type'). We array it using the 'Numpy' library just like the 'X' data. 4.Finally, we divide this data into test and train. The reason for this is in the simplest terms to train algorithms with train data. With test data, determine the accuracy rate of the algorithm and take action. (We set this rate as 20% with the test_size command, but you can change it if you wish.)
Note: In larger databases or more complex Algorithms you may need validation data, but we don't need it here because we are doing a small and simple application.
Yes, our data is ready...
Integration of Algorithms:
Here we will integrate our algorithms with the Sklearn library.
-KNN**
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors=9)
KNN.fit(x_train,y_train)
To explain the code here in order:
- We call the KNeighborsClassifier module from Sklear.neighbors.
- KNN is integrated. With the n_neighbors parameter, it is decided how many nearest neighbors to look at. (This value may vary according to the project and database.)
- Train the model with .fit command with x_train and y_train data.
-SVM
from sklearn import svm
Svm = svm.SVC(kernel=linear)
Svm.fit(x_train,y_train)
To explain the code here in order:
- We call the svm module from sklearn.
- We call the Support Vector Classification function in Svm. (Briefly, this function allows you to perform classification using the Svm infrastructure). As hyperparameter (Kernel :'linear', 'poly', 'rbf', 'sigmoid') can be used.
- With the .fit command the model is trained with x_train and y_train data.
-Decision Tree
from sklearn.tree import DecisionTreeClassifier
Dt = DecisionTreeClassifier(random_state=9)
Dt.fit(x_train,y_train)
To explain the code here in order:
- We call the DecisionTreeClassifier module from sklearn.tree.
- DecisionTree is integrated. With the random_state parameter, the stability of the algorithm is increased.
- With the .fit command, the model is trained with x_train and y_train data.
Now that we have integrated our algorithms, we can move on to visualization and comparison.
Visualization and Comparison:
First, let's count the libraries I will use:
- matplotlib In short, Matplotlib is a visualization library. It is simple to use and suitable for clean code writing.
All algorithms need to be trained to make comparisons. The code we will use after training:
dt_report =dt.predict_report(3, dt_x_train, dt_x_test, dt_y_train, dt_y_test)
svm_report =Svc.predict_report(3, svc_x_train, svc_x_test, svc_y_train, svc_y_test)
knn_report =Knear.predict_report(3, knn_x_train, knn_x_test, knn_y_train, knn_y_test)
In short, we can print the values we want on the screen with the very simple predict_report command.
Sample output (taken from the internet):
Now let's move on to the comparison:
-Accuracy
- Decision_Tree >> 0.6976744186046512
- KNN >> 0.6511627906976745
- SVM >> 0.6511627906976745
Here the algorithm with the highest prediction was Decision Tree.
-Macro avg precision Score
- Decision_Tree >> 0.7226495726495727
- SVM >> 0.611111111111111
- KNN >> 0.5030501089324618
Here the algorithm with the highest prediction was Decision Tree.
-Macro avg Recall Score
- Decision_Tree >> 0.6472222222222223
- SVM >> 0.5863095238095238
- KNN >> 0.4795454545454545
Here the algorithm with the highest prediction was Decision Tree.
-Macro avg F1 Score
- Decision_Tree >> 0.6738576238576238
- SVM >> 0.5548611111111111
- KNN >> 0.45506715506715506
Here the algorithm with the highest prediction was Decision Tree.
-Weighted avg precision Score
- Decision_Tree >> 0.7241502683363149
- SVM >> 0.6627906976744186
- KNN >> 0.6219182246542027
Here the algorithm with the highest prediction was Decision Tree.
-Weighted avg Recall Score
- Decision_Tree >> 0.6976744186046512
- SVM >> 0.6511627906976745
- KNN >> 0.6511627906976745
Here the algorithm with the highest prediction was Decision Tree.
-Weighted avg F1 Score
- Decision_Tree >> 0.7030168797610657
- SVM >> 0.6397286821705426
- KNN >> 0.6020444671607461
The algorithm with the highest prediction was Decision Tree.
CONCLUSION
As a result, in this article, we compared 3 Machine Learning Algorithms and decided that Decision Tree is the best for the Database we have.
You can access the codes here and you can change and improve them as you wish.
-CODE : https://github.com/Ertugrulmutlu/Machine_Learning_Alg_Comp
If you have a "Suggestion-Request-Question", please leave a comment or contact me via e-mail...
Posted on March 10, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.