Classification for Myocardial Infarction Dataset
Aldi Fianda Putra
Posted on March 1, 2023
What is Myocardial Infarction
Myocardial infarction, commonly known as the heart attack," is a very serious heart problem. This disorder occurs when the heart muscle does not get good blood flow. This condition will interfere with cardiac function and the flow of blood flow throughout the body, which can be fatal to humans.
This complication can be detected by a tool called an electrocardiogram, or EKG. This tool detects the heartbeat wave signal, of which each signal is classified into three kinds: Q, R, and S. There are 11 kinds of heartbeat waves, from this wave we can classify a person's heart disease.
Project Description
This is a project to classify a person heart disease from Myocardial Infarction Complication which can be accessed from this link. On this project, the models used are Naive Bayes, Decision Trees, and Support Vector Machines. But there are some problems with the dataset. The dataset contains missing values, and most of them are not normalized. Therefore, imputation needs to be done, and then the normalization process is also carried out using min-max normalization. This project itself has also been published in an article that can be accessed on this page.
Dataset has categorical and non-categorical attributes. This non-categorical attribute will be normalized. These attributes include
- S_AD_KBRIG: Systolic blood pressure (mmHg)
- D_AD_KBRIG: diastolic blood pressure (mmHg)
- S_AD_ORIT: Systolic blood pressure from ICU (mmHg)
- D_AD_ORIT: Diastolic Blood Pressure from ICU (mmHg)
- K_BLOOD: the amount of serum potassium in the patient's blood (mmol/L)
- NA_BLOOD: total serum sodium in the patient's blood (mmol/L)
- ALT_BLOOD: amount of AIAT serum in the patient's blood (IU/L)
- AST_BLOOD: amount of serum AsAT in the patient's blood (IU/L)
- KFK_BLOOD: amount of serum CPK in the patient's blood (IU/L)
- L_BLOOD: the number of white blood cells present in the patient
- ROE: the amount of ESR or high blood sedimentation rate experienced by the patient
The last attribute, namely "LET_IS," is a lethal outcome that displays the classification of heart disease suffered by the patient based on the symptoms and complications experienced by the patient. There are 8 classes of disease classification results, which are represented by numbers, namely:
- 0 indicates the patient is healthy.
- 1 indicates the patient has cardiogenic shock.
- 2 indicates the patient has pulmonary edema.
- 3 indicates the patient is classified as having a myocardial rupture.
- 4 indicates the patient has congestive heart failure.
- 5 indicates the patient has thromboembolism.
- 6 indicates the patient has asystole.
- 7 indicates that the patient has ventricular fibrillation.
Preprocessing
The first step is to do the initial data processing, which involves some initial steps such as checking the data to see if there is a missing value.
import pandas as pd
import numpy as np
!wget https://s3-eu-west-1.amazonaws.com/pstorage-leicester-213265548798/23581310/MyocardialinfarctioncomplicationsDatabase.csv
data = pd.read_csv('MyocardialinfarctioncomplicationsDatabase.csv')
data
def cetak_rentang(df_input):
list_fitur=df_input.columns[:-1]
for fitur in list_fitur:
max=df_input[fitur].max()
min=df_input[fitur].min()
print("Rentang fitur", fitur, "Adalah", max-min)
As seen in the two outputs above, if there is a missing value marked with NaN, then the feature range itself is not the same for all features. Thus, it is necessary to impute and normalize the data.
To perform imputation of data, use the following equation and then call :
def imputasi(df_input):
list_columns = df_input.columns
class_column = list_columns[-1]
for column in list_columns[:-1] :
df_input[column] = df_input[column].fillna(round(df_input.groupby(class_column)[column].transform('mean'),0))
df_input[column] = df_input[column].fillna(df_input[column].mean())
return df_input
As seen in the picture above, there are no more missing values. So the next step is to normalize the min-max data so that the data has a range between 0 and 1. However, normalized features are only continuous or non-categorical features; discrete features such as age do not need to be normalized. For that, normalization can be done by using the following function:
kolomfitur = ['S_AD_KBRIG','D_AD_KBRIG','S_AD_ORIT','D_AD_ORIT','K_BLOOD','NA_BLOOD',
'ALT_BLOOD','AST_BLOOD','KFK_BLOOD','L_BLOOD','ROE']
from sklearn.preprocessing import MinMaxScaler
normalisasi = MinMaxScaler()
data[kolomfitur] = normalisasi.fit_transform(data[kolomfitur])
Because the data no longer contains missing values and non-categorical features have been normalized. So the next step is to carry out the classification process on the dataset.
Classification
Model Building
The algorithm model in this project was made using sklearn. The code below serves to build an algorithm model for Naive Bayes, Decision Tree, and SVM.
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
models = []
models.append(('NB', GaussianNB()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('SVM', SVC(gamma='auto')))
results = []
names = []
Training Data and Validation data
The next stage is to determine the distribution of training and test data. The division used for training data and test data is 80:20.
array = data.values
X = array[:,0:123]
y = array[:,123]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.filterwarnings("ignore")
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
Model Prediction
Naive Bayes
model = GaussianNB()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
print(classification_report(Y_validation, predictions))
From the picture above, it can be seen that Naive Bayes has an accuracy of 0.48.This accuracy can be considered low compared to other methods. And from these results on 4, 6, and 7 get the value of precision, recall, and very low f1-score of 0.0.
Decision Tree
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
print(classification_report(Y_validation, predictions))
From the picture above, it can be seen that the decision tree has an accuracy of about 0.99. The accuracy level can be considered high compared to other methods. And from those results, 7 still
get the value of precision, recall, and very low f1-score of 0.0.
SVM
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, Y_train)
model = SVC(gamma='auto')
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
X, y = make_classification(n_samples=340, n_features=20, n_informative=15, n_redundant=5, random_state=1)
print(classification_report(Y_validation, predictions))
From the picture above, it can be seen that SVM gets an accuracy of 0.92. This accuracy can be said to be high, but under the Decision Tree method. Thus, this method has accuracy in the middle of the other methods. And from these results, 2, 5, and 6 get very low values of precision, recall, and the f1-score, namely 0.0.
Conclusion
MIC, or Myocardial Infarction Complication, is a dataset that contains patients who experience heart disease and possible complications. The need for pre-processing on this dataset is very crucial, such as for missing values, feature values that have not been normalized, and other deficiencies that need to be corrected so that the prediction results on the data can be obtained optimally. then finally, after modeling using three different algorithms, namely Naive Bayes, Decision Trees, and SVM, It is known that Decision Tree is the best algorithm to use in this MIC dataset because it can get higher accuracy compared to other algorithms.
Posted on March 1, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024
November 29, 2024