Random forest
F.elicià
Posted on October 23, 2022
Explanation:
Random Forest is a classifier that contains several Decision Trees on various subsets of a given DataSet and takes the average to
improve the predictive accuracy of that dataset. During the implementation of homework #2, I fitted several classifiers
including RandomForestClassifier and ExtraTreesClassifier to predict the binary response variable – TREG1 (whether a person is a smoker or not). All variables in the dataset, like age, gender, race, alcohol use, and others (see dataset) were used to build the final model. After fitting the model, these factors influenced the final variable with different levels of importance
Calculated and sorted descending these factors into feature important list:
marever1 0.096374
age 0.083599
DEVIANT1 0.080081
SCHCONN1 0.075221
GPA1 0.074775
DEP1 0.071728
FAMCONCT 0.067389
PARACTV 0.063784
ESTEEM1 0.057945
ALCPROBS1 0.057670
VIOL1 0.048614
ALCEVR1 0.043539
PARPRES 0.039425
WHITE 0.022146
cigavail 0.021671
BLACK 0.018512
BIO_SEX 0.014942
inhever1 0.012832
cocever1 0.012590
PASSIST 0.010221
EXPEL1 0.009777
HISPANIC 0.007991
AMERICAN 0.005332
ASIAN 0.003844
Source code:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
%matplotlib inline
RND_STATE = 55324
AH_data = pd.read_csv(“data/tree_addhealth.csv”)
data_clean = AH_data.dropna()
data_clean.dtypes
data_clean.describe()
predictors = data_clean[[‘BIO_SEX’, ‘HISPANIC’, ‘WHITE’, ‘BLACK’, ‘NAMERICAN’, ‘ASIAN’, ‘age’,
‘ALCEVR1’, ‘ALCPROBS1’, ‘marever1’, ‘cocever1’, ‘inhever1’, ‘cigavail’, ‘DEP1’, ‘ESTEEM1’,
‘VIOL1’,
‘PASSIST’, ‘DEVIANT1’, ‘SCHCONN1’, ‘GPA1’, ‘EXPEL1’, ‘FAMCONCT’, ‘PARACTV’, ‘PARPRES’]]
targets = data_clean.TREG1
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4, random_state=RND_STATE)
print(“Predict train shape: “, pred_train.shape)
print(“Predict test shape: “, pred_test.shape)
print(“Target train shape: “, tar_train.shape)
print(“Target test shape: “, tar_test.shape)
classifier = RandomForestClassifier(n_estimators=25, random_state=RND_STATE)
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
print(“Confusion matrix:”)
print(confusion_matrix(tar_test, predictions))
print()
print(“Accuracy: “, accuracy_score(tar_test, predictions))
important_features = pd.Series(data=classifier.feature_importances_,index=predictors.columns)
important_features.sort_values(ascending=False,inplace=True)
print(important_features)
model = ExtraTreesClassifier(random_state=RND_STATE)
model.fit(pred_train, tar_train)
print(model.feature_importances_)
trees = range(25)
accuracy = np.zeros(25)
for idx in range(len(trees)):
classifier = RandomForestClassifier(n_estimators=idx + 1, random_state=RND_STATE)
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
accuracy[idx] = accuracy_score(tar_test, predictions)
plt.cla()
plt.plot(trees, accuracy)
plt.show()
Output:
Final model looked well on test data and showed an accuracy level of 83,4%! So results can be presented in this plot:
As we can see from the plot that, even one tree can show the accuracy at a good level. The above-given data can be described even with one tree. But, on the other hand, it is clear, that after adding some more trees final accuracy increases a bit, can make the model able to predict the data more precisely
Posted on October 23, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024