Insurance Cost Prediction using Machine Learning with Python.
Oluwafunmilola Obisesan
Posted on January 29, 2023
Machine learning (ML) is a sub set of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so.
Machine learning algorithms uses historical data as input to predict new output values.
In this project, I worked on developing an end to end machine learning model using linear regression.
Data cleaning, Extensive data visulaization, Exploratory data analysis was also done.
Data Description:
The dataset used for this project is an Insurance focused dataset that contains columns such as age, sex, bmi, region, and other data, which were used to determine the cost of each person’s insurance.
Steps
- Importing the necessary libraries: Numpy, pandas, matplotlib, seaborn and sckitlearn were imported.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
%matplotlib inline
- Loading in the dataset: The csv was loaded using the code below:
Insurance = pd.read_csv("https://raw.githubusercontent
- Information about the data. To get some information about the data such as the type of data in each column, we use the code below
Insurance.info()
- Checking the statistical description of the data:
Insurance.describe()
- Checking for the number of rows and columns present in the dataset:
Insurance.shape
Data Cleaning and preparation:
Working with “unclean” data leads to inaccuracy in results, so it’s necessary to carry out data cleaning before any analysis or prediction is done.
- Checking for null values:
To check for null values in our dataset, we use the code below:
Insurance.isnull().any()
- Checking for duplicates:
Insurance.duplicated().any()
Exploratory Data Analysis:
Exploratory data analysis helps in understanding the patterns, trends and metrics in a dataset. Also helps in detecting outliers and anomalous events.
- Using a correlation matrix to check for correlations among the columns in the dataset:
sns.heatmap(Insurance.corr())
The correlation matrix shows there’s little or no correlation between “age” and “charges”.
- Checking for the distribution pattern of the “charges” column
sns.distplot(Insurance['charges'])
- Plotting a pairplot to check out the relationship that exists between one column to another.
sns.pairplot(Insurance);
Extracting dependent and independent variables:
The dependent variable in this case is the “charges “ while the independent variables are the other columns.
X = Insurance.drop(columns = ["charges"])
X.head(5)
y = Insurance["charges"]
y
Splitting the dataset into test and train.
To build a machine learning algorithm, you have to “train” the model with a set of data and use the other set to “test” the model you’ve built.
So we split our data into “test” data and “train” data, using 80 percent to train the model and using the other 20 percent to test the model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 0)
X_train.head()
One hot encoding to transform categorical text data
The data contains some columns which have texts in them, such as gender, region.
Since we can’t build the model with these text data, we need to convert it into numbers.
Using the gender column as an example; assigning 0 to female and 1 to male.
We can do this using one hot encoding, using the code below
X_train_ = pd.get_dummies(X_train, columns=["sex", "smoker", "region"], drop_first=True)
Building and fitting the model.
Here is the most interesting part of this project , now that we are done with data cleaning and converting text data to numbers, we can now build our model using the line of code below:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train_,y_train)
Predicting the “test” set results.
Remember we trained our model on 80 percent of our data, now that we’ve built the model, we can use the model to predict the outcome of the 20 percent we set aside.
Here’s the code and the prediction using our “test” data.
predictions = lm.predict(X_test_)
Now let’s check the accuracy of our model, if our model is 100 percent accurate in predicting the “test” set results.
Model evaluation:
To evaluate the accuracy of our model, we’ll use the R2 score.
The R2 score measures the amount of variance of the prediction which is explained by the dataset.
If the value of the R2 score is 1, it means the model is perfect, and if it’s 0, it means the model will perform badly in an unseen data.
The closer the value of the R2 is to 1, the more perfectly the model is trained.
To check our R2 score, we use the code below:
from sklearn.metrics import r2_score
r2_score(y_test, predictions)
Oops
Not a bad model I must say!
View the entire code here:
https://github.com/heyfunmi/Insurance_Cost_Prediction_using_Machine_Learning_with_Python
See you in another project!
Cheers!!
Posted on January 29, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
September 1, 2022