Building your first machine learning model in Python
Silvester
Posted on May 28, 2024
Machine learning is the use of algorithms that can learn from data over time and therefore can detect and learn patterns from the data. Machine learning models are divided into Supervised, Unsupervised, and Reinforcement learning. The commonly used machine learning algorithms fall under Supervised learning and the linear regression model is usually the first model you will encounter in this category.
Under Linear regression models, we have simple linear and multiple linear models. A simple linear model involves the use of one independent and one dependent variable. On the other hand, multiple linear models have one dependent variable and more than two independent variables. In this article, I will take you through the process of creating your first multiple linear model for predicting the tips that customers give waiters in restaurants.
Getting started
Before we start, there are some technologies that you should be familiar with.
- Basic understanding of Python
- Some familiarity with statistics
- Python libraries including pandas, numpy, matplotlib, seaborn,
- scikit-learn
Linear regression
Linear regression is among the simple but commonly used algorithms, especially when the focus is to determine how variables are related. A linear regression model aims to get the best fit linear line that minimizes the sum of squared differences between actual and predicted values.
There are many uses of linear regression models. Some of the uses are market analysis, sports analysis, and financial analysis among other uses.
Loading and understanding the dataset
We will use the tips dataset embedded in the Seaborn library. The tips dataset contains simulated data on tips that waiters receive in restaurants in addition to other attributes.
For this demonstration, this is the complete Google Colab that I used. We start by loading the necessary libraries and loading the data.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
After loading the libraries, we first check for the datasets in the Seaborn library.
print(sns.get_dataset_names())
After looking at the various datasets and opting for the dataset of choice, we can now load the dataset.
tips = sns.load_dataset('tips')
tips.head(5)
The table above shows that there are 7 variables in the dataset. The numerical columns in the dataset are total_bill, tip and size while the categorical columns are sex, smoker, day and time.
For basic statistics, we can use the describe () method.
tips.describe().T
The describe () function gives the summary statistics of the numerical variables only. From the output, we can see the mean, standard deviation, minimum, maximum, and percentiles of the variables.
Data visualizations
Distribution of sex variable
sns.countplot(x ='sex', data = tips)
plt.title('Distribution of Sex variable')
We can see from the plot above that men comprised a big percentage of the customers represented in the restaurant.
- ####Total bill variable
sns.histplot(x ='total_bill', data = tips)
plt.title('Histogram of the Total bill variable')
The above plots show the distributions of two variables. We can see that the majority of the bills fall between $10 and $20. The sex distribution variable also shows that most of the customers were men.
Scatterplot
sns.scatterplot(x='total_bill', y='tip', data=tips)
plt.title('Scatter plot of total bill and tip variables')
plt.show()
Correlation plot
num_cols = tips.select_dtypes(include='number')
corr_matrix = num_cols.Corr()
sns.heatmap(corr_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()
The scatterplot above shows that the tip and total_bill have a strong linear relationship. We can see from the correlation plot that the total_bill and tip correlate 0.68, indicating a strong positive correlation.
Model building
Before building the model, the data has to be processed in a format that is compatible with the machine learning algorithm. Machine learning algorithms work with numerical data and that necessitates changing the categorical values to numerical. To change the data from categorical to numerical, there are various approaches like Label Encoding and OneHotEncoding. For this project, we will use OneHotEncoding.
tips = pd.get_dummies(tips, columns=['sex', 'smoker', 'day', 'time'], dtype=int)
Using OneHotEncoding creates new variables for each of the categorical values. For example, we had a variable named sex which has Male and Female as the values. After using the get_dummies () method which encodes the data using OneHotEncoding, we have two new variables from the sex variable named sex_Male and sex_Female. Note that we started our data analysis with 7 variables and now after applying OneHotEncoding, we have 13 variables.
After encoding the data, we now have to scale the data to fall within the same range. For example, values in the total_bill column vary between 3 and 50 while for the majority of the remaining columns, the values are between 0 and 1. Scaling ensures that the model is robust by ensuring there are no extreme values. For this, we are using the MinMaxScaler class of the scikit-learn library.
from sklearn.preprocessing import MinMaxScaler
# Instantiate the scaler
MM = MinMaxScaler()
col_to_scale = ['total_bill']
# Fitting and transforming the scaler
scaled_data = MM.fit_transform(tips[col_to_scale])
# Convert the scaled data into a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=col_to_scale)
# Dropping the original columns to avoid duplication
tips_df = tips.drop(columns=col_to_scale).join(scaled_df)
After scaling the total_bill column, we have the results below. You can see that the values in the total_bill column now range between 0 and 1 like the rest of the variables.
Next, we split the data into train and test sets. We will use the training data to train the model and test data to test the performance of our model.
from sklearn.model_selection import train_test_split
X= tips_df.drop(columns='tip', axis=1)
y=tips_df['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=42)
The big X represents the independent variables (features) that will be fed to our model and the small y represents the target variable.
After splitting the data, we now proceed to instantiate the model and fit it to the training data as shown by the code below.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
LR = LinearRegression()
LR.fit(X_train, y_train)
Model evaluation
After fitting the training data to the model, we now proceed to test the model with our unseen data. Evaluating the model is important as it tells us whether our model performance is good or bad. For regression models, the evaluation metrics are the mean absolute error, mean squared error, mean squared error, R squared and Root mean squared error among others.
y_pred = LR.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
r2 = r2_score(y_test, y_pred)
print("R-Squared (R2) Score:", r2)
The output is:
Mean Squared Error: 0.7033566017436106
R-Squared (R2) Score: 0.43730181943482493
The mean squared error is high and this means that our model is not predicting well while the R squared value is low meaning that the model is not fitting the data well. Ideally, the mean squared error must be low and the R-squared value must be high.
On visualizing the results;
plt.figure(figsize=(8,8))
plt.scatter(y_test, y_pred)
#adding labels to the plot
plt.xlabel("The Actual Tip Amount")
plt.ylabel("The Predicted Tip Amount")
plt.title("Plot of Actual versus Predicted Tip Amount")
plt.plot([0, max(y_test)], [0, max(y_test)], color='green', linestyle='--')
plt.show()
From the plot, we can see that there are many values below the diagonal line. This means that in many cases, the predicted tip amount tends to be lower than the actual tip amount.
Conclusion
In this article, we successfully built our first machine-learning model to predict the tips that customers pay. This regression model has provided us with a starting point to understand the relationship between several independent features and the tip amount. We also saw in the model evaluation that our model did not perform well in predicting the tip amount.
The performance of our model highlights an important aspect of data science and machine learning which is improving models iteratively. To further improve our model, we may have to use feature engineering, perform hyperparameter tuning, or do data quality checks. As you embark on this machine-learning journey, remember that your model may need several improvements before it achieves the desired performance.
Additional readings
Posted on May 28, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.