Independence of Errors: A Guide to Validating Linear Regression Assumptions

Before training a Linear model with a dataset, it is important to be sure that the assumptions for Linear regression are met by the dataset.

Some of the most popular assumptions of Linear Regression are:

Independence of Errors
Linearity
Homoscedasticity
Multicollinearity
Normal distribution of Errors

Independence of Error Assumption

Independence of errors means that the residuals from a model are not correlated with each other, therefore the value of one error does not predict the value of another error. It is also referred to as 'No Autocorrelation.'

Independence of errors is crucial for the reliability of hypothesis tests on the regression coefficients. If errors are not independent, standard statistical tests can yield misleading results, including underestimating or overestimating the significance of variables.

Below are the steps I took to verify the independence of errors in my analysis:

Import Packages

The packages I used for this analysis are pandas and statsmodels. I used the durbin_watson method inside the stattools module to carry out the durbin_watson's test for Independence of Errors.



import pandas as pd 

import statsmodels.api as sm

from statsmodels.stats.stattools import durbin_watson

Feature Engineering

With other methods being carried out to ensure the data is cleaned and ready for model fitting, I wrote a group_location function to perform bucketing on a categorical column in the dataset in a bid to tackle high cardinality and remove impending complexity of the model.

Prior to this, the column had over 800 unique categories.

Here's a breakdown of what the function achieves:
Calculate Frequency: It first calculates the frequency of each unique category in the 'location' column of the dataframe (df).

Identify Low Frequency Categories: It then identifies which locations appear with a frequency below a specified threshold.

Replace with 'Other': These infrequent categories are replaced with the label 'Other'.

Return Counts: Finally, it returns the new value counts of the modified 'location' column, which now includes the aggregated 'Other' category.

The next step I took for data preprocessing was to encode the categorical columns, using the get_dummies method in pandas.

Fitting an OLS Model: A Key Step in Testing Error Independence

This step involved:

splitting the data into dependent and independent variables.
adding a constant term(intercept) to improve flexibility and remove bias from the model.
fitting the OLS model

Durbin-Watson Test

A Durbin-Watson statistic close to 2.0 suggests no autocorrelation.
Values approaching 0 indicate positive autocorrelation.
Values approaching 4 indicate negative autocorrelation.

Interpreting the Durbin-Watson Statistic:

The Result of the Durbin-Watson test indicates no autocorrelation in the residuals of the model. Therefore, I fail to reject the null hypothesis of Independence of Errors.

This result implies that the residuals from the regression model are independent of each other, satisfying one of the critical assumptions of the OLS regression.

Thank you for reading to this point, if you have questions or want to talk about the steps, involved in this process, let's discuss in the comment section.

Blog