Independence of Errors: A Guide to Validating Linear Regression Assumptions

ungest

David Usoro

Posted on April 14, 2024

Independence of Errors: A Guide to Validating Linear Regression Assumptions

Before training a Linear model with a dataset, it is important to be sure that the assumptions for Linear regression are met by the dataset.

Some of the most popular assumptions of Linear Regression are:

  • Independence of Errors

  • Linearity

  • Homoscedasticity

  • Multicollinearity

  • Normal distribution of Errors

Graphs showing some Assumptions for linear regression

Independence of Error Assumption

Independence of errors means that the residuals from a model are not correlated with each other, therefore the value of one error does not predict the value of another error. It is also referred to as 'No Autocorrelation.'

Independence of errors is crucial for the reliability of hypothesis tests on the regression coefficients. If errors are not independent, standard statistical tests can yield misleading results, including underestimating or overestimating the significance of variables.

Below are the steps I took to verify the independence of errors in my analysis:

Import Packages

The packages I used for this analysis are pandas and statsmodels. I used the durbin_watson method inside the stattools module to carry out the durbin_watson's test for Independence of Errors.



import pandas as pd
import statsmodels.api as sm

from statsmodels.stats.stattools import durbin_watson

Enter fullscreen mode Exit fullscreen mode




Feature Engineering

With other methods being carried out to ensure the data is cleaned and ready for model fitting, I wrote a group_location function to perform bucketing on a categorical column in the dataset in a bid to tackle high cardinality and remove impending complexity of the model.

group_location function

Prior to this, the column had over 800 unique categories.

unique categories

Here's a breakdown of what the function achieves:
Calculate Frequency: It first calculates the frequency of each unique category in the 'location' column of the dataframe (df).

Identify Low Frequency Categories: It then identifies which locations appear with a frequency below a specified threshold.

Replace with 'Other': These infrequent categories are replaced with the label 'Other'.

Return Counts: Finally, it returns the new value counts of the modified 'location' column, which now includes the aggregated 'Other' category.

The next step I took for data preprocessing was to encode the categorical columns, using the get_dummies method in pandas.

Fitting an OLS Model: A Key Step in Testing Error Independence

OLS model fitting

This step involved:

  • splitting the data into dependent and independent variables.

  • adding a constant term(intercept) to improve flexibility and remove bias from the model.

  • fitting the OLS model

Durbin-Watson Test

durbin_watson test

  • A Durbin-Watson statistic close to 2.0 suggests no autocorrelation.
  • Values approaching 0 indicate positive autocorrelation.
  • Values approaching 4 indicate negative autocorrelation.

Interpreting the Durbin-Watson Statistic:

durbin_watson test result

The Result of the Durbin-Watson test indicates no autocorrelation in the residuals of the model. Therefore, I fail to reject the null hypothesis of Independence of Errors.

This result implies that the residuals from the regression model are independent of each other, satisfying one of the critical assumptions of the OLS regression.

Thank you for reading to this point, if you have questions or want to talk about the steps, involved in this process, let's discuss in the comment section.

💖 💪 🙅 🚩
ungest
David Usoro

Posted on April 14, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related