Regression Modeling 101: Understanding Different Types of Models and How to Choose the Right One
Anurag Verma
Posted on January 4, 2023
A regression model is a statistical model that is used to predict a continuous variable based on one or more predictor variables. The goal of a regression model is to identify the relationship between the predictor variables and the response variable, and then use that relationship to make predictions about the response variable based on new data.
There are several different types of regression models, including:
Linear regression: Linear regression is a statistical method used to model the relationship between a continuous response variable (also known as the dependent variable) and one or more predictor variables (also known as independent variables). The goal of linear regression is to identify the best-fitting line that describes the relationship between the predictor and response variables. The best-fitting line is determined by finding the values of the coefficients (also known as slopes) that minimize the sum of squared errors between the predicted values and the observed values of the response variable.
Polynomial regression: Polynomial regression is a type of regression model that allows for more complex relationships between the predictor variables and the response variable by using polynomial terms in the model. For example, a quadratic term (x^2) can be included in the model to allow for a relationship that is not linear. Higher-order polynomial terms (e.g. x^3, x^4) can also be included to allow for even more complex relationships.
Logistic regression: Logistic regression is a type of regression model that is used for predicting a binary response variable (i.e. a response variable that can only take on two values). The goal of logistic regression is to model the probability of the response variable taking on one of the two values, given the predictor variables. The model is fit by maximizing the likelihood of the observed data, given the model assumptions.
Multivariate regression: Multivariate regression is a type of regression model that allows for multiple predictor variables to be used in the model. This type of model is useful for examining the relationships between multiple predictor variables and the response variable, and for identifying which predictor variables are the most important for predicting the response variable.
Ridge regression: Ridge regression is a type of regression model that is used to address overfitting in linear regression models. Overfitting occurs when the model is too complex and has too many parameters, resulting in poor generalization to new data. Ridge regression addresses overfitting by adding a regularization term to the model that penalizes large coefficients, forcing some of the coefficients to be closer to zero. This helps to reduce the complexity of the model and improve its generalization ability.
Lasso regression: Lasso regression is similar to ridge regression in that it is used to address overfitting in linear regression models. However, instead of using a regularization term that penalizes large coefficients, lasso regression uses a regularization term that sets some of the coefficients exactly to zero. This can be useful for identifying a subset of important predictor variables and eliminating less important variables from the model.
Elastic net regression: Elastic net regression is a type of regression model that combines the regularization terms of both ridge and lasso regression. This allows the model to both shrink some of the coefficients towards zero and set some coefficients exactly to zero, depending on the relative importance of the predictor variables.
Stepwise regression: Stepwise regression is a type of regression model that is used to select the most important predictor variables for the model. The process involves gradually adding or removing variables based on their statistical significance, with the goal of finding the most parsimonious model that best explains the variation in the response variable.
Multivariate adaptive regression splines (MARS): MARS is a type of regression model that is used to model complex, non-linear relationships between the predictor variables and the response variable. The model uses piecewise linear functions to model the relationship, and is particularly useful for modeling relationships that are not well-described by a single linear equation.
Random forest regression: Random forest regression is a type of ensemble model that uses multiple decision trees to make predictions. Each decision tree in the random forest is trained on a different subset of the data and makes predictions based on the variables that are most important for that particular tree. The final prediction is made by averaging the predictions of all of the decision trees in the forest. Random forest regression is particularly useful for modeling complex, non-linear relationships, and can also be used to identify important predictor variables.
Overall, the choice of which type of regression model to use will depend on the characteristics of the data and the specific goals of the analysis. It is important to carefully consider the assumptions of each type of regression model and choose the one that is most appropriate for the data at hand.
Now we will be going to take an example for the model listed above and train the model for the same with the given data.
1. Linear regression
Here is an example of linear regression using the Python library scikit-learn. The data used for this example is a synthetic dataset of two predictor variables (x1 and x2) and a response variable (y).
Here is a sample dataset that can be used for linear regression:
x1 | x2 | y |
---|---|---|
1 | 2 | 4 |
2 | 3 | 6 |
3 | 4 | 8 |
4 | 5 | 10 |
5 | 6 | 12 |
To use this data for linear regression, you would simply need to load it into a pandas DataFrame and follow the steps outlined in the previous example.
For example, you could load the data into a DataFrame like this:
import pandas as pd
# Load the data into a DataFrame
data = pd.DataFrame({
"x1": [1, 2, 3, 4, 5],
"x2": [2, 3, 4, 5, 6],
"y": [4, 6, 8, 10, 12]
})
First, we can start by importing the necessary libraries and loading the data:
from sklearn.linear_model import LinearRegression
# Split the data into predictor and response variables
X = data[["x1", "x2"]]
y = data["y"]
Next, we can split the data into training and test sets using scikit-learn's train_test_split function:
from sklearn.model_selection import train_test_split
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now that we have our data split into training and test sets, we can create a linear regression model and fit it to the training data:
# Create a linear regression model
model = LinearRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
Finally, we can use the model to make predictions on the test data and evaluate the model's performance using scikit-learn's mean squared error function:
from sklearn.metrics import mean_squared_error
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
This is just a basic example of linear regression using scikit-learn, but there are many other features and options that you can use to fine-tune your model and improve its performance.
2. Polynomial regression
Here is an example of how to perform polynomial regression using Python and the scikit-learn library:
Here is the sample data we use in a tabular form:
X | y |
---|---|
1 | 1 |
2 | 4 |
3 | 9 |
4 | 16 |
5 | 25 |
First, we need to import the necessary libraries and load the data:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Load the data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 4, 9, 16, 25])
Next, we can use the PolynomialFeatures function to transform the data into polynomial features:
# Transform the data into polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
Finally, we can fit the polynomial regression model using the transformed data:
# Fit the polynomial regression model
model = LinearRegression()
model.fit(X_poly, y)
# Make predictions on new data
X_new = np.array([[6], [7], [8]])
X_new_poly = poly.transform(X_new)
y_pred = model.predict(X_new_poly)
print(y_pred)
The output of this code will be an array of predictions for the values of y for the new data points [6, 7, 8].
[36. 49. 64.]
Note that this is just a simple example, and in practice you may want to perform additional steps such as cross-validation and hyperparameter tuning to optimize the performance of the model.
3. Logistic regression
Here is an example of logistic regression using sample data in a tabular form:
Sample data:
Age | Gender | Income | Credit Score | Approved for Loan |
---|---|---|---|---|
25 | Male | $50,000 | 750 | Yes |
30 | Female | $40,000 | 700 | Yes |
35 | Male | $60,000 | 650 | No |
40 | Female | $70,000 | 800 | Yes |
45 | Male | $80,000 | 850 | No |
Explanation: In this example, we are trying to predict whether an individual will be approved for a loan based on their age, gender, income, and credit score. The response variable is "Approved for Loan" which is a binary variable (either Yes or No). The predictor variables are Age, Gender, Income, and Credit Score.
Code for logistic regression using Python:
# Import necessary libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
# Load sample data into a Pandas DataFrame
data = pd.DataFrame({'Age': [25, 30, 35, 40, 45],
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
'Income': [50000, 40000, 60000, 70000, 80000],
'Credit Score': [750, 700, 650, 800, 850],
'Approved for Loan': ['Yes', 'Yes', 'No', 'Yes', 'No']})
# Create feature matrix (X) and response vector (y)
X = data[['Age', 'Gender', 'Income', 'Credit Score']]
y = data['Approved for Loan']
# Convert categorical variables to dummy variables
X = pd.get_dummies(X)
# Create a logistic regression model
model = LogisticRegression()
# Fit the model to the training data
model.fit(X, y)
# Predict whether a new individual with the following characteristics will be approved for a loan
new_individual = [[30, 1, 40000, 700]] # age 30, male, income $40,000, credit score 700
prediction = model.predict(new_individual)
print(prediction) # Output: ['Yes']
Explanation: In this example, we first load the sample data into a Pandas DataFrame, and then create a feature matrix (X) and response vector (y). We then convert the categorical variables (Gender) to dummy variables using the get_dummies
function in Pandas. Next, we create a logistic regression model using the LogisticRegression
function from the sklearn
library. We then fit the model to the training data using the fit
function, and finally use the model to make a prediction for a new individual using the predict
function. The output of the model is that the new individual is predicted to be approved for a loan.
It is important to note that this is just a basic example of logistic regression, and there are many other considerations and techniques that can be used to improve the model's performance. For example, you may want to normalize the predictor variables, or use cross-validation to evaluate the model's performance. You may also want to consider using other evaluation metrics, such as the confusion matrix or the AUC (area under the curve) score, to assess the model's accuracy.
Additionally, it is important to carefully consider the assumptions of logistic regression and ensure that they are met before using the model. For example, logistic regression assumes that there is a linear relationship between the predictor variables and the log-odds of the response variable, and that the errors in the model are independent and normally distributed. If these assumptions are not met, the model may not be appropriate for the data.
Overall, logistic regression is a powerful and widely used tool for predicting binary outcomes, and can be a valuable addition to any data scientist's toolkit. By carefully considering the characteristics of the data and the assumptions of the model, you can use logistic regression to make accurate and reliable predictions.
4. Multivariate regression
Here is an example of multivariate regression using Python with sample data in a tabular form:
Year | Sales | Advertising | Price |
---|---|---|---|
1 | 100 | 50 | 10 |
2 | 110 | 55 | 12 |
3 | 120 | 60 | 14 |
4 | 130 | 65 | 16 |
5 | 140 | 70 | 18 |
In this example, we are trying to predict sales (the response variable) based on two predictor variables: advertising and price. We can build a multivariate regression model in Python using the following code:
import pandas as pd
from sklearn.linear_model import LinearRegression
# read in the data
df = pd.read_csv("sample_data.csv")
# create a Linear Regression model object
model = LinearRegression()
# fit the model using the Advertising and Price columns as predictor variables
# and the Sales column as the response variable
X = df[["Advertising", "Price"]]
y = df["Sales"]
model.fit(X, y)
# view the model coefficients
print(model.coef_)
# view the model intercept
print(model.intercept_)
# view the model R-squared value
print(model.score(X, y))
The output of the model coefficients will give us the estimated effect of each predictor variable on the response variable (i.e. how much the response variable is expected to change for each unit increase in the predictor variable). The output of the model intercept will give us the estimated value of the response variable when all predictor variables are equal to zero. The output of the model R-squared value will give us a measure of how well the model explains the variance in the response variable.
Using this multivariate regression model, we can make predictions about sales based on new values of advertising and price. For example, if we wanted to predict sales for a year with $70 of advertising and a price of $20, we could use the following code:
# create a new data frame with the new values of advertising and price
new_data = pd.DataFrame({"Advertising": [70], "Price": [20]})
# make the prediction
prediction = model.predict(new_data)
print(prediction)
This would give us the predicted value of sales based on the values of advertising and price in the new data frame.
It is important to note that this is just a basic example of multivariate regression using Python, and there are many other considerations and techniques that may be relevant depending on the specific goals of the analysis and the characteristics of the data. Some additional considerations for multivariate regression might include:
Handling missing data: If there are missing values in the data, we may need to impute missing values or use techniques such as multiple imputation to handle missing data.
Feature scaling: If the scale of the predictor variables is very different, it may be beneficial to scale the variables so that they are on the same scale. This can help the model converge more quickly and may improve the model's performance.
Model evaluation: It is important to evaluate the performance of the model using appropriate metrics and techniques such as cross-validation to ensure that the model is not overfitting the data.
Model selection: If there are multiple potential predictor variables, we may need to select the most important variables to include in the model using techniques such as stepwise regression or regularization methods.
5. Ridge regression
here is an example of using ridge regression on sample data using Python:
First, we can start by importing the necessary libraries:
import numpy as np
from sklearn.linear_model import Ridge
Next, let's define our sample data in a tabular form:
Predictor 1 | Predictor 2 | Response |
---|---|---|
1 | 2 | 5 |
3 | 4 | 9 |
5 | 6 | 13 |
7 | 8 | 17 |
We can then convert this data into arrays that can be used by the ridge regression model:
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([5, 9, 13, 17])
Now, we can create a ridge regression model and fit it to the data:
model = Ridge(alpha=1.0)
model.fit(X, y)
The alpha parameter in the model specifies the amount of regularization to apply. A larger alpha value will result in a model with more regularization, which can help to reduce overfitting.
Finally, we can use the model to make predictions on new data:
predictions = model.predict([[9, 10]])
print(predictions)
This will output an array with the prediction for the response variable based on the given predictor variables. In this case, the prediction would be [21].
6. Lasso regression
Below is an example of lasso regression using a sample dataset of housing prices in a city. The goal is to predict the price of a house based on its size (in square feet) and number of bedrooms.
Size (sqft) | Bedrooms | Price ($) |
---|---|---|
2,000 | 3 | 300,000 |
1,500 | 2 | 200,000 |
3,000 | 4 | 400,000 |
1,200 | 3 | 250,000 |
2,500 | 4 | 350,000 |
Here is the Python code for implementing lasso regression using the sample data:
import pandas as pd
from sklearn.linear_model import Lasso
# Load the data into a pandas DataFrame
df = pd.read_csv('housing_prices.csv')
# Define the predictor variables and the response variable
X = df[['Size (sqft)', 'Bedrooms']]
y = df['Price ($)']
# Fit the lasso regression model to the data
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
# Make predictions using the lasso model
predictions = lasso.predict(X)
# Print the model's coefficients
print(lasso.coef_)
In this example, the lasso model is fit to the data using an alpha value of 0.1, which determines the strength of the regularization term. The model is then used to make predictions for the response variable (housing prices) based on the predictor variables (size and number of bedrooms). Finally, the model's coefficients are printed, which indicate the importance of each predictor variable in the model.
Lasso regression is useful in situations where there may be a large number of predictor variables and we want to select only the most important ones for the model. The regularization term helps to shrink the coefficients of the less important variables towards zero, effectively eliminating them from the model. This can improve the interpretability and generalizability of the model.
7. Elastic net regression
Here is an example of the sample data in tabular form:
predictor_variable_1 | predictor_variable_2 | predictor_variable_3 | response_variable |
---|---|---|---|
0.5 | 0.7 | 0.3 | 0.6 |
0.8 | 0.2 | 0.9 | 0.7 |
0.1 | 0.5 | 0.7 | 0.3 |
0.3 | 0.6 | 0.4 | 0.5 |
In this example, we are using elastic net regression to predict the response_variable based on the three predictor variables. The alpha parameter in the ElasticNet model controls the amount of regularization, and the l1_ratio parameter controls the balance between the L1 and L2 regularization terms. In this example, we set alpha to 0.1 and l1_ratio to 0.5, which means that the model will use a combination of L1 and L2 regularization. The model is then fit to the training data using the fit() method, and the mean absolute error is used to evaluate the model's performance on the test set.
First, we will start by importing the necessary libraries and the sample data:
import pandas as pd
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
# load the sample data
data = pd.read_csv('sample_data.csv')
Next, we will split the data into a training set and a test set:
# split the data into a training set and a test set
X = data.drop('response_variable', axis=1)
y = data['response_variable']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Then, we will fit the elastic net regression model to the training data:
# fit the elastic net model to the training data
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
model.fit(X_train, y_train)
Finally, we can use the model to make predictions on the test set and evaluate its performance:
# make predictions on the test set
predictions = model.predict(X_test)
# evaluate the model's performance
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(y_test, predictions))
8. Stepwise regression
here is an example of stepwise regression using the scikit-learn library in Python:
First, we will import the necessary libraries and generate some sample data:
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=10, random_state=0)
This generates 100 samples with 10 predictor variables and a continuous response variable.
Next, we will split the data into training and test sets and standardize the predictor variables:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Now we can fit a stepwise regression model using the StepwiseRegressor
class from scikit-learn:
from sklearn.linear_model import StepwiseRegressor
model = StepwiseRegressor(direction='backward', max_iter=5)
model.fit(X_train, y_train)
The direction
parameter specifies whether we want to add or remove variables from the model, and the max_iter
parameter specifies the maximum number of iterations for the stepwise selection process.
We can then use the model to make predictions on the test set:
y_pred = model.predict(X_test)
Finally, we can evaluate the performance of the model using metrics like mean squared error:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
This will print out the mean squared error of the model on the test set.
Here is a sample of the data in tabular form:
Predictor 1 | Predictor 2 | Predictor 3 | Predictor 4 | Predictor 5 | Predictor 6 | Predictor 7 | Predictor 8 | Predictor 9 | Predictor 10 | Response |
---|---|---|---|---|---|---|---|---|---|---|
0.26 | -0.14 | -0.13 | -0.38 | -0.06 | -0.33 | -0.28 | 0.44 | -0.3 | -0.06 | -89.77 |
-0.22 | -0.14 | -0.11 | -0.43 | -0.14 | -0.23 | -0.3 | -0.31 | -0.28 | -0.06 | -93.65 |
0.17 | -0.1 | -0.17 | -0.39 | -0.13 | -0.37 | -0.34 | -0.03 | -0.3 | -0.06 | -80.85 |
-0.34 | -0.2 | -0.15 | -0.34 | -0.11 | -0.32 | -0.29 | -0.4 | -0.27 | -0.06 | -102.47 |
0.34 | -0.12 | -0.17 | -0.34 | -0.11 | -0.34 | -0.27 | 0.03 | -0.3 | -0.06 | -79.15 |
-0.13 | -0.2 | -0.14 | -0.41 | -0.11 | -0.32 | -0.29 | -0.32 | -0.27 | -0.06 | -96.57 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9. Multivariate adaptive regression splines (MARS)
Here is an example of using the Multivariate adaptive regression splines (MARS) model in Python with sample data:
First, we will need to install the py-earth package, which provides the MARS model in Python:
pip install py-earth
Next, we will import the necessary libraries and load the sample data:
import pandas as pd
from pyearth import Earth
# Load sample data from a CSV file
df = pd.read_csv('sample_data.csv')
The sample data might look something like this:
X1 | X2 | X3 | Y |
---|---|---|---|
5 | 3 | 1 | 7 |
3 | 2 | 4 | 8 |
8 | 1 | 2 | 10 |
2 | 6 | 3 | 9 |
1 | 8 | 6 | 11 |
We can then fit the MARS model to the data using the Earth() function from the py-earth package:
# Create the MARS model
mars_model = Earth()
# Fit the model to the data
mars_model.fit(df[['X1', 'X2', 'X3']], df['Y'])
We can then make predictions using the predict() function:
# Make predictions using the model
predictions = mars_model.predict(df[['X1', 'X2', 'X3']])
Finally, we can evaluate the performance of the model using a metric such as mean squared error:
from sklearn.metrics import mean_squared_error
# Calculate the mean squared error of the predictions
mse = mean_squared_error(df['Y'], predictions)
print(f'Mean squared error: {mse}')
This is a simple example of using the MARS model with sample data, but keep in mind that this model is particularly useful for modeling complex, non-linear relationships, so it may be necessary to tune the model parameters or transform the data in order to get good results.
9. Random forest regression
Here is an example of random forest regression using sample data in tabular form with Python code and an explanation:
Sample Data:
X1 | X2 | Y |
---|---|---|
1 | 2 | 3 |
2 | 3 | 4 |
3 | 4 | 5 |
4 | 5 | 6 |
5 | 6 | 7 |
Explanation: In this example, we have a dataset with two predictor variables (X1 and X2) and a response variable (Y). We want to use random forest regression to model the relationship between X1 and X2 and predict the value of Y.
Python Code:
# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X = [[1,2], [2,3], [3,4], [4,5], [5,6]]
y = [3, 4, 5, 6, 7]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Train the model using the training data
model = RandomForestRegressor()
model.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = model.predict(X_test)
# Evaluate the model's performance
score = model.score(X_test, y_test)
print("Model score:", score)
Explanation: In the code above, we first import the necessary libraries for random forest regression and splitting the data into training and testing sets. Then, we split the data into training and testing sets using the train_test_split function. Next, we create a random forest regression model and fit it to the training data using the fit function. After that, we use the model to make predictions on the testing data using the predict function. Finally, we evaluate the model's performance by calculating the score on the testing data using the score function. The output of this code will be the model's score, which is a measure of how well the model was able to predict the response variable based on the predictor variables.
If you liked the post then please
Posted on January 4, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
January 4, 2023