ANCOVA - Machine Learning
Rutik Bhoyar
Posted on March 17, 2021
What is ANCOVA?
ANCOVA is a blend of analysis of variance (ANOVA) and regression. It is similar to factorial ANOVA, in that it can tell you what additional information you can get by considering one independent variable (factor) at a time, without the influence of the others. It can be used as:
- An extension of multiple regression to compare multiple regression lines,
- An extension of analysis of variance.
Although ANCOVA is usually used when there are differences between your baseline groups (Senn, 1994; Overall, 1993), it can also be used in pretest/posttest analysis when regression to the mean affects your posttest measurement (Bonate, 2000). The technique is also common in non-experimental research (e.g. surveys) and for quasi-experiments (when study participants can’t be assigned randomly). However, this particular application of ANCOVA is not always recommended (Vogt, 1999).
Multiple Regression Analysis
Multiple regression analysis is used to see if there is a statistically significant relationship between sets of variables. It’s used to find trends in those sets of data.
Multiple regression analysis is almost the same as simple linear regression. The only difference between simple linear regression and multiple regression is in the number of predictors (“x” variables) used in the regression.
Simple regression analysis uses a single x variable for each dependent “y” variable. For example: (x1, Y1).
Multiple regression uses multiple “x” variables for each independent variable: (x1)1, (x2)1, (x3)1, Y1).
In one-variable linear regression, you would input one dependent variable (i.e. “sales”) against an independent variable (i.e. “profit”). But you might be interested in how different types of sales effect the regression. You could set your X1 as one type of sales, your X2 as another type of sales and so on.
Extension of Multiple Regression
When used as an extension of multiple regression, ANCOVA can test all of the regression lines to see which have different Y intercepts as long as the slopes for all lines are equal.
Like regression analysis, ANCOVA enables you to look at how an independent variable acts on a dependent variable. ANCOVA removes any effect of covariates, which are variables you don’t want to study. For example, you might want to study how different levels of teaching skills affect student performance in math; It may not be possible to randomly assign students to classrooms. You’ll need to account for systematic differences between the students in different classes (e.g. different initial levels of math skills between gifted and mainstream students).
Example
You might want to find out if a new drug works for depression. The study has three treatment groups and one control group. A regular ANOVA can tell you if the treatment works. ANCOVA can control for other factors that might influence the outcome. For example: family life, job status, or drug use.
Extension of ANOVA
As an extension of ANOVA, ANCOVA can be used in two ways (Leech et. al, 2005):
To control for covariates (typically continuous or variables on a particular scale) that aren’t the main focus of your study.
To study combinations of categorical and continuous variables, or variables on a scale as predictors. In this case, the covariate is a variable of interest (as opposed to one you want to control for).
Within-Group Variance
ANCOVA can explain within-group variance. It takes the unexplained variances from the ANOVA test and tries to explain them with confounding variables (or other covariates). You can use multiple possible covariates. However, more you enter, the fewer degrees of freedom you’ll have. Entering a weak covariate isn’t a good idea as it will reduce the statistical power. The lower the power, the less likely you’ll be able to rely on the results from your test. Strong covariates have the opposite effect: it can increase the power of your test.
Assumptions for ANCOVA
Assumptions are basically the same as the ANOVA assumptions. Check that the following are true before running the test:
Independent variables (minimum of two) should be categorical variables.
The dependent variable and covariate should be continuous variables (measured on an interval scale or ratio scale.)
Make sure observations are independent. In other words, don’t put people into more than one group.
Software can usually check the following assumptions.
Normality: the dependent variable should be roughly normal for each of category of independent variables.
Data should show homogeneity of variance.
The covariate and dependent variable (at each level of independent variable) should be linearly related.
Your data should be homoscedastic of Y for each value of X.
The covariate and the independent variable shouldn’t interact. In other words, there should be homogeneity of regression slopes.
Multiple regression with categorical independent variables or factors: Analysis of covariance (ANCOVA)
Analysis of covariance (ANCOVA) is a linear model that blends ANOVA and linear regression. ANCOVA evaluates whether population means of a dependent variable (DV) are equal across levels of a categorical independent variable (IV) often called a treatment, while statistically controlling for the effects of other quantitative or continuous variables that are not of primary interest, known as covariates (CV).
Now lets understand this with the help of example..
Step-1: Reading the dataset and importing required libraries.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
try:
df = pd.read_csv("../datasets/salary_table.csv")
except:
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
df = pd.read_csv(url)
print(df.head())
Step-2: Calculating Normality Test p-value.
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
lm = smf.ols('salary ~ experience', df).fit()
df["residuals"] = lm.resid
print("Normality test p-value %.5f" % sms.jarque_bera(lm.resid)[1])
ax = sns.displot(df, x='residuals', kind="kde", fill=True)
ax = sns.displot(df, x='residuals', kind="kde", hue='management', fill=True)
After the output you can see that, Normality assumption of the residuals can be rejected (p-value < 0.05). There is an effect of the "management" factor, take it into account.
Step-3: One-way ANCOVA
ANOVA: one categorical independent variable, i.e. one factor.
ANCOVA: ANOVA with some covariates.
import statsmodels.api as sm
oneway = smf.ols('salary ~ management + experience', df).fit()
df["residuals"] = oneway.resid
sns.displot(df, x='residuals', kind="kde", fill=True)
print(sm.stats.anova_lm(oneway, typ=2))
print("Normality test p-value %.3f" % sms.jarque_bera(oneway.resid)[1])
Distribution of residuals is still not normal but closer to normality. Both management and experience are significantly associated with salary.
Step-4:
Two-way AN(C)OVA
Ancova with two categorical independent variables, i.e. two factors.
twoway = smf.ols('salary ~ education + management + experience', df).fit()
df["residuals"] = twoway.resid
sns.displot(df, x='residuals', kind="kde", fill=True)
print(sm.stats.anova_lm(twoway, typ=2))
print("Normality test p-value %.3f" % sms.jarque_bera(twoway.resid)[1])
Normality assumption cannot be rejected. Assume it. Education, management and experience are significantly associated with salary.
Step-5:
Comparing two nested models
oneway is nested within twoway. Comparing two nested models tells us if the additional predictors (i.e. education) of the full model significantly decrease the residuals. Such comparison can be done using an 𝐹 -test on residuals:
print(twoway.compare_f_test(oneway)) # return F, pval, df
twoway is significantly better than one way
print(twoway.model.data.param_names)
print(twoway.model.data.exog[:10, :])
Step-6: Contrasts and post-hoc tests
Post hoc tests are an integral part of ANOVA. When you use ANOVA to test the equality of at least three group means, statistically significant results indicate that not all of the group means are equal. However, ANOVA results do not identify which particular differences between pairs of means are significant. Use post hoc tests to explore differences between multiple group means while controlling the experiment-wise error rate.
ttest_exp = twoway.t_test([0, 0, 0, 0, 1])
ttest_exp.pvalue, ttest_exp.tvalue
print(ttest_exp)
# Alternatively, you can specify the hypothesis tests using a string
twoway.t_test('experience')
# Post-hoc is salary of Master different salary of Ph.D?
# ie. t-test salary of Master = salary of Ph.D.
print(twoway.t_test('education[T.Master] = education[T.Ph.D]'))
The code and the dataset is derived from the notebook of duchesnay. Hope it will help you because there is very few material regarding ANCOVA on the internet.
Posted on March 17, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.