ANOVA : Building and Understanding ANOVA in Python 🐍📶
Anand
Posted on June 19, 2024
ANOVA, or Analysis of Variance, is a statistical technique used to determine if there are any statistically significant differences between the means of three or more independent (unrelated) groups. It helps to test hypotheses about differences among group means and is especially useful when comparing multiple groups.
Key Concepts:
Groups or Levels: These are the different categories or treatments being compared. For example, if you're testing the effect of different diets on weight loss, each diet is a group.
Within-Group Variance: This is the variability of data points within each group. It measures how much the data points in a single group deviate from the group mean.
Between-Group Variance: This is the variability between the group means. It measures how much the group means differ from the overall mean.
F-Statistic: ANOVA produces an F-statistic, which is a ratio of the between-group variance to the within-group variance. A higher F-statistic suggests a greater likelihood that the observed differences between group means are real and not due to random chance.
Steps in ANOVA:
Formulate Hypotheses:
- Null Hypothesis (H₀): The means of all groups are equal.
- Alternative Hypothesis (H₁): At least one group mean is different from the others.
Calculate Group Means and Overall Mean.
Calculate Within-Group and Between-Group Variance.
Compute the F-Statistic:
F = Between-Group Variance / Within-Group Variance
- Compare the F-Statistic to a Critical Value: This critical value is determined by the degrees of freedom and the chosen significance level (often 0.05). If the F-statistic is larger than the critical value, reject the null hypothesis.
Assumptions of ANOVA:
- Independence: The samples must be independent of each other.
- Normality: The data in each group should be approximately normally distributed.
- Homogeneity of Variances: The variance among the groups should be approximately equal.
Types of ANOVA:
- One-Way ANOVA: Used when comparing the means of three or more independent groups based on one factor.
- Two-Way ANOVA: Used when comparing the means based on two factors and can also test for interaction effects between the factors.
- Repeated Measures ANOVA: Used when the same subjects are used for each treatment (e.g., a longitudinal study).
Example Scenario:
We have test scores from three different teaching methods. We want to determine if there is a statistically significant difference between the means of these three groups.
import numpy as np
# Example data: test scores from three different teaching methods
group1 = np.array([85, 90, 88, 92, 87])
group2 = np.array([78, 85, 80, 83, 82])
group3 = np.array([90, 92, 95, 91, 89])
# Combine all groups into a single array
all_data = np.concatenate([group1, group2, group3])
# Calculate group means and overall mean
mean_group1 = np.mean(group1)
mean_group2 = np.mean(group2)
mean_group3 = np.mean(group3)
mean_overall = np.mean(all_data)
# Calculate sum of squares between groups (SSB)
ssb = (len(group1) * (mean_group1 - mean_overall)**2 +
len(group2) * (mean_group2 - mean_overall)**2 +
len(group3) * (mean_group3 - mean_overall)**2)
# Calculate sum of squares within groups (SSW)
ssw = (np.sum((group1 - mean_group1)**2) +
np.sum((group2 - mean_group2)**2) +
np.sum((group3 - mean_group3)**2))
# Calculate degrees of freedom
df_between = 3 - 1 # Number of groups - 1
df_within = len(all_data) - 3 # Total number of observations - Number of groups
# Calculate mean squares
ms_between = ssb / df_between
ms_within = ssw / df_within
# Calculate the F-statistic
f_statistic = ms_between / ms_within
# Display results
print("ANOVA Results")
print("=============")
print(f"Sum of Squares Between (SSB): {ssb:.2f}")
print(f"Sum of Squares Within (SSW): {ssw:.2f}")
print(f"Degrees of Freedom Between: {df_between}")
print(f"Degrees of Freedom Within: {df_within}")
print(f"Mean Square Between (MSB): {ms_between:.2f}")
print(f"Mean Square Within (MSW): {ms_within:.2f}")
print(f"F-Statistic: {f_statistic:.2f}")
# To determine the p-value, we need to use the F-distribution
from scipy.stats import f
# Calculate the p-value
p_value = 1 - f.cdf(f_statistic, df_between, df_within)
print(f"P-Value: {p_value:.4f}")
# Conclusion
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis: There is a significant difference between the group means.")
else:
print("Fail to reject the null hypothesis: There is no significant difference between the group means.")
output:
ANOVA Results
=============
Sum of Squares Between (SSB): 252.13
Sum of Squares Within (SSW): 79.60
Degrees of Freedom Between: 2
Degrees of Freedom Within: 12
Mean Square Between (MSB): 126.07
Mean Square Within (MSW): 6.63
F-Statistic: 19.01
P-Value: 0.0002
Reject the null hypothesis: There is a significant difference between the group means.
Explanation:
- Data Preparation: Three groups of test scores are defined and combined into a single array.
- Mean Calculation: Calculate the means for each group and the overall mean.
- Sum of Squares Calculation:
- SSB (Sum of Squares Between): Measures the variance between the group means and the overall mean.
- SSW (Sum of Squares Within): Measures the variance within each group.
- Degrees of Freedom: Calculated for both between-group and within-group variations.
- Mean Squares: Compute the mean squares by dividing the sum of squares by the respective degrees of freedom.
- F-Statistic: Ratio of the mean square between groups to the mean square within groups.
- P-Value: Using the F-distribution to determine the significance of the F-statistic.
Conclusion: Based on the p-value, decide whether to reject the null hypothesis.
Posted on June 19, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.