Statistics: Сrash Course for Data Science. Part l
Maksim Karyagin
Posted on June 21, 2023
Introduction
In the vast landscape of data science education, finding a concise and well-structured course on statistics can be a daunting task. However, fear not, for this course aims to provide just that — a clear and comprehensive journey through the foundations of statistics for data science.
Through this course, you'll acquire the skills to analyze data, extract meaningful insights, and make informed decisions. Whether you're starting your data science journey or looking to strengthen your statistical foundation, this course provides the stepping stones to success.
Foundations
In this first part, we will establish the essential foundations of statistics and their relevance in the field of data science. By comprehending these fundamental principles, you'll gain the necessary groundwork to explore advanced statistical techniques and their practical applications. Let's embark on this journey by delving into key definitions and concepts.
Sample and Variables
In statistical analysis, a sample is a subset of data collected from a larger population. It represents a smaller but representative portion of the entire population. Working with samples allows us to draw inferences about the population as a whole.
And understanding the properties and characteristics of samples is crucial for statistical analysis as well.
Usually, Central Tendency helps with that. Central tendency measures provide a way to summarize and describe the center or typical value of a dataset. They help us understand the distribution of the data and make comparisons between different groups or variables.
There are three commonly used measures of central tendency: the mean, median, and mode.
1) The mean is calculated by summing up all the values in a dataset and dividing the sum by the total number of values. It represents the average value of the dataset.
2) The median is the middle value in a dataset when it is sorted in ascending or descending order. It divides the dataset into two equal halves.
3) The mode is the most frequently occurring value in a dataset. It represents the value that appears with the highest frequency.
Example:
Let's say we have a dataset containing the ages of the main characters in the Harry Potter movies. Now, we will proceed to calculate the measures of central tendency for this dataset.
import numpy as np
from scipy.stats import mode
// Sample data: Ages of main characters in the Harry Potter movies
ages = np.array([18, 17, 16, 17, 18, 19, 17, 18, 17, 16])
// Calculate the mean age
mean_age = np.mean(ages)
// Calculate the median age
median_age = np.median(ages)
// Calculate the mode age
mode_age = mode(ages)
// And our results
print("Median Age:", median_age)
print("Mean Age:", mean_age)
print("Mode Age:", mode_age.mode[0])
By considering measures of central tendency, we can gain a better understanding of the typical or central values within a dataset, helping us summarize and analyze the data effectively.
Standardization and Z-Transform
Standardization is a technique used to transform variables to a common scale, making them comparable. The Z-transform is one method of standardization that converts a variable into a standard normal distribution with a mean of 0 and a standard deviation of 1.
Example:
Let's consider a dataset of students' test scores. We can standardize the scores using the Z-transform.
from sklearn.preprocessing import StandardScaler
// Sample data
scores = np.array([75, 80, 85, 90, 95])
// Standardize the scores
scaler = StandardScaler()
standardized_scores = scaler.fit_transform(scores.reshape(-1, 1))
print("Standardized Scores:", standardized_scores)
Standardization allows us to make meaningful comparisons between variables with different scales or units.
Distributions and Normal Distribution
Distributions lie at the core of statistical analysis, characterizing the range of values and their associated probabilities within a dataset. Proficiency in understanding distributions is pivotal since many statistical techniques presuppose specific distributional properties.
Among the multitude of distributions, the normal distribution—also known as the Gaussian distribution—occupies a prominent position due to its prevalence across various domains.
Example:
Let's generate a random dataset following a normal distribution with a mean of 0 and a standard deviation of 1.
import numpy as np
import matplotlib.pyplot as plt
// Generate random data from a normal distribution
data = np.random.normal(0, 1, 1000)
// Plotting the distribution
plt.hist(data, bins=30)
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Normal Distribution')
plt.show()
Understanding the characteristics of the normal distribution is essential as many statistical methods rely on its properties.
Central Limit Theorem and Confidence Interval
The Central Limit Theorem (CLT) stands as a fundamental pillar of statistical theory. It states that the sum or average of a large number of independent and identically distributed random variables will converge to a normal distribution, regardless of the original distribution. This theorem forms the bedrock of numerous statistical inference techniques and enables us to make robust conclusions about population parameters based on sample statistics.
One vital tool stemming from the CLT is the confidence interval—a range of values within which we estimate a population parameter to lie with a specified level of confidence. Confidence intervals provide an understanding of the uncertainty surrounding our estimates, making them invaluable in drawing meaningful insights from data.
Example:
Let's consider the heights of students in a school. We can calculate the confidence interval for the population mean height using the CLT.
import numpy as np
import scipy.stats as stats
// Sample data
heights = np.array([165, 170, 175, 160, 155, 180, 185, 170, 168, 172])
// Calculate the confidence interval
confidence_interval = stats.norm.interval(0.95, loc=np.mean(heights), scale=np.std(heights))
print("Confidence Interval:", confidence_interval)
The confidence interval provides a range of values within which the true population parameter is likely to fall. It allows us to estimate the precision and reliability of our sample data.
P-value
The p-value is a statistical measure that helps us determine the strength of evidence against a null hypothesis. It quantifies the probability of obtaining the observed data or more extreme data if the null hypothesis is true. The p-value is a crucial component of hypothesis testing in statistics.
In hypothesis testing or A/B testing as well, we start with a null hypothesis (H0), which represents the assumption of no significant difference or effect. The alternative hypothesis (H1) contradicts the null hypothesis and suggests that there is a significant difference or effect present in the data.
The p-value allows us to make an inference about the null hypothesis. If the p-value is small (typically below a predetermined significance level, such as 0.05), we have strong evidence to reject the null hypothesis in favor of the alternative hypothesis.
To calculate the p-value, we compare the test statistic (which depends on the test being conducted) to the distribution of the test statistic under the null hypothesis. The p-value represents the probability of obtaining a test statistic as extreme as or more extreme than the observed test statistic, assuming the null hypothesis is true.
Example:
Let's perform a t-test to compare the heights of male and female characters in the Harry Potter movies. The null hypothesis (H0) is that there is no significant difference in the heights of male and female characters. The alternative hypothesis (H1) is that there is a significant difference.
import numpy as np
import scipy.stats as stats
// Sample data
male_heights = np.array([170, 175, 180, 185, 190])
female_heights = np.array([160, 165, 170, 175, 180])
// Perform t-test
t_statistic, p_value = stats.ttest_ind(male_heights, female_heights)
print("p-value:", p_value)
In this example, if the resulting p-value is less than the significance level (e.g., 0.05), we can reject the null hypothesis and conclude that there is a significant difference in the heights of male and female characters in the Harry Potter movies.
It helps us make informed decisions about the statistical significance of our findings. By considering the p-value alongside other relevant factors, we can draw meaningful conclusions and make data-driven decisions.
See you next time
By exploring the foundational concepts of statistics in this first part of the series, including samples and variables, standardization, distributions, the central limit theorem, confidence intervals, and the p-value, you have established a solid foundation for further exploration into advanced statistical techniques.
In the second (and final) part of the series, we will delve deeper into the Student's T-test, analysis of variance (ANOVA), correlation, and regression.
Stay tuned to uncover the surprising link between the T-test and the legendary Guinness Brewery.
Posted on June 21, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.