Statistical Essentials for Data Analysts: A Beginner's Guide
Anand
Posted on March 6, 2024
Understanding Basic Statistical Terminologies with Python
In this post, we'll explore some fundamental statistical concepts using Python and explain them in detail. We'll be working with a dataset of student scores in an exam, and we'll use Python's statistics
module and matplotlib
library for visualization.
Let's start by importing the necessary libraries and defining our dataset:
import matplotlib.pyplot as plt
import statistics
students data
# data of student scores in an exam
student_scores = [85, 78, 92, 88, 76, 80, 85, 90, 85, 78]
Mean, Median, and Mode
The mean represents the average value of the dataset, while the median is the middle value when the data is arranged in ascending order. The mode is the most frequent value in the dataset
mean_score = statistics.mean(student_scores)
median_score = statistics.median(student_scores)
mode_score = statistics.mode(student_scores)
print("Mean:", mean_score) # Mean: 83.7
print("Median:", median_score) # Median: 85.0
print("Mode:", mode_score) # Mode: 85
Standard Deviation and Variance
Standard deviation measures the dispersion of data points from the mean, while variance represents the average of the squared differences from the mean.
std_deviation = statistics.stdev(student_scores)
variance = statistics.variance(student_scores)
print("Standard Deviation:", std_deviation) # Standard Deviation : 5.47
print("Variance:", variance) # Variance : 30.011
Range and Quartiles
The range is the difference between the maximum and minimum values in the dataset. Quartiles divide the data into four equal parts.
range_score = max(student_scores) - min(student_scores)
sorted_scores = sorted(student_scores)
q1 = statistics.median(sorted_scores[:len(sorted_scores)//2])
q2 = statistics.median(sorted_scores)
q3 = statistics.median(sorted_scores[len(sorted_scores)//2:])
print("Range:", range_score) # Range: 16
print("Q1:", q1) #Q1: 78
print("Q2 (Median):", q2) #Q2: 85.0
print("Q3:", q3) #Q3: 88
Interquartile Range (IQR)
The Interquartile Range (IQR) is the range between the first and third quartiles, measuring the spread of data.
iqr = q3 - q1
print("Interquartile Range (IQR):", iqr) #Interquartile Range(IQR): 10
Correlation Coefficient
The correlation coefficient measures the linear relationship between two variables. We'll calculate the correlation coefficient between hours studied and test scores
def correlation_coefficient(x, y):
n = len(x)
mean_x = sum(x) / n
mean_y = sum(y) / n
covariance = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(n))
std_dev_x = (sum((xi - mean_x) ** 2 for xi in x) / n) ** 0.5
std_dev_y = (sum((yi - mean_y) ** 2 for yi in y) / n) ** 0.5
correlation = covariance / (std_dev_x * std_dev_y)
return correlation
hours_studied = [4, 6, 3, 5, 7]
test_scores = [85, 90, 82, 88, 92]
correlation = correlation_coefficient(hours_studied, test_scores)
print("Correlation between hours studied and test scores:", correlation)
#Correlation between hours studied and test scores: 4.97223302698313
Scatter Plot Visualization
Lastly, we'll visualize the relationship between hours studied and test scores using a scatter plot.
plt.scatter(hours_studied, test_scores)
plt.xlabel('Hours Studied')
plt.ylabel('Test Scores')
plt.title('Hours Studied vs. Test Scores')
plt.show()
Posted on March 6, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.