Esther Njihia
Posted on January 22, 2023
Overview
The Olympic Games are one of the most prestigious and widely-followed sporting events in the world. With a history dating back over a century, the Olympics have seen countless athletes from all corners of the globe compete for glory and honor. In this project, we will use Python, SQL and Tableau to analyze 120 years of Olympics data sourced from Kaggle.
Data Collection
The data for this analysis was sourced from Kaggle and includes information on athletes, teams, events, and medals. The data was collected for all Summer and Winter Olympics from 1896 to 2016.
Data source: Kaggle
Data Cleaning and Preparation
Before we could begin our analysis, the data needed to be cleaned and prepared for use. This involved removing any duplicate or irrelevant data, correcting any errors or inconsistencies, and formatting the data in a way that would be compatible with our analysis tools.
Data Cleaning with Python
Load data
df = pd.read_csv("data/athlete_events.csv")
Calculate the number of missing values in the data
df.isnull().sum()
We got some missing values with the age, height and weight variables. How about we fix that.
Fill in the age missing values with its mean.
df['Age'].fillna(df['Age'].mean(), inplace= True)
Repeat the same for the height and weight variables.
SQL Analysis
Once the data was cleaned and prepared, we used SQL to extract and analyze the data. We were able to answer questions such as:
1.What is the Gender distribution of athletes across different sports and event?
SELECT Sport, Event, Sex, COUNT(Name) as athlete_count
FROM Athlete
GROUP BY Sport, Event, Sex
ORDER BY athlete_count DESC;
2.Which city and season have the most successful athletes?
SELECT City, Season, COUNT(Medal) as Medal_count
FROM Athlete
WHERE Medal IS NOT NULL
GROUP BY City, Season
ORDER BY Medal_count DESC
3.Which athletes are the most successful in terms of number of medals?
SELECT Name, COUNT(Medal) as medal_count
FROM Athlete
GROUP BY Name
ORDER BY medal_count DESC;
4.Trend analysis to identify any patterns in the number of medals over time
By year
SELECT Year, COUNT(Medal) as medal_count
FROM Athlete
GROUP BY Year
ORDER BY Year ;
5.Use statistical analysis to find factors associated with winning more medals.
- first create a sub query to find the total count of medals per city and season
WITH city_season_medals AS (
SELECT City, Season, COUNT(Medal) AS medal_count
FROM Athlete
GROUP BY City, Season
)
- use statistical methods to find the relationship between city & season and the total number of medals. Fo this instance I used z-score. The higher the z-score the higher the correlation.
SELECT City, Season, medal_count, (medal_count - AVG(medal_count) OVER()) / STDEV(medal_count) OVER() AS z_score
FROM city_season_medals
ORDER BY z_score DESC
These are some of the questions that were answered during the SQL analysis.
Visualization with Tableau
To make the data more accessible and understandable, Tableau was used to create visualizations of our findings. Some of the visualizations we created include:
- A pie chart showing the gender distribution across all sports
- A bar chart showing the most successful athletes over time.
- line charts to show trends and patterns in the number of medals won over time
View the whole Olympics dashboard:Olympics Dashboard
Conclusion
Through this analysis, I was able to gain a deeper understanding of the history of the Olympics and the performances of countries and athletes over the past 120 years. By using SQL and Tableau, I was able to extract valuable insights from the data and present them in a clear and accessible way.
Code and Data
The code and data used for this project is available at Github
By using the above techniques, we can also use the data to predict future results, and also to identify patterns and trends that can be used to improve performance and achieve success in future Olympic games.
Posted on January 22, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.