Data Science Zero to Hero - 1.3: Matplotlib, Seaborn & Jupyter Notebooks
StevenMcGown
Posted on July 27, 2023
You like python programs, don't you Squidward?
*crickets*
I'll be here all week!
Jokes aside, if you're like me, you're getting excited about learning new tools for your Data Science/ML/AI journey. So far we've covered Numpy and Pandas, where we learned how to manipulate, process, and analyze numerical and tabular data. These libraries gave us a solid foundation in handling and preparing data for further analysis or modeling, and as we delve into Matplotlib and Seaborn inside of Jupyter Notebooks, we're now stepping into the fascinating world of data visualization. Trust me, it only gets better from here!
Table of Contents
- Line Plot
- Using Pandas with Matplotlib
- Scatter Plot
- Bar Plot
- Pie Plot
- Histogram
- Box Plot
- Violin Plot
- Strip Plot
- Pair Plot
- Distribution Plot
- Count Plot
- Heat Map
Data Visualization with Matplotlib and Seaborn in Jupyter Notebooks
Matplotlib is one of the most widely used libraries for creating static, animated, and interactive visualizations in Python. Its extensive functionality and versatility make it a powerful tool for any data scientist or analyst to perform Exploratory Data Analysis (EDA)
Once imported, Matplotlib provides a wide variety of plots and charts to visualize data, from simple line and bar plots to more complex scatter plots and histograms. Whether you're trying to spot trends over time, distributions of data, or relationships between variables, Matplotlib has the flexibility to meet your needs.
Seaborn, while built on Matplotlib, enhances its capabilities and introduces more sophisticated visualization tools. It's designed to work seamlessly with Pandas dataframes and makes creating complex plots from dataframes quite straightforward.
With Seaborn, you can create a range of informative and attractive statistical graphics. Heat maps, violin plots, pair plots, and swarm plots are just a few of the more advanced visualizations available.
Both Matplotlib and Seaborn work exceptionally well in Jupyter Notebooks, a popular open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Jupyter Notebooks provide an interactive and intuitive interface for conducting data analysis and visualization.
To use Matplotlib or Seaborn in Jupyter Notebooks, you simply need to import the required libraries and execute your code. The outputs, including all graphs and plots, are then displayed directly under each code cell, making it easy to view and interpret your results in a structured and clear manner.
Typically when people use these libraries, they do the imports with the following aliases:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Now for the fun. Let's see what these libraries can do.
Line Plot
Lets start with a simple example. A line plot is used to display information as a series of data points connected by straight line segments. It's useful for visualizing data over time, also known as time series data. In this example, we can show the unemployment rate each year. In the code we can see that we have two arrays with an equal number of values in each, and we plot an unemployment rate with a corresponding year.
year = [1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010]
unemployment_rate = [9.8, 12, 8, 7.2, 6.9, 7, 6.5, 6.2, 5.5, 6.3]
plt.plot(year, unemployment_rate)
plt.title('unemployment rate vs year')
plt.xlabel('year')
plt.ylabel('unemployment rate')
plt.show()
Side note: Don't ask me where this data came from, it could be wrong for all I know and should only be used for demonstration purposes.
Using Pandas with matplotlib
Now that we have taken a stab at using matplotlib, let's load a dataset from a csv file into a pandas dataframe. We can take the information from this dataframe and plot it with a variety of different methods. You can download the dataset that I'm using from here:
https://www.kaggle.com/code/sanjanabasu/tips-dataset/input
Also, recall that above we defined pandas as 'pd'
df=pd.read_csv('tips.csv')
# Print DataFrame
print(df.head())
# Outputs:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Scatter Plot
A scatter plot uses dots to represent values for two different numeric variables. The position of each dot represents the value of data point, and this is useful for visualizing the relationship between two variables. Many scatter plots simply use one color of dots to illustrate the relationship between two variables, but in this example I've shown that you can describe the characteristics of your data points better with a little bit of creativity.
# Prepare df for plotting
total_bill = df['total_bill']
tip = df['tip']
sex = df['sex']
smoker = df['smoker']
# Create a scatter plot
plt.figure(figsize=(10, 6))
colors = {'Male': 'blue', 'Female': 'red'}
smoker_markers = {'Yes': 'x', 'No': 'o'}
for i in range(len(total_bill)):
plt.scatter(total_bill[i], tip[i], c=colors[sex[i]], marker=smoker_markers[smoker[i]], s=100)
# Set plot labels and title
plt.xlabel('Amount Due')
plt.ylabel('Gratuity')
plt.title('Scatter Plot of Amount Due vs. Gratuity')
# Add legend for gender and smoker status
for gender_label, color in colors.items():
plt.scatter([], [], c=color, label=gender_label)
for smoker_label, marker in smoker_markers.items():
plt.scatter([], [], marker=marker, label='Smoker: ' + smoker_label)
plt.legend(loc='upper right')
plt.grid(True)
plt.show()
We can see from the plot of the following example that there is a positive correlation between the x and y variables. This positive correlation implies that the higher the bill is, the higher the tip will be. This makes sense if you think about how many people tip based on the percentage of the bill.
Bar Plot
Bar plots are used to display and compare the number, frequency or other measure (like mean) for different categories. Each bar's height is proportional to the value it represents. In this plot we can see that there are 4 days recorded, Friday, Saturday, Sunday and Thursday. I suppose we can assume that the person who recorded this data only worked and recorded tips on those days.
# Group the data by 'day' and calculate the average 'total_bill' for each day
average_total_bill_by_day = df.groupby('day')['total_bill'].mean()
# Create the bar plot
plt.bar(average_total_bill_by_day.index, average_total_bill_by_day.values)
plt.xlabel('Day of the Week')
plt.ylabel('Average Total Bill')
plt.title('Average Total Bill by Day of the Week')
plt.show()
Pie Plot
Pie plots represent the size of items (out of 100%) in one data series, proportional to the sum of the items. We've all seen a pie plot before; it's useful when you want to visualize percentage breakdown of categories.
# Group the data by 'sex' and calculate the total count for each category
sex_counts = df['sex'].value_counts()
# Create the pie plot
plt.figure(figsize=(6, 6))
plt.pie(sex_counts, labels=sex_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Sex')
plt.axis('equal') # Equal aspect ratio ensures that the pie plot is circular.
plt.show()
Histogram
Histograms show the distribution of numeric data by dividing the data into bins of equal width. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin. We'll get more into distributions in the future, they're important to understanding nature of your data.
You may also notice in the code that an argument 'kde' is set to true. This is known as the Kernel Density Estimation, and it allows us to visualize the data distribution in a smooth and continuous manner, avoiding the limitations of discrete binning.
plt.figure(figsize=(8, 6))
sns.histplot(df['total_bill'], kde=True)
plt.xlabel('Total Bill')
plt.ylabel('Frequency')
plt.title('Total Bill Histogram')
plt.show()
Box Plot
A box plot, also known as a box and whisker plot, shows the quartiles of the dataset and is useful to visualize the distribution and skewness of your data. It also identifies outliers in your data. We'll go deeper into quartiles in future posts about distributions as well. For now, think of it this way:
A box plot divides your data into four equal parts, with each part representing a quarter of the data points. The "box" in the plot represents the middle 50% of the data, where the lower boundary of the box is the first quartile (Q1) and the upper boundary is the third quartile (Q3). The line inside the box represents the median (Q2), which is the middle value of the dataset.
Additionally, the "whiskers" extend from the box and indicate the range of the data, excluding outliers. Typically, the whiskers encompass data within 1.5 times the interquartile range (IQR), which is the difference between Q3 and Q1. Data points outside this range are considered outliers and are represented as individual points beyond the whiskers.
plt.figure(figsize=(8, 6))
sns.boxplot(data=df[['A', 'B']])
plt.title('Box Plot')
plt.show()
Violin Plot
A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.
The real value in using a violin plot is that it not only displays the quartile information like a box plot, but it also provides a more detailed view of the data distribution by showing the probability density of the data at different values.
plt.figure(figsize=(10, 6))
sns.violinplot(x='day', y='total_bill', hue='sex', data=df, split=True, palette='muted')
plt.xlabel('Day of the Week')
plt.ylabel('Total Bill')
plt.title('Total Bill Distribution by Day and Sex')
plt.legend(title='Sex', loc='upper right')
plt.show()
Strip Plot
Strip plots are used to represent the distribution of data. It's a good complement to a box or violin plot in cases where all observations along each category can be shown.
plt.figure(figsize=(8, 6))
sns.stripplot(data=df, x='D', y='A', jitter=True)
plt.title('Strip Plot')
plt.show()
Pair Plot
Pair plots are used to visualize the pairwise relationship between the columns. They are an effective way to visualize the relationships between different columns in a dataset, allowing you to quickly identify patterns, correlations, and trends. It is a useful exploratory data analysis tool when dealing with datasets containing multiple numerical variables.
sns.pairplot(df, hue='D')
plt.title('Pair Plot')
plt.show()
Distribution Plot
Distribution plot visualizes the distribution of a univariate set of observations. In seaborn, it is mainly done through the histplot function.
# Distribution Plot
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='A', kde=True)
plt.title('Distribution Plot')
plt.show()
Count Plot
Count plot can be thought of as a histogram across a categorical variable, instead of a quantitative one. It shows the counts of observations in each categorical bin.
# Count Plot
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='D')
plt.title('Count Plot')
plt.show()
Heat Map
A heat map is a two-dimensional representation of information with the help of colors. In the context of data visualization, it is used to represent the correlation between different features.
# Heatmap
corr = df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Heat Map')
plt.show()
Heat maps are exceptionally powerful as they provide an intuitive and visually striking representation of data. By using a color-coded system to display values on a 2D matrix, heat maps allow us to grasp complex patterns, trends, and relationships within the data at a glance.
We can see a few different things in this plot, and the first thing that might stick out to you is the red, diagonal line of boxes of 1's. These all indicate a 100% correlation, which makes sense when you see that it is being shown to correlate with itself.
If you look in the first column where total_bill is being compared with tip, we can see that there is a relatively strong correlation. This is in line with our assumption earlier that larger bill totals tend to garner larger tips. We can also see that there's a relatively strong correlation between large party size and tips, as well as total bill, which makes sense.
On the flip side, there's the cool blue size of the spectrum, which indicates negative correlation. If we look where time_Dinner and day_Thur is, we can see there is a very strong negative correlation between the two variables. Saturday and Sunday seem to follow opposite trends.
Conclusion
Well, we've made it to the end. I hope you have enjoyed this post on data visualization with matplotlib and seaborn! I highly recommend using these plots for your data science projects
as they will not only make your analyses more insightful and compelling but also enable you to effectively communicate your findings to others.
Happy visualizing and exploring the exciting world of data science!
Posted on July 27, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
July 27, 2023