Introduction to Data Analysis with Python Part 4: Data Visualisation with Matplotlib
dev_neil_a
Posted on August 22, 2022
- Introduction
- Step 1. Importing Pandas and NumPy
- Step 2. Import From Excel
- Step 3. Validating the Data
- Step 4. Creating New Dataframes for Charting
- Step 5. Creating Charts with Matplotlib
- Conclusion
- Resources
Introduction
In this final part of the multi-part series, I'll be showing you how to create some basic charts using Matplotlib.
The data for the charts will come from a number of Pandas dataframes that will be created from the data that was used in part three.
To recap what was covered previously:
In part one, I covered importing data from a CSV file, cleaning up & converting data and finally exporting it to an Excel file.
In part two, I covered performing mathematical operations against the data that is stored in a dataframe using both Pandas and NumPy
In part three, I covered how to perform analytical operations against data in a Pandas dataframe to show data that could be used for reporting, such as a total for example.
As before in the previous parts, there is a Jupyter notebook, along with all the other required files located in a GitHub repo that is linked in the Resources section.
With that said, let's get started on the series finale. Spoilers, there is no cliffhanger!
Step 1. Importing Pandas and NumPy
First of all, the Pandas and NumPy libraries need to be imported. In addition, Matplotlib will also be imported as it will be required for creating the charts.
# --- %matplotlib inline will ensure that the plots (charts) and figures show up in the notebook.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
If you don't have matplotlib installed, you can install it using pip:
pip install matplotlib
Step 2. Import From Excel
Once the libraries have been imported, the next step is to get the data imported. This is the same data that was used in part three and there have been no changes to it.
sales_data = pd.read_excel(io = "data/order_data_with_totals.xlsx",
sheet_name = "order_data_with_totals",
dtype = {"order_id": np.int64,
"order_date": "datetime64",
"customer_id": np.int64,
"customer_first_name": str,
"customer_last_name": str,
"customer_gender": str,
"customer_city": str,
"customer_country": str,
"item_description": str,
"item_qty": np.int64,
"item_price": np.float64,
"order_currency": str,
"order_vat_rate": np.float64,
"order_total_ex_vat_local_currency": np.float64,
"order_total_vat_local_currency": np.float64,
"order_total_inc_vat_local_currency": np.float64,
"order_currency_conversion_rate": np.float64,
"order_total_ex_vat_converted_gbp": np.float64,
"order_total_vat_converted_gbp": np.float64,
"order_total_inc_vat_converted_gbp": np.float64})
Step 3. Validating the Data
Now that the data has been imported from the Excel file into the sales_data dataframe, let's take a look at the data it contains.
Step 3.1. What the Data Looks Like
First, let's have a look at some of the data in the first five rows of the data in the sales_data dataframe.
sales_data.head(n = 5)
There are more columns in the sales_data dataframe but it would be too long to fit into an image.
Step 3.2. Check the Columns DataTypes
Next, let's have a look at the datatypes that have been assigned to each column in the sales_data dataframe.
sales_data.dtypes
As expected, all the datatypes match to what they were specified to be when they were imported.
Step 3.3. Check for NaN (Null) Values
sales_data.isna().sum()
Just as before, there are no NaN values in the dataframe.
Now, let's move on to creating some additional dataframes from the sales_data dataframe that can then be used to create some charts.
Step 4. Creating New Dataframes for Charting
In this section, two new dataframes will be created that will be used for creating the charts. The first dataframe will cover the total number of orders by the currencies that were used and the second will be a list of orders placed by the customers gender.
Unlike in part three, the two dataframes will each be assigned to a variable so they can be referenced when it comes to creating the charts.
Step 4.1. Create Total Number of Orders by Currency Dataframe
# --- Create a variable for the dataframe:
orders_by_currency_df = sales_data.groupby(["order_currency"])\
.size()\
.to_frame("total_number_of_orders")\
.sort_values("total_number_of_orders",
ascending = True)
# --- Show the contents of the dataframe:
orders_by_currency_df
Step 4.2. Create Total Number of Orders by Gender Dataframe
orders_by_gender_df = sales_data.groupby(["customer_gender"])\
.size()\
.to_frame("no_of_orders")\
.sort_values("no_of_orders",
ascending = False)
# --- Show the contents of the dataframe:
orders_by_gender_df
Step 5. Creating Charts with Matplotlib
So what is Matplotlib? Matplotlib is a library that is used by Python to create charts from data that can come from many different sources. In the examples in this article, the data sources will be the two dataframes that were create earlier from the sales_data dataframe.
A Matplotlib chart consists of a number of elements. The below diagram depicts what each element is.
- Figure: This of this as a canvas that the chart(s) is / are placed onto.
- Figure Title: The title of the figure. This can be different to the title given to a chart (or axes). This is not shown on the above example.
- Axes: An axes is the container for a chart (also called a plot). An axes sits on top of the figure and there can be more than one axes on a figure.
- Axes Title: This is the title for the axes.
- Y-Axis Label: The label that describes what the y-axis represents.
- X-Axis Label: Does the same as the y-axis label, only it's for the x-axis.
- Tick: What the data represents from the data source (for example, what currency does the bar represent).
- Legend: A list of what each data point on the plot/chart is.
As part of each chart, I've added notes for each section of the code to describe what it does or what its purpose is.
First, let's begin by looking at making a bar chart from the orders_by_currency_df dataframe.
Step 5.1. Orders by Currency as a Bar Chart
There are two ways that you can create charts (plots) with Matplotlib. The first one is a simple API called plot that will create everything for you and does have a level of customisation available. For example, let's create a quick bar chart using the orders_by_currency_df dataframe:
orders_by_currency_df.plot(kind = "bar")
As you can see, it is a basic bar chart that shows pretty much the data. But let's say we want to create a bar chart using a method that is more object-orientated and offers the maximum amount of customisation available. This is what will be used in the examples going forward, starting with the below bar chart.
# --- Create a list of colors to use for each item in the dataframe.
colors_to_use = ["blue", "red"]
# --- Setup the figure and the axes:
fig, ax = plt.subplots(figsize = (10, 8))
# --- This sets the figure to white. There seems to be a bug in VS Code that can cause
# --- the figure to go dark when using dark mode passed through from the operating system.
fig.set_facecolor("white")
# --- Customise the bar chart axes:
ax.set_title(label = "Total Orders by Currency",
fontdict = {"fontsize": 20,
"color": "black",
"weight": "bold"})
# --- Set the x axis label:
ax.set_xlabel("Currency", fontsize = 16)
plt.xticks(fontsize = 16)
# --- Set the y axis label:
ax.set_ylabel("Number of Orders", fontsize = 16)
plt.yticks(fontsize = 16)
# --- Create a bar plot:
bar_chart = ax.bar(x = orders_by_currency_df.index.values,
height = orders_by_currency_df["total_number_of_orders"],
color = colors_to_use,
tick_label = orders_by_currency_df.index)
# --- Set the label for each bar to appear inside each bar with the value of each currency:
ax.bar_label(container =bar_chart,
label_type = "center",
labels = orders_by_currency_df["total_number_of_orders"],
color = "white",
weight = "bold",
fontsize = 16)
# --- Create a dictionary that maps the currency to the color used.
# --- These will be used in the the legend.
currency_cmap = dict(zip(orders_by_currency_df.index.values,
colors_to_use))
patches = [Patch(label = currency,
color = currency_color) for currency, currency_color in currency_cmap.items()]
# --- Add a legend:
ax.legend(handles = patches,
fontsize = 16,
labelcolor = "black",
title = "Currency",
title_fontproperties = {"size": 16,
"weight": "bold"});
You may have noticed that there is a semi-colon at the end of the last line of the code, which is unusual with Python.
The reason for this is that the default behaviour for Matplotlib is to show the object name above the chart. Adding the semi-colon will suppress this so you will only see the chart.
Note: If you look at the first bar chart at the beginning of this step, you will see the object name.
Step 5.2. Percentage of Orders by Currency as a Pie Chart
Now let's take the same data used for the bar chart and create a pie chart from it.
# --- Create a list of colors to use for each item in the dataframe.
colors_to_use = ["blue", "red"]
# --- Setup the figure and the axes:
fig, ax = plt.subplots(figsize = (14, 10))
# --- This sets the figure to white. There seems to be a bug in VS Code that can cause
# --- the figure to go dark when using dark mode passed through from the operating system.
fig.set_facecolor("white")
# --- Customise the pie chart plot:
ax.set_title(label = "Total Orders by Currency (%)",
fontdict = {"fontsize": 20,
"color": "black",
"weight": "bold"})
# --- Create a pie chart plot.
# --- explode will take one of the pieces out of the pie slightly.
# --- autopct will format the percentages to two decimal points:
patches, texts, pcts = ax.pie(x = orders_by_currency_df["total_number_of_orders"],
labels = orders_by_currency_df.index.values,
explode = (0.2, 0),
autopct = '%0.2f%%',
shadow = False,
colors = colors_to_use,
textprops = {"fontsize": 16,
"weight": "bold"})
# --- Set the color of the percentage to white:
plt.setp(pcts,
color = "white",
weight = "bold")
# --- This will change the color of the text label for each slice to the color the slice used:
for index_pos, patch in enumerate(patches):
texts[index_pos].set_color(patch.get_facecolor())
# --- Add a legend:
ax.legend(fontsize = 16,
title = "Currency",
loc = "upper left",
title_fontproperties = {"size": 16,
"weight": "bold"});
Step 5.3. Percentage of Orders by Gender as a Pie Chart
Lastly, let's create another bar chart, this time using the orders_by_gender_df dataframe.
# --- Create a list of colors to use for each item in the dataframe.
colors_to_use = ["red", "blue","purple"]
# --- Setup the figure and the axes:
fig, ax = plt.subplots(figsize = (16, 12))
# --- This sets the figure to white. There seems to be a bug in VS Code that can cause
# --- the figure to go dark when using dark mode passed through from the operating system.
fig.set_facecolor("white")
# --- Customise the pie chart plot:
ax.set_title(label = "Total Orders by Gender (%)",
fontdict = {"fontsize": 20,
"color": "black",
"weight": "bold"})
# --- Create a pie chart plot.
# --- explode will not take any of the pieces out of the pie.
# --- autopct will format the percentages to two decimal points:
patches, texts, pcts = ax.pie(x = orders_by_gender_df["no_of_orders"],
labels = orders_by_gender_df.index.values,
autopct = '%0.2f%%',
explode = (0.0, 0.0, 0.0),
shadow = False,
colors= colors_to_use,
textprops = {"fontsize": 16,
"weight": "bold"})
# --- Set the color of the percentages to white:
plt.setp(pcts,
color = "white",
weight = "bold")
# --- This will change the color of the text label for each slice to the color the slice used:
for index_pos, patch in enumerate(patches):
texts[index_pos].set_color(patch.get_facecolor())
# --- Add a legend:
ax.legend(fontsize = 16,
labelcolor = "black",
title = "Gender",
loc = "upper left",
title_fontproperties = {"size": 16,
"weight": "bold"});
Conclusion
In this final part in the series, I covered how to use an object-oriented way of using Matplotlib to allow you to create some basic charts.
There are many different options you use to customise charts, be that color themes, visualisation styles (such as histogram, scatter plots and line graphs to name a few) and more.
I would recommend checking the Matplotlib documentation to see what other possibilities are available for you to use.
Thank you for reading and have a good day!
Resources
Posted on August 22, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
August 22, 2022