Introduction to Data Analysis with Python Part 3: Analysing Data Using Pandas and NumPy with Python
dev_neil_a
Posted on August 15, 2022
- Introduction
- Step 1. Importing Pandas and NumPy
- Step 2. Import From Excel
- Step 3. Validating the Data
- Step 4. Basic Analysis Functions with Pandas
- Step 5. Grouping Data by Criteria
- Step 6. Using Mean
- Step 7. Using Median
-
Step 8. Working with Dates
- Step 8.1. Recreate the Index for the Dataframe with Dates for the Index
- Step 8.2. Work out the Orders Totals per Day in a Specific Range
- Step 8.3. Show Total Order Values and Mean for Each Year
- Step 8.4. Show Total Order Values and Mean for a Single Year
- Step 8.5. Show Total Order Values and Mean per Month for Each Year
- Resources
Introduction
In this multi-part series, I'll be going over some of the basics of Pandas, NumPy and Matplotlib that I've learned over the past few weeks.
In part one, I covered the following topics:
- Importing data from a CSV file into Pandas
- Cleaning up the data and removing unusable data.
- Converting between datatypes.
- Exporting and importing data to and from Excel files.
In part two, I covered performing mathematical operations against the data that is stored in a dataframe using both Pandas and NumPy
In this part, I will be covering how to perform analytical operations against data in a Pandas dataframe to show data that could be used for reporting, such as a total for example.
As before in the previous parts, there is a Jupyter notebook, along with all the other required files located in a GitHub repo that is linked in the Resources section.
Let's get started on part three.
Step 1. Importing Pandas and NumPy
First of all, the Pandas and NumPy libraries need to be imported.
import pandas as pd
import numpy as np
Step 2. Import From Excel
Once the libraries have been imported, the next step is to get the data imported. The import will be using strict datatype enforcement when importing the data, just like in part two the second time the Excel sheet was imported.
The only difference this time is that there are more columns to import and there will be no rows to skip.
sales_data = pd.read_excel(io = "data/order_data_with_totals.xlsx",
sheet_name = "order_data_with_totals",
dtype = {"order_id": np.int64,
"order_date": "datetime64",
"customer_id": np.int64,
"customer_first_name": str,
"customer_last_name": str,
"customer_gender": str,
"customer_city": str,
"customer_country": str,
"item_description": str,
"item_qty": np.int64,
"item_price": np.float64,
"order_currency": str,
"order_vat_rate": np.float64,
"order_total_ex_vat_local_currency": np.float64,
"order_total_vat_local_currency": np.float64,
"order_total_inc_vat_local_currency": np.float64,
"order_currency_conversion_rate": np.float64,
"order_total_ex_vat_converted_gbp": np.float64,
"order_total_vat_converted_gbp": np.float64,
"order_total_inc_vat_converted_gbp": np.float64})
Step 3. Validating the Data
Now that the data has been imported from the Excel file into the sales_data dataframe, let's take a look at the data it contains.
Step 3.1. What the Data Looks Like
First, let's have a look at the first five rows of the data in the sales_data dataframe.
sales_data.head(n = 5)
The main difference this time is that there are more columns to the right of the sales_data dataframe. I've only shown some of the columns as the image wouldn't fit with all the columns that are in the sales_data dataframe.
Step 3.2. Check the Columns DataTypes
Next, let's have a look at the datatypes that have been assigned to each column in the sales_data dataframe.
sales_data.dtypes
As expected, all the datatypes match to what they were specified to be when they were imported.
Step 3.3. Check for NaN (Null) Values
sales_data.isna().sum()
There are no NaN values in the sales_data dataframe as it was cleaned up in part one and the new data and columns that was created in part two were checked for NaN values before exporting the data to a new Excel file.
Step 4. Basic Analysis Functions with Pandas
First up, let's take a look at some of the basic analysis functions that can be performed with Pandas. This is just a small sample of what can be done so I would recommend looking at the Pandas documentation if you need to find out how to perform a specific function against a dataframe.
Step 4.1. Total Number of Orders
Although this isn't strictly a Pandas specific function, it is useful to show how many rows there are in the sales_data dataframe.
print(f"Total Number of Orders: {len(sales_data)}")
Step 4.2. Show Orders with a Total Greater than 50 GBP
Now let's take a look at narrowing down some of the data in the sales_data dataframe. To start with, let's see the first five orders that have a value of greater than (>) 50 GBP in the order_total_inc_vat_converted_gbp column.
sales_data[sales_data["order_total_inc_vat_converted_gbp"] > 50].head(n = 5)
You can substitute the >
for <
(less than) if you need to find values under 50 GBP.
Step 4.3. Show the Highest Value Order
Next, let's use the max()
function to find the order in the sales_data dataframe with the highest value in the order_total_inc_vat_converted_gbp column.
sales_data[sales_data["order_total_inc_vat_converted_gbp"] == sales_data["order_total_inc_vat_converted_gbp"].max()]
Step 4.4. Show the Lowest Value Order
Finally, let's use the min()
function to find the order in the sales_data dataframe with the lowest value in the order_total_inc_vat_converted_gbp column.
sales_data[sales_data["order_total_inc_vat_converted_gbp"] == sales_data["order_total_inc_vat_converted_gbp"].min()]
Step 5. Grouping Data by Criteria
In this section, I'll be demonstrating how to use the groupby
function, along with a few others that can be used with it.
The groupby
function is used to group together rows that match a criteria for a given column or a list of columns that the data needs to be grouped by. The main purpose is so that you can display a set of results by that grouped up criteria, rather than just showing every row that matches the specified criteria.
Step 5.1. Total Number of Orders by Currency
To start this section, let's begin by getting a list of the currencies that the orders were placed with and then show the total number of orders up for each currency.
sales_data.groupby(["order_currency"]).size()
In the above, groupby
will group up the entries found in the order_currency (GBP and EUR in this case) column and size
will count up each one that it finds.
Now, let's make the output a little bit more presentable by putting it into a frame using the to_frame
function.
Whilst we are at it, let's sort the orders by the currency with the lowest number of orders first using the sort_values
function.
sales_data.groupby(["order_currency"])\
.size()\
.to_frame("total_number_of_orders")\
.sort_values("total_number_of_orders",
ascending = True)
That looks better!
Step 5.2. Total Number of Orders by Gender
Using the same method as the above, let's take a look at the total number of orders by the gender of the customers.
sales_data.groupby(["customer_gender"])\
.size()\
.to_frame("no_of_orders")\
.sort_values("no_of_orders",
ascending = False)
Step 5.3. Total Value of Orders by Gender
Now that the total number of orders by gender is known, let's see what the total order values are for each gender. In this step, it will use the agg
function to perform the aggregation of the orders by passing a dictionary of columns and a function to use. In this example, it will only be the order_total_inc_vat_converted_gbp column.
As part of this, NumPy will be used to perform a sum (np.sum
) against the orders that are found for each gender.
Finally, the round
function is used to round the results to two decimal places.
sales_data.groupby(["customer_gender"])\
.agg({"order_total_inc_vat_converted_gbp": np.sum})\
.sort_values(["order_total_inc_vat_converted_gbp"],
ascending = False)\
.round(2)
As a side note, you can pass multiple columns and functions with the agg
function, such as sum, mean and median. There is an example of how to do this in step 6.2.
Step 6. Using Mean
In this section, the mean
function, specifically the np.mean
(NumPy) function will be used to aggregate data in the sales_data dataframe.
What is mean?
Mean is the total of the numbers in a given column, divided by how many numbers there are, be that the total number in the column or by the number that has been filtered.
Step 6.1. Get the Mean of all the Orders
First, let's run mean against the order_total_inc_vat_converted_gbp column.
print(f"Mean of all orders (Converted to GBP): £{np.mean(sales_data['order_total_inc_vat_converted_gbp']):.2f}")
Just a quick note, :.2f
will only show two decimal places in the output.
Step 6.2. Get the Mean of Orders by Country
Next, let's use agg
to get the mean of the order_total_inc_vat_local_currency and order_total_inc_vat_converted_gbp columns and use groupby
to group up the results by country and the currency used.
sales_data.groupby(["customer_country", "order_currency"])\
.agg({"order_total_inc_vat_local_currency": np.mean,
"order_total_inc_vat_converted_gbp": np.mean},)\
.sort_values(["order_total_inc_vat_converted_gbp"],
ascending = False)\
.round(2)
From the above, the results are grouped up by country and currency.
Step 6.3. Get the Mean of Orders by Country, City and Gender
Following on from the previous example, let's take it a little bit further. This time, let's use three columns for the grouping and do a mean of just the order_total_inc_vat_converted_gbp column.
sales_data.groupby(["customer_country", "customer_city", "customer_gender"])\
.agg({"order_total_inc_vat_converted_gbp": np.mean})\
.sort_values(["order_total_inc_vat_converted_gbp"],
ascending = False)\
.round(2)
The results this time are grouped up by country, city and gender. This will allow for a deeper dive into the customers by each and can be further expanded to see what they buy.
Step 7. Using Median
Now that mean has been covered, let's take a look at using median. The process for using median is very much the same as mean, you just substitute mean
with median
What is median?
Median is the middle number that is found by putting all the data in a column in order (lowest to highest) and then finding the number in the middle. If there are two middle numbers (even amount of entries normally), it will work out the mean of those two numbers to give the median.
The two examples below are the the same as the examples used in step 6, only they use np.median
(NumPy median) instead of np.mean
(NumPy mean).
Step 7.1. Get the Median of Orders by Country
sales_data.groupby(["customer_country"])\
.agg({"order_total_inc_vat_converted_gbp": np.median},)\
.sort_values(["order_total_inc_vat_converted_gbp"],
ascending = False)\
.round(2)
Step 7.2. Get the Median of Orders by Country, City and Gender
sales_data.groupby(["customer_country", "customer_city", "customer_gender"])\
.agg({"order_total_inc_vat_converted_gbp": np.median})\
.sort_values(["order_total_inc_vat_converted_gbp"],
ascending = False)\
.round(2)
Step 8. Working with Dates
This final section will focus on using dates with Pandas dataframes. Using dates can be done a number of ways but I'll be covering just a couple. Feel free to leave a comment with ways that you would typically use dates with Pandas dataframes.
Step 8.1. Recreate the Index for the Dataframe with Dates for the Index
First, we will recreate the index for the sales_data dataframe, using the values stored in the order_date
column. This allows the sales_data dataframe to then be treated as a time series-based dataframe that works faster when using both dates and / or times.
sales_data.set_index([sales_data["order_date"]],
inplace = True)
sales_data.index.rename("date",
inplace = True)
sales_data.head(n = 2)
The index (first column on the left) is now showing the same date per row as order_date
.
Step 8.2. Work out the Orders Totals per Day in a Specific Range
Now that the sales_data dataframe is indexed with dates, let's make use of it. To do this, the loc
function is used. This function will find entries in the index that fall within the criteria supplied.
In the below example, loc
will search for values in the index that match between the first of April 2020 and the tenth of April 2020.
From there, it will group up the orders found by the order_date
column and perform a sum on the order_total_inc_vat_converted_gbp column for any orders placed on each day. The output will show each day and the total for that day.
start_date = "2020-04-01"
end_date = "2020-04-10"
sales_data.loc[start_date : end_date]\
.groupby(["order_date"])\
.agg({"order_total_inc_vat_converted_gbp": np.sum})\
.sort_values("order_date",
ascending = True)\
.round(2)
Step 8.3. Show Total Order Values and Mean for Each Year
Next up, let's take a look at getting the total and the mean for each year that is found in the sales_data dataframe index. To do this, the index will be resampled using the resample
function, which groups up the sales_data dataframe index and entries by the year ("Y")
.
This won't impact the sales_data dataframe as it isn't getting reassigned with the resampled data.
In addition, the agg
function will be done a slightly different way. This time, we will use a method that will allow us to specify the name of the column in the output for each column that has been specified. For example, instead of order_total_inc_vat_converted_gbp, the column name in the output will be year_total_gbp.
sales_data.resample("Y")\
.agg(year_total_gbp = ("order_total_inc_vat_converted_gbp", np.sum),
year_mean_gbp = ("order_total_inc_vat_converted_gbp", np.mean))\
.round(2)
Step 8.4. Show Total Order Values and Mean for a Single Year
Now that you have seen it done for each year, let's do it again but only show the results for one year instead.
To do this, the loc
function needs to be passed first which will use the index to filter for the year that is specified. In this case, the year will be 2020 but there is data for 2021 as well if you want to try it.
year_to_use = 2020
sales_data.loc[sales_data.index.year == year]\
.resample("Y")\
.agg(year_total_gbp = ("order_total_inc_vat_converted_gbp", np.sum),
year_mean_gbp = ("order_total_inc_vat_converted_gbp", np.mean))
Step 8.5. Show Total Order Values and Mean per Month for Each Year
Lastly, let's get the order total and mean for each month ("M")
that is in the sales_data dataframe index.
sales_data.resample("M")\
.agg(month_total_gbp = ("order_total_inc_vat_converted_gbp", np.sum),
month_mean_gbp = ("order_total_inc_vat_converted_gbp", np.mean))\
.round(2)
If you change resample("M")
to resample("Q")
, the results for each quarter of the year will be shown.
Resources
Posted on August 15, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
August 22, 2022