Feature Engineering - Min/Max Aggregate
Mage
Posted on March 2, 2022
TLDR
In this lesson, we’ll learn about the aggregate functions min() and max(), and see how they’re helpful in analyzing and understanding the data.
Glossary
- Data Aggregation
- Why is it necessary
- Definition
- Example
- How to code
Data Aggregation
Data aggregation is known as summarization of data. Some of the most common aggregate functions are min(), max(), mean(), count(), sum() etc.
Why is it necessary
Data aggregation is a part of the data analysis process. Data analysis is the first and most critical step of model building. This allows us to delve deeper into the data and help us understand the data better.
Definition
In this lesson, we’ll explore min() and max() functions in detail.
min(): This function helps us find the minimum or least value in a feature or column.
max(): This function helps us find the maximum or highest value in a feature or column.
We can apply aggregate functions in 2 different ways:
Case-1: Apply aggregate functions on a single feature or column i.e., analyzing each column individually.
Case-2: Apply aggregate functions on groups i.e., we’ll group rows and analyze each group individually.
Example
Consider a dataset with 2 columns "Product" and "Price". Let’s apply aggregate functions (min() and max()) to find minimum and maximum value in the “Price” column.
Grouping is a 3 step process as shown below:
Step-1: Split the rows into groups based on the “Product” column.
There are 3 unique products (Laptop, Desk, Chair) in the “Product” column, so the rows are split into 3 groups.
Step-2: Find the minimum price of each unique product
Step-3: Display the output. For this, we’ll combine each group’s output to form a data frame and display the data frame.
How to code
In recent years, the popularity of ridesharing has skyrocketed. The key benefits of ridesharing are that it’s inexpensive, convenient, and allows anyone to easily travel from 1 location to another.
Service providers frequently change prices based on time, traffic, the number of cabs available, and other factors. As costs fluctuate, it's beneficial to offer users a range of prices for a specific route. So, with the help of rides data, let’s find the minimum and maximum prices for each unique route.
Step-1:
First let’s group rides by source and then by destination. To do this, we’ll iterate through the rows of rides data and save the “source” as keys of the dictionary. The final result should be as shown below.
Output format: {‘sourceA’: [(destination1, price1), (destination1, price2),...], ‘sourceB’:[(destination1, price1), (destination1, price2),...],....}
Step-2:
Find minimum price
By comparing the prices of routes with the same starting location and destination, we'll find the minimum price of each route.
Find maximum price
By comparing the prices of routes with the same starting point and destination, we'll find the highest price for each route.
From the output, we see that the price from “Haymarket Square” to “North Station” ranges between 3.0 and 32.5, “Haymarket Square” to “West End” ranges between 3.0 and 27.5, etc.
Group rows of the same route, and find the minimum and maximum price of each individual route.
Pandas has a built-in function groupby() that’s used to group rows in a dataset. This function is used along with min() and max() functions to find minimum and maximum values of each unique group.
Find minimum price
Find maximum price
Magical no code solution
For quick analysis and results, try our product, Mage. Our service features an "Edit data" area with multiple aggregation options. Apart from analyzing the data, you can create a new column and store the aggregation results that help in further analysis of the data.
Want to learn more about machine learning (ML)? Visit Mage Academy! ✨🔮
Posted on March 2, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.