Product developers’ guide to getting started with AI — Part 2: Surfing through dataframes
Mage
Posted on September 23, 2021
TLDR
Surfing through data is a quick and simple way to put all relevant information on the top. To go from a searching experience to a surfing experience, all it takes is a strong filtering, sorting, and grouping procedure.
Outline
- Introduction
- Before we begin
- Filtering dataframes
- Conditions
- Grouping dataframes
- Sorting dataframes
- Conclusion
Introduction
In the past, people had to use the Dewey Decimal System in order to discover new information to answer questions and perform research. It was a tedious and monotonous process that required you to pick a topic, then look through each in an unchanging order. Following the Dewey Decimal System, the search results for a person looking for answers on “How to cook” is the exact same for another, even when they are looking for a different answer.
Nowadays, when people have a question or want to learn, they surf the web to find an answer. You found this article and clicked it probably because it has the answers to what you’re looking for. Search engines visit sites that collect tons of data and employ techniques to decide which piece of data goes on top.
To surf through a dataframe follow these 3 steps: filter, order, and group. This transforms your user experience from searching to surfing.
Before we begin
In this guide, we’ll be using the Titanic dataset along with Google Collab. We’ll import the dataset, look at the metadata to find the best filters, then slice groups of data down in order to cleanly surf. If you need a refresher on anything mentioned above, please refer back to part 1 to learn how to set up the environment and view metadata.
Filtering Dataframes
Back on the Titanic, everyone is boarding the ship and you find a ticket on the floor. But it doesn’t say anything else about the name of the person on it. From the writing you can see the 1st 4 digits, 3734, with the remaining being too dirty to figure out. Being the good samaritan that you are, you tell the staff and begin your search for the owner.
Let’s examine entries of the dataframe for the ticket and the name.
To select specifically two columns we can use what we learned in part 1 with loc and iloc, or we may use items. In Pandas, the filter method takes in the items method in order to display columns. For the case of the Ticket, we want to display the ‘Name’ and ‘Ticket’ Column. If you are unfamiliar with this, refer back to part 1 and use info to find the metadata.
When comparing strings a great filter to use is contains. In this case we check if the ticket number contains 3734.
We see that the owner is a young man named William Henry Allen.
Great job! We found him, but he looks lost. We chat with William, and he has a request for us. He got separated at the checkpoint and needs help locating his business partners. He forgot their names, but recalls that they’re a group of 4 middle-aged men embarking in group Q.
We have a new problem, but unlike the ticket, this time all we know is that we’re searching for middle-aged men in group Q. None of these values are unique so we can’t only filter. We’ll need to start expanding our search repertoire with conditionals.
Conditions
Similar to other programming languages, Pandas supports filtering with conditional operators. There are 2 types of operators we will be focusing on, relational and logical. In Pandas, to apply a condition to a dataframe, the syntax is df[(“conditions”)].
The relational operators are > (greater than), >= (greater than or equal to), < (less than), <= (less than or equal to) , == (equals), != (not equals) and are used to search through a dataframe by comparing all values in a column with a fixed value. Then, to compare multiple columns we chain the conditionals of relational operators together with logical operators & (and), | (or) , ~ (not).
To find all middle-aged men in group Q, let’s break it down into 2 parts. The 1st part is to capture all the people in group Q. We can use a single relational operator on the embarked column, df[‘Embarked] == ‘Q’. For the males, df[‘Sex’] == ‘male’. Alternatively, we can use “not” to invert it as df[‘Sex ~== ‘female’. Finally we chain the conditionals together using & (AND) to get all males in group Q.
Additionally, use these (see info) boolean operators to compare dataframes.
Grouping Dataframes
The data is reduced. We begin analysis on how many are in our search space that match what we’re looking for. In this case we’re searching for the right age range.
Groupby is useful to get the count of how many values match, and from there we can begin slicing the data with head. Start by breaking down the ages into 3 groups, adolescent to adulthood, middle-aged, and senior. Then define rules for the 3 groups. The ranges will be 0–31, 32–55, and 56+ respectively. We know that age must be defined, so we clear up all empty age values. The ‘[‘ is inclusive and ‘(‘ is exclusive.
Create a range of bins and pass it as a parameter for groupby. Use groupby to find the rows of middle-aged and extract.
Cut down the data into chunks of 0–31, 32–55, and 56+
Sorting Dataframes
The 4 fellows we are looking for are in the middle of the dataframe. We sort it in order and slice specifically starting from 9 to 13 (9+4) to find only the middle-aged men in group Q.
There we have it! We’ve reduced the search space from 891 passengers aboard the ship all the way down to 4. Inspecting the dataframe, we see that there are 4 possible people.
Mr. Patrick Dooley, Mr. John Bourke, Mr. James Farrell, and Dr. William Edward Minahan
Conclusion
William should be able to reunite with his partners now. Just in time as the foghorn blows and the ship begins to sail off. Stay tuned for what happens to the Titanic as it embarks across the sea. We’ll take a deeper look, in part 3, “Terraforming Dataframes”.
Posted on September 23, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.