Using pipes in Pandas, a travel of flow and data
Fernando
Posted on December 8, 2023
Pandas library in Python is indeed one of the most significant libraries to learn, especially if you want to dedicate yourself to data science. One of the common challenges people face is managing the cleaning and transformation of data, as well as writing code that is not too long and convoluted.
What is imperative programming?
Imperative programming is a style that focuses on the "how" rather than the "what" detailing the precise instructions for Pandas to execute. It involves explicitly defining the steps required to manipulate and transform data, making it more suitable for tasks that involve data manipulation.
import pandas as pd
df = pd.read_csv('dataset.csv')
df.drop_duplicates(inplace=True)
df.dropna(subset=['column1', 'column2'], inplace=True)
try:
df['date'] = pd.to_datetime(df['date'])
except ValueError:
print("Invalid dates found, dropping rows...")
bad_dates = df[pd.to_datetime(df['date'], errors='coerce').index[pd.to_datetime(df['date'], errors='coerce').isnull()]]
df.drop(bad_dates.index, inplace=True)
df.rename({'old_name': 'new_name'}, axis='columns', inplace=True)
It is a straightforward approach for beginners, but it can become cumbersome and less maintainable for complex data manipulation tasks. As the number of steps and data volume increase, the code can become lengthy and difficult to understand
Programming with a functional approach
Let's explore the steps involved in creating a cleaning and transformation process for a small CSV file. This will allow us to track all the changes we make to the data and ensure that everything is in order. By doing so, we can minimize the risk of errors and ensure that our data is accurate and reliable.
Id,Age,Name,Genre Likes,Height,Weight
12345,20,John Doe,Rock,5'10",150
23456,21,Jane Smith,Pop,5'7",130
34567,22,Alice Johnson,Country,5'9",160
45678,23,Bob Brown,Hip Hop,6'0",180
56789,24,Charlie Wilson,Classical,5'8",140
67890,25,Diana Ross,Jazz,5'11",170
78901,26,Elvis Presley,Blues,5'11",185
89012,27,Mariah Carey,R&B,5'9",175
90123,28,Michael Jackson,Soul,5'10",165
101234,,Madonna,Dance,5'7",145
111235,30,Lady Gaga,Electronic,5'1",120
121236,31,Katy Perry,Indie,5'8",135
131237,32,Taylor Swift,Folk,5'10",155
141238,33,Beyonce,Alternative,5'7",160
151239,34,Pink,Metal,5'7",140
161240,35,Justin Bieber,,5'9",
171241,36,Selena Gomez,Reggae,5'5",130
181242,37,Ed Sheeran,Ska,5'8",160
191243,38,Ariana Grande,Funk,5'0",125
201244,39,Dua Lipa,Disco,5'8",145
211245,40,Billie Eilish,Latin,5'3",115
221246,41,The Weeknd,K-Pop,5'11",175
231247,42,Olivia Rodrigo,J-Pop,5'6",130
241248,43,Imagine Dragons,Vocal,5'11",180
251249,44,Maroon 5,,6'0",190
As you can see, there are missing data in the dataset. Therefore, we will need to clean the data before using it. Here is the code that we will use for the cleaning process.
import pandas as pd
def clean_missed_data(df):
return df.dropna(inplace=True)
def filter_by_age(df):
return df[df["Age"] <= 40]
data = pd.read_csv("../data/data.csv")
cleaned_df = data.pipe(clean_missed_data).pipe(filter_by_age)
The given code demonstrates how to use Pandas' pipe() function. It's a nifty trick that lets you chain multiple transformations on a DataFrame without creating extra variables. It's like a flow of functions that flow one after another, simplifying your code.
By following this approach, we avoid cluttering the code with extra variables and assignment statements.
To further illustrate the functionality of the provided code snippets, we will now add two additional functions that convert feet to meters and pounds to kilograms.
def inch_to_meters(x):
if x is np.NaN:
return x
else:
ft, inc = x.split("'")
inches = inc.split("\"")[0]
return ((12 * int(ft)) + int(inches)) * 0.0254
def convert_feet_to_meters(df):
df['Height'] = df['Height'].apply(inch_to_meters)
return df
def convert_pounds_to_kilograms(df):
df['Weight'] = df['Weight'] * 0.453592
return df
universal_df = cleaned_df.pipe(convert_feet_to_meters).pipe(convert_pounds_to_kilograms)
When using the pipe function in Pandas, it is essential to be aware that it modifies the original dataframe. To avoid changing it, you can use a copy in the function
Posted on December 8, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.