Pandas 101 - pt. ii: Practicing with DataFrames

hugoestradas

Hugo Estrada S.

Posted on June 29, 2020

Pandas 101 - pt. ii: Practicing with DataFrames

Alt Text

For starters, as usual the repository with all the notebooks of this three-part series about Pandas is here:

https://github.com/hugoestradas/Pandas_101


In my previous lecture I showed you the basics of the two main data structures of the Pandas library: Series and DataFrames.

Let's focus on the DataFrames for now.

1) Main Methods of Pandas

I'll start by creating our dummy DataFrame for this section:

Alt Text

In lecture pt. i I renamed the DataFrame column heading, but there are other ways of doing this.

You can rename your heading columns, by using another list:

Alt Text

If you'd like to keep the naming of the columns headings, but would like to replace some text-format, like special characters or spaces in-between names, you can use "str.replace()" function:

Alt Text

By default, the DataFrame has a numerical index.
You might want to change it according to your needs:

Alt Text

If just so happens you need to edit the entire DataFrame, with an apply and lambda function (more about that here: https://dev.to/hugoestradas/5-cool-python-tricks-4gcl) you can edit all the information in the DataFrame according to my needs, like add twice the value of "col three" to the "col two" column:

Alt Text

If you need to delete a record from the DataFrame, would come in handy to have a nice-defined index when using the drop() funtion:

Alt Text

As a Data Scientist you might want to create copies of your DataFrame:

Alt Text

You might want to keep certain data in the DataFrame, using .loc with the help of some operators you can achieve this:

Alt Text

2) Multi-Indexing

Let's say you're reading a book with lots of chapters, in the index you look the title, that leads you to an specific page or chapter you might be looking for, in Pandas an Index it's a lot lot like this. An Index works like an address, that's how any data point across the DataFrame or series can be accessed.

For this section I'll need a bigger and more complex DataFrame with I'm going to create from a .csv file (which you can find here: https://www.kaggle.com/mokosan/lord-of-the-rings-character-data/download):

Alt Text

Since I'm using Azure Databricks I'm charging the file into the DBFS system, but you might not need to do this (depending on what Jupyter-like tool you're using).

If you want to know more about your DataFrame simply use ".info()" or if you want to take a glimpse to it use ".head()":

Alt Text

A MultiIndex is as a multi-level index or hierarchical index, that allows you to have multiple columns acting as a row identifier, while having each index column related to another through a parent/child relationship.

Now to start this exercise first I need to obtain the original DataFrame’s index label, we can use this code:

Alt Text

This output's result: “FrozenList”, is a Pandas specific construct used to show the index label(s) of a DataFrame.
Here, we see the value is “None”, as this is the default value of a DataFrame’s index.

In order to create a MultiIndex with the original DataFrame, all we need to do is pass a list of columns into the .set_index() Pandas function like this:

Alt Text

Alt Text

You can see that the new DataFrame called “multiindex” has been organized so that there are now four columns that make up the index. We can check this by looking at the index names once more:

Alt Text

The previous “None” value has been replaced by the names of the four columns we assigned to be our new index.
Each index value in the regular, unaltered DataFrame would just be a number from 0 to 730 (because the DataFrame has 731 rows). To show you what each index value is in our newly created :multiindex", we can use this line of code:

Alt Text

When creating a DataFrame with a MultiIndex, make sure to append that to the end of the line of code like this:

Alt Text

3) Select Columns by Data Type

For this section I'm going to use another DataSet (which you can find in the repository's "data" folder):

Alt Text

Let's start by checking all the data types of the DataFrame:

Alt Text

If you're interested exclusively in numeric columns, use the "select_dtypes" method:

Alt Text

Or... maybe the opposite of that:

Alt Text

You can even be more specific, by creating a list of each data type you're interested in the DataFrame:

Alt Text

4) Reduce the DataFrame Size

The Pandas DataFrames are designed to fit into memory, and sometimes it's necessary to reduce the size of the DataFrame to smoothly work with it.

Let's start by getting the current size of the DataFrame:

Alt Text

Currently it is using 30.4KB.

This is a very small DataFrame, but if just so happens you're experiencing performance problems, or you're not able to read the DataFrame these are the steps to follow to reduce the size of the DataFrame.

First let's reduce only the columns that you really need to work, you can do this by creating a list of them:

Alt Text

After that, simply recreate the DataFrame with the columns you specified, and see how it considerably reduces its size (from 30.4KB to 13.6BK):

Alt Text

💖 💪 🙅 🚩
hugoestradas
Hugo Estrada S.

Posted on June 29, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related