Data Visualization with Python pt. i

hugoestradas

Hugo Estrada S.

Posted on January 15, 2021

Data Visualization with Python pt. i

Alt Text

First things first, all the code I cover on this lecture it's right here:

https://github.com/hugoestradas/Data_Visualisation_with_Python.git


Part 1: Using Matplitlib for the Very First Time

Matplotlib is a popular data visualization library for Python.

The reason I'm going to use it, it's because it's fairly easy to use and out of many Python data visualization libraries it's the most commonly used one.

With Matplotlib you'll be able to create many different types of charts.

Let's see how to create a line chart with Matplotlib:

The first thing to do is to import the Matplotlib module:
Alt Text

Now to start plotting data I'll use the following line:

Alt Text

This line says "put 1, 2 and 3 in the 'x' axis; and 1, 4 and 9 in the 'y' axis".
To show this plot, it's necessary the following line:

Alt Text

It is possible to add labels for the 'x' and 'y' axis and a title for the whole plot:

Alt Text

The whole cell should look like this, and the end plot should be the following:

Alt Text

It's also possible to plot multiple lines on the same plot:

Alt Text

And the plot looks like this:

Alt Text

To clarify the values of each line, it is possible to define them by name using the "plt.legend" method:

Alt Text

And the plot looks like this:

Alt Text

It is possible to export the plot as an image as well:

Alt Text

Part 2: Using Pandas

Pandas is a Python library that helps you import, organize and process data, it's familiar to "dataframes" in the R language.

Let's create a dataframe in Pandas, select data with Boolean indexing and finally plots using the same Pandas dataframe:

Alt Text

This is the data I'll be using:

Alt Text

To create a dataframe to store this data, I'm going to create a dummy data as dictionary with three attributes: 'year', 'attendees' and 'average age'.

Alt Text

And after executing the cell, the displayed dataframe should look like this one:

Alt Text

I can assign this newly created dataframe to a variable called 'df' (the standard variable name for a dataframe in Pandas):

Alt Text

And the result should be the same:

Alt Text

There are three columns in this dummy dataframe, you can select a single column out of this dataframe, for example:

Alt Text

The type of this new data is something called a "Pandas Series".

It's similar to a regular Python list and also to the NumPy array, if you're familiar with the NumPy library.

Alt Text

Knowing this, you can apply an inequality operation on the series with df['year'] < 2010:

Alt Text

This returns a series of Boolean values:

Alt Text

Let's store the output into a variable:

Alt Text

Using the Boolean Series you can select only the part of the data where the year is earlier than 2010, this is called "Boolean Indexing":

Alt Text

Imagine that you want to examine how the number of attendees has changed for the last three events.

To best figure this out, you might want to plot the number of attendees against the year:

Alt Text

This line of course puts the year on the x axis and the attendees on the y axis, and the result it's the following:

Alt Text

If you want to plot the number of attendees and average age on the same plot we can just call 'plt.plot()' multiple times:

Alt Text

Part 3: Importing Data with Pandas

For this example, the sample data that I'm going to use, is the following .csv file:

Alt Text

It is a list of countries and their basic demographics for each year, years ranging from 1952 to 2007 for every five years.

To import this .csv file make sure that you ether know the path of the file or the both the notebook and the .csv file are located in the same location within the Jupyter intance.

Alt Text

This dataset is pretty small, but in real world scenarios if you want to have a glimpse of the data you're dealing with, all you need to to is to use the 'head()' method:

Alt Text

This gives you the first five rows of the dataframe:

Alt Text

Now, if you would like to plot how GDP Per Capita has changed over time in Afghanistan.

To do that, it's necessary to isolate the data bout Afghanistan from the data variable.
To Select the country column you can write:

Alt Text

or

Alt Text

Either syntax does exactly the same thing:

Alt Text

Using this and Boolean Indexing you can select only the data about Afghanistan with:

Alt Text

And to plot it:

Alt Text

And the final plot is the following:

Alt Text

💖 💪 🙅 🚩
hugoestradas
Hugo Estrada S.

Posted on January 15, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related