Francesco Tisiot
Posted on June 23, 2021
The world is seldom perfect, and the same goes for real-life data. Read on to find out how to clean up the data in your PostgreSQL database so that it's usable for analysis.
Creating information is like any creative process. It starts with finding the materials (in this case, datasources) and continues with cleaning, joining and wrangling datasets. By the time you finish you've built something beautiful and gained new insights to share - and then you start all over again.
One of the most widely-used languages in the creation of information is Python, loved by data scientists, engineers and analysts for its great ecosystem of existing libraries for data wrangling.
This blog post explores three of such libraries: pandas
, ddlgenerator
and psycopg2
. These three enable us to clean a dataset and push it to a PostgreSQL database where the data can later be queried and exposed to a huge variety of company figures.
Today, the objective is simple: we will explore a happiness dataset and try to find out where in the world we should move to have joyful life!
0. Get the dataset
As mentioned before, begin by finding a dataset. Kaggle is a website widely used in the data science community, providing datasets used for challenges, competitions or learning. They have this nice dataset about World Happiness which only requires a login to be downloaded. It contains 5 CSV files, one per year, listing the happiness ranking of various countries together with some other indicators.
Let's create a folder named happiness
and a subfolder named data
for storing the file files:
mkdir -p happiness/data
cd happiness
Next, download the data from Kaggle and place the 5 CSVs in the happiness/data
folder.
1. Clean the data
As it happens frequently, the source datasets are not always tidy and clean. This is valid for our happiness dataset too.
When browsing the files we can note that 2015.csv
and 2016.csv
share the same format of column names and data. The 2017.csv
has the Region
column missing, some others named and ordered differently (Generosity
and Trust
are swapped). The same applies to the files 2018.csv
and 2019.csv
where we can't find the confidence intervals anymore. Moreover, the year is only contained in the file name, and not in a column.
Before starting our cleaning efforts, it's a good practice to create a separate virtual environment every time we start a new Python project. This ensures our default Python environment is not overloaded with possibly conflicting libraries. We can create a virtual environment and enable it like this:
python -m venv python_pg
source python_pg/bin/activate
The files need some work in order to produce a consistent view of the data, so let's put on our Python superhero mask. Start by installing the pandas library, which will help us with data manipulation and analysis, by executing the following code in your terminal:
pip install pandas
Now we can create a file named prepare.py
in our happiness
folder and start including some Python code in it. We'll start by importing the library installed earlier, together with os
and glob
which should be installed by default and enable the interaction with the operative system and pathname pattern expansion respectively.
import pandas as pd
import glob, os
Next, define the column names that we're going to use in our target DataFrame named all_years_df
to store the dataset once cleaned. Add the following lines to the prepare.py
file:
column_names = ['survey_yr', 'country', 'overall_rank',
'score', 'gdp', 'generosity', 'freedom',
'social_support', 'life_exp', 'gov_trust']
all_years_df = pd.DataFrame()
Now it's time to bring order to chaos and parse all the CSV files in the data
folder. Add the following code to the prepare.py
file:
files = glob.glob('data/*.csv')
for fp in files:
# Handling file names as new column
file_year = os.path.basename(fp).split('.')[0]
file_content = pd.read_csv(fp).assign(Year=int(file_year)-2000)
# Picking the right column for each file based on the year
uniformed_columns = find_columns(file_content, file_year)
# Conforming column names and appending the parsed year data to all_years_df
uniformed_columns.columns = column_names
all_years_df = pd.concat([all_years_df, uniformed_columns])
all_years_df.to_csv('results.csv', index=False)
The glob
function returns the list of files in the happiness/data
folder, that we loop over. For each file, we read the filename with os.path.basename
, split the name on the dot (.
) and take the first part, extracting only the year information. The file_content
line uses pandas' read_csv
function to load the file data and populates the new column Year
(using the assign
function) with the last two digits of the year (int(file_year)-2000
).
Next we call the find_columns
function, which we'll define soon, to select the correct columns from the source file depending on the year.
After that, we consolidate the column names in the uniformed_columns
DataFrame containing only one year of data, and then append it to the all_years_df
DataFrame that will contain the whole dataset at the end of the process. Lastly we store the end result in a CSV file named results.csv
in the happiness
folder.
Picking the correct columns
The missing part is the find_columns
function definition where we consolidate the column names and order. As mentioned above, depending on the year, we need to select the correct columns from the source file. Add the following code to the prepare.py
file, just after the import
statements, to perform the trick on various years:
def find_columns(file_content, file_year):
if file_year in ['2015', '2016']:
uniformed_columns = file_content[[
'Year', 'Country', 'Happiness Rank', 'Happiness Score',
'Economy (GDP per Capita)', 'Generosity', 'Freedom', 'Family',
'Health (Life Expectancy)', 'Trust (Government Corruption)'
]]
elif file_year in ['2017']:
uniformed_columns = file_content[[
'Year', 'Country', 'Happiness.Rank', 'Happiness.Score',
'Economy..GDP.per.Capita.', 'Generosity', 'Freedom', 'Family',
'Health..Life.Expectancy.', 'Trust..Government.Corruption.'
]]
else:
uniformed_columns = file_content[[
'Year', 'Country or region', 'Overall rank', 'Score',
'GDP per capita', 'Generosity', 'Freedom to make life choices',
'Social support', 'Healthy life expectancy',
'Perceptions of corruption'
]]
return uniformed_columns
The function picks the correct column for each year and returns the resulting DataFrame.
It's now time to check that the pipeline is working. With the terminal positioned on the happiness
folder let's execute the following:
python prepare.py
Now check the file results.csv
in the happiness
folder to verify that the code worked. The results should look like this:
2. Create a PostgreSQL instance
The second step is to store our data in a PostgreSQL table. But before storing, we need a PG Database, let's create one with Aiven CLI in our terminal
avn service create demo-pg \
-t pg \
--cloud google-europe-west3 \
-p hobbyist
The above command creates a PostgreSQL instance named demo-pg
on google-europe-west3
using a hobbyist
plan. You can review all PostgreSQL plans in our pricing page. To connect to it, we'll need the service uri
containing the information like host
, port
, and the default avnadmin
user's password
. We can retrieve it with:
avn service get demo-pg --format '{service_uri}'
Now a couple of minutes of relax, while waiting for the instance to be ready with the following call in the terminal
avn service wait demo-pg
3. Push the data to PostgreSQL
Once the service is running, go back to Python and prepare the data push.
We'll use the ddlgenerator library to automatically generate the SQL statement to create the PostgreSQL table, and psycopg2 to interact with the database. We can install both with the following line in the terminal:
pip install psycopg2 ddlgenerator
Now create a new push.py
file and add the import and connect to the database statements. Replace in the following code the <SERVICE_URI_OUTPUT>
parameter with the output of the avn service get
command above:
import psycopg2
conn = psycopg2.connect('<SERVICE_URI_OUTPUT>')
The connection is established. We can now create the table and push the data. ddlgenerator
provides a nice functionality to create both table DDLs and insert statements from a CSV file. We can append the following code to the push.py
file:
from ddlgenerator.ddlgenerator import Table
table = Table('results.csv', table_name='HAPPINESS')
sql = table.sql('postgresql', inserts=True)
After importing the Table
class from ddlgenerator
, the code creates an instance parsing the results.csv
file, with a target table name of HAPPINESS
. The last line creates the CREATE
table definition SQL for PostgreSQL, including the insert statements for each line (inserts=True
). If we print the SQL statement it should look like the following:
DROP TABLE IF EXISTS happiness;
CREATE TABLE happiness (
survey_yr INTEGER NOT NULL,
country VARCHAR(24) NOT NULL,
overall_rank INTEGER NOT NULL,
score DECIMAL(15, 14) NOT NULL,
gdp DECIMAL(17, 16) NOT NULL,
generosity DECIMAL(16, 16) NOT NULL,
freedom DECIMAL(16, 16) NOT NULL,
social_support DECIMAL(16, 15) NOT NULL,
life_exp DECIMAL(17, 16) NOT NULL,
gov_trust DECIMAL(16, 16)
);
INSERT INTO happiness (survey_yr, country, overall_rank, score, gdp, generosity, freedom, social_support, life_exp, gov_trust) VALUES (19, 'Finland', 1, 7.769, 1.34, 0.153, 0.596, 1.587, 0.986, 0.393);
INSERT INTO happiness (survey_yr, country, overall_rank, score, gdp, generosity, freedom, social_support, life_exp, gov_trust) VALUES (19, 'Denmark', 2, 7.6, 1.383, 0.252, 0.592, 1.573, 0.996, 0.41);
...
We can now execute the sql statement against the database by adding the following code to the push.py
file
cur = conn.cursor()
cur.execute(sql)
conn.commit()
cur.close()
conn.close()
We created a cursor, executed the sql statement, committed the change to the database and closed both cursor and connection. It's now time to run the code. With the same terminal on the happiness
folder, run the following:
python push.py
Congrats, the data is now loaded in PostgreSQL.
Query the Data in PostgreSQL
Let's verify that the data is correctly populated in the PostgreSQL happiness
table. Create a new Python file query.py
with the following content (as before, amend the connection service uri
):
import psycopg2
conn = psycopg2.connect('<SERVICE_URI_OUTPUT>')
cur = conn.cursor()
cur.execute('''
SELECT SURVEY_YR, COUNTRY, OVERALL_RANK
from HAPPINESS
WHERE OVERALL_RANK <= 3
ORDER BY SURVEY_YR, OVERALL_RANK
''')
print(cur.fetchall())
cur.close()
conn.close()
As per the push.py
file, we are creating a connection to PostgreSQL, then executing a query to check which countries were in the top three position over the various years. The result should be similar to:
[
(15, 'Switzerland', 1), (15, 'Iceland', 2), (15, 'Denmark', 3),
(16, 'Denmark', 1), (16, 'Switzerland', 2), (16, 'Iceland', 3),
(17, 'Norway', 1), (17, 'Denmark', 2), (17, 'Iceland', 3),
(18, 'Finland', 1), (18, 'Norway', 2), (18, 'Denmark', 3),
(19, 'Finland', 1), (19, 'Denmark', 2), (19, 'Norway', 3)
]
Looks like Northern European countries are a great place to live! Take particular notice of how Aiven's home country, Finland
, has been on top in the last two years.
The data, now available in a PostgreSQL relational table, can be queried and exposed by all major analytics and reporting tools, making it accessible by a wider audience.
Wrapping up
Finding the right dataset is just the beginning: cleaning, wrangling and publishing the data creates information, which helps people in their decision-making. Python is a great fit in this area and widely adopted in the data community. The pandas
, ddlgenerator
and psycopg2
libraries mentioned in the blog enable an easy data handling and push to PostgreSQL. This makes for a solid base for data discovery or corporate reporting.
Some more info in this area:
- Aiven for PostgreSQL: where you can find all the info about Aiven's offering
- Aiven PostgreSQL supported extensions: where you can browse the extensions we support
- pandas, ddlgenerator and psycopg2: where you can find documentation about the libraries used
Posted on June 23, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.