Lucas Barret
Posted on March 20, 2023
Introduction
Sometimes you reach a point where you connect dots between different things. Indeed I am currently digging into SQL, and for each new concept that I learned I tried to illustrate it with a ruby project.
Here is the thing, I love ruby it is a cool scripting language. But I have to admit it is not the best language for analytical stuff and so on. Plus it is cool sometimes to learn new stuff and practice our skill another way. So I have decided to use Polars and Python to have another point of view.
Wait what? Polars is not for SQL. What am I talking about? How can I learn SQL using Polars and Python?
My second point is what is important are the concepts, and there are a lot of concepts (not all of them I suppose) that are common to both SQL and Polars. For example the relational algebra or window function.
This article is an introduction to the use of Polars and Python in order to learn the concepts of SQL.
Baby step
That said before diving into really complex subjects. We should first learn the abc of our API.
So as a first project, I wanted to do some basic analytics on github data that I found on Kaggle. If you are interested here is the dataset.
Like I said let's take Polars gently and see what we can do. Let's discover our dataset by reading the CSV and displaying the first 5 repository name.
import polars as pl
df = pl.read_csv("path/to/github_dataset.csv")
first_five_repo = df.select(
pl.col("repositories").head(5)
)
print(first_five_repo)
This is the equivalent of this in SQL, if we suppose that our csv is let's say a table name git_repos.
SELECT repositories FROM git_repos LIMIT 5
And it gives us this result :
shape: (5, 1)
┌───────────────────────────┐
│ repositories │
│ --- │
│ str │
╞═══════════════════════════╡
│ octocat/Hello-World │
│ EddieHubCommunity/support │
│ ethereum/aleth │
│ localstack/localstack │
│ education/classroom │
└───────────────────────────┘
It is a good beginning, so let's dive a bit more into polars and let's do a bit more of analytics and reporting.
A little more advanced stuff
Let's see what the top 5 languages in our csv file, I think you are going to be really surprised :
top_5 = df.groupby("language").agg(
[pl.count().alias("count")]
).sort("count",descending=True).head(5)
print(top_5)
This equivalent to this in PostgreSQL would be something like this
SELECT COUNT(*) AS count FROM git_repos GROUP BY language ORDER BY count DESC;
And eventually we end up with this result :
shape: (5, 2)
┌────────────┬───────┐
│ language ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪═══════╡
│ JavaScript ┆ 253 │
│ Python ┆ 155 │
│ NULL ┆ 145 │
│ HTML ┆ 72 │
│ Java ┆ 44 │
└────────────┴───────┘
Yes, it seems that NULL is the 3rd favorite language of people around the world 😅.
Let's just filtering out the null values to have a better idea of the language distribution, and it would be ok then :
filtered_language = df.filter(pl.col("language") != "NULL")
top_5 = filtered_language.groupby("language").agg(
[pl.count().alias("count")]
).sort("count",descending=True).head(5)
print(top_5)
Which pretty much I would say is not exactly equivalent to something like one of the following answers.
SELECT COUNT(*) AS count FROM git_repos WHERE language IS NOT NULL GROUP BY language ORDER BY count DESC;
--OR something like that in the idea but a bit more advanced in SQL
WITH filtered_language AS (SELECT * FROM git_repos WHERE language IS NOT NULL)
SELECT COUNT(*) AS count FROM filtered_language GROUP BY count ORDER BY DESC
What is cool about using python and Polars to create all of this is that you have some cool concepts of SQL or something near for free. For example, here we have declared a filtered_language value which contains all the columns of our DataFrame but without any NULL.
It is close to the Common Table Expressions in SQL, like with seen in the second way to write this query.
Conclusion
I have to say that I am pleased by the flexibility of Polars, which is close to Pandas of course. It is adding a layer of abstraction above our loveable Relational Algebra. I am not throwing Ruby away or else, but it's true that when you want to do analytics on data it is less natural to use Ruby than Python.
Nevertheless, python and pandas will never replace SQL since it is not for the same usage and keep increasing my SQL skills is still a way to go for me along with Ruby. I will continue my quest in SQL and data, trying to improve my skills and share all this with you people
Keep in Touch
On Twitter : @yet_anotherdev
Posted on March 20, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.