Turn a Pandas DataFrame into an API
Eric P Green
Posted on June 10, 2021
Pandas DataFrames are my favorite way to manipulate data in Python. In fact, the end product of many of my small analytics projects is just a data frame containing my results.
I used to dump my dataframes to CSV files and save them to Github. But recently, I've been using Beneath, a data sharing service I'm building, to save my dataframes and simultaneously turn them into a full-blown API with a website. It's great when I need to hand-off a dataset to clients or integrate the data into a frontend.
In this post, I'll show you how that works! I'm going to fetch GitHub commits, analyze them, and use Beneath to turn the result into an API.
Setup Beneath
To get started, you need to install the Beneath pip
module and login with a free Beneath account. It's pretty easy and the docs already cover it. Just follow these steps.
Make sure to remember your username as you'll need it in a minute!
Let's analyze some data
I think Github activity is a fascinating, underexplored data source. Let's scratch the surface and look at commits to... Pandas! Here's a quick script to fetch the pandas
source code and aggregate some daily stats on the number of commits and contributors:
import io
import pandas as pd
import subprocess
# Get all Pandas commit timestamps
repo = "pandas-dev/pandas"
cmd = f"""
if [ -d "repo" ]; then rm -Rf "repo"; fi;
git clone https://github.com/{repo}.git repo;
cd repo;
echo "timestamp,contributor";
git log --pretty=format:"%ad,%ae" --date=iso
"""
res = subprocess.run(cmd, capture_output=True, shell=True).stdout.decode()
# Group by day and count number of commits and contributors
df = (
pd.read_csv(
io.StringIO(res),
parse_dates=["timestamp"],
date_parser=lambda col: pd.to_datetime(col, utc=True),
)
.resample(rule="d", on="timestamp")["contributor"]
.agg(commits="count", contributors="nunique")
.rename_axis("day")
.reset_index()
)
Now, the df
variable contains our insights. If you're following along, you can change the repo
variable to scrape another Github project. Just beware that some major repos can take a long time to analyze (I'm looking at you, torvalds/linux).
Save the DataFrame to Beneath
First, we'll create a new project to store our results. I'll do that from the command-line, but you can also use the web console:
beneath project create USERNAME/github-fun
Just replace USERNAME
with your own username.
Now, we're ready to publish the dataframe. We do it with a simple one-liner directly in Python (well, I split it over multiple lines, but it's still just one call):
import beneath
await beneath.write_full(
table_path="USERNAME/github-fun/pandas-commits",
records=df,
key=["day"],
description="Daily commits to https://github.com/pandas-dev/pandas",
)
There are a few things going on here. Let's go through them:
- The
table_path
gives the full path for the output table, including our username and project. - We use the
records
parameter to pass our DataFrame. - We provide a
key
for the data. The auto-generated API uses the key to index the data so we can quickly filter records. By default, Beneath will use our DataFrame's index as the key, but I prefer setting it manually. - The
description
parameter adds some documentation to the dataset that will be shown at the top of the table's page.
And that's it! Now let's explore the results.
Explore your data
You can now head over to the web console and browse the data and its API docs. Mine's at https://beneath.dev/epg/github-fun/table:pandas-commits (if you used the same project and table names, you can just replace my username epg
for your own).
You can also share or publish the data. Permissions are managed at the project layer, so just head over to the project page and add members or flip the project settings to public
.
Use the API
Now that the data is in Beneath, anyone with access can use the API. On the "API" tab of the table page, we get auto-generated code snippets for integrating the dataset.
For example, we can load the dataframe back into Python:
import beneath
df = await beneath.load_full("USERNAME/github-fun/pandas-commits")
Or we can query the REST API and get the commit info every day in May 2021:
curl https://data.beneath.dev/v1/USERNAME/github-fun/pandas-commits \
-d type=index \
-d filter='{"day":{"_gte":"2021-05-01","_lt":"2021-06-01"}}' \
-G
Or use the React hook to read data directly into the frontend:
import { useRecords } from "beneath-react";
const App = () => {
const { records, loading, error } = useRecords({
table: "USERNAME/github-fun/pandas-commits",
query: {
type: "index",
filter: '{"day":{"_gte":"2021-05-01","_lt":"2021-06-01"}}'
}
})
...
}
Check out the API tab of my dataframe in the Beneath console to see all the ways to use the data.
That's it
That's it! We used Beneath to turn a Pandas DataFrame into an API. If you have any questions, I'm online most of the time in Beneath's Discord (I love to chat about data science, so you're also welcome to just say hi 👋). And let me know if you publish a cool dataset that I can spotlight in the featured projects!
Posted on June 10, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.