When building applications within the confines of a single-threaded, synchronous language, the limitations become very obvious very quickly. The first thing that comes to mind is writes: the very definition of an I/O-bound task. When writing data to files (or databases), each "write" action intentionally occupies a thread until the write is complete. This makes a lot of sense for ensuring data integrity in most systems. For example, if two operations simultaneously attempt to update a database record, which one is correct? Alternatively, if a script requires an HTTP request to succeed before continuing, how could we move on until we know the request succeeded?
HTTP requests are among the most common thread-blocking operations. When we write scripts that expect data from an external third party, we introduce a myriad of uncertainties that can only be answered by the request itself, such as response time latency, the nature of data we expect to receive, or if the request will succeed. Even when working with APIs we're confident in, no operation is sure to succeed until it's complete. Hence, we're "blocked."
As applications grow in complexity to support more simultaneous user interactions, software is moving away from the paradigm of being executed linearly. So while we might not be sure that a specific request succeeds or a database write is completed, this can be acceptable as long as we have ways to handle and mitigate these issues gracefully.
A Problem Worthy of Asynchronous Execution
How long do you suppose it would take a Python script to execute a few hundred HTTP requests, parse each response, and write the output to a single file? If you were to use requests in a simple for loop, you'd need to wait a fair amount of time for Python to execute each request, open a file, write to it, close it, and move on to the next.
Let's put asyncio's ability to improve script efficiency to an actual test. We'll execute two I/O-blocking actions per task for a few hundred URLs: executing and parsing an HTTP request and writing the desired result to a single file. The input for our experiment will be a ton of URLs, with the expected output to be metadata parsed from those URLs. Let's see how long it takes to do this for hundreds of URLs.
This site has roughly two hundred published posts of its own, which makes it a great guinea pig for this little experiment. I've created a CSV that contains the URLs to these posts, which will be our input. Here's a sneak peek below:
For each URL found in our input CSV, our script will fetch the URL, parse the page, and write some choice data to a single CSV. The result will resemble the below example:
title
description
primary_tag
url
published_at
Intro to Asynchronous Python with Asyncio
Execute multiple tasks concurrently in Python with Asyncio: Python`s built-in async library.
We're going to need three core Python libraries to pull this off:
Asyncio: Python's bread-and-butter library for running asynchronous IO-bound tasks. The library has somewhat built itself into the Python core language, introducing async/await keywords that denote when a function is run asynchronously and when to wait on such a function (respectively).
Aiohttp: When used on the client-side, similar to Python's requests library for making asynchronous requests. Alternatively, aiohttp can be used inversely: as an application webserver to handle incoming requests & serving responses, but that's a tale for another time.
Aiofiles: Makes writing to disk (such as creating and writing bytes to files) a non-blocking task, such that multiple writes can happen on the same thread without blocking one another - even when multiple tasks are bound to the same file.
BONUS: Dependencies to Optimize Speed
aiohttp can execute requests even faster by simply installing a few supplemental libraries. These libraries are cchardet (character encoding detection), aiodns (asynchronous DNS resolution), and brotlipy (lossless compression). I'd highly recommend installing these using the conveniently provided shortcut below (take it from me, I'm a stranger on the internet):
Preparing an Asynchronous Script/Application
We're going to structure this script like any other Python script. Our main module, aiohttp_aiofiles_tutorial will handle all of our logic. config.py and main.py both live outside the main module, and offer our script some basic configuration and an entry point respectively:
/export is simply an empty directory where we'll write our output file to.
The /data submodule contains the input CSV mentioned above, and some basic logic to parse it. Not much to phone home about, but if you're curious the source is available on the Github repo.
Kicking Things Off
With sleeves rolled high, we start with the obligatory script "entry point," main.py. This initiates the core function in /aiohttp_aiofiles_tutorial called init_script():
This seems like we're running a single function/coroutine init_script() via asyncio.run(), which seems counter-intuitive at first glance. Isn't the point of asyncio to run multiple coroutines concurrently, you ask?
Indeed it is! init_script() is a coroutine that calls other coroutines. Some of these coroutines create tasks out of other coroutines, others execute them, etc. asyncio.run() creates an event loop that doesn't stop running until the target coroutine is done, including all the coroutines that the parent coroutines calls. So, if we keep things clean, asyncio.run() is a one-time call to initialize a script.
Initializing Our Script
Here's where the fun begins. We've established that the purpose of our script is to output a single CSV file, and that's where we'll start: by creating and opening an output file within the context of which our entire script will operate:
Our script begins by opening a file context with aiofiles. As long as our script operates inside the context of an open async file via async with aiofiles.open() as outfile:, we can write to this file constantly without worrying about opening and closing the file.
Compare this to the synchronous (default) implementation of handling file I/O in Python, with open() as outfile:. By using aiofiles, we can write data to the same file from multiple sources at virtually the same time.
EXPORT_FILEPATH happens to target a CSV ( /export/hackers_pages_metadata.csv ). Every CSV needs a row of headers; hence our one-off usage of await outfile.write() to write headers immediately after opening our CSV:
Moving Along
Below is the fully fleshed-out version of __init__.py that will ultimately put our script into action. The most notable addition is the introduction of the execute_fetcher_tasks() coroutine; we'll dissect this one piece at a time:
execute_fetcher_tasks() is broken out mainly to organize our code. This coroutine accepts outfile as a parameter, which will serve as the destination for data we end up parsing. Taking this line-by-line:
async with ClientSession(headers=HTTP_HEADERS) as session: Unlike the Python requests library, aiohttp enables us to open a client-side session that creates a connection pool that allows for up to 100 active connections at a single time. Because we're going to make under 200 requests, the amount of time it will take to fetch all these URLs will be comparable to the time it takes Python to fetch two under normal circumstances.
create_tasks(): This function we're about to define and accepts three parameters. The first is the async ClientSession we just opened a line earlier. Next, we have the urls_to_fetch variable (imported earlier in our script). This is a simple Python list of strings, where each string is a URL parsed from our earlier "input" CSV. That logic is handled elsewhere via a simple function (and not important for the purpose of this tutorial). Lastly, our outfile is passed along, as we'll be writing to this file later. With these parameters, create_tasks() will create a task for each of our 174 URLs. Each of which will download the contents of the given URL to the target directory. This function returns the tasks but will not execute them until we give the word, which happens via...
asyncio.gather(*task_list): Asyncio's gather() method performs a collection of tasks inside the currently running event loop. Once this kicks off, the speed benefits of asynchronous I/O will become immediately apparent.
Creating Asyncio Tasks
If you recall, a Python Task wraps a function (coroutine) which we'll execute in the future. In addition, each task can be temporarily put on hold for other tasks. A predefined coroutine must be passed along with the proper parameters before execution to create a task.
I separated create_tasks() to return a list of Python Tasks, where each "task" will execute fetching one of our URLs:
A few notable things about Asyncio Tasks:
We're defining "work is to be done" upfront. The creation of a Task doesn't execute code. Our script will essentially run the same function 174 times concurrently, with different parameters. It makes sense that we'd want to define these tasks upfront.
Defining tasks is quick and straightforward. In an instant, each URL from our CSV will have a corresponding Task created and added to task_list.
With our tasks prepared, there's only one thing left to do to kick them all off and get the party started. That's where the asyncio.gather(*task_list) line from __ init __ .py comes into play.
Asyncio's Task object is a class in itself with its attributes and methods, essentially providing a wrapper with ways to check task status, cancel tasks, and so forth.
Executing our Tasks
Back in create_tasks(), we created tasks that each individually execute a method called fetch_url_and_save_data() per task. This function does three things:
Make an async request to the given task's URL via aiohttp's session context (handled by async with session.get(url) as resp:)
Read the body of the response as a string.
Write the contents of the response body to a file by passing html to our last function, parse_html_page_metadata():
When fetching a URL via an aiohttpClientSession, calling the .text() method on the response (await resp.text()) will return the response of a request as a string. This is not to be confused with .body(), which returns a bytes object (useful for pulling media files or anything besides a string).
If you're keeping track, we're now three "contexts" deep:
We started our script by opening an aiofiles.open() context, which will remain open until our script is complete. This allows us to write to our outfile from any task for the duration of our script.
After writing headers to our CSV file, we opened a persistent client request session with async with ClientSession() as session, which allows us to make requests en masse as long as the session is open.
In the snippet above, we've entered a third and final context: the response context for a single URL (via async with session.get(url) as resp). Unlike the other two contexts, we'll be entering and leaving this context 174 times (once per URL).
Inside each URL response context is where we finally start producing some output. This leaves us with our final bit of logic (await parse_html_page_metadata(html, url)) which parses each URL response and returns some scraped metadata from the page before writing said metadata to our outfile on the next line, await outfile.write(f"{page_metadata}\n").
Write Parsed Metadata to CSV
How are we planning to rip metadata out of HTML pages, you ask? With BeautifulSoup, of course! With the HTML of an HTTP response in hand, we use bs4 to parse each URL response and return values for each of the columns in our outfile: title , description , primary_tag , published at , and url.
These five values are returned as a comma-separated string, then written to our outfile CSV as a single row.
Run the Jewels, Run the Script
Let's take this bad boy for a spin. I threw a timer into __init__.py to log the number of seconds that elapse for the duration of the script:
Mash that mfing make run command if you're following along in the repo (or just punch in python3 main.py). Strap yourself in:
The higher end of our script's execution time is 3 seconds. A typical Python request takes 1-2 seconds to complete, so our speed optimization is in the range of hundreds of times faster for a sample size of data like this.
Writing async scripts in Python surely takes more effort, but not hundreds or thousands of times more effort. Even if isn't speed you're after, handling volume of larger-scale applications renders Asyncio absolutely critical. For example, if your chatbot or webserver is in the middle of handling a user's request, what happens when a second user attempts to interact with your app in the meantime? Often times the answer is nothing:User 1 gets what they want, and User 2 is stuck taking to a blocked thread.
Anyway, seeing is believing. Here's the source code for this tutorial:
Get up and running by cloning this repository and running make deploy:
$ git clone https://github.com/hackersandslackers/aiohttp-aiofiles-tutorial.git
$ cd aiohttp-aiofiles-tutorial
$ make deploy
Hackers and Slackers tutorials are free of charge. If you found this tutorial helpful, a small donation would be greatly appreciated to keep us in business. All proceeds go towards coffee, and all coffee goes towards more content.