How To Make Functional Programming in Python Go Fast

I've previously written about the Python library pfun that allows you to write Python code in a purely functional style by providing a functional effect system. Simply put, the pfun effect system works by representing side-effects as a recursive data-structure that can then be interpreted at the edge of your program to produce real effects, very much inspired by how effects are managed in Haskell or the purely functional ZIO Scala library.

In this post, we'll study how such an effect system can be implemented in Python, how it impacts performance of Python programs, and why the pfun effect system has particularly good performance. We wont spend any time on why you would want to manage side-effects with an effect system. If you want to read more about that, check out one of my previous posts.

To illustrate how the pfun effect system works, let's define our own toy effect system that models interaction with the console. Each interaction with the console will be represented by a type that can be interpreted to produce an effect. To that end, lets define a super-type that all console effects will inherit from:

class Console:
    def interpret(self):
        raise NotImplementedError()

The first effect that comes to mind when thinking about console interaction is printing. Let's define a type that represents console printing:

from dataclasses import dataclass


@dataclass
class PrintLine(Console):
    line: str
    rest: Console

    def interpret(self):
        print(self.line)
        return self.rest.interpret()

Notice that PrintLine has a field rest that allows you to compose it with other console effects. When a PrintLine is interpreted, it prints something to the console and continues interpretation of the composed console effect.

Next, let's handle reading from the console:

from typing import Callable


@dataclass
class ReadLine(Console):
    rest: Callable[[str], Console]

    def interpret(self):
        return self.rest(input()).interpret()

Similarly to PrintLine, ReadLine has a callable field rest that allows you to continue the effect interpretation with the result of reading from the console.

Finally, we need a way to terminate a console program. Lets define a leaf type that just wraps a value, but doesn't continue interpretation recursively:

from typing import Any


@dataclass
class Return(Console):
    val: Any

    def interpret(self):
        return self.val

We can use our model of console interaction to build data-structures that express sequences of operations that interacts with the console:

program = PrintLine(
    'what is your name? ',
    ReadLine(
        lambda name: PrintLine(f'Hello {name}!',
                               Return(None))
    )
)

program on its own is simply a Python value. During its definition, no side-effects are actually carried out. In fact, nothing really happens until program.interpret() is invoked, at which point all the effects represented by program are carried out sequentially.

Using an effect system of this sort to express side-effects obviously comes with the overhead of interpreting the effect data-structure in Python, which may be significant depending on the depth of the effect data-structure.

But what's worse is that interpreting PrintLine and ReadLine instances requires recursively interpreting its sub effects. Since this recursive call is in tail position in both cases, it could in theory be tail-call-optimized away, but alas Python famously doesn't perform tail-call-optimization. This means that a sufficiently deep data-structure will eventually overflow the Python stack.

Concretely, the following program:

p = PrintLine('', Return(None))
for _ in range(10000):
    p = PrintLine('', p)
p.interpret()

Causes the following error:

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)

----> 1 p.interpret()

... in interpret(self)
      5     def interpret(self):
      6         print(self.line)
----> 7         return self.rest.interpret()
      8
      9

... last 1 frames repeated, from the frame below ...

... in interpret(self)
      5     def interpret(self):
      6         print(self.line)
----> 7         return self.rest.interpret()
      8
      9

RecursionError: maximum recursion depth exceeded while calling a Python object

The solution to this problem is to use another data-structure called a trampoline that can interpret the recursive function calls in constant stack space:

@dataclass
class Trampoline:
    def interpret(self):
        raise NotImplementedError()


@dataclass
class Done(Trampoline):
    val: Any

    def interpret(self):
        return self.val


@dataclass
class Call(Trampoline):
    thunk: Callable[[], Trampoline]

    def interpret(self):
        trampoline = self
        while not isinstance(trampoline, Done):
            trampoline = trampoline.thunk()
        return trampoline.val

With this in hand we can re-write our Console effect system as follows:

@dataclass
class Return(Console):
    val: Any

    def interpret(self):
        return Done(self.val)


@dataclass
class PrintLine(Console):
    line: str
    rest: Console

    def interpret(self):
        print(self.line)
        return Call(lambda: self.rest.interpret())


@dataclass
class ReadLine(Console):
    rest: Callable[[str], Console]

    def interpret(self):
        return Call(lambda: self.rest(input()).interpret())

Notice how all the implementations of interpret returns immediately, instead of directly performing a recursive call. This allows us to interpret Console values of arbitrary depth without running into RecursionError.

Unfortunately, this solution makes the performance of our Console effect system even worse, since now we've also added the overhead of interpreting the trampoline data-structure.

The effect system in pfun is conceptually similar to the Console effect system we've built here (although it can do much, much more than interacting with the console). To compose effects in the pfun effect system, we use a function called and_then which takes as an argument a callback function from the result of an effect to a new effect. for example:

from pfun import files


read_foo = files.read('foo.txt')
read_foo.and_then(lambda content: files.write('bar.txt', content))

read_foo.and_then(...) produces a new effect that is similar in concept to PrintLine and ReadLine in that, when interpreted, it will first interpret read_foo, pass the result to the callback function, and then interpret the effect returned by that callback.

Interpreting deep effect data-structures produced by long sequences of calls to and_then would lead to RecursionError just as with our initial version of the Console effect system, were it not for the fact that the pfun effect system also uses a trampolining mechanism in its own interpreter. Unfortunately, running this trampolined interpreter in Python is pretty slow because it involves looping over Python values and calling Python functions, both of which are not terribly fast.

Luckily, Python integrates nicely with C code through a feature called extension types. Running an interpreter for an effect system in C is much, much faster than running it in Python.

In pfun 0.12.0 the interpreter for the effect system was completely re-written as a Python C extension. Lets do some benchmarking to see how big a difference this actually makes in terms of performance. We'll use the performance of the the Scala library ZIO as a baseline, as the pfun effect system draws most of its inspiration from there. ZIO has a fairly extensive benchmarking suite. The most obvious benchmark for testing raw interpreter speed, without any parallelism, is called deepLeftBind (bind is the canonical name for the and_then operation, also called flatMap in Scala):

import zio._

def deepLeftBind(): UIO[Int] = {
    var i  = 0
    var io = IO.effectTotal(i)
    while (i < 10000) {
        io = io.flatMap(i => IO.effectTotal(i))
        i += 1
    }
    io
}

unsafeRun(deepLeftBind())

The equivalent in the pfun effect system looks like this:

from pfun.effect import success


def deep_left_bind():
    i = 0
    e = success(i)
    while i < 10000:
        e = e.and_then(success)
        i += 1
    return e

deep_left_bind().run(None)

Each benchmark is executed 100 times in a loop in a single Python interpreter and Scala REPL respectively, measuring execution time in milliseconds each time. For the ZIO version, 5 warmup executions are also performed before the actual benchmarking loop to prevent time used for JVM class loading etc. to influence to the outcome.

Below I've plotted the mean execution time in milliseconds over the 100 executions for three benchmarks:

pfun 0.11.5 which uses a pure Python interpreter for its effect system
pfun 0.12.1 which uses a C interpreter for its effect system
ZIO 1.0.8

Clearly, pfun 0.12.1 is much, much faster than version 0.11.5. Both ZIO and pfun 0.12.1 are so much faster than pfun 0.11.5 that it's hard to see how pfun 0.12.1 and ZIO compare to each-other. Let's zoom in on just those two:

It's not surprising that there's still some room for improvement in the performance of pfun compared to ZIO, as a lot of work has gone into optimizing the ZIO interpreter. Maybe pfun will never catch up to the performance of ZIO as some of the overhead of the pfun implementation is fundamental to having a Python interface: In order to work with pfun from the Python side, pfun must accept Python callback functions. Jumping back and forth between Python and C during effect interpretation introduces overhead that will be difficult to get rid of, while still keeping the pfun api ergonomical. That being said, the current C implementation of the pfun effect system is only the first version, so there is almost without a doubt room for further optimization of the interpreter itself.

To learn more about pfun, check out the documentation or head over to the github repository.

Blog

How To Make Functional Programming in Python Go Fast

Sune Debel

Join Our Newsletter. No Spam, Only the good stuff.

Related