Recursion vs Loop: a low-level analysis

Introduction

Sometimes when writing code we have to decide between using loop or recursion. How do both work under the hood? In terms of performance which one is the best to pick and why?

PS.: CPU architecture -> x86_64 | RAM -> DDR4 16G

1. Study Case 📝

Let's implement a function that receives a number and adding it one by one, arrives at the value received.

1.1. Recursion

int sum(int n) {
    if (n == 0) {
        return 0;
    }

    return 1 + sum(n - 1);
}

1.2. Loop

int sum(int n) {
    int result = 0;

    for (int i = 1; i <= n; i++) {
        result += 1;
    }

    return result;
}

1.3. Time each function

#include <stdio.h>
#include <time.h>

// sum functions here

int main() {
    int n = 70000;

    clock_t start = clock();

    sum(n);

    clock_t end = clock();

    double cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

    printf("sum(%d) Time elapsed = %.7f\n", n, cpu_time_used);

    return 0;
}

2. Compiling the code

Let's compile but not assemble it. We can do it by using the gcc compiler with the flag -S. Since we are interested in the recursive and loop functions, we will only examine these assembly instructions:

2.1. Recursion

sum:
.LFB0:
    .cfi_startproc
    endbr64
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $16, %rsp
    movl    %edi, -4(%rbp)
    cmpl    $0, -4(%rbp)
    jne .L2
    movl    $0, %eax
    jmp .L3
.L2:
    movl    -4(%rbp), %eax
    subl    $1, %eax
    movl    %eax, %edi
    call    sum
    addl    $1, %eax
.L3:
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   sum, .-sum
    .section    .rodata

2.2. Loop

sum:
.LFB0:
    .cfi_startproc
    endbr64
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    %edi, -20(%rbp)
    movl    $0, -8(%rbp)
    movl    $1, -4(%rbp)
    jmp .L2
.L3:
    addl    $1, -8(%rbp)
    addl    $1, -4(%rbp)
.L2:
    movl    -4(%rbp), %eax
    cmpl    -20(%rbp), %eax
    jle .L3
    movl    -8(%rbp), %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   sum, .-sum
    .section    .rodata

3. Executing both programs

Execution of sum(70000) using both approaches 5 times each. Elapsed time in seconds:

Loop	Recursion
0.0001380	0.0012100
0.0001230	0.0011630
0.0001370	0.0011390
0.0001710	0.0012150
0.0001380	0.0011810

Why is the loop approach 10 times faster?

4. Cause of the performance penalty 📉

At the assembly code, we have to pay attention to some instructions in order to understand the overhead of the recursion approach. These are:

pushq %rbp
movq %rsp, %rbp
leave or popq %rbp

In short, pushq %rbp saves the base pointer of the previous function. movq %rsp, %rbp initializes the base pointer to the current function, subq $16, %rsp alocates 16 bytes on the top of the stackframe - by subtracting 16 bytes from the initial %rsp, since it is initialy equal to the %rbp: the stack grows from high memory addresses to the lower ones. leave sets %rsp to %rbp, then pops the top of the stack, thus restoring the function that called it.

This process is repeated in each function call. It significantly increases the interaction with memory - L caches cannot help that much in this case - the performance penalty comes precisely from that.

5. Comparing the instructions in both cases

5.1. Loop

movl    %edi, -20(%rbp)
movl    $0, -8(%rbp)
movl    $1, -4(%rbp)

These three instructions load into memory the initial values of n, result and i. %edi is the register used to pass the argument to the function, so the value of n is saved into the address -20(%rbp). i and result are initially set to 1 and 0, that is why $1 and $0 are loaded into memory addresses -4(%rbp) and -8(%rbp), respectively.

The loop starts at jmp .L2. movl -4(%rbp), %eax loads data into %eax register so it can be used to compare in cmpl -20(%rbp), %eax. jle .L3 jumps to .L3 if %eax is less or equal to -20(%rbp), namely i <= n;.

In L3., addl $1, -8(%rbp) will increase -8(%rbp) by one - result -, then do the same to i: addl $1, -4(%rbp), i.e., i++

This process is executed until the comparison is false, then popq %rbp and ret are executed, respectively poping the top of the stack and returning sum.

5.2. Recursion

Recursion instructions:

movl %edi, -4(%rbp) takes the argument - sent via %edi register - and saves it 4 bytes bellow the value of the pointer saved in %rbp. cmpl $0, -4(%rbp) takes the value and compares it to 0, jne .L2 sets that, if the value in -4(%rbp) is not equal to 0, the code jumps to .L2 block.

At .L2, movl -4(%rbp), %eax will load the value of -4(%rbp) to %eax, then subl $1, %eax subtract it by 1 and save it in the register itself. In movl %eax, %edi, the value of register %eax is loaded into another register: %edi. As we saw, this register is responsible for passing arguments to functions. The argument is passed to call sum, so allocating more addresses onto the top of the stackframe and repeting the whole process of recursion. addl $1, %eax will increase by 1 and save the value in the register %eax. When recursion reach the base case, movl $0, %eax - so placing 0 as return of sum(n - 1) - will be executed, then jmp.L3 will jump to .L3 and execute leave and ret.

6. Segmentation Fault using recursion

Once it is clear how recursion works under the hood, it is much easier to see how a Segmentation Fault can take place. Each process allocates a finite space in memory, since recursions can grown the stackframe indefinitely, therefore Segmentation Fault can occur.

E.g., let's set int n = 1000000; and see how it performs:

linux@x86_64:~$ ./recursive_sum
Segmentation fault (core dumped)

Once using loop approach there is no indefinite function calls, there is no risk of the Stack Overflow that happens under recursion.

Blog