From Source to Binaries: The journey of a C++ program
abhinav the builder
Posted on October 15, 2020
If you couldn't already tell, I love Bjarne, and by extension, I love C++. In this article, I go over how C++ compiles a program to binaries, and why I love C++. This is partially to understand how C++ works. Documentation helps me understand.
There are only two kinds of languages: the ones people complain about and the ones nobody uses.
ā Bjarne Stroustrup, The C++ Programming Language
I got pretty inspired by HaoranWang's CRUST and thus wanted to write my own Compiler for C. I'll probably stick to Rust. Also, shoutout ShivyC, that's where I got this idea from.
Let's get started with the compilation pipeline!
What you see above is the compilation flow taken from NerdyElectronics.com.
For the purpose of this article, we will use a simple addition problem with predefined values.
//a.cpp program
#include <iostream>
using namespace std;
int main()
{
int firstNumber = 2, secondNumber =4, sumOfTwoNumbers;
// sum of two numbers in stored in variable sumOfTwoNumbers
sumOfTwoNumbers = firstNumber + secondNumber;
// Prints sum
cout << firstNumber << " + " << secondNumber << " = " << sumOfTwoNumbers;
return 0;
}
If you go back to the diagram, you can see we are presently on the preprocessing stage. Let's have a quick look at the Translation Unit. Translation Units is the input you give to the compiler, after it includes header files and expands macros.
You can get your Translation unit dump using the following command
g++ <filename>.cpp -E
The dump looks something like this
# 1 "a.cpp"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "a.cpp"
# 1 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\iostream" 1 3
# 36 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\iostream" 3
# 37 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\iostream" 3
# 1 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\mingw32\\bits\\c++config.h" 1 3
# 236 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\mingw32\\bits\\c++config.h" 3
# 236 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\mingw32\\bits\\c++config.h" 3
namespace std
{
typedef unsigned int size_t;
typedef int ptrdiff_t;
It's too damn long to post here (because it literally adds stdio header file, that's like 1k lines of code), but run it on your system if you're curious.
Assembly Code
Detouring for a bit, There's this grapevine that the closer you are to the hardware, the faster you will be. While there is a modicum of truth to this, often "slower" languages like Python are slow because they're interpreted or are memory hogs due to dynamic typing. There are plenty of Python-to-C/C++ compilers and there are plenty of projects that help you do Python "faster". Don't, for the love of God, develop something in a certain language because it is "closer to the hardware".
Anyway, now run this on the a.cpp file we had
gpp a.cpp -S
Now you'll have something like this
.file "a.cpp"
.text
.section .rdata,"dr"
__ZStL19piecewise_construct:
.space 1
.lcomm __ZStL8__ioinit,1,1
.def ___main; .scl 2; .type 32; .endef
LC0:
.ascii " + \0"
LC1:
.ascii " = \0"
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB1502:
.cfi_startproc
leal 4(%esp), %ecx
.cfi_def_cfa 1, 0
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
.cfi_escape 0x10,0x5,0x2,0x75,0
movl %esp, %ebp
pushl %ecx
.cfi_escape 0xf,0x3,0x75,0x7c,0x6
subl $36, %esp
call ___main
movl $2, -12(%ebp)
movl $4, -16(%ebp)
movl -12(%ebp), %edx
movl -16(%ebp), %eax
addl %edx, %eax
movl %eax, -20(%ebp)
movl -12(%ebp), %eax
movl %eax, (%esp)
movl $__ZSt4cout, %ecx
call __ZNSolsEi
subl $4, %esp
movl $LC0, 4(%esp)
movl %eax, (%esp)
call __ZStlsISt11char_
Again, too damn long, try it out on your own system! This will be built for your target architecture. Find out what is your system's architecture as an exercise! Now, Each architecture has a different Instruction Set that is understood by its processor and your compiler splits this into processes:
- Create an Abstract Syntax Tree
- Generate architecture dependent instructions
Let's go over what that means.
Abstract Syntax Trees
Abstract Syntax tree is well, abstract from the target architecture. However, that's not where the "abstract" part of the term comes from. According to the Wikipedia, abstract refers to the fact that "it does not refer to every detail appearing in the real syntax, but rather just structural or content related details". ASTs are generated after syntax analysis. All programs can generate an AST. For our code, this is what the AST looks like
Here's how to do it yourself
g++ -fdump-tree-all-graph a.cpp -o a
dot -Tpng a.cpp.013t.cfg.dot -o a.png
This is built using GraphViz, install it for your command line. You can also copy paste the contents of a.cpp.013t.cfg.dot
on any online GraphViz visualizer.
Object File and Linking
Object File has object code, that is essentially machine code (or some intermediate code). It is the "object" of compiling process, as you can see in this classic article. The reason I didn't use that fancy "Phases of Compiling Process" chart is because it kind of abstracts the real process of compilation. In due time, we will talk about that too. Create your object (.o) file using this, before we go ahead.
g++ a.cpp -c
Now, let's look at linking, which you do after you create your object files. The object files are linked together to create another object file that is executable. For this, let me divide the program into a header and a main CPP file.
//a.h
#include <stdio.h>
void printLinker()
{
printf("Hello World");
}
Now, let's call that in another file
#include "a.h"
int main()
{
printLinker();
return 0;
}
Finally, to show the linking, let's create another source file
//We will name this a2.cpp
void printLinker();
Compiling a.cpp would give me a Hello World
, as expected. But we need to see the linking, right?
g++ a.h -c
g++ a.cpp -c
g++ a2.cpp -c
Now, we are back to having a .obj and a .gch (precompiled header, if this is not found, the compiler looks for the header). Let's link!
gcc a.o a2.o -o a2.exe
Nice, you see how we just called the two object files and compiled them? Now we need to just run a2.exe, it would have printed Hello World
.
./a2.exe
Hello World
Perfect. If you want to see what lies inside these files, you use nm tool.
nm a.o
You get the following
00000000 b .bss
00000000 d .data
00000000 r .eh_frame
U ___main
00000015 T _main
U _printf
00000000 r .rdata
00000000 r .rdata$zzz
00000000 t .text
00000000 T __Z7print_av
You can do the same for the other object file! It is pretty clear what the files contain, it is well sectioned.
Whew, that was a lot, that's how a compiler compiles. Let me just conclude real quick.
- Preprocessing
- Compilation
- Assembly
- Linking
That's about it, folks! See you around in part 2, which I will update here.
Posted on October 15, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.