I was bored, so I built a programming language

faraazahmad

Syed Faraaz Ahmad

Posted on March 24, 2021

I was bored, so I built a programming language

One fine evening in the first lockdown of 2020, there I was, sitting idly, not knowing what to do.

Calmly freaking out.

You see, I really needed something to do. I had been doing a few web related projects on the side and that was something I didn't want to do any more, at least for a while. So I looked into doing something "closer to the metal", something much lower level than sending requests back and forth to a web server. So I quickly fired up Learn X by doing Y and searched for something interesting, eventually ending up on Building your own Lisp (We all have a Lisp phase, it was just my turn).

image

For the uninitiated, Lisp is (simply put) a family of programming languages whose syntax looks more or less like this (You can quickly see where the parentheses jokes stem from).

(defun area-circle(rad)
   "Calculates area of a circle with given radius"
   (terpri)
   (format t "Radius: ~5f" rad)
   (format t "~%Area: ~10f" (* 3.141592 rad rad)))
(area-circle 10)
Enter fullscreen mode Exit fullscreen mode

But I didn't just want to create any Lisp, I wanted to make some modifications as per my preference. So there was only 1 way to go about doing this.

Coming up with a syntax

What got me into lisps and writing Lisp was (suprisingly) the Emacs text editor (or Emacs OS for the hardcore users). What was I doing going anywhere even remotely near Emacs? I don't know, I'll tell you some other day. You see, writing Emacs Lisp is the recommended way of configuring Emacs. Even though I didn't write a lot of Emacs Lisp, a few of the keywords I saw there didn't feel very natural to me (especially coming from using Ruby). So one of the major goals would be to have more "natural" sounding keywords.

image

You can't talk about Lisp without talking about the parentheses. A big advantage in readability for other languages is that everything isn't inside a parenthesis. Square brackets usually denote an array, braces usually denote code blocks, while parenthesis are used for function calls. It leads to easier visual grepping of code, not so much in Lisp, especially as a beginner. So I need to reduce the usage of parentheses, but I cant just change all of its syntax because then it might not stay a Lisp anymore (Ship of Theseus, anyone?). I think Clojure strikes a good balance between staying a Lisp and improving visual grepping. So another goal would be to reduce the reliance on parentheses so that the code can be more readable.

'(1 2 3)     ; list
[1 2 3]      ; vector
#{1 2 3}     ; set
{:a 1, :b 2} ; map
Enter fullscreen mode Exit fullscreen mode

(Example from Clojure syntax guide)

So it began

After setting all these requirements for the language, I set out to finally start building it. I chose Rust to build it because it seemed like the right tool to do so because of it's speed and memory safety. I was building my own programming language, I felt like a "real programmer" (no such thing btw) so naturally I started tweeting about it.

Starting simple

I wanted to start off with something as simple as possible, like adding a few numbers. It felt like the perfect starting point. I'll be evaluating a simple arithmetic expression written in Lisp, for example, 2 * 3 + 1 - 4, which would be written as (- (+ 1 (* 2 3)) 4) in Lisp. To the unaware, this is in the form of Polish notation, just think of it as a simple function with multiple arguments. (+ 1 2 3) is simply the sum of 1, 2 & 3. Its result can then be used as an argument to a parent function, and then that gets evaluated. Similarly the whole expression can be solved using the method below.

image
(Source: Crafting Interpreters)

After building the initial prototype I tested it with (+ 2 (- 4 6) 8). Doing it manually, you can tell it's 4 - 6 + 2 + 8 = 8. And if you see the result in the bottom red rectangle, the result is indeed 8.

Alt Text

Laying the building blocks

Now that we've had something to play around with, it's time to grow up and lay the foundations for a proper compiler. The first step for any compiler is to read the source file and turn it into a stream machine readable objects called tokens. This step is called lexical analysis, or lexing. It simply maps all possible characters in the programming language to a Token object. Like so:

Character Token
( Token { kind: OpenParen, value: None }
let Token { kind: Identifier, value: Some("let") }
-?([0-9])+ Token { kind: Number, value: Some(getVal()) }

In this way you can build a mapping for token and their various types, so that its easier to build your syntax tree in the next step.

Cool, now that I can generate tokens, it was time to build on top of this abstraction and give this stream of tokens a proper structure so that it's easier to work with. We're going to go examine every token and make sense of it all by generating a syntax tree. This step is called parsing and at its end, we going to have something like this:

Abstract syntax tree
(Source: Abstract Syntax tree:Wikipedia)

So I go about trying to build a syntax tree, starting with defining a basic grammar in the form of an enum which looked something like this

pub enum Expr {
    Constant(Value),
    Builtin(Ident), 
    Call(Box<Expr>, Vec<Expr>),
    While {
        condition: Box<Expr>,
        body: Vec<Expr>,
    },
}
Enter fullscreen mode Exit fullscreen mode

Reading the source code of the Rust programming language was very helpful and I might have stolen some concepts from it. Let me just quickly define what these types are:

  • Constant: Any literal of type Value (number, string, boolean)
  • Builtin: Any builtin token like an operator (+,-,*,/) or a keyword (let, print, if)
  • Call: this is simple, any function call (truth be told, everything should just be a function call since Lisp is a functional language)
  • While: A while loop (duh)

(If you're wondering, Box is Rust's built-in type for pointers, broadly speaking.)

This is the gist of it, rest of the process was just finding ways to build this syntax tree using the token stream I had. Slowly but surely, I was making progress.

Alt text

At this point, the syntax tree felt robust enough for what I wanted the language to do for now. It was time to move on to bigger and better things.

LLVM

Low..Level..Virtual Machine? good guess, but not anymore. What is LLVM? let's first take a step back and see why it's needed in the first place. You see, a compiler has 2 sides - a frontend and a backend - the former containing lexing and parsing, while the latter has intermediate code generation, optimisation and target code generation. What I'd built until now was just the frontend, and I had the internal representation that can be used by the backend.

To build all the backend from scratch, here's what I would have to do. Firstly, limit myself to an architecture since supporting multiple processor architectures was not an option, simply because of the intricacies of the way processors work. After selecting the most popular processor architecture I'd need to get a good grasp of its assembly language. You see, because of the different ways microprocessors are designed and made, they all have different instruction sets. To support an instruction set, you need to take the code the user has written in your language and translate it in the form of instructions of that specific set. Processors can work in weird ways, sometimes for compatibility reasons and sometimes just because they can. So when you optimise your generated code, you have to take all that into account. Let's say you do manage to support one architecture successfully but now there's (somehow) popular demand for your language on some other architecture, you will have to repeat this process all over again, and then again for another architecture, and so on. Don't even get me started on how you would debug errors that are specific to only some architectures. It would be a huge mess.

LLVM being an abstraction

This is where LLVM comes in, it provides a convenient abstraction layer over all the instruction sets that you can target. You only need to worry about translating to its own internal representation (LLVM IR). LLVM will then handle all the code optimisation to its IR, and then for the instruction set you're targetting. But it's still a good idea to first optimise your code and then give it to LLVM, instead of just expecting LLVM to do it all.

So to use LLVM, all I needed to do was to take my AST and convert it into LLVM IR. I did it using Inkwell the LLVM API in Rust. It is a wrapper around the C API that LLVM has built, and on top of that provides the memory safety guaranteed by Rust. Oh, and this is what LLVM IR looks like in text format (Don't try to understand it, feel it):

; ModuleID = 'example'
source_filename = "/home/faraaz/hello.tp"

@format_string = private unnamed_addr constant [4 x i8] c"%f \00", align 1

define i32 @main(i32 %0) {
entry:
  %num = alloca double
  store double 4.200000e+02, double* %num
  %num1 = load double, double* %num
  %printf = call double (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @format_string, i32 0, i32 0), double %num1)
  ret i32 0
}

declare double @printf(i8*, ...)
Enter fullscreen mode Exit fullscreen mode

Code in LLVM IR that sets a variable to 420 and prints it

Making it all legit

Until now, executing the compiler's binary did only one thing and it wasn't very configurable or helpful for someone using it for the first time. I wanted it to have a proper Command Line Interface (CLI), something like this:

CLI for htop tool

So I used the Clap crate (aka package) to build the CLI and I think it turned out all right.

CLI for Tisp

Bringing it all together

Now that I have the compiler set up, and the language has features like storing variables, performing arithmetic, looping (nested loops don't work yet) and printing to screen, it was time to make a small program with it. Here's me writing a program to compute the first 5 Fibonacci numbers in Tisp and then compile and run it:

Here's the project if you want to check it out yourself.

GitHub logo faraazahmad / tisp

A toy programming language based on Lisp and built in Rust & LLVM

Tisp (Toy Lisp)

A Lisp-like programming language that is typed and compiled. It aims to support multiple processor architectures by being built upon LLVM. It takes inspiration from programming languages like Rust, Lisp and Elixir.

Current working example

A program to compute first 5 fibonacci numbers:

(let first 0)
(let second 1)
(let fib)
(let n 0)

(while (< n 5)
    (let fib (+ first second))
    (let second first)
    (let first fib)

    (let n (+ n 1))
    (print fib)
)
Enter fullscreen mode Exit fullscreen mode

Features to build

  • Convert raw code into token stream
  • Convert token stream into Expression tree
  • Handle multiple levels of nested expressions
  • Have multiple (independent) expressions per file
  • Generate LLVM IR for the currently supported features
  • add CLI flag to emit llvm
  • add while loop
  • Declare variables
  • add…

Here's what I learned

Programming languages are hard. Its' hard to read code in the first place. To understand and learn one is way harder. One can only imagine how hard it is to build one, let alone build one that is widely used. This project was mostly me trying to answer the question "How hard can it really be?", I guess I have my answer now. But it's also kinda easy low-key. Once you dip your toes in the hard parts, the next time you do it, it feels a little easier. Every day you do something hard, it gets a little easier, but you gotta do it everyday, that's the hard part.

💖 💪 🙅 🚩
faraazahmad
Syed Faraaz Ahmad

Posted on March 24, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related