Crafting Interpreters in OCaml - Finishing the Lexer
Nikhil Thomas
Posted on May 22, 2019
This builds on my last post.
To follow along the source material, check out the Scanning chapter of Crafting Interpreters
Moving on to section 4.5.2 of the book, we have multi-character operators. Cool! Specifically, we have the !=
, ==
, <=
and >=
operators. Bob defines a match
method that takes some expected character and checks to see if the current character is some character. If we're at the end, it's not. If the current character isn't the expected character, it's false. Else, we advance and return true. Okay that seems straightforward.
Let's define a match
function. Remember that we don't have any global variables, so we're going to have to pass in context as well as the character. Since we want to conditionally mutate current
, it makes sense that our function takes and returns a scanner_context
object. We also want to maintain that boolean return, so let's return a tuple.
We also need our match function to look at the character in the next position, so we'll need a helper function.
In our scanner.mli
file we can define the following:
val next_char: scanner_context -> char option
val match_character: char -> scanner_context -> bool * scanner_context
This defines a function that takes a character and a context and returns a 2-tuple of boolean and context types.
Now we can write some code
let next_char context =
try Some (String.get context.source (context.current)) with Invalid_argument _ -> None
let match_character expected_char context = if (is_at_end context) then (false, context) else
match next_char context with
| Some c when c = expected_char -> (true, advance context)
| _ -> (false, context)
You may note that next_char
and current_char
look really similar. I don't really have a good solution for that but that's okay for now.
Now we can add some new cases to our match statement. We're going to have to write a lot of cases that look something like
let x = match match_character '=' advanced_context with
| (true, ctx) -> add_non_literal_token BANG_EQUAL ctx
| (false, ctx) -> add_non_literal_token BANG ctx
But for all four above two-character operators. Since I'm finally being lazy, let's write a helper. Our new function needs to take an expected character to match, a token for the true case, a token for the false case, and a context. It will return a new context with the token added. That signature will look like
val add_conditional_non_literal_token: char -> token_type -> token_type -> scanner_context -> scanner_context
Neat. Now we can write that function
let add_conditional_non_literal_token expected_char token_if_true token_if_false context =
match (match_character expected_char context) with
| (true, new_context) -> add_non_literal_token token_if_true new_context
| (false, new_context) -> add_non_literal_token token_if_false new_context
And our scan_token
function now has the following new clauses:
let scan_token context =
let advanced_context = advance context in
match (current_char advanced_context) with
| Some '(' -> add_non_literal_token LEFT_PAREN advanced_context
| Some ')' -> add_non_literal_token RIGHT_PAREN advanced_context
| Some '{' -> add_non_literal_token LEFT_BRACE advanced_context
| Some '}' -> add_non_literal_token RIGHT_BRACE advanced_context
| Some ',' -> add_non_literal_token COMMA advanced_context
| Some '.' -> add_non_literal_token DOT advanced_context
| Some '-' -> add_non_literal_token MINUS advanced_context
| Some '+' -> add_non_literal_token PLUS advanced_context
| Some ';' -> add_non_literal_token SEMICOLON advanced_context
| Some '*' -> add_non_literal_token STAR advanced_context
| Some '!' -> add_conditional_non_literal_token '=' BANG_EQUAL BANG advanced_context
| Some '=' -> add_conditional_non_literal_token '=' EQUAL_EQUAL EQUAL advanced_context
| Some '>' -> add_conditional_non_literal_token '=' GREATER_EQUAL GREATER advanced_context
| Some '<' -> add_conditional_non_literal_token '=' LESS_EQUAL LESS advanced_context
| _ -> failwith "Disallowed character"
So far we're doing great! Section 4.6 adds some more lexeme handling. Specifically, comments. Let's describe in words what Bob's code does.
If we see a slash, check to see if the next character is also a slash. If it is, just more forward until we either see a newline or reach the end of the file. If the next character ISN'T a slash, then just add a slash token. In our world we don't have multiline comments (whew).
So if we look a little ahead, we also have this peek
function that, if we're at the end, returns a null character, else the character at the current position. We can use another char option
here since we don't need nulls in this magical ML world.
It turns out we already wrote this function! next_char
does the exact same thing. So we'll just use that. If you want, you can alias it as let peek = next_char
but I won't.
We can use our match_character
helper here to create another helper function. We want a function that either adds a slash token, or just moves on. Let's call that function add_conditional_slash
for now. It'll take context and return context.
val add_conditional_slash: scanner_context -> scanner_context
let add_conditional_slash context =
let rec consume_line context =
match (not (next_char context = Some '\n')), not (is_at_end context)) with
| (true, true) -> consume_line (advance context)
| _ -> context
in
if match_character '/' context then consume_line context else add_non_literal_token SLASH context
Let's explain this code:
We nest a function to consume the entire line that says "if the next character isn't a newline AND we're not at the end of the file, then move forward a character and check again. Once either we see a newline or end of file, return the context object.
Then we simply say if the next character is a slash, consume the line. Else, add a slash token. Maybe later we'll pull out this nested function but for now I think it's okay, we have no other use cases for it.
Right - so we have some default cases that Bob lists. Let's translate those by either returning context, or in the case of a newline, increment the line in our context.
let scan_token context =
let advanced_context = advance context in
match (current_char advanced_context) with
| Some '(' -> add_non_literal_token LEFT_PAREN advanced_context
| Some ')' -> add_non_literal_token RIGHT_PAREN advanced_context
| Some '{' -> add_non_literal_token LEFT_BRACE advanced_context
| Some '}' -> add_non_literal_token RIGHT_BRACE advanced_context
| Some ',' -> add_non_literal_token COMMA advanced_context
| Some '.' -> add_non_literal_token DOT advanced_context
| Some '-' -> add_non_literal_token MINUS advanced_context
| Some '+' -> add_non_literal_token PLUS advanced_context
| Some ';' -> add_non_literal_token SEMICOLON advanced_context
| Some '*' -> add_non_literal_token STAR advanced_context
| Some '!' -> add_conditional_non_literal_token '=' BANG_EQUAL BANG advanced_context
| Some '=' -> add_conditional_non_literal_token '=' EQUAL_EQUAL EQUAL advanced_context
| Some '>' -> add_conditional_non_literal_token '=' GREATER_EQUAL GREATER advanced_context
| Some '<' -> add_conditional_non_literal_token '=' LESS_EQUAL LESS advanced_context
| Some ' ' -> advanced_context
| Some '\r' -> advanced_context
| Some '\t' -> advanced_context
| Some '\n' -> {advanced_context with line = advanced_context.line + 1}
| _ -> failwith "Disallowed character"
Not so bad! This is looking more and more readable. How rad is this? Let's handle literals and then write some tests.
Bob defers all string handling to a function called string
. In words, it does the following.
As long as the next character isn't a quote, and as long as we're not at the end just consume. If we see a newline, increment line and keep going. If we see the end of the file along the way, throw an error. Once we see the closing quote, advance one more, then grab the string and add a String token. Okay! Let's make our own function. We'll call it add_string_literal
which will take and output a scanner_context.
There's some conditonal line incrementing logic that I'm going to make it's own function just for ease as well.
val increment_line_if_newline: scanner_context -> scanner_context
val add_string_literal: scanner_context -> scanner_context
let increment_line_if_newline context =
match next_char context with
| Some '\n' -> {context with line = context.line+1}
| _ -> context
let rec add_string_literal context =
if( not ( next_char context = Some('"')) && (not (is_at_end context)) ) then
context |> increment_line_if_newline |> advance |> add_string_literal
else if (is_at_end context) then failwith "Unterminated string" else
let str = (String.sub context.source (context.start + 1) (context.current - 1)) in
add_literal_token STRING (Some (STRING_LITERAL str)) context
It's a bit gnarly and imperative. I'd like to find a way to pattern match out. BUT now we can add the following to our scan_tokens
| Some '"' -> add_string_literal advanced_context
Starting to get exciting now. Let's also handle number literals in section 4.6.2. The rules of numbers are simple - if there is a .
character, it must have 1 or more digits on both sides. So let's try and implement this.
We need a function is_digit
that takes a char option
and returns true or false, and then a add_number_literal
function that takes and returns a scanner_context
. We'll also create a peek_next
along the way that takes a scanner_context
and returns char option
.
To make things easy, we should write a function that simply captures digits. Let's call it capture_digits
. It'll just abstract out the Java code of while (isDigit(peek())) advance();
. We can also write a function to defer to above with capture_decimal
In our interface:
val is_digit: char option -> bool
val capture_digits: scanner_context -> scanner_context
val capture_decimal: scanner_context -> scanner_context
val peek_next: scanner_context -> char option
val add_number_literal: scanner_context -> scanner_context
And back in scanner.ml
let peek_next context =
try Some (String.get context.source (context.current + 1)) with Invalid_argument _ -> None
let is_digit character = match character with
| Some c when c >= '0' && c <= '9' -> true
| _ -> false
let rec capture_digits context =
if is_digit (next_char context) then context |> advance |> capture_digits else context
let capture_decimal context = match (next_char context) with
| Some '.' when (is_digit peek_next context) -> context |> advance |> capture_digits
| _ -> context
let add_number_literal context =
let new_context = context |> capture_digits |> capture_decimal in
let stringified_number = (String.sub new_context.source new_context.start (new_context.current - new_context.start)
let parsed_number = int_of_string stringified_number in
add_literal_token NUMBER NUMBER_LITERAL(stringified_number) new_context
Then just add in scan_tokens
the following clause
| digit when (is_digit digit) -> add_number_literal advanced_context
Spicy. If you go ahead and dune runtest
you should get some compiler errors about unused characters but no syntax or interface errors. Woohoo!
Let's finish up identifiers real quick. Once again, skimming ahead in Bob's chapter we see the claim if isAlpha
or in English, "if the next lexeme begins with a letter or underscore, capture everything that's alphanumeric after and assume it's an identifier.
Our version of is_alpha
will take a char option
and return boolean. So will is_alpha_numeric
. Our add_identifier_literal
will need to take a scanner context and return one. Let's create a new literal_type
type literal_type = STRING_LITERAL of string | NUMBER_LITERAL of float | IDENTIFIER_LITERAL of string
This new type will just wrap our identifiers that users create. If a user defines a top levek function, it's an IDENTIFIER_LITERAL. If it's a usual keyword like AND, it's just an IDENTIFIER.
To our interface to define implementations.
val is_alpha: char option -> bool
val is_alpha_numeric: char option -> bool
val add_identifier_literal: scanner_context -> scanner_context
And our implementation
let is_alpha character = match character with
| Some c when c >= 'a' && c <='z' -> true
| Some c when c >= 'A' && c <='Z' -> true
| Some c when c = '_' -> true
| _ -> false
let is_alpha_numeric character = (is_alpha character) || (is_digit character)
let rec add_identifier_literal context =
if (is_alpha_numeric (next_char context)) then context |> advance |> add_identifier_literal
else
let substring = (String.sub context.source context.start (context.start - context.current)) in
match substring with
| "and" -> add_non_literal_token AND context
| "class" -> add_non_literal_token CLASS context
| "else" -> add_non_literal_token ELSE context
| "false" -> add_non_literal_token FALSE context
| "for" -> add_non_literal_token FOR context
| "fun" -> add_non_literal_token FUN context
| "if" -> add_non_literal_token IF context
| "nil" -> add_non_literal_token NIL context
| "or" -> add_non_literal_token OR context
| "print" -> add_non_literal_token PRINT context
| "return" -> add_non_literal_token RETURN context
| "super" -> add_non_literal_token SUPER context
| "this" -> add_non_literal_token THIS context
| "true" -> add_non_literal_token TRUE context
| "var" -> add_non_literal_token VAR context
| "while" -> add_non_literal_token WHILE context
| _ -> add_literal_token IDENTIFIER (Some (IDENTIFIER_LITERAL substring)) context
and then we can add in our scan_tokens
clause the last piece.
| alpha when (is_alpha alpha) -> add_identifier_literal advanced_context
And now if you dune runtest
one more time the only issue you should get is the call to Scanner.scan_tokens
in olox.ml
. Just go ahead and open Scanner
at the top to fix that.
We now have a working scanner. In the next chapter we'll comment out the unfinished code in olox.ml
and write some unit tests.
Posted on May 22, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.