Building a Toy Lexer in Ruby
Honeybadger Staff
Posted on August 4, 2020
This article was originally written by Alex Braha Stoll on the Honeybadger Developer Blog.
Full Source on Github
A complete implementation of the Stoffle programming language is available at GitHub. This reference implementation has a lot of comments to help make reading the code easier. Feel free to open an issue if you find bugs or have questions.
In this blog post, we're going to implement the lexer for Stoffle, a toy programing language built entirely in Ruby. You can read more about this project in the first part of this series.
Learning how to build a lexer, even if it's a simple one, will make you a better programmer. It will help you understand how programming languages really work and will open doors to other fields, such as natural-language processing. It might even help you ace that technical interview.
What's a Lexer?
Lexers, also known as scanners, convert strings of characters into groups of tokens. Imagine we have the following line of code:
my_var = 1
This code may mean something to us as human beings, but to the computer, it's just text.
Our lexer should be able to parse this text and generate a data structure that looks something like this:
[:identifier, :'=', :number]
Once parsed, we no longer need the orgininal text. Implementing our programming language will be a matter of transforming and interpreting the data structure generated by the lexer.
How do you Build a Lexer?
If I were to give an elevator pitch of our implementation strategy, it would sound something like this:
We read the source code character by character, trying to assign each character to a specific "lexeme." Lexemes tell us which tokens to use in our final data structure.
Here are some helpful definitions:
Lexeme: the sequence of characters that matches the pattern associated with a token. Stoffle uses an
fn
token to define a function, and to produce this token, our lexer must find the lexeme"fn"
present in the source code.Token: a data structure that conveys the meaning of a lexeme. For example, the lexeme
"42.42"
would be interpreted as a number. Its token would probably include the numeric value42.42
.
Introducing Stoffle
Before we can build a lexer for Stoffle's syntax, we need to have some idea of what the syntax actually is.
The snippet below shows a simple Stoffle program. It sums all integers between two numbers.
fn sum_integers: first_integer, last_integer
i = first_integer
sum = 0
while i <= last_integer
sum = sum + i
i = i + 1
end
println(sum)
end
sum_integers(1, 100)
At this point, our objective is not to comprehend every corner of the syntax, in which characters and combinations thereof are allowed, or every bit of the semantics (i.e., the meanings of characters and their combinations), but to start gaining a general sense of the Stoffle programming language.
The Mathematician Who Inspired our Stoffle Sample Program
While thinking of a simple and suitable sample program, stories that a friend of mine used to tell about different mathematicians came to mind. Carl Friedrich Gauss supposedly figured out on his own, at only 7 years old, a formula to sum up numbers in a series.
Stoffle's Reserved Words
After reading the sample above, you may be thinking: "this looks a lot like Ruby!". Well, you're darn right if that crossed your mind! It's no mystery that here at Honeybadger, we adore the Ruby programming language. Stoffle is heavily inspired by Ruby; however, there are some minor differences to keep things fresh.
Let's review the main differences between Ruby and Stoffle:
- In Stoffle, we use
fn
instead ofdef
to declare functions; - We indicate the parameters a function receives by inserting a
:
after the function's name instead of using matching()
; - We always use
()
when calling a function, even when it is not receiving any parameters.
Before being able to actually build a lexer for Stoffle, it's necessary to spell out the whole list of characters/keywords we will be looking for. So, here we go:
( ) : , . + - * / ! = == != > < >= <= " #
true false and or
if else elsif
end
fn
return
while
println
nil
Getting Started: Implementing a Token
A token is a data structure. In our implementation, this will be expressed as a Ruby class with a couple attributes. Here are its essential bits:
require 'forwardable'
module Stoffle
class Token
extend Forwardable
attr_reader :type, :lexeme, :literal, :location
def_delegators :@location, :line, :col, :length
def initialize(type, lexeme, literal, location)
@type = type
@lexeme = lexeme
@literal = literal
@location = location
end
# ...
end
end
The @type
holds a symbol to identify which token we're dealing with. Our convention is to expect the symbol to be the lexeme (i.e., a string matching the token) converted to a Symbol type; for example, the symbol expected for the "("
lexeme is :'('
.
The exception is tokens whose pattern is a rule instead of a fixed set of characters. Identifiers, such as keywords and variable and function names, in Stoffle must be started by a letter and can also have numbers and underlines in their composition. As an example, the token object created for a my_variable
identifier will have the :identifier
symbol in its @type
attribute.
The @lexeme
is the string matching the pattern expected by a token. Here are some examples: ":"
, ">="
, "my_var"
.
The @literal
is optional and will be used for tokens that have a runtime value associated with them. For example, a :number
token may have "2020"
as the lexeme and 2020.0
as its value.
The @location
holds a Ruby Struct
containing the line and the column where we can find the lexeme at the source code. It also has the length of the lexeme. We won't go into detail about this localization mechanism, but you can check the Location
implementation at GitHub.
The Lexer's Core Logic
The most important methods of the Lexer
class are responsible for giving life to the basic strategy we discussed above. The public API of the Lexer
class is very simple; after creating the object, we just have to call #start_tokenization
to get the list of tokens that compose the source code that was supplied. #start_tokenization
calls the private method #tokenize
repeatedly to have each token produced. #tokenize
- as we will see shortly - interacts with many other smaller, private methods, each responsible for dealing with a specific case (some examples: #string
, #number
, and #identifier
).
Let's start from the beginning with #start_tokenization
:
class Lexer
# ...
def initialize(source)
@source = source
@tokens = []
@line = 0
@next_p = 0
@lexeme_start_p = 0
end
def start_tokenization
while source_uncompleted?
tokenize
end
tokens << Token.new(:eof, '', nil, after_source_end_location)
end
# ...
end
We loop through the source code until we reach the end of the file. As we consume new characters, we try to match them to various tokens that are part of Stoffle. To mark the end of the source in a clean manner, the method ends by adding a final :eof
(eof stands for end of file) token to the instance variable @tokens
.
The meat of the implementation is in the method we repeatedly call from #start_tokenization
, the #tokenize
method:
# inside Stoffle::Lexer
def tokenize
self.lexeme_start_p = next_p
token = nil
c = consume
return if WHITESPACE.include?(c)
return ignore_comment_line if c == '#'
if c == "\n"
self.line += 1
tokens << token_from_one_char_lex(c) if tokens.last&.type != :"\n"
return
end
token =
if ONE_CHAR_LEX.include?(c)
token_from_one_char_lex(c)
elsif ONE_OR_TWO_CHAR_LEX.include?(c)
token_from_one_or_two_char_lex(c)
elsif c == '"'
string
elsif digit?(c)
number
elsif alpha_numeric?(c)
identifier
end
if token
tokens << token
else
raise('Unknown character.')
end
end
The #tokenize
method is responsible, along with the help of auxiliary methods, for analyzing the current character, classifying it, and then emitting the appropriate token. Depending on the scenario, it may be necessary to consume multiple next characters before we're able to make a decision on which token to produce.
Before we explore the methods responsible for determining whether a character or group of characters match with the pattern associated with each specific token, we have to comprehend the mechanism used to progress through the source code. Its implementation happens mostly in two methods, #lookahead
and #consume
. First, there is the #lookahead
method:
# inside Stoffle::Lexer
def lookahead(offset = 1)
lookahead_p = (next_p - 1) + offset
return "\0" if lookahead_p >= source.length
source[lookahead_p]
end
To move through the source code, we use three "pointers". These are variables holding integers whose values are a position in the array of characters that is our source code (the instance variable @source
). Two of them - @lexeme_start_p
and @next_p
- are instance variables. The third one, lookahead_p
, is local to #lookahead
.
Both @lexeme_start_p
and @next_p
are initialized to 0 when a Lexer
object is created. In our lexer, we're always consuming one character at a time. @next_p
has the job of pointing to the next character in the pipeline. The task of @lexeme_start_p
is a bit different, to track the position of the character that started the current lexeme (remember, some lexemes are made of multiple chars!).
The #lookahead
method returns a character that is offset from the char we're currently consuming by an arbitrary number of positions. This offset is given by the value of the parameter offset
, defaulting to 1. Notice that since the offset is from the character currently being consumed, calling #lookahead
with an offset of 1 is equivalent to source[next_p]
. The #lookahead
implementation also includes a guard to protect us against going after the end of the source.
A simple but crucial bit is that when comparing lookahead_p
with source.length
we use >=
because source
is a plain old array whose index starts at 0. Notice also that when using #lookahead
, we are only looking and not consuming (i.e., moving @next_p
) characters.
If you understood what I elucidated above, #consume
will be a stroll in the park and requires no further explanation:
def consume
c = lookahead
self.next_p += 1
c
end
Whitespace Chars, Comments, and Newlines
Now, let's go back to #tokenize
and inspect it part-by-part. First, let's look at the beginning of its body:
# inside Stoffle::Lexer#tokenize
self.lexeme_start_p = next_p
token = nil
c = consume
return if WHITESPACE.include?(c)
return ignore_comment_line if c == '#'
if c == "\n"
self.line += 1
tokens << token_from_one_char_lex(c) if tokens.last&.type != :"\n"
return
end
Every time we start the process of trying to emit the next token, we have to set @lexeme_start_p
to the value being held by @next_p
since this is, potentially, the pointer to the beginning of the next lexeme. We then consume the character we are about to analyze by calling our old and dear #consume
method.
Next, we start by checking three uninteresting but relevant scenarios. If our character matches one of the chars we are classifying as whitespace, we don't generate a token and simply move on by returning. The same is true if we detect our character is a #
, but in this case, we ignore all chars until the end of the line, not only the current one. If you want to check out the implementation of #ignore_comment_line
, please take a look at Stoffle's GitHub repo.
The last lines of this segment are quite significant. As you might have realized, Stoffle does not use any explicit terminator, such as the ;
we see in many popular programming languages. In Ruby, a newline is used to terminate a line of code.
Here, we could go for more sophisticated rules, but since Stoffle is an educational project, it's better to keep it as simple as possible. In Stoffle, every newline at the end of a non-empty line is considered meaningful. It is trivial and easy to implement rule of thumb, but it has some consequences we must be aware of. The following snippet, which would be valid in Ruby, is not valid Stoffle code:
two = 1 +
1
If we try to later on run a program that is splitting a line in this fashion, we'll get an error. I think this is something we can live with and a reasonable compromise to keep things as uncomplicated as possible. So, if we find out the current char is a \n
, we have to do two things:
- Advance a counter that keeps track of the current line we are processing (this will be used for precise error reporting, but it's out of the scope of this post);
- We only emit a newline if the last token is not also a newline token. A newline token marks the end of a significant line of code, so we don't want to emit tokens for blank lines that are only being used to organize the source code.
One and Two Character Lexemes, Strings, Numbers, and Identifiers
Let's continue delving into #tokenize
. Now, the main part of the method:
# inside Stoffle::Lexer#tokenize
token =
if ONE_CHAR_LEX.include?(c)
token_from_one_char_lex(c)
elsif ONE_OR_TWO_CHAR_LEX.include?(c)
token_from_one_or_two_char_lex(c)
elsif c == '"'
string
elsif digit?(c)
number
elsif alpha_numeric?(c)
identifier
end
The first two branches of this conditional expression are straightforward. We first determine whether our character matches one of the multiple lexemes of length 1. If so, we emit the appropriate token. In case that doesn't happen, we test against a special group of lexemes containing 1 or 2 characters. This group contains lexemes such as "="
and "=="
.
The strategy here is not difficult; if the current character matches a lexeme of size 1 belonging to this group, we just make sure we're in fact dealing with the lexeme in question by taking a look at the next character in the source. The idea is to always "bite" as many chars as we can! This appproach guarantees, for example, that ==
results in the equality token being produced (and not the assignment token twice).
It's time now for us to inspect the #string
method, which is called when we know we are dealing with a string:
# inside Stoffle::Lexer
def string
while lookahead != '"' && source_uncompleted?
self.line += 1 if lookahead == "\n"
consume
end
raise 'Unterminated string error.' if source_completed?
consume # consuming the closing '"'.
lexeme = source[(lexeme_start_p)..(next_p - 1)]
# the actual value of the string is the content between the double quotes.
literal = source[(lexeme_start_p + 1)..(next_p - 2)]
Token.new(:string, lexeme, literal, current_location)
end
Here, we use the #lookahead
method we described earlier (combined with the also familiar #consume
) to keep consuming characters until the next char to be processed is the closing "
. If we can't find the closing "
before the end of the source, we raise an error. Causing our lexer to fail in such a manner is definitely not the best approach, but it will do for now.
Finally, we use our pointers to extract the lexeme and the actual value of the string. We include double quotes for the lexeme but not for the value.
In case our character being processed at #tokenize
failed the string test, we next determine whether it is a number. If so, here's how we handle it and generate the associated token:
# inside Stoffle::Lexer
def number
consume_digits
# Look for a fractional part.
if lookahead == '.' && digit?(lookahead(2))
consume # consuming the '.' character.
consume_digits
end
lexeme = source[lexeme_start_p..(next_p - 1)]
Token.new(:number, lexeme, lexeme.to_f, current_location)
end
In this method, we continue moving through the source as long as the next characters are digits; this is basically what #consume_digits
does. Note that we do take numbers with a fractional part into consideration. To reduce complexity, all numbers in Stoffle are floats, so we make sure to leverage Ruby's #to_f
method to make the process of generating an actual float value for the token a piece of cake.
For characters that had no match up to this point, we perform a final verification to see if we are dealing with an alphanumeric. If positive, the method #identifier
is the one used to handle the task:
# inside Stoffle::Lexer
def identifier
while alpha_numeric?(lookahead)
consume
end
identifier = source[lexeme_start_p..(next_p - 1)]
type =
if KEYWORD.include?(identifier)
identifier.to_sym
else
:identifier
end
Token.new(type, identifier, nil, current_location)
end
In this case, the plan is to keep consuming the next characters until we find one that is not a number, letter, or underscore, which is what #alpha_numeric?
is verifying. For identifiers, there are two scenarios. We may be dealing with a language keyword (such as fn
or while
) or an identifier defined by the programmer (e.g., a variable name). Since keywords can't be used to name a variable or a function, we first try to match the lexeme against one of Stoffle's keywords. Not being able to do so, we are sure that the right thing to do is generate a generic :identifier
token.
Finally, here's the last bit of #tokenize
:
# inside Stoffle::Lexer#tokenize
if token
tokens << token
else
raise('Unknown character.')
end
We push the most recently generated token to @tokens
. However, if no token was emitted, we can be sure at this point that we hit a character unsupported by Stoffle, and we should then ring the alarm.
Your Next Steps
Hooray, we have just gone through the principal parts of our lexer! Since it would be exhausting to explore all the minor details, I urge you to scan the complete implementation at GitHub. If you like, take a look at the test suite included, as the examples provided there may help illuminate some detail of the implementation that you may have found mysterious at first glance.
Other Usages of a Lexer
Now that you understand the basics on how to implement a lexer, you may be thinking of where else lexers are applied. A curious usage is in the field of natural language processing, in which a lexer can be used to transform raw input, either text or sound, into a more structured list of tokens to be leveraged by higher layers in a processing pipeline. Expect, however, these lexers to be significantly more complex than the one we have just implemented.
Parting Thoughts
Our little language is starting to come to life! We now have a working lexer that is able to produce the valuable tokens our parser will need to do its job. The star of the next post in the series will be the parser itself. We'll see you soon in the next step of bringing Stoffle into being!
Posted on August 4, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024