Building a Toy Lexer in Ruby

honeybadger_staff

Honeybadger Staff

Posted on August 4, 2020

Building a Toy Lexer in Ruby

This article was originally written by Alex Braha Stoll on the Honeybadger Developer Blog.

Full Source on Github

A complete implementation of the Stoffle programming language is available at GitHub. This reference implementation has a lot of comments to help make reading the code easier. Feel free to open an issue if you find bugs or have questions.

In this blog post, we're going to implement the lexer for Stoffle, a toy programing language built entirely in Ruby. You can read more about this project in the first part of this series.

Learning how to build a lexer, even if it's a simple one, will make you a better programmer. It will help you understand how programming languages really work and will open doors to other fields, such as natural-language processing. It might even help you ace that technical interview.

What's a Lexer?

Lexers, also known as scanners, convert strings of characters into groups of tokens. Imagine we have the following line of code:

my_var = 1
Enter fullscreen mode Exit fullscreen mode

This code may mean something to us as human beings, but to the computer, it's just text.

Our lexer should be able to parse this text and generate a data structure that looks something like this:

[:identifier, :'=', :number]
Enter fullscreen mode Exit fullscreen mode

Once parsed, we no longer need the orgininal text. Implementing our programming language will be a matter of transforming and interpreting the data structure generated by the lexer.

How do you Build a Lexer?

If I were to give an elevator pitch of our implementation strategy, it would sound something like this:

We read the source code character by character, trying to assign each character to a specific "lexeme." Lexemes tell us which tokens to use in our final data structure.

Here are some helpful definitions:

  • Lexeme: the sequence of characters that matches the pattern associated with a token. Stoffle uses an fn token to define a function, and to produce this token, our lexer must find the lexeme "fn" present in the source code.

  • Token: a data structure that conveys the meaning of a lexeme. For example, the lexeme "42.42" would be interpreted as a number. Its token would probably include the numeric value 42.42.

Introducing Stoffle

Before we can build a lexer for Stoffle's syntax, we need to have some idea of what the syntax actually is.

The snippet below shows a simple Stoffle program. It sums all integers between two numbers.

fn sum_integers: first_integer, last_integer
  i = first_integer
  sum = 0
  while i <= last_integer
    sum = sum + i

    i = i + 1
  end

  println(sum)
end

sum_integers(1, 100)
Enter fullscreen mode Exit fullscreen mode

At this point, our objective is not to comprehend every corner of the syntax, in which characters and combinations thereof are allowed, or every bit of the semantics (i.e., the meanings of characters and their combinations), but to start gaining a general sense of the Stoffle programming language.

The Mathematician Who Inspired our Stoffle Sample Program

While thinking of a simple and suitable sample program, stories that a friend of mine used to tell about different mathematicians came to mind. Carl Friedrich Gauss supposedly figured out on his own, at only 7 years old, a formula to sum up numbers in a series.

Stoffle's Reserved Words

After reading the sample above, you may be thinking: "this looks a lot like Ruby!". Well, you're darn right if that crossed your mind! It's no mystery that here at Honeybadger, we adore the Ruby programming language. Stoffle is heavily inspired by Ruby; however, there are some minor differences to keep things fresh.

Let's review the main differences between Ruby and Stoffle:

  • In Stoffle, we use fn instead of def to declare functions;
  • We indicate the parameters a function receives by inserting a : after the function's name instead of using matching ();
  • We always use () when calling a function, even when it is not receiving any parameters.

Before being able to actually build a lexer for Stoffle, it's necessary to spell out the whole list of characters/keywords we will be looking for. So, here we go:

( ) : , . + - * / ! = == != > < >= <= " #
true false and or
if else elsif
end
fn
return
while
println
nil
Enter fullscreen mode Exit fullscreen mode

Getting Started: Implementing a Token

A token is a data structure. In our implementation, this will be expressed as a Ruby class with a couple attributes. Here are its essential bits:

require 'forwardable'

module Stoffle
  class Token
    extend Forwardable

    attr_reader :type, :lexeme, :literal, :location
    def_delegators :@location, :line, :col, :length

    def initialize(type, lexeme, literal, location)
      @type = type
      @lexeme = lexeme
      @literal = literal
      @location = location
    end
    # ...
  end
end
Enter fullscreen mode Exit fullscreen mode

The @type holds a symbol to identify which token we're dealing with. Our convention is to expect the symbol to be the lexeme (i.e., a string matching the token) converted to a Symbol type; for example, the symbol expected for the "(" lexeme is :'('.

The exception is tokens whose pattern is a rule instead of a fixed set of characters. Identifiers, such as keywords and variable and function names, in Stoffle must be started by a letter and can also have numbers and underlines in their composition. As an example, the token object created for a my_variable identifier will have the :identifier symbol in its @type attribute.

The @lexeme is the string matching the pattern expected by a token. Here are some examples: ":", ">=", "my_var".

The @literal is optional and will be used for tokens that have a runtime value associated with them. For example, a :number token may have "2020" as the lexeme and 2020.0 as its value.

The @location holds a Ruby Struct containing the line and the column where we can find the lexeme at the source code. It also has the length of the lexeme. We won't go into detail about this localization mechanism, but you can check the Location implementation at GitHub.

The Lexer's Core Logic

The most important methods of the Lexer class are responsible for giving life to the basic strategy we discussed above. The public API of the Lexer class is very simple; after creating the object, we just have to call #start_tokenization to get the list of tokens that compose the source code that was supplied. #start_tokenization calls the private method #tokenize repeatedly to have each token produced. #tokenize - as we will see shortly - interacts with many other smaller, private methods, each responsible for dealing with a specific case (some examples: #string, #number, and #identifier).

Let's start from the beginning with #start_tokenization:

class Lexer
  # ...
  def initialize(source)
    @source = source
    @tokens = []
    @line = 0
    @next_p = 0
    @lexeme_start_p = 0
  end

  def start_tokenization
    while source_uncompleted?
      tokenize
    end

    tokens << Token.new(:eof, '', nil, after_source_end_location)
  end
  # ...
end
Enter fullscreen mode Exit fullscreen mode

We loop through the source code until we reach the end of the file. As we consume new characters, we try to match them to various tokens that are part of Stoffle. To mark the end of the source in a clean manner, the method ends by adding a final :eof (eof stands for end of file) token to the instance variable @tokens.

The meat of the implementation is in the method we repeatedly call from #start_tokenization, the #tokenize method:

# inside Stoffle::Lexer
def tokenize
  self.lexeme_start_p = next_p
  token = nil

  c = consume

  return if WHITESPACE.include?(c)
  return ignore_comment_line if c == '#'
  if c == "\n"
    self.line += 1
    tokens << token_from_one_char_lex(c) if tokens.last&.type != :"\n"

    return
  end

  token =
    if ONE_CHAR_LEX.include?(c)
      token_from_one_char_lex(c)
    elsif ONE_OR_TWO_CHAR_LEX.include?(c)
      token_from_one_or_two_char_lex(c)
    elsif c == '"'
      string
    elsif digit?(c)
      number
    elsif alpha_numeric?(c)
      identifier
    end

  if token
    tokens << token
  else
    raise('Unknown character.')
  end
end
Enter fullscreen mode Exit fullscreen mode

The #tokenize method is responsible, along with the help of auxiliary methods, for analyzing the current character, classifying it, and then emitting the appropriate token. Depending on the scenario, it may be necessary to consume multiple next characters before we're able to make a decision on which token to produce.

Before we explore the methods responsible for determining whether a character or group of characters match with the pattern associated with each specific token, we have to comprehend the mechanism used to progress through the source code. Its implementation happens mostly in two methods, #lookahead and #consume. First, there is the #lookahead method:

# inside Stoffle::Lexer
def lookahead(offset = 1)
  lookahead_p = (next_p - 1) + offset
  return "\0" if lookahead_p >= source.length

  source[lookahead_p]
end
Enter fullscreen mode Exit fullscreen mode

To move through the source code, we use three "pointers". These are variables holding integers whose values are a position in the array of characters that is our source code (the instance variable @source). Two of them - @lexeme_start_p and @next_p - are instance variables. The third one, lookahead_p, is local to #lookahead.

Both @lexeme_start_p and @next_p are initialized to 0 when a Lexer object is created. In our lexer, we're always consuming one character at a time. @next_p has the job of pointing to the next character in the pipeline. The task of @lexeme_start_p is a bit different, to track the position of the character that started the current lexeme (remember, some lexemes are made of multiple chars!).

The #lookahead method returns a character that is offset from the char we're currently consuming by an arbitrary number of positions. This offset is given by the value of the parameter offset, defaulting to 1. Notice that since the offset is from the character currently being consumed, calling #lookahead with an offset of 1 is equivalent to source[next_p]. The #lookahead implementation also includes a guard to protect us against going after the end of the source.

A simple but crucial bit is that when comparing lookahead_p with source.length we use >= because source is a plain old array whose index starts at 0. Notice also that when using #lookahead, we are only looking and not consuming (i.e., moving @next_p) characters.

If you understood what I elucidated above, #consume will be a stroll in the park and requires no further explanation:

def consume
  c = lookahead
  self.next_p += 1
  c
end
Enter fullscreen mode Exit fullscreen mode

Whitespace Chars, Comments, and Newlines

Now, let's go back to #tokenize and inspect it part-by-part. First, let's look at the beginning of its body:

# inside Stoffle::Lexer#tokenize
self.lexeme_start_p = next_p
token = nil

c = consume

return if WHITESPACE.include?(c)
return ignore_comment_line if c == '#'
if c == "\n"
  self.line += 1
  tokens << token_from_one_char_lex(c) if tokens.last&.type != :"\n"

  return
end
Enter fullscreen mode Exit fullscreen mode

Every time we start the process of trying to emit the next token, we have to set @lexeme_start_p to the value being held by @next_p since this is, potentially, the pointer to the beginning of the next lexeme. We then consume the character we are about to analyze by calling our old and dear #consume method.

Next, we start by checking three uninteresting but relevant scenarios. If our character matches one of the chars we are classifying as whitespace, we don't generate a token and simply move on by returning. The same is true if we detect our character is a #, but in this case, we ignore all chars until the end of the line, not only the current one. If you want to check out the implementation of #ignore_comment_line, please take a look at Stoffle's GitHub repo.

The last lines of this segment are quite significant. As you might have realized, Stoffle does not use any explicit terminator, such as the ; we see in many popular programming languages. In Ruby, a newline is used to terminate a line of code.

Here, we could go for more sophisticated rules, but since Stoffle is an educational project, it's better to keep it as simple as possible. In Stoffle, every newline at the end of a non-empty line is considered meaningful. It is trivial and easy to implement rule of thumb, but it has some consequences we must be aware of. The following snippet, which would be valid in Ruby, is not valid Stoffle code:

  two = 1 +
    1
Enter fullscreen mode Exit fullscreen mode

If we try to later on run a program that is splitting a line in this fashion, we'll get an error. I think this is something we can live with and a reasonable compromise to keep things as uncomplicated as possible. So, if we find out the current char is a \n, we have to do two things:

  • Advance a counter that keeps track of the current line we are processing (this will be used for precise error reporting, but it's out of the scope of this post);
  • We only emit a newline if the last token is not also a newline token. A newline token marks the end of a significant line of code, so we don't want to emit tokens for blank lines that are only being used to organize the source code.

One and Two Character Lexemes, Strings, Numbers, and Identifiers

Let's continue delving into #tokenize. Now, the main part of the method:

# inside Stoffle::Lexer#tokenize
token =
  if ONE_CHAR_LEX.include?(c)
    token_from_one_char_lex(c)
  elsif ONE_OR_TWO_CHAR_LEX.include?(c)
    token_from_one_or_two_char_lex(c)
  elsif c == '"'
    string
  elsif digit?(c)
    number
  elsif alpha_numeric?(c)
    identifier
  end
Enter fullscreen mode Exit fullscreen mode

The first two branches of this conditional expression are straightforward. We first determine whether our character matches one of the multiple lexemes of length 1. If so, we emit the appropriate token. In case that doesn't happen, we test against a special group of lexemes containing 1 or 2 characters. This group contains lexemes such as "=" and "==".

The strategy here is not difficult; if the current character matches a lexeme of size 1 belonging to this group, we just make sure we're in fact dealing with the lexeme in question by taking a look at the next character in the source. The idea is to always "bite" as many chars as we can! This appproach guarantees, for example, that == results in the equality token being produced (and not the assignment token twice).

It's time now for us to inspect the #string method, which is called when we know we are dealing with a string:

# inside Stoffle::Lexer
def string
  while lookahead != '"' && source_uncompleted?
    self.line += 1 if lookahead == "\n"
    consume
  end
  raise 'Unterminated string error.' if source_completed?

  consume # consuming the closing '"'.
  lexeme = source[(lexeme_start_p)..(next_p - 1)]
  # the actual value of the string is the content between the double quotes.
  literal = source[(lexeme_start_p + 1)..(next_p - 2)]

  Token.new(:string, lexeme, literal, current_location)
end
Enter fullscreen mode Exit fullscreen mode

Here, we use the #lookahead method we described earlier (combined with the also familiar #consume) to keep consuming characters until the next char to be processed is the closing ". If we can't find the closing " before the end of the source, we raise an error. Causing our lexer to fail in such a manner is definitely not the best approach, but it will do for now.

Finally, we use our pointers to extract the lexeme and the actual value of the string. We include double quotes for the lexeme but not for the value.

In case our character being processed at #tokenize failed the string test, we next determine whether it is a number. If so, here's how we handle it and generate the associated token:

# inside Stoffle::Lexer
def number
  consume_digits

  # Look for a fractional part.
  if lookahead == '.' && digit?(lookahead(2))
    consume # consuming the '.' character.
    consume_digits
  end

  lexeme = source[lexeme_start_p..(next_p - 1)]
  Token.new(:number, lexeme, lexeme.to_f, current_location)
end
Enter fullscreen mode Exit fullscreen mode

In this method, we continue moving through the source as long as the next characters are digits; this is basically what #consume_digits does. Note that we do take numbers with a fractional part into consideration. To reduce complexity, all numbers in Stoffle are floats, so we make sure to leverage Ruby's #to_f method to make the process of generating an actual float value for the token a piece of cake.

For characters that had no match up to this point, we perform a final verification to see if we are dealing with an alphanumeric. If positive, the method #identifier is the one used to handle the task:

# inside Stoffle::Lexer
def identifier
  while alpha_numeric?(lookahead)
    consume
  end

  identifier = source[lexeme_start_p..(next_p - 1)]
  type =
    if KEYWORD.include?(identifier)
      identifier.to_sym
    else
      :identifier
    end

  Token.new(type, identifier, nil, current_location)
end
Enter fullscreen mode Exit fullscreen mode

In this case, the plan is to keep consuming the next characters until we find one that is not a number, letter, or underscore, which is what #alpha_numeric? is verifying. For identifiers, there are two scenarios. We may be dealing with a language keyword (such as fn or while) or an identifier defined by the programmer (e.g., a variable name). Since keywords can't be used to name a variable or a function, we first try to match the lexeme against one of Stoffle's keywords. Not being able to do so, we are sure that the right thing to do is generate a generic :identifier token.

Finally, here's the last bit of #tokenize:

# inside Stoffle::Lexer#tokenize
if token
  tokens << token
else
  raise('Unknown character.')
end
Enter fullscreen mode Exit fullscreen mode

We push the most recently generated token to @tokens. However, if no token was emitted, we can be sure at this point that we hit a character unsupported by Stoffle, and we should then ring the alarm.

Your Next Steps

Hooray, we have just gone through the principal parts of our lexer! Since it would be exhausting to explore all the minor details, I urge you to scan the complete implementation at GitHub. If you like, take a look at the test suite included, as the examples provided there may help illuminate some detail of the implementation that you may have found mysterious at first glance.

Other Usages of a Lexer

Now that you understand the basics on how to implement a lexer, you may be thinking of where else lexers are applied. A curious usage is in the field of natural language processing, in which a lexer can be used to transform raw input, either text or sound, into a more structured list of tokens to be leveraged by higher layers in a processing pipeline. Expect, however, these lexers to be significantly more complex than the one we have just implemented.

Parting Thoughts

Our little language is starting to come to life! We now have a working lexer that is able to produce the valuable tokens our parser will need to do its job. The star of the next post in the series will be the parser itself. We'll see you soon in the next step of bringing Stoffle into being!

💖 💪 🙅 🚩
honeybadger_staff
Honeybadger Staff

Posted on August 4, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related