LL(1) parsing in Scala

berryware

David Berry

Posted on July 13, 2023

LL(1) parsing in Scala

From Wikipedia, an LL parser (Left-to-right, leftmost derivation) is a top-down parser for a restricted context-free language. An LL parser is called an LL(k) parser if it uses k tokens of lookahead when parsing a sentence. An LL(1) parser therefore has 1 token of lookahead. The lookahead allows you to make decisions on future tokens without removing the token from the input. Many modern day parsers are top down parsers which take a source file, turn it into a set of tokens, and parses those tokens into an abstract syntax tree or some other data structure.

Scala has a Source object that allows you to create an Iterator[Char] from files or strings. We will use this as the basis of the input for our parser. Unfortunately, iterators do not allow you to peek or look ahead. Once you call next() you have consumed the next Char. What we need is an iterator that allows you to peek and see the next item. Here is a PeekableIterator class that wraps an iterator and provides the peek() and poke() methods.

/**
 * Create a peekable iterator to allow the caller to peek() the next item without moving the current position.
 *
 * @param it an Iterator
 * @tparam T the type of value in the iterator
 */
class PeekableIterator[T](private val it:Iterator[T]) extends Peekable[T]:
  private var nextToken:Try[T] = Try(it.next)
  def peek:Try[T] = nextToken
  def poll:Try[T] =
    val currentToken = nextToken
    nextToken = Try(it.next)
    currentToken
Enter fullscreen mode Exit fullscreen mode

PeekableIterator() takes an iterator parameter and immediately calls next to load the next token. nextToken is a Try[T] so that it captures either the successful token or the failure. peek() returns the nextToken, put does not call next on the iterator. poll(), on the other hand, will return the nextToken and will call next() on the iterator and move to the next token.

Now that we can look ahead, let's define the goal of the parser. The parser should parse english text and return a list of sentences where each sentence is a list of words. The return type is Vector[Vector[String]] as Vector has better append performance than List. So how do we go from a Peekable[Char] to the Vector[Vector[String]]?

Most parsers start with a lexical analysis(tokenizing) step and then a parsing step. Our parsing goal is simple and does not require an LL(1) parser, but we use LL(1) parsing techniques to give readers an example they can use for more complicated parsing. Therefore we will process the source file using both the tokenizing step and a parsing step. The tokenizing step is just another parser but is usually referred to as a lexer. A first implementation could be step 1 as a lexer of Char to Vector[String] and step 2 is a Parser of Vector[String] to Vector[Vector[String]].

The shortcoming of this design is that the entire file is turned into tokens before any tokens are consumed by the parser. If you try to parse an extremely large file, you could run out of memory building the Vector[String] and never parse one token. It would be more efficient, in both time and space, to be able to transform the source file into tokens in a stream-like way where tokens are created lazily and consume the source input as needed.

3 ways to solve this problem were considered: LazyList, Akka Streams, and chained iterators. The addition of Akka Streams seemed like overkill for a simple parser, but it may be the subject of a future article. LazyList's immutability and memoization made it complicated to just create and consume one token at a time. Hence, I was left with chaining iterators together.

The goal of iterator chaining is to allow one iterator to use another iterator to lazily create its next item. To demonstrate this I will chain Iterator[Char] to Iterator[String]. The chaining is done as an apply on the PeekableIterator object:

/**
 * Companion object for creating PeekableIterators from an iterator or by chaining iteratars.
 */
object PeekableIterator:
  def apply[T](it:Iterator[T]):PeekableIterator[T] = new PeekableIterator[T](it)

  def apply[TInput, TOutput](it1:Iterator[TInput], getNext: PeekableIterator[TInput]=>TOutput ): PeekableIterator[TOutput] =
    val input = PeekableIterator[TInput](it1)
    PeekableIterator[TOutput](new Iterator[TOutput] {
      override def hasNext: Boolean = input.peek.isSuccess
      override def next(): TOutput = getNext(input)
    })
Enter fullscreen mode Exit fullscreen mode

The second apply method takes an Iterator and a getNext function which will be used to dynamically create an iterator that will chain or wrap the first iterator. Its use is shown in the creation of the lexer for the parser.

    val lexer = PeekableIterator[Char,String](source, lazyLexer)
Enter fullscreen mode Exit fullscreen mode

lazyLexer is the method that is used as the getNext() for the dynamically created iterator. It takes a Peekable[Char] as its parameter and returns a String

  // method used to do the actual tokenizing. It builds one token and returns
  private def lazyLexer(tokenizer: Peekable[Char]): String =
    // ignore the whitespace and look for an end of sentence
    while tokenizer.peek.isSuccess && (tokenizer.peek.get.isWhitespace || tokenizer.peek.get.isOtherPunctuation) do
      tokenizer.poll

    // Represent the End of Sentence as an empty String
    if tokenizer.peek.isSuccess && tokenizer.peek.get.isEndOfSentence then
      tokenizer.poll
      return ""

    // if we are at the end of input pass the failure on
    if tokenizer.peek.isFailure then
      throw tokenizer.peek.failed.get

    // build a word
    val sb = new StringBuilder()
    while tokenizer.peek.isSuccess && !(tokenizer.peek.get.isWhitespace || tokenizer.peek.get.isPunctuation) do
      sb.append(tokenizer.poll.get)

    // return the word
    sb.toString()
  end lazyLexer
Enter fullscreen mode Exit fullscreen mode

Finally, the parser takes the words from the lexer and builds a Vector[Vector[String]] to represent a list of sentences. It is written to ignore empty sentences.

  def parse(source:Source):Vector[Vector[String]] =
    // create the lexer by chaining the source to the lazylexer
    var sentences:Vector[Vector[String]] = Vector.empty
    var sentence:Vector[String] = Vector.empty
    val lexer = PeekableIterator[Char,String](source, lazyLexer)

    while lexer.peek.isSuccess do
      if lexer.peek.get.isEmpty then
        // ignore empty sentences
        if sentence.nonEmpty then
          sentences = sentences :+ sentence
          sentence = Vector.empty
        end if
        lexer.poll
      else
        sentence = sentence :+ lexer.poll.get
      end if
    end while

    if sentence.nonEmpty then
      sentences = sentences :+ sentence
    end if

    sentences
  end parse

Enter fullscreen mode Exit fullscreen mode

The code from this article can be found here: ll1-parser

💖 💪 🙅 🚩
berryware
David Berry

Posted on July 13, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

LL(1) parsing in Scala
scala LL(1) parsing in Scala

July 13, 2023