Compiler Series Part 5: Lexical analysis

From this post onwards, I will be explaining sections from my compiler and the thoughts gone into designing them. I will link to the relevant files in my compiler's repository found here.

Defining basic tokens

The smallest "atoms" of my language (aside from individual characters) are called tokens. The scanner's job is to recognise valid tokens in the input text and reject invalid ones. The following BNF describes some basic tokens.

<digit> : [0-9]+
<number> ::= [<digit>]+.[<digit>]+

<name> ::= [a-zA-Z][a-zA-Z0-9]*
<string_lit> ::= " [\w*] "

<op> ::= ['+' | '-' | '*' | '/' | '^' | '%' | '=' | '>' | '<' | '!' | '|' | '&']+
<whitespace> ::= [' ' | '\n' | '\t']+
<comment> ::= # <all>

Strings will make up identifiers and specific keywords as well as string literals. Operators can be any combination of the above symbols. The scanner will then go on to recognise certain combinations. If the combination of input characters is not recognised, the scanner throws an error. The scanner also removes whitespace and comments.

A lexeme is any combination of characters which matches the tokens. My complete list of tokens looks like this:

Token Types {
    PLUS, MINUS, STAR, SLASH, HAT, MOD, INC, DEC,
    EQ, LESS, GREATER, LEQ, GREQ, NEQ,
    AND, OR, NOT,
    ASSIGN, CONDITIONAL, COLON,

    LEFTPAREN, RIGHTPAREN, LEFTSQ, RIGHTSQ, COMMA,
    NUMBER, STRING, IDENTIFIER, BOOL,
    BEGIN, IF, ENDIF, ELSE, THEN,
    FOR, IN, ENDFOR,
    DEFINE, ENDDEF, EXT,
    NEWLINE, END
}

Some language theory¹

The scanner recognises a language which is type 3 on the Chomsky Hierarchy². A type 3 language can be recognised/parsed by regular expressions and state machines. I could implement a literal state machine for the parser, but I will not, as this approach is not very efficient to write. Similarly, I could have written a complex regex to split the input into tokens, but this obscures a lot of the logic for me. I started this project to learn, not write a compiler in as few lines as possible.
Later on, the parser will interpret a type 2 language using a push-down automata. (A type of automata which uses a stack to hold information. In my implementation, this stack is the C++ function call stack). The difference between the two languages being interpreted is the main reason to have a separate lexer and parser³.

A tool called flex⁴ exists which can generate scanner state machines in C. This is a modern version of GNU lex. I decided that this tool introduced too much overhead into my code.

Algorithm

The basic algorithm for the scanner is below. The functions such as getString() and getOp() are described in the BNF above.

getNextTok =>
   nextChar()
   if next char is '#' skip line then nextChar() 
   skip whitespace

   if next char is digit, getNumber() and return number token

   getName() if next char is alpha
       if string is keyword return keyword token
       else return identifier token

   getOp() if next char is operator
       if operator string is recognized return operator token
       else error

   getString() if next char is '"'

   if next char is recognized punctuation
       return punctuation token

   if next char is unrecognized
        error

Writing a scanner is not too difficult at all and should come in useful for many projects, when you need to parse any kind of input.
Finally, here is the scanner from my compiler.

Aho, A., Lam, M., Sethi, R. and Ullman, J. (2006). Compilers: Principles, Techniques, and Tools. 2nd ed. Pearson Education Inc. ↩
Theory of Computation - Chomsky Hierarchy ↩
Let's Build a Compiler, by Jack Crenshaw - Part 7: LEXICAL SCANNING ↩
GitHub - westes/flex: The Fast Lexical Analyzer ↩

Blog

Compiler Series Part 5: Lexical analysis

James

Defining basic tokens

Some language theory¹

Algorithm

Join Our Newsletter. No Spam, Only the good stuff.

Related

Compiler Series Part 5: Lexical analysis

James

Defining basic tokens

Some language theory1

Algorithm

Join Our Newsletter. No Spam, Only the good stuff.

Related

Some language theory¹