Writing a tokenizer
ndesmic
Posted on December 18, 2023
Several projects I'm working on require parsing. While there are goto packages that do this work part of what I'm trying to do is build tools from scratch. Parsing is probably one of the largest tasks I've yet to tackle directly. At the same time I think there's a lack of small parsers that aren't full of esoteric tricks so maybe there's value in it. Of course actually parsing is several pieces that really deserve their own look. So for this we'll look at the first step in parsing: tokenizing (otherwise known as "lexing"). Hopefully we can make something generic enough that we can add support for other languages without too much trouble and manual work.
Anatomy of parsing
"Parsing" I find tends to mean a lot of things related to taking text and making a computer program or structured data. Without getting to deep into how computers interpret computer programs, there are two general parts: the "front-end" which is going from text to the AST (abstract syntax tree), a data structure that represents the grammar of the program, and the "backend" which is going from the AST through layers of verification, optimization, and intermediate languages until something, somewhere can execute the program or emit something else.
The front-end has two parts: "tokenization" which take the text and chunks into semantic bits like keywords, literals, operators etc. and "parsing" which takes the tokens and arranges them into the AST based on the parsing rules.
The image above has a lot of short-cuts for brevity (real ASTs have a lot more layers) but hopefully it gets the idea across. To start and to reduce complexity we'll just worry about the first part of taking text and forming it into a stream of higher level tokens.
Tokenizer Class
We'll start with a tokenizer class. It's actually pretty simple, it takes some configuration about which tokens to look for in the constructor and then has a method tokenize
that will return an iterator that sends back the tokens. This is done so that we don't need a lot of big buffers, we can essentially stream tokens from the input as we need more to consume. The tokens to look for will be a list of objects { matcher: RegExp, type: string }
which will try to match and if it does will add a new token of type
to the output. We'll also add a property valueExtractor
which will be optional and let us extract a value to reference later. This will be mostly useful mostly for things like numbers, strings, and comments.
Regular Expressions
The way we can find text pieces is via RegExp
s. The naïve thing to do would be to think that we can just use these for everything and this just simply isn't the case. The basis of this is Chomsky's Hierarchy of Language. Regular Expressions are called that because they can parse a regular language which you can generally think of as anything non-recursive (in terms of nesting). Beyond that, we need a recursive processor to parse context-free languages which is where pretty much all programming languages lie. So regular expressions can help us out for the token part but you can't parse HTML with a regex, it will never work completely.
RegExp
s can be kinda difficult to deal with. When we parse we want to do so at least somewhat efficiently. What this means is we don't want to make string clones eg text.slice(index)
just to scan it with a regex from where we've read so far (See comments below for a discussion on how this might be incorrect intuition). So we need the regex to start at the index and leave the string alone as much as possible.
We can do this by setting the lastIndex
on the RegExp
instance. It's very weird and it means we need to clone the regex per each test (or reset it). However, for this to work we need to use the sticky flag y
. Since RegExp
flags are read-only and because it's not a great API to force the user to pass in the sticky flag on every RegExp
or expect the tokenizer to fail we'll go with the clone method.
//tokenizer.js
export const END = Symbol("END");
export class Tokenizer {
#tokenTypes;
constructor(tokenTypes) {
this.#tokenTypes = tokenTypes;
}
*tokenize(text) {
let index = 0;
while (index < text.length) {
let hasMatch = false;
for (const { matcher, type, valueExtractor } of this.#tokenTypes) {
const currentMatcher = new RegExp(matcher.source, "y");
currentMatcher.lastIndex = index;
const matched = currentMatcher.exec(text);
if (matched !== null) {
index += matched[0].length;
if(type != null) {
const token = { type, index };
if(valueExtractor){
token.value = valueExtractor(matched[0]);
}
yield token;
}
hasMatch = true;
}
}
if (!hasMatch) {
throw new Error(`Unexpected token at index ${index}`);
}
}
yield { type: END };
}
}
Here's the basic tokenizer. You pass in a list of tokens to match with the matcher
being a RegExp
. We start at index 0 and then try to match each token. We do so by cloning the RegExp
and adding the y
flag and then running exec
. If we get a match we advance the index by that much and output the token, if we don't then we try another token. If none matched (we need the boolean hasMatch
to break the double loop because nobody understands labeled loops) then we throw a syntax error (now you know why compilers always use this particular wording). Note that this is also pretty naïve. We could also do error recovery here by advancing the index by one and then trying again. That might be more useful if you were making a code editor because code is often incorrect while typing and throwing an error per key-stroke is pretty harsh. Any way we repeat this process until there is no more input.
And yes it is a generator, that *
in front of tokenize
is important.
Lookahead
Many parsers/tokenizers have a concept of lookahead. This is how many characters that you need to be able to see ahead before you can definitively decide what you have. This is a performance thing so you don't waste a lot of time trying to parse something and then find out it's another thing. Imagine if we decided to make a language where strings were just plain text but ended with a character like \$
. Whether or not that's a good idea from a dev perspective is debatable but this would require infinite lookahead because you could not tell whether something was an identifier or a string until you reached the end. For a pathological case you could think of a series of 1 million characters, you'd need to parse all of them until the end and if you were wrong you have to do it again. Not great. Most languages enforce a cap on lookaheads, for example CSS uses a lookahead of 3 characters. In our implementation above we have a look-ahead of 1, note the (?=\r?\n|$)
of a comment which is looking ahead one character.
One thing to note is the special END
symbol. It's very common to add an a token just for that so that we know we're at the end. We can make this token universal by exporting a symbol and we'll always output it when we're done.
Another thing is that we allow token types to be null which means they are excluded from the final stream. This is useful for whitespace which is not useful when actually interpreting the input but needs to be removed in a consistent way.
To use the tokenizer we pass through some tokens:
//javascript-tokenizer.js (unfinished obviously!)
export const javascriptTokenizer = new Tokenizer([
{ matcher: /[ \t]+/, type: null },
{ matcher: /\r?\n/, type: "line-break" },
{ matcher: /{/, type: "{" },
{ matcher: /}/, type: "}" },
{ matcher: /"[^"\r\n]+"/, type: "string-literal", valueExtractor: match => match.slice(1, -1) },
]);
//javascript-tokenizer.test.js
Deno.test("can parse space", () => {
const tokens = [...javascriptTokenizer.tokenize(` `)];
assertEquals(tokens, [{ type: END }]);
});
Deno.test("javascriptTokenizer can parse left curly brace", () => {
const tokens = [...javascriptTokenizer.tokenize(`{`)];
assertEquals(tokens, [{ type: "{", value: "{" }, { type: END }]);
});
Deno.test("javascriptTokenizer can parse strings (double quote)", () => {
const tokens = [...javascriptTokenizer.tokenize(`"foo"`)];
assertEquals(tokens, [
{ type: "string-literal", value: `foo` },
{ type: END }]);
});
Javascript tokens
So to get these tokens from some text input, we need to know which tokens we need to look for. To do so we can look at the 2024 (as of the time of writing) version of the ECMAScript specification (the spec for javascript is called Ecmascript for licensing reasons, or so I'm told. Oracle actually owns a trademark for javascript). We can find the spec here: https://tc39.es/ecma262/. We're not going to create a fully compliant implementation, at least not now as there's lots of edge cases we probably don't need to consider at this point because it's just going to inflate the code and we can add those as they come up.
The rest of this is mostly making tests and passing them for each type of token.
Notes
-
Whitespace
/[ \t]+/
- this isn't too hard, just tabs and spaces (ignoring less common Unicode stuff that would be spec-compliant). Just make sure to coalesce them as multiple whitespace tokens are unnecessary.\s
is too general because it picks up line-breaks which we might want to differentiate. -
Line breaks - these can match
\r\n
or\n
depending on the platform. -
Line comments
/\/\/(.*?)(?=\r?\n|$)/
- This takes everything after//
and to either the end of file or until the line-break (but without taking the line-break token itself). You might wonder like I did about syntax in the comments like JSDOC or ts-ignore type of stuff. I think in this case we'll have a secondary tokenizer/parser that will go over the comment value if we ever need it so we'll also keep the value around. -
Identifiers
/[a-zA-Z$_][a-zA-Z0-9$_]*/
- These are pretty much all non-quoted text that aren't keywords. They represent names of variables, methods etc. These should never start with numbers. -
String literals
/"[^"\n]+"/
- This can be a little tricky as there are 3 types of strings. The example is for double quotes but you can just replace the double quotes with single for that type (if you try to use a singleRegExp
make sure to quote types match). Note that this won't account for\
escapes and line continuations because those get complex. For back ticks it's a bit simpler/`[^`]+`/
as these don't need to stop at line breaks. We'll also skip template literals due to complexity as well. -
Number literal
/-?[0-9]+\.?[0-9]*(?=[a-zA-Z$_])/
- Identifiers can't start with numbers which is how we tell the difference. Numbers can have optional negative signs and an optional decimal part but the character following the literal cannot be the start of an identifier. We'll ignore bigints, hex/oct/bin literals, underscores and scientific notation. -
Syntax symbols
\;\
- There's a number of little syntax symbols to match like semicolons. You'll need to make sure you are properly escaping the regex eg/\./
for "dot". You might also need to know the difference between an equal sign and a double and triple equals sign. We could use the negative lookahead on the regex but a more efficient way is to simply put the longer rules first.
Here's a very abridged set of tokens for javascript:
//javascript-tokenizer.js
import { Tokenizer } from "../src/tokenizer.js";
export const javascriptTokenizer = new Tokenizer([
{ matcher: /[ \t]+/, type: null },
{ matcher: /\r?\n/, type: "line-break" },
{ matcher: /\/\/(.*?)(?=\r?\n|$)/, type: "comment", valueExtractor: match => match.slice(2) },
{ matcher: /"[^"\r\n]+"/, type: "string-literal", valueExtractor: match => match.slice(1, -1) },
{ matcher: /'[^'\r\n]+'/, type: "string-literal", valueExtractor: match => match.slice(1, -1) },
{ matcher: /`[^`]+`/, type: "string-literal", valueExtractor: match => match.slice(1, -1) },
{ matcher: /-?[0-9]+\.?[0-9]*(?![a-zA-Z$_])/, type: "number-literal", valueExtractor: match => parseFloat(match) },
{ matcher: /{/, type: "{" },
{ matcher: /}/, type: "}" },
{ matcher: /\[/, type: "[" },
{ matcher: /\]/, type: "]" },
{ matcher: /\(/, type: "(" },
{ matcher: /\)/, type: ")" },
{ matcher: /;/, type: ";" },
{ matcher: /:/, type: ":" },
{ matcher: /,/, type: "," },
{ matcher: /\.\.\./, type: "..." },
{ matcher: /\./, type: "." },
{ matcher: /\*\*/, type: "**" },
{ matcher: /\*/, type: "*" },
{ matcher: /===/, type: "===" },
{ matcher: /==/, type: "==" },
{ matcher: /=>/, type: "=>" },
{ matcher: /=/, type: "=" },
{ matcher: /!==/, type: "!==" },
{ matcher: /!=/, type: "!=" },
{ matcher: /&&/, type: "&&" },
{ matcher: /&/, type: "&" },
{ matcher: /\^/, type: "^" },
{ matcher: /~/, type: "~" },
{ matcher: /!/, type: "!" },
{ matcher: /\|\|/, type: "||" },
{ matcher: /\|/, type: "|" },
{ matcher: /\+\+/, type: "++" },
{ matcher: /\+/, type: "+" },
{ matcher: /\-\-/, type: "--" },
{ matcher: /\-/, type: "-" },
{ matcher: /\\/, type: "\\" },
{ matcher: /%/, type: "%" },
{ matcher: /\?\?/, type: "??" },
{ matcher: /\?/, type: "?" },
{ matcher: />=/, type: ">=" },
{ matcher: /<=/, type: "<=" },
{ matcher: />>/, type: ">>" },
{ matcher: />/, type: ">" },
{ matcher: /<</, type: "<<" },
{ matcher: /</, type: "<" },
{ matcher: /null/, type: "null" },
{ matcher: /true/, type: "true", valueExtractor: x => x },
{ matcher: /false/, type: "false", valueExtractor: x => x },
{ matcher: /import/, type: "import" },
{ matcher: /export/, type: "export" },
{ matcher: /from/, type: "from" },
{ matcher: /as/, type: "as" },
{ matcher: /for/, type: "for" },
{ matcher: /while/, type: "while" },
{ matcher: /in/, type: "in" },
{ matcher: /of/, type: "of" },
{ matcher: /break/, type: "break" },
{ matcher: /continue/, type: "continue" },
{ matcher: /do/, type: "do" },
{ matcher: /if/, type: "if" },
{ matcher: /else/, type: "else" },
{ matcher: /switch/, type: "switch" },
{ matcher: /case/, type: "case" },
{ matcher: /default/, type: "default" },
{ matcher: /function/, type: "function" },
{ matcher: /return/, type: "return" },
{ matcher: /yield/, type: "yield" },
{ matcher: /await/, type: "await" },
{ matcher: /try/, type: "try" },
{ matcher: /catch/, type: "catch" },
{ matcher: /finally/, type: "finally" },
{ matcher: /throw/, type: "throw" },
{ matcher: /new/, type: "new" },
{ matcher: /class/, type: "class" },
{ matcher: /super/, type: "super" },
{ matcher: /let/, type: "let" },
{ matcher: /const/, type: "const" },
{ matcher: /this/, type: "this" },
{ matcher: /[a-zA-Z$_][a-zA-Z0-9$_]*/, type: "identifier", valueExtractor: x => x },
]);
```
## Optimization
Note that there are some optimizations that could be made at the expense of readability and in real-life we might do things like making the type an integer or instead of a string copy of the value giving back the offset and length into the original string to keep these small and copy-less. I think some high-performance tokenizers actually operate on bytes to avoid the actual string decoding overhead. What I don't want to do is hand-write the tokenizer per language because I want this to be a building block to build up parsers easily so I didn't look into many of these.
## Conclusion
Building the tokenizer part isn't too hard but you do need a lot of tests to make sure what you are building works. Most actual languages like javascript have very complex rules but you can be useful with less. It's also maybe a good idea to stick to the spec names as well. I'm playing a bit fast and loose and that might come back to bite me when I need to use the exact wording that the spec uses to ensure compliance.
## Code
https://github.com/ndesmic/jsparse/tree/v0.1 (Note that this is an earlier version that I improved for this post so it won't match up exactly)
Posted on December 18, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.