Writing a toy ECMAScript parser
ndesmic
Posted on December 28, 2023
Last time we wrote a simple tokenizer and tried to make a subset of the most used javascript tokens. This time we'll take it to the next step, making an Abstract Syntax Tree (AST). This works by taking a series of tokens and pattern matching it to create a (as the ECMAScript spec calls it) ParseNode
. These nodes have children as you might expect of a tree. This step is just about making the tree. Once you have it you can do all sorts of things like interpretive execution, code re-writing, formatting, compilation into a lower representation like WASM, transpilation to a different high-level language, bundling, type-checking, you name it.
What goes in the parser vs tokenizer
You might wonder which things go in the tokenizer versus which things go in the parser as you could technically do some things in both places. This one is hard to say but generally if you were making your own language things that aren't recursive would go in the tokenizer, or another way to think of it is things that cannot be parsed with just a RegExp
. Think string literals versus string templates. One is pretty much the end point but the other can recursively have things inside of it so it can't be parsed with a pure RegExp
. If you are using an existing language spec it might help you decide.
Parsing basics
Backus-Naur form (BNF)
The most common way you'll see parsing rules expressed is in Backus-Naur (BNF) form. This looks something like:
AdditionExpression : MultiplicationExpression + AdditionExpression
| MultiplicationExpression
MultiplicationExpression : Number * MultiplicationExpression
| Float
Float : (number)
The idea here is that you have a type of node you want to produce on the left of the colon (called a "production") and the right side is how you make it. The bar is the various alternative ways in which the production is made. So here I have AdditionExpression
which is made of two MultiplicationExpressions
(because a MultiplcationExpression
is also an AdditionExpression
) or just another AdditionExpression
Note that this is recursive, the right-side expands so that we can express things like 1*2 + 2*3 + 3*4
. These middle-tier productions are called "nonterminals" because they do not terminate and do not represent a leaf node. Eventually we get to Float
which contains a number and this is never made of anything else, we call this a "terminal". So eventually all rules must lead to terminals or we'll never end.
Note that it's important to specify addition expressions made up of multiplication expressions rather than the other way around. This is because of order precedence. If we did it the other way then the addition would group first when traversing the tree which is not how normal order precedence rules work. Typically you follow normal math order with comparisons as the weakest groupings (as is the case with javascript) but it depends on the language spec. Just make sure you consider this if creating your own rules.
Parsing algorithms
One things to note is where the recursion happens. The recursive term in the above example was placed on the right of the operator. This is because if it was placed on the left we'd have a problem. If AdditionExpression
needs to parse an AdditionExpression
at the start then we'll infinitely recurse. Luckily, left recursion can be changed to right recursion by moving terms around. The downside is that we introduce some performance issues because now both branches of AdditionExpression
start with the same nonterminal Multiplicationexpression
. This means we might need to test both branches to see if there was a match because we don't know which one we have. To further improve we need to look ahead at least one token so that we don't fall down a branch of unnecessary work.
There's a ton of parsing algorithms you might see like LL(1), LR(1) etc. These determine the type of parsing algorithm (which determines which types of grammar it can handle) and the number indicates how much lookup is needed. Although from what I understand that's just a nerd stat, as long as it's a finite number it doesn't change much.
Since we're tentatively looking at the ECMAScript spec, ECMAScript is left-recursive (look at a rule like AdditiveExpression) with a finite lookup so our chosen algorithm has to be able to handle that. Since I also want ease of implementation and small code-size I'm going with a recursive-descent parser with backtracking (that is, infinite lookup which is not O(1)). There's plenty of other ways to do this that might be faster or easier to express (but have more complex algorithms). This means that where we see left recursion we'll need to refactor it into right recursion which means it will take more rules in general. Consider the following left-recursive grammar:
AdditionExpression : AdditionExpression + MultiplicationExpression
| AdditionExpression
MultiplicationExpression : MultiplicationExpression * Float
| Float
Float : (number)
we can reformat it into
AdditionExpression : MultiplicationExpression AdditionSuffixExpression
AdditionSuffixExpression : + MultiplicationExpression AdditionSuffixExpression
| ε
MultiplicationExpression : MultiplicationExpression MultiplicationSuffixExpression
MultiplicationSuffixExpression : + Float MultiplicationSuffixExpression
| ε
Float : (number)
Here the ε
(epsilon) symbol means nothing, no token and it's how we can end the recursion.
Parser class
export function nonterminal(type){
return { type: "nonterminal", matcher: type };
}
export class Parser {
#tokenizer;
#productions;
#goalProduction;
constructor(tokenizer, productions, goalProduction){
this.#tokenizer = tokenizer;
this.#productions = productions;
this.#goalProduction = goalProduction;
if (!this.#productions[goalProduction]) throw new Error(`Goal production ${goalProduction} did not exist.`);
}
parse(text){
const tokens = [...this.#tokenizer.tokenize(text)];
const production = this.produce(tokens, this.#goalProduction, this.#productions[this.#goalProduction]);
if(production.failed){
throw new Error(`Syntax Error`);
}
return {
type: production.type,
children: production.children
};
}
produce(tokens, productionType, productionRules, startIndex = 0){
for(const rule of productionRules){
const match = this.matchRule(tokens, rule, startIndex);
if(!match.failed){
return {
failed: false,
type: productionType,
children: match.matches,
length: match.length
};
}
}
return { failed: true };
}
matchRule(tokens, rule, startIndex){
const matches = [];
let offset = 0;
for (const part of rule) {
if (typeof (part) === "string") {
if (tokens[startIndex + offset].type === part) {
matches.push(tokens[startIndex + offset]);
offset++;
} else {
return { length: offset, failed: true };
}
} else if (part.type === "nonterminal") {
if (!this.#productions[part.matcher]){
throw new Error(`Nonterminal production ${part.matcher} did not exist.`);
}
const production = this.produce(
tokens,
part.matcher,
this.#productions[part.matcher],
startIndex + offset
);
if (!production.failed) {
matches.push({ type: production.type, children: production.children });
offset += production.length
} else {
return { length: offset, failed: true };
}
}
}
return {
failed: false,
length: offset,
matches
};
}
}
We create a parser using our parse rules. These look like the BNF forms above. I'm using a convention of TitleCase for nonterminals which will help with clarity and debugging. We also pass in our tokenizer from last time. While I tried to make the tokenizer able to stream data I'm not sure it's really possible at the parser level. Certainly language like HTML can do it but something like javascript I'm not sure if it makes sense and it would complicate things, so for now we'll just convert that back into an array which will dump the whole file into memory again, oh well.
export const parser = new Parser(
new Tokenizer([
{ matcher: / /, type: null },
{ matcher: /d+/, type: "number" },
{ matcher: /\*/, type: "*" },
{ matcher: /\+/, type: "+" }
]),
{
"AdditiveExpression" : [
[nonterminal("MultiplicativeExpression"), "+", nonterminal("AdditiveExpression")],
[nonterminal("MultiplicativeExpression")]
],
"MultiplicativeExpression": [
[nonterminal("Float"), "*", nonterminal("MultiplicativeExpression")],
[nonterminal("Float")]
],
"Float": [
["number"]
]
}, "AdditionExpression")
Each rule is an array of variants for making the production. Each variant is an array that represents a sequence of tokens or productions. We use the nonterminal
to create non-terminals which get looked up by the name. Strings will exactly match a token type, and arrays of string will match one of a set of strings. In this way we can match the grammar pretty verbatim to it's BNF which gives us a nice flexible tool to build from. Like a regex for grammars. The final parameter is the goal production. This is the highest level production of our "top-down" parser so it might represent the module javascript or whatever entity represents a single file in other languages.
Reading through the javascript spec, I noticed a couple things:
- Most productions have an optional
yield
orawait
in front on them - Some production make assertions that a particular token is not in front of them
- Some productions take one of several tokens
We could just expand these into individual production rules but that's not very ergonomic. To fix that I added some helpers optional
and not
which create special matchers that will either optionally validate and add a token or will instantly fail a rule if a token is present. I also made it so that arrays will work like or
and validate one of the variants (we won't need to use an array for AND
because that's the default for items in the rule array.)
Here's the updated matchRule
.
// parser.js - at top
export function not(type){
return { type: "not", matcher: type }
}
export function optional(type){
return { type: "optional", matcher: type };
}
// parser.js - at matchRule
matchRule(tokens, rule, startIndex){
const matches = [];
let offset = 0;
for (const part of rule) {
if (typeof (part) === "string" || typeof(part) === "symbol") {
if (tokens[startIndex + offset].type === part) {
matches.push(tokens[startIndex + offset]);
offset++;
} else {
return { length: offset, failed: true };
}
} else if (Array.isArray(part)) {
if (part.includes(tokens[startIndex + offset].type)) {
matches.push(tokens[startIndex + offset]);
offset++
} else {
return { length: offset, failed: true };
}
} else if (part.type === "nonterminal") {
if (!this.#productions[part.matcher]){
throw new Error(`Nonterminal production ${part.matcher} did not exist.`);
}
const production = this.produce(
tokens,
part.matcher,
this.#productions[part.matcher],
startIndex + offset
);
if (!production.failed) {
matches.push({ type: production.type, children: production.children });
offset += production.length
} else {
return { length: offset, failed: true };
}
} else if (part.type === "not") {
if (tokens[startIndex + offset].type !== part.matcher) {
continue;
} else {
return { length: offset, failed: true };
}
} else if (part.type === "optional") {
if (tokens[startIndex + offset].type === part.matcher) {
matches.push(tokens[startIndex + offset]);
offset++;
}
}
}
return {
failed: false,
length: offset,
matches
};
}
Now you can make rules like this:
const parser = new Parser(
new Tokenizer([
{ matcher: /\ /, type: null },
{ matcher: /\n/, type: "\n" },
{ matcher: /foo/, type: "foo" },
{ matcher: /bar/, type: "bar" },
{ matcher: /qux/, type: "qux" }
]),
{
Foobar: [
[nonterminal("Foo"), not("\n"), nonterminal("Bar")],
[nonterminal("FooThenBar")]
],
FooThenBar: [
[nonterminal("Foo"), "\n", nonterminal("Bar")],
],
Foo: [
["foo"]
],
Bar: [
["bar"]
]
},
"Foobar"
);
Here we get a very different AST if there's a linebreak between the foo and bar. The rules are almost like a mini language to themselves...
Left recursion -> Right recursion
What I didn't fix is having left recursion. This still needs to be fixed by the user (or we need to use a bottom-up parsing algorithm). I'm doing the former for now. Let's go back to our original left-recursive grammar:
AdditionExpression : AdditionExpression + MultiplicationExpression
| AdditionExpression
MultiplicationExpression : MultiplicationExpression * Float
| Float
Float : (number)
As discussed above we can refactor this to make it right-recursive:
{
"AdditiveExpression": [
[nonterminal("MultiplicativeExpression"), nonterminal("AdditiveSuffixExpression")],
],
"AdditiveSuffixExpression": [
["+", nonterminal("MultiplicativeExpression"), nonterminal("AdditiveSuffixExpression")],
[]
],
"MultiplicativeExpression": [
[nonterminal("Float"), nonterminal("MultiplicativeSuffixExpression")]
],
"MultiplicativeSuffixExpression": [
["*", nonterminal("Float"), nonterminal("MultiplicativeSuffixExpression")],
[]
],
"Float": [
["number"]
]
}
Where ε
becomes an empty rule []
. The problem is that now our AST looks really messy and if we're trying to follow a left-recursive spec it's not going to work out right. What we need is a way to reformat these nodes.
The simplistic way was able to reduce this problem is to introduce another rule type virtual
. This behaves like nonterminal
except instead of adding a new nonterminal node, it takes it children and adds them to the current node thus meaning that virtual node will not appear in the AST. This isn't perfect though. Aside from having to know which nodes to mark as virtual
, the AST is not equivalent as you can get AdditiveExpressions
with more than 2 values. This isn't necessarily a big problem for additions since we know how to add 3 things (and order doesn't matter) but it could become a problem later if we don't reduce to binary expressions (my current thinking is this can be accounted for in evaluation but we'll see..). For now, I think this is okay, we'll deal with any issues when they come up but keep it in mind.
//parser.js - in matchRule, previous non terminal case
} else if (part.type === "nonterminal" || part.type === "virtual") {
if (!this.#productions[part.matcher]){
throw new Error(`Nonterminal production ${part.matcher} did not exist.`);
}
const production = this.produce(
tokens,
part.matcher,
this.#productions[part.matcher],
startIndex + offset
);
if (!production.failed) {
if(part.type === "virtual"){
matches.push(...production.children);
} else {
matches.push({ type: production.type, children: production.children });
}
offset += production.length
} else {
return { length: offset, failed: true };
}
The final optimization I want to make is to disregard any leaf level nodes if they don't have values. It's unnecessary to pollute the AST with a bunch of syntax symbols because they don't do anything, their value is encoded into the matched nonterminal. So let's remove them.
//parser.js - in matchRule
matchRule(tokens, rule, startIndex){
const matches = [];
let offset = 0;
for (const part of rule) {
const currentToken = tokens[startIndex + offset];
if (typeof (part) === "string" || typeof(part) === "symbol") {
if (currentToken.type === part) {
if(currentToken.value){
matches.push(currentToken);
}
offset++;
} else {
return { length: offset, failed: true };
}
} else if (Array.isArray(part)) {
if (part.includes(currentToken.type)) {
if(currentToken.value){
matches.push(currentToken);
}
offset++
} else {
return { length: offset, failed: true };
}
} else if (part.type === "nonterminal" || part.type === "virtual") {
if (!this.#productions[part.matcher]){
throw new Error(`Nonterminal production ${part.matcher} did not exist.`);
}
const production = this.produce(
tokens,
part.matcher,
this.#productions[part.matcher],
startIndex + offset
);
if (!production.failed) {
if(part.type === "virtual"){
matches.push(...production.children);
} else {
matches.push({ type: production.type, children: production.children });
}
offset += production.length
} else {
return { length: offset, failed: true };
}
} else if (part.type === "not") {
if (currentToken.type !== part.matcher) {
continue;
} else {
return { length: offset, failed: true };
}
} else if (part.type === "optional") {
if (currentToken.type === part.matcher) {
if(currentToken.value){
matches.push(currentToken);
}
offset++;
}
} else if (part.type === "nonremovable") {
if (currentToken.type === part.matcher) {
matches.push(currentToken);
offset++;
} else {
return { length: offset, failed: true };
}
}
}
return {
failed: false,
length: offset,
matches
};
}
This should cut down on noise. If need these are actually necessary for something we can use nonremovable
to make sure they stay. There might be a case for combinations of things like optionalnonremovable
or optionalvirtual
but the same idea applies.
Notes on Javascript parsing
The official resource to follow is: https://tc39.es/ecma262/ . It can be a bit dense but mostly readable.
Automatic semicolon insertion
A nasty artefact of javascript is that there was a time when they though it would be a great idea to not have semicolons end statements. Instead virtual semicolons would be added to the token stream in certain cases. This is known as "automatic semicolon insertion" or ASI. I feel like now we know better now and this adds a lot of complexity so I think we'll just drop it. If you didn't put a semicolon, it's not implied.
Module vs Script
Javascript has two modes of parsing, one for scripts and one for modules. Scripts are basically old style javascript, the stuff you would inject into a page without type=module
as well as commonjs. Since there's no point in dealing with scripts for new code in 2023, we can eliminate that class of things entirely.
Strict mode
Javascript has a concept of strict mode in which certain things are not allowed or behave a bit differently. This is mostly useful for non-modules as modules are strict mode by default (actually they are slightly stricter than strict mode). This is complexity we can leave out and assume everything is a module.
Significant whitespace
Some production rules consider things like line breaks (eg you can't break a line in the middle of the rule). This creates a problem because at the token level we remove line breaks. Including them means they need to be accounted for which is worse because all the rules get polluted. I'm not sure how to best deal with the problem.
Implementation 1 (expressions)
Javascript has a lot of expression types, I only added a few and it was already a lot
export const javascriptParser = new Parser(
javascriptTokenizer,
{
"AdditionExpression": [
[nonterminal("MultiplicativeExpression"), virtual("AdditionSuffixExpression")]
],
"AdditionSuffixExpression": [
[["+", "-"], nonterminal("MultiplicativeExpression"), virtual("AdditionSuffixExpression")],
[]
],
"MultiplicativeExpression": [
[nonterminal("ExponentiationExpression"), virtual("MultiplicativeSuffixExpression")]
],
"MultiplicativeSuffixExpression": [
[["*", "/", "%"], nonterminal("ExponentiationExpression"), virtual("MultiplicativeSuffixExpression")],
[]
],
"ExponentiationExpression": [
[nonterminal("UpdateExpression"), "**", nonterminal("ExponentiationExpression")],
[nonterminal("UnaryExpression")]
],
"UnaryExpression": [
[nonterminal("UpdateExpression")],
["typeof", nonterminal("UnaryExpression")],
["-", nonterminal("UnaryExpression")],
["~", nonterminal("UnaryExpression")],
["!", nonterminal("UnaryExpression")],
["await", nonterminal("UnaryExpression")],
],
"UpdateExpression": [
[nonterminal("LeftHandSideExpression")],
["++", nonterminal("UnaryExpression")],
["--", nonterminal("UnaryExpression")]
],
"LeftHandSideExpression": [
[nonterminal("NewExpression")],
],
"NewExpression": [
[nonterminal("MemberExpression")],
["new", nonterminal("MemberExpression")]
],
"MemberExpression":[
[nonterminal("PrimaryExpression")]
],
"PrimaryExpression": [
[nonterminal("Literal")]
],
"Literal" : [
[nonterminal("NumericLiteral")]
],
"NumericLiteral": [
["number-literal"]
]
}, "AdditionExpression");
Deno.test("javascriptParser can parse math expressions", () => {
const ast = javascriptParser.parse(`1 + 2`);
assertEquals(ast, {
type: "AdditionExpression",
children: [{
type: "MultiplicativeExpression",
children: [{
type: "ExponentiationExpression",
children: [{
type: "UnaryExpression",
children: [{
type: "UpdateExpression",
children: [{
type: "LeftHandSideExpression",
children: [{
type: "NewExpression",
children: [{
type: "MemberExpression",
children: [{
type: "PrimaryExpression",
children: [{
type: "Literal",
children: [{
type: "NumericLiteral",
children: [{
type: "number-literal",
value: 1
}]
}]
}]
}]
}]
}]
}]
}]
}]
}]
},
{
type: "MultiplicativeExpression",
children: [{
type: "ExponentiationExpression",
children: [{
type: "UnaryExpression",
children: [{
type: "UpdateExpression",
children: [{
type: "LeftHandSideExpression",
children: [{
type: "NewExpression",
children: [{
type: "MemberExpression",
children: [{
type: "PrimaryExpression",
children: [{
type: "Literal",
children: [{
type: "NumericLiteral",
children: [{
type: "number-literal",
value: 2
}]
}]
}]
}]
}]
}]
}]
}]
}]
}]
}]
});
});
This is perhaps enough to implement some basic math ops.
Implementation 2 (static module analysis)
For my actual task I wanted to find a reliable way to find the imports and exports for a javascript module. This task is more top-down than bottom up and my feeling is that I will not need to implement the entire grammar to get what I need. We'll start by doing a very simplified grammar that only parses import statements and nothing else (it also won't handle all the various cases yet). The top-level production as noted above is a Module
and we consume a series of terminals (tokens) and non-terminals. You might also notice that lists are defined recursively which is representationally handy if maybe a bit strange if you haven't encountered this before and these had to be fixed for right-recursion.
//esm-parser.js
import { Parser, nonterminal, virtual, not } from "../src/parser.js";
import { javascriptTokenizer } from "./javascript-tokenizer.js";
import { END } from "../src/tokenizer.js";
export const esmParser = new Parser(
javascriptTokenizer,
{
"Module": [
[nonterminal("ModuleBody")],
[]
],
"ModuleBody": [
[nonterminal("ModuleItemList")]
],
"ModuleItemList": [ //fixed for right-recursion
[nonterminal("ModuleItem"), virtual("ModuleItemListSuffix")],
],
"ModuleItemListSuffix": [
[nonterminal("ModuleItem"), virtual("ModuleItemListSuffix")],
[]
],
"ModuleItem": [
[nonterminal("ImportDeclaration")],
//[nonterminal("ExportDeclaration")],
//[nonterminal("StatementListItem")],
[nonterminal("OtherList")]
],
"OtherList": [
[nonterminal("Other"), virtual("OtherListSuffix")],
],
"OtherListSuffix": [
[nonterminal("Other"), virtual("OtherListSuffix")],
[]
],
"Other": [
[not(["export", END])]
],
"ImportDeclaration": [
["import", nonterminal("ImportClause"), nonterminal("FromClause"), ";"],
["import", nonterminal("ModuleSpecifier")]
],
"ImportClause": [
[nonterminal("ImportedDefaultBinding")],
[nonterminal("NameSpaceImport")],
[nonterminal("NamedImports")],
[nonterminal("ImportedDefaultBinding"), ",", nonterminal("NameSpaceImport")],
[nonterminal("ImportedDefaultBinding", ",", nonterminal("NamedImports"))]
],
"ImportedDefaultBinding": [
[nonterminal("ImportedBinding")]
],
"NameSpaceImport": [
["*", "as", nonterminal("ImportedBinding")]
],
"NamedImports": [
["{", "}"],
["{", nonterminal("ImportsList"), "}"],
["{", nonterminal("ImportsList"), ",", "}"]
],
"FromClause": [
["from", nonterminal("ModuleSpecifier")]
],
"ImportsList": [ //fix for right-recursion
[nonterminal("ImportSpecifier"), virtual("ImportsListSuffix")],
],
"ImportsListSuffix": [
[",", nonterminal("ImportSpecifier"), virtual("ImportsListSuffix")],
[]
],
"ImportSpecifier": [
[nonterminal("ModuleExportName"), "as", nonterminal("ImportedBinding")],
[nonterminal("ImportedBinding")],
],
"ModuleExportName": [
["identifier"],
["string-literal"]
],
"ModuleSpecifier": [
["string-literal"]
],
"ImportedBinding": [
["identifier"],
[nonterminal("BindingIdentifier")]
],
"BindingIdentifier": [
[nonterminal("Identifier")],
["yield"],
["await"]
],
"Identifier": [
["identifier"]
]
}, "Module");
OtherList
is a production I made up which basically includes everything that's not an import statement. This allows us to parse a module for imports but not consider the rest of the grammar.
Deno.test("esmParser can parse NamedImport { foo as bar, baz as qux }", () => {
const tokens = [...javascriptTokenizer.tokenize("{ foo as bar, baz as qux }")];
const { type, children, failed } = esmParser.produce(tokens, `NamedImports`);
assertEquals(type, "NamedImports");
assertEquals(children, [
{
type: "ImportsList",
children: [
{
type: "ImportSpecifier",
children: [
{
type: "ModuleExportName",
children: [
{
type: "identifier",
value: "foo"
}
]
},
{
type: "ImportedBinding",
children: [
{
type: "identifier",
value: "bar"
}
]
}
]
},
{
type: "ImportSpecifier",
children: [
{
type: "ModuleExportName",
children: [
{
type: "identifier",
value: "baz"
}
]
},
{
type: "ImportedBinding",
children: [
{
type: "identifier",
value: "qux"
}
]
}
]
}
]
}
]);
assertEquals(failed, false);
});
Cleanup
After playing around and trying to implement more grammar I noticed the mini grammar I used for making productions from tokens needed to be a bit more structured. As it turns out there's two orthogonal things that these do.
1) Decide to include or exclude the token/production from the AST
2) Match a token in a particular way
So in order to make it robust I came up with a table of how they should work:
type | match | AST |
---|---|---|
nonterminal | nonterminal | include |
virtualNonterminal | nonterminal | children |
excludeNonterminal | nonterminal | exclude |
T\ | T[] | equals\ |
exclude(T\ | T[]) | equals\ |
include(T\ | []) | equals\ |
not(T\ | T[]) | not\ |
excludeNot(T\ | T[]) | not\ |
optional(T\ | T[]) | optional |
excludeOptional(T\ | T[]) | optional |
Where T
is either a string or symbol that matches a token type. Using string and arrays can be considered a short-hand but we can make an explicit version for all of them. I'm also considering making TitleCased strings match productions as a short-hand and formalize the conventions but did not include this yet. This gives us the final parser with some fairly flexible behavior.
//parser.js
export function nonterminal(type) {
return { type: "nonterminal", include: true, matcher: type };
}
export function virtualNonterminal(type) {
return { type: "nonterminal", include: "children", matcher: type };
}
export function excludeNonterminal(type) {
return { type: "nonterminal", include: false, matcher: type };
}
export function include(type) {
return { type: "equals", include: true, matcher: type };
}
export function exclude(type) {
return { type: "equals", include: false, matcher: type };
}
export function not(type) {
return { type: "not", include: true, matcher: type }
}
export function excludeNot(type) {
return { type: "not", include: false, matcher: type }
}
export function optional(type) {
return { type: "optional", include: true, matcher: type };
}
export function excludeOptional(type) {
return { type: "optional", include: false, matcher: type };
}
export class Parser {
#tokenizer;
#productions;
#goalProduction;
constructor(tokenizer, productions, goalProduction) {
this.#tokenizer = tokenizer;
this.#productions = productions;
this.#goalProduction = goalProduction;
if (!this.#productions[goalProduction]) throw new Error(`Goal production ${goalProduction} did not exist.`);
}
parse(text) {
const tokens = [...this.#tokenizer.tokenize(text)];
const production = this.produce(tokens, this.#goalProduction);
if (production.failed) {
throw new Error(`Syntax Error`);
}
if (production.length != tokens.length && production.length != tokens.length - 1) { //if just END is left that's probably okay, we don't need to force them to make a production that includes it.
throw new Error(`Syntax Error: not all content read.`);
}
return {
type: production.type,
children: production.children
};
}
debug(tokens, productionType, startIndex, label) {
const token = tokens[startIndex];
const tokenInfo = token.type.toString();
console.log(`${label} ${[productionType]} index: ${startIndex}, type: "${tokenInfo}"${token.value ? `, value: ${token.value}` : ""}`);
}
produce(tokens, productionType, startIndex = 0) {
const productionRules = this.#productions[productionType];
//this.debug(tokens, productionType, startIndex, "Producing");
for (const rule of productionRules) {
const match = this.matchRule(tokens, rule, startIndex);
//this.debug(tokens, productionType, startIndex, match.failed ? "Backtracked" : "Produced");
if (!match.failed) {
return {
failed: false,
type: productionType,
children: match.matches,
length: match.length
};
}
}
return { failed: true };
}
matchRule(tokens, rule, startIndex) {
const matches = [];
let offset = 0;
for (const part of rule) {
const currentToken = tokens[startIndex + offset];
const isBasicPart = typeof (part) === "string" || typeof (part) === "symbol" || Array.isArray(part);
if (isBasicPart || part.type === "equals") {
const shouldInclude = (isBasicPart && currentToken.value)
|| part.include;
if (currentToken.type === part
|| currentToken.type === part.matcher
|| Array.isArray(part) && part.includes(currentToken.type)
|| Array.isArray(part.matcher) && part.matcher.includes(currentToken.type)) {
if (shouldInclude) {
matches.push(currentToken);
}
offset++;
} else {
return { length: offset, failed: true };
}
} else if (part.type === "nonterminal") {
if (!this.#productions[part.matcher]) {
throw new Error(`Nonterminal production ${part.matcher} did not exist.`);
}
const production = this.produce(
tokens,
part.matcher,
startIndex + offset
);
if (!production.failed) {
if (part.include === "children") {
matches.push(...production.children);
} else if(part.include){
matches.push({ type: production.type, children: production.children });
}
offset += production.length
} else {
return { length: offset, failed: true };
}
} else if (part.type === "not") {
if ((!Array.isArray(part.matcher) && currentToken.type !== part.matcher)
|| (Array.isArray(part.matcher) && !part.matcher.includes(currentToken.type))) {
if(part.include){
matches.push(currentToken)
}
offset++;
} else {
return { length: offset, failed: true };
}
} else if (part.type === "optional") {
if (currentToken.type === part.matcher
|| Array.isArray(part.matcher) && part.matcher.includes(currentToken.type)) {
if (part.include) {
matches.push(currentToken);
}
offset++;
}
}
}
return {
failed: false,
length: offset,
matches
};
}
}
There's also some commented out debug lines which are useful if you are trying to figure out why your grammar isn't working like expected.
Condensing the AST
One thing is the the AST generated is huge because it contains all the productions from the very top. One nice optimization would be to make this smaller so only the actually important productions show up. I'm keeping it optional since it's not clear at all that this will work like I think it does. But maybe it could be useful for stringified output?
//parser.js - produce method, before the successful token return
if(match.matches.length === 1 && this.#condensed){
const condensedProduction = {
failed: false,
type: match.matches[0].type,
children: match.matches,
length: match.length
};
if(match.matches[0].value){
condensedProduction.value = match.matches[0].value;
}
if (match.matches[0].children) {
condensedProduction.children = match.matches[0].children;
}
return condensedProduction;
}
return {
failed: false,
type: productionType,
children: match.matches,
length: match.length
};
Conclusion
So we came up with a basic parsing framework and it's okay I guess. Enough to get the job done and no too much code (< 150 lines not counting the token helpers). It kinda wound up being a parser unto itself with token modifiers but I'm not sure if there's a better way to get rid of those. Left-recursion is still a problem because we can't express these types of languages easily so they have to be manually massaged to work. Aside from changing to a bottom-up algorithm we might also be able to apply transforms on the grammar coming in to create those virtual production types. But even still we have the problem where the AST could be ambiguous because not everything would be a binary expression and we have to lean on associativity rules in evaluation (if that even works, I'm not clear if it would in all cases).
Backtracking is a performance problem. It would be nicer to have a more efficient way to deal with lookaheads so we don't need backtracking and can stay fast.
A last problem is with error reporting. We don't do this very well. It would be nicer to give the user the exact index a problem happened on, even better with some context info and what sorts of things were expected. This probably requires a little more design because the tokenization might need to carry those indices and have it flow through the whole thing. But in general it's also really hard to debug your grammar because of all the recursion.
Honestly, there's a lot to complain about but I feel it took getting here to really find all these issues. I'm still pretty happy with what we have. Certainly, my understanding of compiler design got a bit deeper now that I've experienced some of these issues first-hand.
Code
Posted on December 28, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.