Building Zerocalc, part II - evaluating then parsing
Michal Ciesielski
Posted on June 16, 2024
In part I we built a simple tokenizer while explaining how rustc
converts an input text into a stream of tokens. The next step is to transform the stream of tokens into a representation that will be easy to evaluate. This time we will not follow the rustc
implementation. The rustc
parser is a complex system that produces multiple representations of the code, including abstract syntax tree (AST), high-, mid-, and low-level intermediate language. Those representations are used to perform different checks and transformations such as borrow checking or macro expansion before the low-level code is handed over to LLVM for actual machine code generation. We do not need all that complexity for a simple calculator.
Evaluating
Before we build a parser let's think about what the parser will produce. There are multiple ways of representing and evaluating mathematical expressions such as using an expression tree or following a reverse polish notation. The latter is more interesting as it is closer to what the real machine does.
Let's consider what the machine has to do to add two numbers 1+2
- Store
1
in a register - Store
2
in a register - Call the
add
instruction to add the two registers - Store the result in a register
We can write this as the following pseudo-assembly code. This code uses a stack instead of registers, but the idea is very similar:
push 1
push 2
add // adds 2 numbers from the top of the stack and stores the result back to the stack
pop // get the result from the stack
The above expression can be represented in a reverse polish notation as: 1 2 +
. A more interesting example, which requires considering operator precedence, will be 1 + 2 * 3
. This example in reverse polish notation will be 1 2 3 * +
.
Our calculator will need a representation that can store values and operations on a stack. Let's define two enums for that:
enum Expression {
Val(i32),
BinOp(Op)
}
enum Op {
Add,
Mul
}
An entire mathematical expression or a program to evaluate is a list of expressions:
type Program = Vec<Expression>;
With this representation evaluating mathematical expressions becomes very simple:
fn evaluate(p: &Program) -> i32 {
let mut stack: Vec<i32> = vec![];
for exp in p {
match exp {
Expression::Val(i) => stack.push(*i),
Expression::BinOp(Op::Add) => {
let a = stack.pop().unwrap();
let b = stack.pop().unwrap();
stack.push(a+b);
},
Expression::BinOp(Op::Mul) => {
let a = stack.pop().unwrap();
let b = stack.pop().unwrap();
stack.push(a*b);
}
}
};
//final result is now on the stack
stack.pop().unwrap()
}
#[test]
fn test_add() {
let program = vec![
Expression::Val(1),
Expression::Val(2),
Expression::BinOp(Op::Add)
];
assert_eq!(3, evaluate(&program));
}
The above code ignores error checking, such as overflow errors. I will leave it up to the reader to consider solving that issue. Hint: one possible solution is to use a number wrapper that can keep an actual number and the NaN (not-a-number) term.
Parsing
Our parsing algorithm will use information about the precedence of operations. Literal (value) has the highest precedence - if we see it, we immediately store it in our output program. Then we consider, in order, multiplication and addition. We have 2 extra tokens: Eof (end of the input stream) and Unknown (used for initial values of the tokens) - they will have the lowest precedence. Let's define precedence as a function:
fn precedence(token_kind: lexer::TokenKind) -> i32 {
use lexer::TokenKind::*;
match token_kind {
Unknown | Eof => 0,
Add => 1,
Mul => 2,
Literal(_) => 3
}
}
Our parser, in addition to storing the source being parsed and an output program, needs a few extra things:
- a variable to track the current position in the source so that token text can be extracted,
- variables to store the current and next token
We will use Cursor
structure we defined in part I to tokenize the input string. The lifetime of our parser will be bound to the lifetime of the source code being parsed.
struct Parser<'src> {
source: &'src str,
cursor: lexer::Cursor<'src>,
pos: usize,
current_token: lexer::Token,
next_token: lexer::Token,
program: Program
}
We will also implement a few extra methods, that will help us with parsing:
-
from_source
- a constructor to create the parser from the source code string -
init
- to initialize parsing -
bump
- to advance the parser, tracking its position -
current_value
- to extract the string value of a token using the current position
impl<'src> Parser<'src> {
fn from_source(source: &'src str) -> Parser<'src> {
Parser {
source,
cursor: lexer::Cursor::new(source),
pos: 0,
current_token: lexer::Token::new(lexer::TokenKind::Unknown, 0),
next_token: lexer::Token::new(lexer::TokenKind::Unknown, 0),
program: vec![]
}
}
fn init(&mut self) {
self.next_token = self.cursor.advance_token();
}
fn bump(&mut self) {
self.pos += self.current_token.len;
self.current_token = mem::replace(&mut self.next_token, self.cursor.advance_token());
}
fn current_value(&self) -> &'src str {
&self.source[self.pos..self.pos + self.current_token.len]
}
}
Note the mem:replace
function used in the bump
method. In Rust, we cannot simply move ownership from one variable to another while leaving the old variable pointing to nothing. We could clone the value, but we only need one copy, so we can use the replace
function instead, moving ownership, and assigning a new value in one go. Internally the replace
function uses unsafe code to swap references without copying the actual content of the target object.
For the parsing algorithm, we are going to iterate over tokens and:
- Compare the precedence of the next token and current token - if the precedence of the next token is higher, call parsing recursively
- Push the current expression on the stack.
Akin to the recursive descent parsing method, we will match tokens and delegate parsing each type of token to specialized methods.
fn parse(&mut self) {
self.init();
while self.next_token.kind != lexer::TokenKind::Eof {
self.parse_next(precedence(self.next_token.kind));
}
}
fn parse_next(&mut self, current_precedence: i32) {
while precedence(self.next_token.kind) >= current_precedence {
self.bump();
match self.current_token.kind {
lexer::TokenKind::Add => self.parse_binary_op(Op::Add),
lexer::TokenKind::Mul => self.parse_binary_op(Op::Mul),
lexer::TokenKind::Literal(_) => self.parse_literal(),
_ => (),
}
}
}
fn parse_binary_op(&mut self, op: Op) {
//continue parsing while tokens are of higher or equal precedence
self.parse_next(precedence(self.current_token.kind));
self.program.push(
Expression::BinOp(op)
);
}
fn parse_literal(&mut self) {
let i: i32 = self.current_value().parse().unwrap();
self.program.push(
Expression::Val(i)
);
}
Let's check if it works:
#[test]
fn test_parse() {
let source = "1+2*3";
let expected = vec![
Expression::Val(1),
Expression::Val(2),
Expression::Val(3),
Expression::BinOp(Op::Mul),
Expression::BinOp(Op::Add)
];
let mut parser = Parser::from_source(source);
parser.parse();
assert_eq!(expected, parser.program);
}
That's it! Our basic parser is ready to be enhanced with more operations and proper error handling.
Sources:
Posted on June 16, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.