Building Zerocalc, part III - errors and grammar

michal1024

Michal Ciesielski

Posted on June 23, 2024

Building Zerocalc, part III - errors and grammar

The basic parsing method presented in part II works well for simple expressions that consist of binary operators and literals. Our calculator must support more complex expressions, including unary operators, parenthesis, and function calls. In addition, we'd like to be able to communicate parsing errors to the user pointing them to the potential source of the problem.

Errors

When a parser hits an issue, the user will usually get a message explaining the problem the parser hit and where in the code it happened:

Sample parsing error

To return such errors, we need to store the message and the span of input code the error relates to:

pub struct Span {
    pos: usize,
    len: usize
}

impl Span {
    pub fn new(pos: usize, len: usize) -> Self {
        Span {
            pos,
            len
        }
    }
}

pub struct Error {
    message: String,
    span: Span
}

impl Error {
    pub fn new(message: &str, span: Span) -> Self {
        Error {
            message: String::from(message),
            span
        }
    }
Enter fullscreen mode Exit fullscreen mode

Sometimes the error is detected by a function coming from a library we use and the error type is different than our Error. In this case, it will be handy to wrap that error in our Error structure to have consistent error flow through the call stack:

impl Error {
    pub fn wrap<T: Display>(t: T) -> Self {
        Error {
            message: format!("{t}"),
            span: Span::new(0, 0)
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

We can use the wrap method with Rust's handy Result::map_err method:

pub fn parse_int(input: &str) -> Result<i128, Error> {
    input.parse().map_err(Error::wrap)
}
Enter fullscreen mode Exit fullscreen mode

Grammar

Let's now look into improving our parser to handle more complex expressions. We are following recursive descent parsing (RDP) - but how exactly such a parser should be built? Wikipedia's definition of RDP states:

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

We need to build a grammar and then create a procedure for each non-terminal of the grammar. A non-terminal is an element of the grammar defined by other terminals and non-terminals. A terminal is the actual token we read. A grammar for basic addition can be defined as follows:

exp: fact + fact
fact: literal
Enter fullscreen mode Exit fullscreen mode

The exp (expression) consists of two factors and + operator. A factor is a simple literal. Grammar like this defines a single addition operation 2+2. To define a chain of additions, we need to add recursion to the grammar:

exp: fact + exp | fact
fact: literal
Enter fullscreen mode Exit fullscreen mode

This can define anything from a single literal 1 to any number of additions 1+2+3.... The fact exp is on the right-hand side of the fact + exp term is not an accident - parsing left-recursive grammars is more complicated, and for RDP we need to use right-side recursion.

Next, we need to consider operator precedence. The 1+2*3 should be computed as 1+(2*3) and 1*2+3 should become (1*2)+3. It means that for lower precedence operations like + each side can be a literal or a higher-precedence operation. so for + and * we can write:

exp: exp1 + exp | exp1
exp1: fact * exp1 | fact
fact: literal
Enter fullscreen mode Exit fullscreen mode

It's a bit tricky but as the old joke says, "to understand recursion you first need to understand recursion".

Now we can write full grammar for the expressions we want to handle in our calculator. In addition to basic mathematical operations, we want to handle parenthesis, identifiers (which may be constants like pi or e), and functions like sin.

exp: exp1 | empty
exp1: exp2 op1 exp1 | exp2
op1: + | -
exp2: exp3 op2 exp2| exp3
op2: * | / | %
exp3: term op3 exp3 | term
op3: ^
fact: +fact | -fact | (exp1) | func | id | literal
func: id(exp1)
Enter fullscreen mode Exit fullscreen mode

Parsing

The idea for the parser implementation follows the RDP definition: for each non-terminal, we write a function that will parse this non-terminal. The function will return true if parsing consumed a terminal or false if parsing ended with an empty statement. This will help us check whether all sides of the expression returned the actual result. Example function for parsing a non-terminal:

Parsing a non-terminal

I added comments showing pieces of grammar each function is responsible for:

    pub fn parse(&mut self) -> Result<bool, Error> {
        self.init();
        self.parse_exp()
    }


    // exp: exp1 | empty
    pub fn parse_exp(&mut self) -> Result<bool, Error> {
        match self.next_token.kind {
            lexer::TokenKind::Eof => Ok(false),
            _ => self.parse_exp1()
        }
    }


    // exp1: exp2 op1 exp1 | exp2
    // op1: + | -
    fn parse_exp1(&mut self) -> Result<bool, Error> {
        let has = self.parse_exp2()?;
        match self.next_token.kind {
            kind @(lexer::TokenKind::Add | lexer::TokenKind::Sub) => {
                self.bump();
                if self.parse_exp1()? {
                    self.parse_binary_op(kind)?;
                    Ok(true)
                } else {
                    self.error(ERR_EOF)
                }
            },
            _ => Ok(has),
        }
    }

    // exp2: exp3 op2 exp2| exp3
    // op2: * | / | %
    fn parse_exp2(&mut self) -> Result<bool, Error> {
        let has = self.parse_exp3()?;
        match self.next_token.kind {
            kind @(lexer::TokenKind::Mul | lexer::TokenKind::Div | lexer::TokenKind::Mod) => {
                self.bump();
                if self.parse_exp2()? {
                    self.parse_binary_op(kind)?;
                    Ok(true)
                } else {
                    self.error(ERR_EOF)
                }
            },
            _ => Ok(has)
        }
    }

    // exp3: fact op3 exp3 | fact
    // op3: ^
    fn parse_exp3(&mut self) -> Result<bool, Error> {
        let has = self.parse_fact()?;
        match self.current_token.kind {
            kind @lexer::TokenKind::Pow => {
                self.bump();
                if self.parse_exp3()? {
                    self.parse_binary_op(kind)?;
                    Ok(true)
                } else {
                    self.error(ERR_EOF)
                }
            },
            _ => Ok(has)
        }
    }

    // fact: +fact | -fact | (exp1) | func | id | literal
    // func: id(exp1)
    fn parse_fact(&mut self) -> Result<bool, Error> {
        self.bump();
        match self.current_token.kind {
            kind@ (lexer::TokenKind::Add |lexer::TokenKind::Sub) => {
                if !self.parse_fact()? {
                    return self.error("Unary operator needs expression");
                }
                self.parse_unary_op(kind)
            }
            lexer::TokenKind::Lpar => {
                let has = self.parse_exp1()?;
                if self.next_token.kind != lexer::TokenKind::Rpar {
                    return self.error("Missing closing parenthesis");
                };
                self.bump();
                Ok(has)
            },
            lexer::TokenKind::Literal(kind) => {
                self.parse_literal(kind)
            },
            lexer::TokenKind::Eof => Ok(false),
            // etc for id, function...
            _ => self.error(ERR_UNEXP)
        }
    }
Enter fullscreen mode Exit fullscreen mode

I shortened parse_fact a bit but you get the idea. When parsing a terminal, we just add it to the output program I described in the previous post:


    fn parse_literal(&mut self, l: lexer::LiteralKind) -> Result<bool, Error> {
        let val = match l {
            lexer::LiteralKind::Int => self.parse_int()?,
            lexer::LiteralKind::Float => self.parse_float()?
        };
        self.program.push(Expression::Val(val));
        Ok(true)
    }

    fn parse_binary_op(&mut self, kind: lexer::TokenKind) -> Result<bool, Error>{
        let op = match kind {
            lexer::TokenKind::Add => Op::Add,
            lexer::TokenKind::Sub => Op::Sub,
            lexer::TokenKind::Div => Op::Div,
            lexer::TokenKind::Mul => Op::Mul,
            lexer::TokenKind::Mod => Op::Mod,
            lexer::TokenKind::Pow => Op::Pow,
            _ => return self.error("Invalid binary operator")
        };
        self.program.push(Expression::BinaryOp(op));
        Ok(true)
    }
Enter fullscreen mode Exit fullscreen mode

And that's it! We can now parse very complex mathematical expressions.

Sources:

  1. https://dev.to/nathan20/how-to-handle-errors-in-rust-a-comprehensive-guide-1cco
  2. https://en.wikipedia.org/wiki/Recursive_descent_parser
💖 💪 🙅 🚩
michal1024
Michal Ciesielski

Posted on June 23, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related