Lexer - Expression calculator
Pavel Kutáč
Posted on April 21, 2021
Smart calculators are everywhere. You can type the whole expression and it will count with proper priorities and handles parentheses. In this article and serie is described how to do your own smart calculator.
🇨🇿 V češtině si lze článek přečíst na kutac.cz
There are many possibilities how to solve this problem. But I will focus on implementation with Lexer, which tokenizes input, and parser, which converts tokens into Abstract Syntax Tree.
🔗 The whole code can be found on GitHub in arxeiss/go-expression-calculator repository.
Lexer and tokens
Lexer performs lexical analysis, which is the process of converting input string into the list of tokens. Each token has its own meaning and can contain a value. The first step is to define all tokens which we will need.
type TokenType uint8
const (
EOL TokenType = iota // End Of Line
Unknown
LPar
RPar
Exponent
Multiplication
Division
FloorDiv
Modulus
Addition
Substraction
Number
Identifier
Whitespace
)
type Token struct {
tType TokenType
value float64
idName string
}
"Cheating" with regular expression
The lexer can be implemented with finite automata. But it can be quite complicated, so I used regular expression. The final regex is in the code and also in regex101.com. The main reason I decided to use regular expression is to support multiple number formats. Like 123
, 5.48
, .13
, 71.41e2
, 9.731e+3
, 4.43e-4
and all other not mentioned combinations.
Simplified Lexer
The code below is just a simplified version of the used Lexer. Some edge cases are not caught with this code. Full code can be found on Github with all tests.
With regular expression, the input string is divided into parts and then converted into tokens.
var tokenRegexp = regexp.MustCompile(`\(|\)|\*\*|\^|//|%|\+|\-|\*|/|(?P<num>(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+)(?:e[+-]?[0-9]+)?)|(?P<id>(?i)[a-z_][a-z0-9_]*)|(?P<ws>\s+)`)
type Lexer struct {
expr string
}
func NewLexer(expression string) *Lexer {
return &Lexer{expr: expression}
}
func (l *Lexer) Tokenize() ([]*Token, error) {
expr := make([]*Token, 0)
for _, match := range tokenRegexp.FindAllStringSubmatch(l.expr, -1) {
t := &Token{}
switch {
case match[2] != "":
t.tType = Identifier
t.idName = match[2]
case match[1] != "":
t.tType = Number
var err error
if t.value, err = strconv.ParseFloat(match[1], 64); err != nil {
return nil, TokenError(t, ErrInvalidNumber)
}
case match[3] != "": // Ignore whitespace for now and skip it
continue
default:
t.tType = operatorTokenType(match[0])
}
expr = append(expr, t)
}
expr = append(expr, &Token{tType: EOL})
return expr, nil
}
func operatorTokenType(operator string) TokenType {
switch operator {
case "(":
return LPar
case ")":
return RPar
case "^", "**":
return Exponent
case "*":
return Multiplication
case "/":
return Division
case "//":
return FloorDiv
case "%":
return Modulus
case "+":
return Addition
case "-":
return Substraction
}
return Unknown
}
Why we need tokens?
It would be possible to write a single module, which would parse and count the result at once. But splitting the logic into multiple modules is one of the best practices. Check Single-responsibility principle and optionally other design principles from SOLID.
If there would be a requirement to support inputs like 5 PLUS 8 TIMES 4
it would require only to change lexer. But not the rest of the program. The output of Lexer would be still the same.
Posted on April 21, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.