Building a syntax tree from scratch using André’s Method
André Ovalle
Posted on November 21, 2023
Suppose you are taking a course on compilers or finite automata and you have to construct a syntax tree for a given regular expression. You know how to do it manually, but you struggle to find a simple and efficient way to do it programmatically. You search online for solutions, but most of them involve complicated rules and conditions. You are about to give up when you stumble upon this blog post that offers a novel method for building syntax trees.
Introduction
This method, which I call André’s Method (after myself 😂), is based on an adaptation and extension of another method that works well for mathematical expressions, but not for regular expressions. The goal of this method is to take any regular expression and generate the correct syntax tree that can be used to create finite automata, either deterministic or non-deterministic. The method consists of two main steps:
- Shunting Yard Algorithm
- Postfix expression backtracking
Before explaining each step in detail, I will first state the assumptions and requirements that the method relies on. The following rules must be followed:
The regular expression must be well-formed: This means that the parentheses must be balanced and the operators must be placed correctly. Otherwise, the method may produce incorrect trees.
The regular expression must indicate concatenation explicitly: This means that a special symbol (such as ‘.’) must be used to mark where concatenation occurs in the expression. To illustrate the method, we will use the dot “.” as the symbol for concatenation. For example, the regular expression a(ba)abb would be rewritten as a.(b.a).a.b.b with the dot symbol.
Define the precedence or hierarchy of the operators that will be used in the syntax tree. The precedence determines which operations are performed first in the expression. For this example, we will use a modified version of the usual precedence rules. Next, we will explain each step of the method in more detail.
Shunting Yard Algorithm
The Shunting Yard Algorithm is an algorithm that converts an expression with operators and operands from infix notation to postfix notation. The algorithm uses two queues: one is called the stack queue, which stores the operators according to their precedence and follows the LIFO (Last In First Out) principle. The other queue is called the output or postfix queue, which stores the resulting expression in postfix notation.
The Shunting Yard Algorithm requires a modified version of the operator precedence to work properly. We will describe how to set up the input and output parameters and steps of this algorithm, as well as the importance of operator precedence.
Input: Regular expression in infix
Output: Regular expression in postfix
- Declare the stack queue for operators
- Declare the output queue for postfix expression
- Define the operator precedence by removing the closing parenthesis from the precedence and moving the opening parenthesis to have the lowest precedence.
- For each character in the expression:
- If the character is an operand (not an operator or a parenthesis):
- Enqueue the character to the output queue.
- If the character is an operator or a parenthesis:
- If the character is an opening parenthesis:
- Push the parenthesis to the stack queue.
- If the character is an operator:
- Check the precedence of the operator that is at the top of the stack queue:
- If the stack queue is empty or the precedence of the operator at the top is lower than that of the character:
- Push the operator to the stack queue.
- If the precedence of the operator at the top is higher than or equal to that of the character:
- Pop the operator from the stack queue and enqueue it to the output queue.
- Push the character to the stack queue
- If the character is a closing parenthesis:
- Pop all the operators from the stack queue and enqueue them to the output queue until an opening parenthesis is found at the top of the stack queue. Pop the opening parenthesis from the stack queue, but do not enqueue it to the output queue.
- If the character is an opening parenthesis:
- If the character is an operand (not an operator or a parenthesis):
- If the stack queue is not empty:
- Pop all the operators from the stack queue and enqueue them to the output queue.
- Return the regular expression in postfix.
The operator precedence follows a standard hierarchy for regular expressions, so depending on the operators that are used, the precedence table would look something like this:
Precedence | # Operator Children |
Operators |
---|---|---|
4 | N.A | ( |
3 | 1 | * + ? |
2 | 2 | . |
1 | 2 | | |
0 | N.A | ) |
Postfix expression backtracking
After converting the regular expression to postfix notation using the Shunting Yard Algorithm, we need to backtrack the postfix expression to build the syntax tree. The backtracking algorithm is the unique part of this method that I have developed. I have not found any other source that uses this algorithm to construct syntax trees for regular expressions. If you know of any prior publication that describes the same algorithm, please let me know and I will give proper credit. Now, let’s see how the algorithm works.
The backtracking algorithm reads the postfix expression from right to left and fills the syntax tree in a reverse postorder fashion. This means that it fills the right node first, then the left node, and then the parent node. Whenever it creates a new leaf node, it checks if the maximum number of children matches the number of children that it has. The number of children is basically the number of operands that an operator can take. For “|” and “.” (concatenation) operators, the number of children is two, while for the rest of the operators it is one. The operands have zero children, which makes them leaves.
Now that we have explained the algorithm, we can specify the input and output parameters and the formal steps of the method.
Input: Regular expression in postfix
Output: Syntax tree for regular expression
Algorithm:
- Reverse the input regular expression.
- Create the root node with the first character of the reversed expression.
- Set the focus node to the root node.
- For each character of the reversed expression (excluding the first character):
- Create a new node with the character.
- If the focus node can have two children:
- If the focus node has no right child:
- Insert the new node as the right child of the focus node.
- Set the focus node to the new node.
- Otherwise:
- Insert the new node as the left child of the focus node.
- Set the focus node to the new node.
- If the focus node has no right child:
- If the focus node can have one child:
- Insert the new node as the left child of the focus node.
- Set the focus node to the new node.
- While the focus node is not the root node and the focus node has all its possible children:
- Set the focus node to its parent node
- Return the syntax tree that was created
This way, we have our beautiful syntax tree that will not lower our grade in our compiler project. Unless, of course, one of the 100 expressions that we tested in our compiler during our presentation failed and that’s why we went from having a 90 to a 50 :D.
Code
You can find samples of code written in Python here. Please remember to cite me if you are going to use my code so that you don’t get accused of plagiarism 👀.
Bibliography
Dr Dijkstra .E.W, Original description of the Shunting yard algorithm. Retrieved from: https://www.cs.utexas.edu/~EWD/MCReps/MR35.PDF
Posted on November 21, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024