Creating Regular Expression
Akira
Posted on November 13, 2024
I created a toy regular expression engine to understand how regular expression works.
Process Flow
A regular expression represents a pattern for strings, such as /ab*(a|c)+/. However, this expression can be difficult to handle directly in a program. Therefore, the first step is to convert it into a format that is easier to work with programmatically.
This format is typically represented as an AST (Abstract Syntax Tree). An AST is a data structure that represents the structure of the regular expression in a tree format. By creating an AST, it becomes easier to analyze the regular expression and process it for pattern matching.
Next, we perform code generation based on the AST. Code generation is the process of creating instructions that a computer can execute. By analyzing the AST, we can generate the necessary commands and processing steps to create code that can actually handle the regular expression.
Finally, we evaluate the given string to see if it matches the regular expression using the generated code. The goal is to execute the generated code to determine whether the given string matches the regular expression pattern.
This approach allows effective handling of regular expressions, enabling pattern matching and searching within strings.
Parsing to AST
First, we parse the regular expression into an AST.
Consider a regular expression such as ab*(c|d)
as an example.
AST for ab*(c|d)
Concat
|
---------------------------------
| | |
Literal('a') Repeat0(Literal('b')) or
|
-----------
| |
Literal('c') Literal('d')
Representing this tree structure as structs makes it easier to handle in a program.
enum RegexAst {
Literal(char),
Concat(Vec<RegexAst>),
Or(Box<RegexAst>, Box<RegexAst>),
Repeat0(Box<RegexAst>), // Repeat 0 or more times
...other symbols like + and ?
}
With such a struct definition, the AST for ab*(c|d)
can be represented as follows:
Concat([
Literal('a'),
Repeat0(Literal('b')),
Or(
Concat([Literal('c')]),
Concat([Literal('d')])
)
])
Code Generation
In code generation, we analyze the AST and generate code to process the regular expression. Specifically, we define a set of instructions to handle the regular expression, and based on this, we generate the code.
First, we define a register machine as the data structure to handle regular expressions. A register machine consists of a finite set of registers and a program counter (which indicates the current instruction to execute). Using this register machine, we execute the regular expression processing.
We define the commands to use. This register machine has four commands:
- char: Matches a specific character. If the current character matches the specified character, it moves to the next command.
- match: Checks for a match with the regular expression. If the specified pattern matches the current string, it moves to the next command.
- jump: Jumps to a specified location, changing the execution point of the command.
- split: Branches to two different commands, each with its own target command. This can allow processing of two branches.
By combining these commands, we can represent the processing of regular expressions. By analyzing the AST and generating commands for each node, we also generate control structures like loops and conditional branches as needed.
With just these four commands, we can evaluate a regular expression.
The sample regular expression (ab*(c|d)
) can be represented with only these four commands, as shown below.
0001: Char(a)
0002: Split 0003, 0005
0003: Char(b)
0004: Jump(0002)
0005: Split 0006, 0008
0006: Char(c)
0007: Jump(0009)
0008: Char(d)
0009: Match
The numbers on the left are the PC (Program Counter).
Evaluation
This may be a bit challenging to understand, so let’s walk through the generated code step by step.
First, the Char(a) command is executed, checking if the current character is 'a'. If it doesn’t match, the regular expression does not match. If it matches, we increment the PC by one and jump to 0002.
If it matches, the Split 0003, 0005 command is executed, branching the program counter into two locations, 0003 and 0005. This represents the
b*
part of the expression, which can repeat 0 or more times. If 0005 is chosen, 0003 is skipped, so the 'b' character is not evaluated. However, if it jumps to 0003, it evaluates 'b'. If it matches, the PC is incremented to 0004, which instructs to jump back to 0002. This branches again into 0003 and 0005, and if 0003 is selected, 'b' is re-evaluated, allowing 'b' to match consecutively. Jumping to 0005 exits this branch, resulting in theb*
evaluation for zero or more repetitions of 'b'.Next, it branches into 0006 and 0008. This allows for a match of either 'c' or 'd'. At location 0006, the Char(c) command is executed, checking if the current character is 'c'. If it matches, it moves the PC to 0007, which jumps to 0009, ending the process. At location 0008, the Char(d) command is executed, checking if the current character is 'd'. Here too, the PC advances, and ultimately, the Match command confirms a match with the regular expression and the program ends.
Thus, this sequence of commands efficiently processes the pattern /ab*(c|d)/. Each command reflects the structure of the regular expression, with appropriate branching and repetition achieved through program counter control.
Evaluation is conducted using a loop.
For instance, to evaluate a string like char = xabd
:
-
char[0] == x
, so process at 0001 ends, moving to the next check. -
char[1] == a
, so 0001 passes, and the PC increments. - At 0002, branches to 0003 and 0005 are evaluated.
- At 0003, 'b' matches, so we change the evaluated character to 'd' and increment the PC to 0004.
This process continues, evaluating one character at a time.
Posted on November 13, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.