Parsing Python (Pogo Pt: 6)
Chig Beef
Posted on January 22, 2024
Intro
In this series I am creating a transpiler from Python to Golang called Pogo. In the last post we created an object called a Structure
, which we are going to use in this post to create an abstract syntax tree through the use of a parser.
What is a parser
A parser in most simple compilers will become the largest part of the project. This is because it has to check for many different syntax cases. The parser turns the slice of tokens that we received from the lexer into a tree structure. We do this because code is not linear like an array, but everything works in group. For example, an if statement would be a parent node to multiple print statements, and in that if statement we could have a loop. In this way we can create complicated tree structures through parsing the code.
The Parser Struct
The Parser
as to keep track of 4 values. First, the source, which is the slice of Token
s we got from the lexer. We also need to keep track of our current position in the source, which is very similar to the lexer. We also need the current token we are looking at, and lastly we need a slice of int
s for markers. Markers will be saved positions that we can use in case we need to go back in the source.
type Parser struct {
curPos int
curToken Token
source []Token
markers []int
}
Creating Helper Functions
We need to effectively be able to move around the source freely, so we will make some functions to help us with this.
Firstly, we need a function to move us onto the next token in the source.
func (p *Parser) nextToken() {
p.curPos++
if p.curPos >= len(p.source) {
p.curToken = Token{} // Nil
} else {
p.curToken = p.source[p.curPos]
}
}
This should look familiar to code in the lexer, and simply gets the next token, and if there isn't one, just returns an empty value.
Now usually once we have the next token, we want something useful, and comments will get in the way of that, so we need a way to move over them without deleting them.
func (p *Parser) nextTokenNoNotes() []Structure {
p.nextToken()
sts := []Structure{}
for p.curToken.code == tokenCode["COMMENT_ONE"] || p.curToken.code == tokenCode["NEWLINE"] {
sc := structureCode["COMMENT_ONE"]
if p.curToken.code == tokenCode["NEWLINE"] {
sc = structureCode["NEWLINE"]
}
sts = append(sts, Structure{[]Structure{}, sc, p.curToken.text})
p.nextToken()
}
return sts
}
This function creates a slice of Structure
s to hold all the newlines and comments. We keep appending to this slice as long as we can see notes and newlines. Since structure codes and token codes don't align perfectly, we need to check within the loop which Token
we got and create the Structure
accordingly
Next, we have a function that can tell us what the future token is going to be.
func (p *Parser) peek() Token {
if p.curPos >= len(p.source)-1 {
return Token{}
}
return p.source[p.curPos+1]
}
Similarly, it would be good to move back just one token, in case we had a choice and we chose the wrong path.
func (p *Parser) rollBack() {
p.curPos--
if p.curPos < len(p.source) {
p.curToken = Token{} // Nil
} else {
p.curToken = p.source[p.curPos]
}
}
There are quite a few similarities between these functions, find which token we want, check if it exists, return accordingly.
func (p *Parser) setMarker() {
p.markers = append(p.markers, p.curPos)
}
func (p *Parser) gotoMarker() {
p.curPos = p.markers[len(p.markers)-1]
p.curToken = p.source[p.curPos]
p.markers = p.markers[:len(p.markers)-1]
}
These 2 functions are used for when we are deep in a parsing branch and need to be pulled out. Before we get into the branch, we set a marker as a safety net. As soon as we realize that we are in the wrong branch, we call gotoMarker
and it will pull as back to the most recently placed marker. gotoMarker
also removes that marker.
Indentation
The syntax of Python that will make our lives slightly harder is the fact it uses indentation to group code rather than literally anything else. In Python we don't end up with a clear character marker that shows us that a block has ended, as apposed to Go which tells us with a }
or even Ruby that uses end
. So now we have to figure out when groups end by ourselves.
All we know is that when the indentation of this line is lower than the previous line's indentation, we have a block end.
func (p *Parser) replaceIndents(input []Token) []Token {
indents := []int{0}
curIndex := 0
for i := 0; i < len(input); i++ {
if input[i].code == tokenCode["NEWLINE"] {
curIndex++
indents = append(indents, 0)
} else if input[i].code == tokenCode["INDENT"] {
indents[curIndex]++
}
}
output := []Token{}
curIndex = 0
for i := 0; i < len(input); i++ {
if input[i].code == tokenCode["NEWLINE"] {
curIndex++
if indents[curIndex] < indents[curIndex-1] {
for j := 0; j < indents[curIndex-1]-indents[curIndex]; j++ {
output = append(output, Token{tokenCode["ANTI_COLON"], ":"})
}
}
}
if input[i].code == tokenCode["INDENT"] {
continue
}
output = append(output, input[i])
}
return output
}
To start of this code, we are going to count the indentation level of each line. We do this by storing these indentation levels in a slice of int
s. We loop over the source, incrementing the indentation level every time we come across and indent, but when we find a newline we create another line in the slice to store more information.
Now we need to start placing in the block ends, which I have called anti-colons because they do the opposite of a colon in Python.
Firstly, lets look at the second if statement in the loop. This if statement simply checks if we have an indent, and by using continue
we don't add it to output
, which will be the slice of Token
s to replace our source. In this way our final source code should contain no indents.
The first if statement, similar to the if statement in the first loop, checks if we are starting a new line, and if we are then we need to increment our counter so we are looking at the indentation level of the correct line.
Now we need to check whether we should be adding an anti-colon in here, and we only need to if the indentation level of our current line is less than the indentation level of the previous line, therefore implying we have finished a block.
If this is the case, we add not one anti-colon, but as many as we need to match up the indentation levels. We need to do this incase the last line was a nested structure, which usually ends in 2 }
s.
This all works so far until you end the Python source file with indentation, because there's no next line to check whether it's got lower indentation than the previous line.
To fix this, we keep count up all of the colons and anti-colons in the code. If there is a difference then we need to add more anti-colons to the end, which is a simple process.
We end the function by returning output
with an extra newline, which helps keep us from leaving the bounds of the source (adds a margin of error).
Checking the Source
Now that we have a way of moving around the source, and have made the source easier to work with, we can now implement more methods to simplify the checking process.
Firstly, we need a way to check whether the current token we are looking at is what we expect.
func (p *Parser) checkToken(tokenKey string) (Structure, error) {
if p.curToken.code == tokenCode[tokenKey] {
return Structure{[]Structure{}, structureCode[tokenKey], p.curToken.text}, nil
}
return Structure{}, errors.New("[Parse (checkToken)] Expected " + tokenKey + ", got " + p.curToken.text)
}
This is a very simple method that we can use all the time, and we check whether the given error is nil
to decide whether we got what we wanted.
What if we have a bunch of tokens in a row that we want to check? We could use checkToken
repeatedly, but then we would be using checkToken
and error checking for every check, which can bloat the code quickly. Instead we will create another method that simply checks whether the next range of tokens is correct.
func (p *Parser) checkTokenRange(tokenKeys []string) ([]Structure, error) {
structures := []Structure{}
for i := 0; i < len(tokenKeys); i++ {
temp, err := p.checkToken(tokenKeys[i])
if err != nil {
return structures, err
}
structures = append(structures, temp)
p.nextToken()
}
return structures, nil
}
This method returns the first error it comes across, or a slice of Structure
s which should correlate with what we were checking for.
Another useful method would be to simplify a simply choice between tokens that could be possible. For example, if we are assigning a value to a variable, that value could be an identifier, or any literal. And since this isn't the only case where this occurs, it would be nice to keep this process simple.
func (p *Parser) checkTokenChoices(tokenKeys []string) (Structure, error) {
for i := 0; i < len(tokenKeys); i++ {
if p.curToken.code == tokenCode[tokenKeys[i]] {
return Structure{[]Structure{}, structureCode[tokenKeys[i]], p.curToken.text}, nil
}
}
errText := ""
for i := 0; i < len(tokenKeys); i++ {
errText += tokenKeys[i]
errText += " or "
}
errText = errText[:len(errText)-4]
return Structure{}, errors.New("[Parse (checkToken)] Expected " + errText + ", got " + p.curToken.text)
}
Similar to checkTokenRange
we decide whether our token is valid by comparing it against a slice of keys. The main section we are interested in is the first loop. This simply returns as soon as it has found a valid option.
The rest of the code is just for creating the error text, which will list all of the values that we were looking for.
Now, we have gone through a lot of code so I'll explain the rest of the parser in the next post. The parser did end up being quite large, even after the generalizations we just made with these methods
Next
In the next post we will create the parse
method on the parser, which, similar to lex
on the lexer, will be the main method that will return our output, which in this case will be an abstract syntax tree. This method will utilize many more methods to create this tree, with each method specializing in blocks, statements, comparisons, etc.
Posted on January 22, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.