Defining the Tokens (Pogo Pt: 3)
Chig Beef
Posted on January 19, 2024
Intro
In this series I am creating a transpiler from Python to Golang called Pogo. So far, we have constructed a typing system for Python, and emulated the target Go code. In this post we will start defining the Tokens that will be used by the compiler's lexer.
Our source code
Here is the current Python code we are attempting to compile.
from GoType import *
# Code
for i in range(2, 100):
prime: bool = True
for j in range(2, i):
if i%j == 0:
prime = False
if prime:
print(i)
From this code, we can start looking for important tokens that we need to check for.
What is a Token?
In a compiler, a token is the smallest semantic part. In English, this would be words, punctuation, and possibly newlines.
In our code tokens will be things like number literals such as 2
, strings, the if
keyword, and brackets.
Identifying the Tokens Manually
By looking at the code we can walk through and find these tokens.
from
, IDENTIFIER
, import
, ASTERISK
, \n
, single-line comment
, for
, in
, range
, (
, number
, separator
, )
, colon
, indent
, assign
, bool
, modulus
, equality
.
These are all the tokens without duplicates (I may have missed some but these are the general tokens).
Creating the Token Struct
To be able to usefully pass around these tokens to other sections of the compiler, such as the parser, we need to make a struct.
The token needs its text and its code. The code will be unique for every type of token, so you can check for left brackets because all left brackets will have the same code. The text is used to differentiate tokens with the same code. In the token list above you may have noticed IDENTIFIER
, this is (but is not limited to) any variable name, which aren't all the same, so we need a way to tell the difference.
Simply, the struct will look like this.
type Token struct {
code int
text string
}
Defining the Token Codes
Now all that is left for tokens is defining unique token codes. These codes will be kept in a map, specifically, map[string]int
, which means we give the map a string and will return our code in the form of a signed integer.
Here is the whole map.
var tokenCode map[string]int = map[string]int{
// Not implemented
"ILLEGAL": -1,
// Keywords
"K_IMPORT": 0,
"K_FROM": 1,
"K_FOR": 2,
"K_IN": 3,
"K_IF": 4,
"K_ELIF": 5,
"K_ELSE": 6,
// In-Built Funcs
"IB_PRINT": 32,
"IB_RANGE": 33,
// Bool operands
"BO_NOT": 64,
"BO_AND": 65,
"BO_OR": 66,
// Math operands
"MO_PLUS": 66,
"MO_SUB": 67,
"MO_MUL": 68,
"MO_DIV": 69,
"MOD_MODULO": 70,
// Other
"IDENTIFIER": 128,
"NEWLINE": 129,
"INDENT": 130,
"L_PAREN": 131,
"R_PAREN": 132,
"L_BLOCK": 133,
"R_BLOCK": 134,
"L_SQUIRLY": 135,
"R_SQUIRLY": 136,
"SEP": 137,
"COLON": 138,
"ASSIGN": 139,
"UNDETERMINED": 140,
"COMMENT_ONE": 141,
// Literals
"L_BOOL": 160,
"L_INT": 161,
"L_STRING": 162,
// Comparison Operands
"CO_EQUALS": 192,
"CO_NOT_EQUALS": 193,
"CO_GT": 194,
"CO_GT_EQUALS": 195,
"CO_LT": 196,
"CO_LT_EQUALS": 197,
}
You may notice first that "ILLEGAL"
is the only negative value. This is done to make it easy to identify and completely separate from any other value. "ILLEGAL"
can be used for values that are not implemented, whether they are going to be in future, or aren't.
We have an in-built functions section, which contains print
and range
. Although range
is a class, for this project (especially at this stage), it will be a lot simpler to think of it as a function with one use.
You may have also noticed the "UNDETERMINED"
value. When the lexer comes across the character '*'
, what should it save it as? This is usually the multiplication operator, however, in Python we can see it is used for a second purpose in import
statements. In this way we need to admit that we don't know which one it is, but it isn't an error, and we have to wait until we have more context to figure out what this token actually is.
Right below this token code is the "COMMENT_ONE"
token. This is used for single-line comments (multi-line will be implemented later). Although most compilers ignore comments, we want to keep comments and implement them into the emitted Go code. Therefore we need a token that we can keep track of so that we don't lose the comment.
Next
In the next post we will start using these tokens and the Python source code we have created to develop the lexer for our compiler! This is the first start in the compilation process, which leads to parsing, semantic analysis, optimizing (optional), and emitting.
Posted on January 19, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024