Defining the Tokens (Pogo Pt: 3)

Intro

In this series I am creating a transpiler from Python to Golang called Pogo. So far, we have constructed a typing system for Python, and emulated the target Go code. In this post we will start defining the Tokens that will be used by the compiler's lexer.

Our source code

Here is the current Python code we are attempting to compile.

from GoType import *

# Code
for i in range(2, 100):
    prime: bool = True

    for j in range(2, i):
        if i%j == 0:
            prime = False

    if prime:
        print(i)

From this code, we can start looking for important tokens that we need to check for.

What is a Token?

In a compiler, a token is the smallest semantic part. In English, this would be words, punctuation, and possibly newlines.
In our code tokens will be things like number literals such as 2, strings, the if keyword, and brackets.

Identifying the Tokens Manually

By looking at the code we can walk through and find these tokens.
from, IDENTIFIER, import, ASTERISK, \n, single-line comment, for, in, range, (, number, separator, ), colon, indent, assign, bool, modulus, equality.
These are all the tokens without duplicates (I may have missed some but these are the general tokens).

Creating the Token Struct

To be able to usefully pass around these tokens to other sections of the compiler, such as the parser, we need to make a struct.
The token needs its text and its code. The code will be unique for every type of token, so you can check for left brackets because all left brackets will have the same code. The text is used to differentiate tokens with the same code. In the token list above you may have noticed IDENTIFIER, this is (but is not limited to) any variable name, which aren't all the same, so we need a way to tell the difference.
Simply, the struct will look like this.

type Token struct {
    code int
    text string
}

Defining the Token Codes

Now all that is left for tokens is defining unique token codes. These codes will be kept in a map, specifically, map[string]int, which means we give the map a string and will return our code in the form of a signed integer.
Here is the whole map.

var tokenCode map[string]int = map[string]int{
    // Not implemented
    "ILLEGAL": -1,

    // Keywords
    "K_IMPORT": 0,
    "K_FROM":   1,
    "K_FOR":    2,
    "K_IN":     3,
    "K_IF":     4,
    "K_ELIF":   5,
    "K_ELSE":   6,

    // In-Built Funcs
    "IB_PRINT": 32,
    "IB_RANGE": 33,

    // Bool operands
    "BO_NOT": 64,
    "BO_AND": 65,
    "BO_OR":  66,

    // Math operands
    "MO_PLUS":    66,
    "MO_SUB":     67,
    "MO_MUL":     68,
    "MO_DIV":     69,
    "MOD_MODULO": 70,

    // Other
    "IDENTIFIER":   128,
    "NEWLINE":      129,
    "INDENT":       130,
    "L_PAREN":      131,
    "R_PAREN":      132,
    "L_BLOCK":      133,
    "R_BLOCK":      134,
    "L_SQUIRLY":    135,
    "R_SQUIRLY":    136,
    "SEP":          137,
    "COLON":        138,
    "ASSIGN":       139,
    "UNDETERMINED": 140,
    "COMMENT_ONE":  141,

    // Literals
    "L_BOOL":   160,
    "L_INT":    161,
    "L_STRING": 162,

    // Comparison Operands
    "CO_EQUALS":     192,
    "CO_NOT_EQUALS": 193,
    "CO_GT":         194,
    "CO_GT_EQUALS":  195,
    "CO_LT":         196,
    "CO_LT_EQUALS":  197,
}

You may notice first that "ILLEGAL" is the only negative value. This is done to make it easy to identify and completely separate from any other value. "ILLEGAL" can be used for values that are not implemented, whether they are going to be in future, or aren't.
We have an in-built functions section, which contains print and range. Although range is a class, for this project (especially at this stage), it will be a lot simpler to think of it as a function with one use.
You may have also noticed the "UNDETERMINED" value. When the lexer comes across the character '*', what should it save it as? This is usually the multiplication operator, however, in Python we can see it is used for a second purpose in import statements. In this way we need to admit that we don't know which one it is, but it isn't an error, and we have to wait until we have more context to figure out what this token actually is.
Right below this token code is the "COMMENT_ONE" token. This is used for single-line comments (multi-line will be implemented later). Although most compilers ignore comments, we want to keep comments and implement them into the emitted Go code. Therefore we need a token that we can keep track of so that we don't lose the comment.

In the next post we will start using these tokens and the Python source code we have created to develop the lexer for our compiler! This is the first start in the compilation process, which leads to parsing, semantic analysis, optimizing (optional), and emitting.

Blog

Defining the Tokens (Pogo Pt: 3)

Chig Beef

Intro

Our source code

What is a Token?

Identifying the Tokens Manually

Creating the Token Struct

Defining the Token Codes

Next

Join Our Newsletter. No Spam, Only the good stuff.

Related