Building a stack-based virtual machine, part 4 - code
James Harton
Posted on September 5, 2018
Welcome back.
Now that we have our instructions and our stack, we need a program that we can run. I've called this Code
.
So what is code?
Let's look at a really simple assembly program to get some ideas:
@0 = 123
@1 = 456
.main:
push @0
push @1
add
This is not a real assembly program, but a simplified version I came up with for the stack-vm
debugging output which is loosely based on the LLVM IR text format. It shows us some important things about a program. A program needs:
-
Data - here the constants
123
and456
are assigned to the names@0
and@1
. -
Labels - in this case the label
main
points to the first instruction (instruction pointer = 0). In order to have functions or flow control we need to be able to label sections of code so that we can easily jump to them. We don't jump directly to instruction pointers because we'd probably make a mistake (remember GOTO?). -
Instructions - the meat and veg of our program; here we use the instruction
push
twice with one argument each time and then we call theadd
instruction.
So what does this boil down to? Our Code
struct needs to contain:
- A "data section" - or a list of constant operands.
- A "program section" - or a list of instruction op codes and indexes into the data section.
- A mapping of labels to instruction pointers (or indexes into the program section).
- A list of instruction names (symbols) stored by op code for easier debugging.
With all that in mind here's our Code
type:
pub struct Code<T> {
pub symbols: Vec<(usize, String)>,
pub code: Vec<usize>,
pub data: Vec<T>,
pub labels: Vec<(usize, String)>
}
We could write our software by just inserting values directly into these vectors but it's not very common to write assembly code by hand. It's much more common to generate code programmatically as an artifact of compiling a higher level language. In order to do that we're going to create our own builder (sometimes called a "code generator" or "assembler").
So what does our builder need to know in order to function?
- An instruction table as outlined in part 3. This is so that we can have access to labels, opcodes and instruction arities.
- Data, Labels and Instructions much like our Code type.
Why separate
Code
andBuilder
if they're mostly the same? I'm glad you asked.
We could conflate the two use cases together but the main reason they're separated is because the builder needs access to the instruction table for verification when inserting instructions, but code assumes that this process has already been done and can be used to serialising and deserialising to disk or for execution.
Here's our Builder
:
pub struct Builder<'a, T: 'a> {
pub instruction_table: &'a InstructionTable<T>,
pub instructions: Vec<usize>,
pub labels: WriteOnceTable<usize>,
pub data: Vec<T>,
}
impl<'a, T> Builder<'a, T> {
pub fn new(instruction_table: &'a InstructionTable<T>) -> Builder<T> {
let mut labels = WriteOnceTable::new();
labels.insert("main", 0);
Builder {
instruction_table: &instruction_table,
instructions: vec![],
labels: labels,
data: vec![],
}
}
}
Don't worry about
WriteOnceTable
- it's just a wrapper aroundHashMap
that only lets you set a value once.
So what behaviour does our builder need?
- Add instructions and their arguments into the program.
- Insert labels.
So let's cover off adding instructions first because it's by far the more complex part of the code.
impl<'a, T> Builder<'a, T> {
pub fn push(&mut self, name: &str, args: Vec<T>) {
let instr = self
.instruction_table
.by_name(name)
.expect(&format!("Unable to find instruction with name {:?}", name));
if args.len() != instr.arity {
panic!("Instruction {} has arity of {}, but you provided {} arguments.", instr.name, instr.arity, args.len())
}
self.instructions.push(instr.op_code);
self.instructions.push(instr.arity);
for arg in args {
let pos = self.push_data(arg);
self.instructions.push(pos);
}
}
}
So given the name of the instruction and a collection of operands we take the following steps:
- Look up the instruction in the instruction table.
- Verify that the number of arguments we've been given and the instruction arity match.
- Push the instruction's opcode into the program.
- Push the instruction's arity into the program. This might seem like it's not necessary but is needed for the
Code
type to be independent of theInstructionTable
. - Push each argument into the data section and it's index into the program.
Now on to labels:
impl<'a, T> Builder<'a, T> {
pub fn label(&mut self, name: &str) {
let idx = self.instructions.len();
self.labels.insert(name, idx);
}
}
This is pretty straight forward; we simply look up the number of instructions currently in the program and store the name pointing to it.
The last bit of the job is being able to convert a Builder
into a Code
. This is a relatively straight forward task:
- Grab all the instruction names (symbols) from the instruction table.
- Grab the program section.
- Grab the data section.
- Grab the labels and sort them by their IP.
- Stuff them all in the Code object.
impl<'a, T> From<Builder<'a, T>> for Code<T> {
fn from(builder: Builder<T>) -> Code<T> {
let symbols = builder.instruction_table.symbols();
let code = builder.instructions;
let data = builder.data;
let mut labels = vec![];
for key in builder.labels.keys() {
let idx = builder.labels.get(key).unwrap();
labels.push((*idx, key.clone()));
}
labels.sort_by(|lhs, rhs| {
lhs.0.cmp(&rhs.0)
});
Code {
symbols: symbols,
code: code,
data: data,
labels: labels
}
}
}
Phew! That's a pretty dense episode for you to watch. I haven't included the tests, but believe me they're all there. In the next episode we'll wire everything up into a machine and actually run a program!
Posted on September 5, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.