Register-based VMs with custom instruction sets

13 October 2024
C/C++,
compilers,
algorithms

Virtual machines are separated on the basis of their corresponding execution model. Traditional stack-based VMs rely on stack operations whereas register-based VMs use units called registers for operand storage and manipulation.

RISC (ARM, MIPS & RISC-V) is a common ISA which contrasts with CISC (x86, et al). Both share a common subset of instructions which may or may not work exactly the same but their differences are mostly architecture-dependent or historical reasons in implementation design.

RISC has faster processing and higher CPI than CISC and also appeals to syntax—that's why ARM is faster than x86 assembly (generally) and comes with the benefit of having a reduced instruction set which works in favor of low end devices or embedded systems.

Related: Average cycles per instruction (CPI)

Average cycles per instruction (CPI) is a rather ambiguous metric that scales with the reduction in the instruction set (cf. instructions per cycle in ARM vs. x86), among other factors. Typically, it is a measure of total CPU clock cycles and the frequency of specific instructions in a running program.

$\sum_{i = 1}^{n} C P I_{i} \times F_{i}$ $where F_{i} = \frac{n_{i}}{instruction count}$ $n_{i} = recurrence count of the instruction$

This is independent of multithreaded running programs, however, in most systems and particularly POSIX threads, processes share resources like the stack pointer, general-purpose registers, signals, scheduling properties and other metadata, thereby reducing average CPI of a single program. Again, this is not guaranteed and will usually depend on implementation.

A basic interpreter for a language would likely support a subset of common assembly instructions like storage and arithmetic operations. We can model the central structure that represents this (final source).

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#define NUM_REGISTERS (1 << 4)
#define MEMORY_SIZE   (1 << 12)

struct vm {
  uint32_t registers[NUM_REGISTERS];
  uint32_t memory[MEMORY_SIZE / 4];
  uint32_t pc;
};

The VM consists of three main entries:

Registers based on a total count.
Memory from a predefined size.
A program counter, pc.

In this case, the registers are anonymous and general-purpose. The program counter is a special index that indicates the current executing instruction. The entire program is loaded as a list of instructions and is procedurally looped over until it issues a termination opcode. An opcode is a unique identifier for a particular instruction.

enum instr {
  NOP   = 0x00,
  LOAD  = 0x01,
  STORE = 0x02,
  ADD   = 0x03,
  SUB   = 0x04,
  MUL   = 0x05,
  DIV   = 0x06,
  JMP   = 0x07,
  JZ    = 0x08,
  JNZ   = 0x09,
  HALT  = 0x0A
};

Before sequential execution, the instructions are decoded by a separate method. The instruction format of our VM is based on encoding/decoding a uint32_t instruction.

A byte is 8 bits. An immediate value is a literal. A 32-bit unsigned integer has a byte range of $2^{8} - 1$ . Generally, a k-bit integer has a byte range of $2^{k} - 1$ . Left shifting $1$ by $p$ is equivalent to raising $2$ by $p$ , i.e. $1 ≪ p \equiv 2^{p}$ . Right shifting $p$ by 1 is equivalent to taking the base-2 logarithm of $p$ , i.e. $p ≫ 1 \equiv l o g_{2} p$ .

For this, the first 8 bits (most significant byte) represents the opcode, which is extracted with a right shift ( $32 - 24 = 8$ , representing the MSB). The next 8 bits represent the 1st register (masked with performing AND on 0xFF, i.e. $2^{8} - 1 = 255 = 0xFF$ , after right shifting $24 - 8 = 16$ ), the next 8 bits the second ( $16 - 8 = 8$ ) and the last remaining byte is reserved for an immediate value.

void execute_instr(struct vm *vm, uint32_t instr) {
  uint32_t opcode = instr >> 24;
  uint32_t r1 = (instr >> 16) & 0xFF;
  uint32_t r2 = (instr >> 8) & 0xFF;
  uint32_t r3 = instr & 0xFF;

  // sequential execution is based on decoded instructions (opcodes & registers)
}

In the same execute_instr definition, the unpacked instructions consisting of opcodes and registers are handled, with load/store depending on only two registers per instruction definition. LOAD copies from the memory into the register list by indexing with the registers in its parameters, STORE does the inverse.

switch (opcode) {
  case NOP:
    break;
  case LOAD:
    vm->registers[r1] = vm->memory[r2];
    break;
  case STORE:
    vm->memory[r2] = vm->registers[r1];
    break;
}

The arithmetic instructions are straightforward and will depend on an additional register to read from. The first register is used to store the result. The current implementation will allow both immediate addressing and register addressing since other modes rely on VM-specific pointers to compute the effective memory address (EA) based on offsets.

JMP will point the program counter to the specified register. JZ and JNZ check for equality with zero. This step is only useful when implementing routines that allow jumping based on flags, the common one being equality checks between immediates, registers, etc. In such a procedure, when the values are equal, a specific register is set to 0 (carry flag), implying a carry-over. It is also the case with unsigned integer addition/subtraction in actual assembly.

switch (opcode) {
  case JMP:
    vm->pc = r1;
    break;
  case JZ:
    if (vm->registers[r1] == 0) vm->pc = r2;
    break;
  case JNZ:
    if (vm->registers[r1] != 0) vm->pc = r2;
    break;
  case HALT:
    return;
  default:
    fprintf(stderr, "unknown opcode: %u\n", opcode);
    exit(1);
}

We can now write routines for loading a specific program into memory based on a VM instance. The program is ideally a const uint32_t array, trivially passable by value and immutable by definition.

Additionally, the memory field is looped over after being copied from the program as mentioned earlier on. This step is abstracted away, it usually serves as an entrypoint for additional functionality like customized addressing modes or optimizations like JIT compilation.

void load_program(struct vm *vm, const uint32_t *program, size_t program_size) {
  for (size_t i = 0; i < program_size; i++) {
    vm->memory[i] = program[i];
  }
}

void run(struct vm *vm) {
  while (1) {
    // index with n=sizeof(uint32_t) blocks
    uint32_t instr = vm->memory[vm->pc / 4];

    vm->pc += 4; // skip n=sizeof(uint32_t) blocks
    execute_instr(vm, instr);

    if ((instr >> 24) == HALT) break;
  }
}

Fixed-length instruction encoding is the method that is used in this VM, contrasting with CISC-esque variable-length operand specifications. This implies that given operations are only applicable based on the operands that are decoded from the corresponding instruction.

We can now run a simple program. This will require encoding instructions based on the reverse order of decoding, left shifting the top-most 8 bits, OR-ing with the middle and last 8 bits, forming a uint32_t. Normal C integers do not guarantee fixed-width length which is why we use stdint.h.

uint32_t program[] = {
  (LOAD  <<  24) | (0 << 16) | (10 << 8),
  (LOAD  <<  24) | (1 << 16) | (20 << 8),
  (MUL   <<  24) | (2 << 16) | (0 <<  8) | 1,
  (STORE <<  24) | (2 << 16) | (30 << 8),
  (HALT  <<  24)
};

int main() {
  struct vm vm = {.pc = 0};
  // initialize memory values at locations 10 and 20
  vm.memory[10] = 5;
  vm.memory[20] = 7;
  load_program(&vm, program, sizeof(program) / sizeof(program[0]));
  run(&vm);
  // display the result stored in memory[30]
  printf("memory @ address 0x30: %u\n", vm.memory[30]);
  return 0;
}

Inserting debug statements across the switch statement cases or printing the VM fields will provide decent introspection of the VM state, particularly the program counter.

The debug logs from the final source include traces which, in my opinion, should be the first item to set up when working with VMs/allocators in C/C++. It helps visualize alignment and other properties which can be easily frustrating.

memory (before load & run):
--------------------------
m0: 0 m1: 0 m2: 0 m3: 0 m4: 0 m5: 0 m6: 0 m7: 0 m8: 0 m9: 0 m10: 0 m11: 0 m12: 0 m13: 0 m14: 0 m15: 0 m16: 0 m17: 0 m18: 0 m19: 0 m20: 0 m21: 0 m22: 0 m23: 0 m24: 0 m25: 0 m26: 0 m27: 0 m28: 0 m29: 0 m30: 0 m31: 0

registers (before load & run):
-----------------------------
r0: 0 r1: 0 r2: 0 r3: 0 r4: 0 r5: 0 r6: 0 r7: 0 r8: 0 r9: 0 r10: 0 r11: 0 r12: 0 r13: 0 r14: 0 r15: 0 program loaded: 0x1000a00 @ memory idx [0]
program loaded: 0x1011400 @ memory idx [1]
program loaded: 0x5020001 @ memory idx [2]
program loaded: 0x2021e00 @ memory idx [3]
program loaded: 0xa000000 @ memory idx [4]
memory @ address 0x30: 35

memory (after load & run):
--------------------------
m0: 16779776 m1: 16847872 m2: 84017153 m3: 33693184 m4: 167772160 m5: 0 m6: 0 m7: 0 m8: 0 m9: 0 m10: 5 m11: 0 m12: 0 m13: 0 m14: 0 m15: 0 m16: 0 m17: 0 m18: 0 m19: 0 m20: 7 m21: 0 m22: 0 m23: 0 m24: 0 m25: 0 m26: 0 m27: 0 m28: 0 m29: 0 m30: 35 m31: 0

registers (after load & run):
-----------------------------
r0: 5 r1: 7 r2: 35 r3: 0 r4: 0 r5: 0 r6: 0 r7: 0 r8: 0 r9: 0 r10: 0 r11: 0 r12: 0 r13: 0 r14: 0 r15: 0

So far, the VM memory (after loading and running) contains garbage values which are actually the encoded representation of the user program. Leaking user programs into memory is dangerous.

Aside from flushing the memory, the current implementation relies entirely on uin32_t without delegating memory values to uint8_t pointers which are reasonably sized.

The key takeaway in the program definition is the actual (base-10) representation:

uint32_t program[] = {
  16779776,
  16847872,
  84017153,
  33693184,
  167772160,
};

This can alternatively be represented by converting the individual values to hexadecimal:

uint32_t program[] = {
  0x1000A00,
  0x1011400,
  0x5020001,
  0x2021E00,
  0xA000000,
};

It might appear as though no significant change is visible but with VMs where instructions are encoded in unsigned integers above 32-bits, the large range of encoded program instructions can be reduced to compact hexidecimal literals. In most cases, this is referred to as the bytecode of the virtual machine and allows for portability. Advanced VMs might reserve a bytecode representation similar, but vastly different, to assembly formats.

Some VMs allow streaming a program as a byte array over networks or other UNIX streamable interfaces, which is mostly common on the server-side or specialized programs like drivers.

Anecdote: The applicability of VMs, particularly in interpreters, is primarily for evaluation but serves the additional purpose of portability in servers since program representations can be packed in a compact form.