Intuition of JIT compilers with C and x86-64

06 December 2024
C,
compilers,
algorithms

If you have ever used an interpreted language before, it's likely that you're familiar with JIT compilers, or at least have an understanding thereof.

Well, let's demystify this black box.

Overview

JIT compiling is a dynamic compilation technique which basically just means a subprogram within an engine that does runtime code synthesis by tracing and compiling frequently executed blocks and executing in subsequent calls.

The approach we'll use will rely on translating encoded x86-64 to machine code.

Consider a program where we need to re-run the factorial function n times. When it is statically linked, there is virtually no overhead (perhaps besides stack frame register metadata). For interpreted languages, being able to run frequently used functions directly as machine code is a massive improvement since it skips the entire evaluation step.

An issue with this is that the JIT compiler needs to be contextually aware of the running program as well as the target triple and architecture of the host device. AST nodes are lowered to encoding functions that emit whichever form of assembly or machine code is required.

This requires portability of the bytecode to machine code translator in the case of engines. There are numerous other edge cases, some of which we'll take a look at (e.g. profiling).

We can now define the factorial function in x64/C. I particularly like using this website, Defuse, that can convert syntactically valid assembly to a shellcode representation.

const uint8_t x86_encoded_factorial[] = {
    0x55,                          // push rbp
    0x48, 0x89, 0xe5,              // mov rbp, rsp
    0x89, 0xf8,                    // mov eax, edi
    0x83, 0xf8, 0x00,              // cmp eax, 0
    0x74, 0x11,                    // je zero_case
    0xb9, 0x01, 0x00, 0x00, 0x00,  // mov ecx, 1
    0x48, 0x0f, 0xaf, 0xc8,        // imul rcx, rax
    0xff, 0xc8,                    // dec eax
    0x75, 0xf9,                    // jnz loop
    0x89, 0xc8,                    // mov eax, ecx
    0x5d,                          // pop rbp
    0xc3,                          // ret
    0xb8, 0x01, 0x00, 0x00, 0x00,  // zero_case: mov eax, 1
    0x5d,                          // pop rbp
    0xc3                           // ret
};

Why use a const byte array? What about inline assembly?

Both are mostly the same since they work in the running program's address space but we can easily modify permissions for read-write-execute policies when using a byte array representation.

There is nothing new in the piece of code.

push rbp and mov rbp, rsp define the function prologue to save the previous frame pointer and set up a new stack frame
mov eax, edi moves the input to eax
mov ecx, 1 required since the output starts at 1
imul rcx, rax multiplies rcx (accumulated output) by rax (current value)
dec eax decrements eax to progress toward the base case
mov eax, ecx moves the output to the return register

Other instructions are pretty much self-explanatory (e.g. jumping to the zero case).

JIT function type #

The function type definition is very basic.

typedef int (*jit_func_t)(int);

Allocating executable memory #

Interestingly, we'll just copy a byte array with mmap and later initialize to a page size (4 kB).

#define PAGE_SIZE 4096

void *allocate_executable_memory(size_t size) {
  void *mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  if (mem == MAP_FAILED) {
    fprintf(stderr, "error allocating executable memory: %s\n",
            strerror(errno));
    exit(EXIT_FAILURE);
  }

  return mem;
}

If, for whatever reason, there is a need to use read-only memory then the permissions can protect the memory while defaulting to UB or segfaulting to cover unwanted access. At times the interpreter may be running 3rd-party source code on the host machine and will likely sandbox the JIT memory model in a real-world implementation.

Copying encoded assembly to a memory region #

Allocate a page of memory (aligned).

uint8_t *program_buffer = (uint8_t *)allocate_executable_memory(PAGE_SIZE);

Copy the factorial function code to the allocated buffer.

memcpy(program_buffer, x86_encoded_factorial, sizeof(x86_encoded_factorial));

Change memory protection to read and execute or panic.

if (mprotect(program_buffer, PAGE_SIZE, PROT_READ | PROT_EXEC) == -1) {
  fprintf(stderr, "error setting memory protection: %s\n", strerror(errno));
  exit(EXIT_FAILURE);
}

Cast the memory buffer to a function pointer.

jit_func_t factorial = (jit_func_t)program_buffer;

Memory at the program_buffer raw pointer represents a valid function and the signature is compatible by definition.

Is this even possible?

In C, you can just do things.

Try this on a small integer set, compare with an equivalent fibonacci.

for (int i = 0; i <= 10; i++) {
  printf("fac(%d) = %d\n", i, factorial(i));
}

Always unmap memory regions or panic. In some cases, the C compiler will just handle this.

if (munmap(program_buffer, PAGE_SIZE) == -1) {
  fprintf(stderr, "error unmapping memory: %s\n", strerror(errno));
  exit(EXIT_FAILURE);
}

The output up until this point is reliably valid.

fac(0) = 1
fac(1) = 1
fac(2) = 2
fac(3) = 6
fac(4) = 24
fac(5) = 120
fac(6) = 720
fac(7) = 5040
fac(8) = 40320
fac(9) = 362880
fac(10) = 3628800

I occasionally try compiling with -fno-stack-protector and -z execstack to know whether the buffer is actually being converted to a function pointer.

Profiling hot path execution #

What usually happens is, JIT compilers will visit AST nodes, scan for blocks and inject custom tracing functions. Since we're simply writing this within C, there is no immediate access to an AST but this can still be defined concretely, albeit unnecessarily verbose.

This particular heuristic is referred to as lightweight profiling.

It's actually very basic. Start by defining a profile.

#define BLOCK_PROFILE(block)           \
  static uint64_t block##_counter = 0; \
  ++block##_counter;                   \
  printf("%s called %llu times\n", #block, block##_counter);

Initializing the macro will define a static counter, increment the counter and do some stub printing. It may be apparent that this approach could hoist the JIT-related functionalities outside the existing function but the procedures will always run immediately before the function runs.

Now we can define stub block-level statements, i.e. functions.

void func1() { BLOCK_PROFILE(func1) }

void func2() { BLOCK_PROFILE(func2) }

void func3() { BLOCK_PROFILE(func3) }

In the profiler's calling function, stress test the functions by calling repeatedly.

for (int i = 0; i < 10; i++) {
  func1();

  if (i % 2 == 0) {
    func2();
  }
}

for (int i = 0; i < 5; i++) {
  func3();
}

Normally, there would be a threshold that will determine whether a function should be compiled or ignored based on the counter.

It should yield an output like this.

func1 called 1 times
func2 called 1 times
func1 called 2 times
func1 called 3 times
func2 called 2 times
func1 called 4 times
func1 called 5 times
func2 called 3 times
func1 called 6 times
func1 called 7 times
func2 called 4 times
func1 called 8 times
func1 called 9 times
func2 called 5 times
func1 called 10 times
func3 called 1 times
func3 called 2 times
func3 called 3 times
func3 called 4 times
func3 called 5 times

Profiling is not just limited to hot path detection, which is a relatively tiny aspect of it. There are other techniques for general profiling e.g. edge-based profiling which is quite common in heap allocators in the context of managing object lifetimes.

Cf. include/v8.h

/**
 * HeapSnapshotEdge represents a directed connection between heap
 * graph nodes: from retainers to retained nodes.
 */
class V8_EXPORT HeapGraphEdge {
 public:
  enum Type {
    kContextVariable = 0,  // A variable from a function context.
    kElement = 1,          // An element of an array.
    kProperty = 2,         // A named object property.
    kInternal = 3,         // A link that can't be accessed from JS,
                           // thus, its name isn't a real property name
                           // (e.g. parts of a ConsString).
    kHidden = 4,           // A link that is needed for proper sizes
                           // calculation, but may be hidden from user.
    kShortcut = 5,         // A link that must not be followed during
                           // sizes calculation.
    kWeak = 6              // A weak reference (ignored by the GC).
  };

  /** Returns edge type (see HeapGraphEdge::Type). */
  Type GetType() const;

  /* ... */
}

Encoding AST nodes #

In case it hasn't been clear, instruction encoding in most assembly languages is challenging to implement and actually not feasible for spec-compliance. You could use a parser module and a secure 3rd-party assembler for basic JIT routines. It will still be a lot of work to implement since JIT compilers are supposed to optimize static programs more than compilers due to runtime awareness and additional type-level context.

Let's take the example of inline caching in V8.

Inline caching can evade performance tax from type resolutions and symbol lookups by caching execution metadata at call sites. There are different modes but the most basic one is based on structural typing.

If objects exhibit the same structure, a single definition will be applied, i.e. monomorphization. Other property accesses will result in upgrading to different structures. Calling with monomorphic representation causes the IC to save the offset for the structures instead of looking up the prototype chain.

Why is this useful?

Hypothetically, the engine can load the property access from a fixed offset in the object.

mov rax, [rcx + 16]   ; load with offset

Regular programs can follow an adaptive optimization approach because the JIT can directly inline the cached property offsets or method addresses into the machine code. Alternatively, we can always deoptimize safely.

AST nodes will usually use a high-level encoder or assembler.

void emit_mov(const char *dest, const char *src) {
  char buffer[__x86_MAX_INSTR_SIZE];
  snprintf(buffer, sizeof(buffer), "mov %s, %s", dest, src);
  __x86_encode(buffer);
}

void emit_add(const char *dest, const char *src) {
  char buffer[__x86_MAX_INSTR_SIZE];
  snprintf(buffer, sizeof(buffer), "add %s, %s", dest, src);
  __x86_encode(buffer);
}

__JIT__ int add(int a, int b) {
  BLOCK_PROFILE(add)
  return a + b;
}

int jit_lower_add(int a, int b) {
  emit_mov(__x86_reg, a);
  emit_add(__x86_reg, b);
  return __x86_reg;
}

Register allocation of x86_reg is managed by the assembler with a graph coloring algorithm and the entire add can undergo JIT compilation.

References #

Previous: Pointer compression in C
Next: Graph coloring and register allocation theory