Intuition of JIT compilers with C and x86-64
If you have ever used an interpreted language before, it's likely that you're familiar with JIT compilers, or at least have an understanding thereof.
Well, let's demystify this black box.
JIT compiling is a dynamic compilation technique which basically just means a subprogram within an engine that does runtime code synthesis by tracing and compiling frequently executed blocks and executing in subsequent calls.
The approach we'll use will rely on translating encoded x86-64 to machine code.
Consider a program where we need to re-run the factorial function n times. When it is statically linked, there is virtually no overhead (perhaps besides stack frame register metadata). For interpreted languages, being able to run frequently used functions directly as machine code is a massive improvement since it skips the entire evaluation step.
An issue with this is that the JIT compiler needs to be contextually aware of the running program as well as the target triple and architecture of the host device. AST nodes are lowered to encoding functions that emit whichever form of assembly or machine code is required.
This requires portability of the bytecode to machine code translator in the case of engines. There are numerous other edge cases, some of which we'll take a look at (e.g. profiling).
We can now define the factorial function in x64/C. I particularly like using this website, Defuse, that can convert syntactically valid assembly to a shellcode representation.
const uint8_t x86_encoded_factorial[] = {
0x55, // push rbp
0x48, 0x89, 0xe5, // mov rbp, rsp
0x89, 0xf8, // mov eax, edi
0x83, 0xf8, 0x00, // cmp eax, 0
0x74, 0x11, // je zero_case
0xb9, 0x01, 0x00, 0x00, 0x00, // mov ecx, 1
0x48, 0x0f, 0xaf, 0xc8, // imul rcx, rax
0xff, 0xc8, // dec eax
0x75, 0xf9, // jnz loop
0x89, 0xc8, // mov eax, ecx
0x5d, // pop rbp
0xc3, // ret
0xb8, 0x01, 0x00, 0x00, 0x00, // zero_case: mov eax, 1
0x5d, // pop rbp
0xc3 // ret
};
Why use a const
byte array? What about inline assembly?
Both are mostly the same since they work in the running program's address space but we can easily modify permissions for read-write-execute policies when using a byte array representation.
There is nothing new in the piece of code.
push rbp
andmov rbp, rsp
define the function prologue to save the previous frame pointer and set up a new stack framemov eax, edi
moves the input toeax
mov ecx, 1
required since the output starts at 1imul rcx, rax
multipliesrcx
(accumulated output) byrax
(current value)dec eax
decrementseax
to progress toward the base casemov eax, ecx
moves the output to the return register
Other instructions are pretty much self-explanatory (e.g. jumping to the zero case).
JIT function type #
The function type definition is very basic.
typedef int (*jit_func_t)(int);
Allocating executable memory #
Interestingly, we'll just copy a byte array with mmap
and later initialize to a page size (4 kB).
#define PAGE_SIZE 4096
void *allocate_executable_memory(size_t size) {
void *mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mem == MAP_FAILED) {
fprintf(stderr, "error allocating executable memory: %s\n",
strerror(errno));
exit(EXIT_FAILURE);
}
return mem;
}
If, for whatever reason, there is a need to use read-only memory then the permissions can protect the memory while defaulting to UB or segfaulting to cover unwanted access. At times the interpreter may be running 3rd-party source code on the host machine and will likely sandbox the JIT memory model in a real-world implementation.
Copying encoded assembly to a memory region #
Allocate a page of memory (aligned).
uint8_t *program_buffer = (uint8_t *)allocate_executable_memory(PAGE_SIZE);
Copy the factorial function code to the allocated buffer.
memcpy(program_buffer, x86_encoded_factorial, sizeof(x86_encoded_factorial));
Change memory protection to read and execute or panic.
if (mprotect(program_buffer, PAGE_SIZE, PROT_READ | PROT_EXEC) == -1) {
fprintf(stderr, "error setting memory protection: %s\n", strerror(errno));
exit(EXIT_FAILURE);
}
Cast the memory buffer to a function pointer.
jit_func_t factorial = (jit_func_t)program_buffer;
Memory at the program_buffer
raw pointer represents a valid function and the signature is compatible by definition.
Is this even possible?
In C, you can just do things.
Try this on a small integer set, compare with an equivalent fibonacci.
for (int i = 0; i <= 10; i++) {
printf("fac(%d) = %d\n", i, factorial(i));
}
Always unmap memory regions or panic. In some cases, the C compiler will just handle this.
if (munmap(program_buffer, PAGE_SIZE) == -1) {
fprintf(stderr, "error unmapping memory: %s\n", strerror(errno));
exit(EXIT_FAILURE);
}
The output up until this point is reliably valid.
fac(0) = 1
fac(1) = 1
fac(2) = 2
fac(3) = 6
fac(4) = 24
fac(5) = 120
fac(6) = 720
fac(7) = 5040
fac(8) = 40320
fac(9) = 362880
fac(10) = 3628800
I occasionally try compiling with -fno-stack-protector
and -z execstack
to know whether the buffer is actually being converted to a function pointer.
Profiling hot path execution #
What usually happens is, JIT compilers will visit AST nodes, scan for blocks and inject custom tracing functions. Since we're simply writing this within C, there is no immediate access to an AST but this can still be defined concretely, albeit unnecessarily verbose.
This particular heuristic is referred to as lightweight profiling.
It's actually very basic. Start by defining a profile.
#define BLOCK_PROFILE(block) \
static uint64_t block##_counter = 0; \
++block##_counter; \
printf("%s called %llu times\n", #block, block##_counter);
Initializing the macro will define a static counter, increment the counter and do some stub printing. It may be apparent that this approach could hoist the JIT-related functionalities outside the existing function but the procedures will always run immediately before the function runs.
Now we can define stub block-level statements, i.e. functions.
void func1() { BLOCK_PROFILE(func1) }
void func2() { BLOCK_PROFILE(func2) }
void func3() { BLOCK_PROFILE(func3) }
In the profiler's calling function, stress test the functions by calling repeatedly.
for (int i = 0; i < 10; i++) {
func1();
if (i % 2 == 0) {
func2();
}
}
for (int i = 0; i < 5; i++) {
func3();
}
Normally, there would be a threshold that will determine whether a function should be compiled or ignored based on the counter.
It should yield an output like this.
func1 called 1 times
func2 called 1 times
func1 called 2 times
func1 called 3 times
func2 called 2 times
func1 called 4 times
func1 called 5 times
func2 called 3 times
func1 called 6 times
func1 called 7 times
func2 called 4 times
func1 called 8 times
func1 called 9 times
func2 called 5 times
func1 called 10 times
func3 called 1 times
func3 called 2 times
func3 called 3 times
func3 called 4 times
func3 called 5 times
Profiling is not just limited to hot path detection, which is a relatively tiny aspect of it. There are other techniques for general profiling e.g. edge-based profiling which is quite common in heap allocators in the context of managing object lifetimes.
Cf. include/v8.h
/**
* HeapSnapshotEdge represents a directed connection between heap
* graph nodes: from retainers to retained nodes.
*/
class V8_EXPORT HeapGraphEdge {
public:
enum Type {
kContextVariable = 0, // A variable from a function context.
kElement = 1, // An element of an array.
kProperty = 2, // A named object property.
kInternal = 3, // A link that can't be accessed from JS,
// thus, its name isn't a real property name
// (e.g. parts of a ConsString).
kHidden = 4, // A link that is needed for proper sizes
// calculation, but may be hidden from user.
kShortcut = 5, // A link that must not be followed during
// sizes calculation.
kWeak = 6 // A weak reference (ignored by the GC).
};
/** Returns edge type (see HeapGraphEdge::Type). */
Type GetType() const;
/* ... */
}
Encoding AST nodes #
In case it hasn't been clear, instruction encoding in most assembly languages is challenging to implement and actually not feasible for spec-compliance. You could use a parser module and a secure 3rd-party assembler for basic JIT routines. It will still be a lot of work to implement since JIT compilers are supposed to optimize static programs more than compilers due to runtime awareness and additional type-level context.
Let's take the example of inline caching in V8.
Inline caching can evade performance tax from type resolutions and symbol lookups by caching execution metadata at call sites. There are different modes but the most basic one is based on structural typing.
If objects exhibit the same structure, a single definition will be applied, i.e. monomorphization. Other property accesses will result in upgrading to different structures. Calling with monomorphic representation causes the IC to save the offset for the structures instead of looking up the prototype chain.
Why is this useful?
Hypothetically, the engine can load the property access from a fixed offset in the object.
mov rax, [rcx + 16] ; load with offset
Regular programs can follow an adaptive optimization approach because the JIT can directly inline the cached property offsets or method addresses into the machine code. Alternatively, we can always deoptimize safely.
AST nodes will usually use a high-level encoder or assembler.
void emit_mov(const char *dest, const char *src) {
char buffer[__x86_MAX_INSTR_SIZE];
snprintf(buffer, sizeof(buffer), "mov %s, %s", dest, src);
__x86_encode(buffer);
}
void emit_add(const char *dest, const char *src) {
char buffer[__x86_MAX_INSTR_SIZE];
snprintf(buffer, sizeof(buffer), "add %s, %s", dest, src);
__x86_encode(buffer);
}
__JIT__ int add(int a, int b) {
BLOCK_PROFILE(add)
return a + b;
}
int jit_lower_add(int a, int b) {
emit_mov(__x86_reg, a);
emit_add(__x86_reg, b);
return __x86_reg;
}
Register allocation of x86_reg
is managed by the assembler with a graph coloring algorithm and the entire add
can undergo JIT compilation.