Endianness in the JVM bytecode layout
Ever disassembled a Java class file and tried introspecting the layout? Bytecode symbols are fixed, ordered and the method's code section is filled with JVM stack operations (e.g. OP_IMUL
). An actual disassembler is extremely complicated and in most cases impracticable for consumers (unless when interfacing). Instead, write the bytecode.
void write_uint16_be(std::ofstream& stream, uint16_t value) {
stream.put(static_cast<char>((value >> 8) & 0xFF)); // <-- MSB
stream.put(static_cast<char>(value & 0xFF)); // <-- LSB
}
void write_uint32_be(std::ofstream& stream, uint32_t value) {
stream.put(static_cast<char>((value >> 24) & 0xFF)); // <-- MSB
stream.put(static_cast<char>((value >> 16) & 0xFF));
stream.put(static_cast<char>((value >> 8) & 0xFF));
stream.put(static_cast<char>(value & 0xFF)); // <-- LSB
}
write_uint16_be
encodes a 16-bit (u2
) field by extracting and sequentially writing its most significant byte (MSB) and least significant byte (LSB) in big-endian ordering. The encoding operation uses right shifts (>>
) to isolate higher-order bits, followed by AND (& 0xFF
) to mask irrelevant higher bits.
For instance:
- Writing
0x1234
produces0x12 0x34
in the stream, where0x12
(MSB) is written first.
write_uint32_be
encodes a 32-bit (u4
) field using four sequential writes, starting from the most significant byte:
- Writing
0x12345678
produces the sequence0x12 0x34 0x56 0x78
.
Cf. the below for a class file disassembly with invokedynamic
by Ben Evans (Java Magazine, Oracle). The disassembler reinterprets the bytecode into the text format, which is way more intuitive to work with for Java developers.
public static void main(java.lang.String[]) throws java.lang.Exception;
Code:
0: invokedynamic #2, 0 // InvokeDynamic
// #0:run:()Ljava/lang/Runnable;
5: astore_1
6: new #3 // class java/lang/Thread
9: dup
10: aload_1
11: invokespecial #4 // Method java/lang/Thread."<init>":
// (Ljava/lang/Runnable;)V
14: astore_2
15: aload_2
16: invokevirtual #5 // Method java/lang/Thread.start:()V
19: aload_2
20: invokevirtual #6 // Method java/lang/Thread.join:()V
23: return
In the context of JVM bytecode, the encoding functions are required in various fields:
- Constant pool entries:
CONSTANT_Class
andCONSTANT_NameAndType
containu2
indices into the constant pool. - Attributes: The
attribute_length
field ofCode
orLineNumberTable
attributes is represented as au4
. - Instruction operands:
ldc_w
usesu2
operands to reference constant pool indices, with big-endian encoding for compatibility with the bytecode's disassembly.