Padding and alignment in structs

18 January 2025
C/C++,
rust,
compilers

Modern processors are designed to read memory in fixed-size chunks, matching the word size of the architecture. When data is properly aligned, the processor can fetch it in a single operation. Misaligned access, on the other hand, often requires multiple memory operations, leading to a performance penalty or, in some cases, hardware exceptions.

The relationship between memory alignment and cache performance adds another layer of complexity. Modern CPU caches operate on cache lines, usually 64 bytes (Pentium, Ryzen, Athlon, ARM processors, Apple M1) or 128 bytes (Apple M2 on ARM, SX-Aurora TSUBASA Vector Engine by NEC) or even 256 bytes (Fugaku supercomputer's A64FX processor) in size.

Let $M$ represent the set of all valid memory addresses in a system. Each address $m \in M$ is a non-negative integer in the range $[0, 2^{n} - 1]$ , where $n$ is the address bus width.

For a data type $T$ , we define its alignment requirement $α (T)$ as:

$α (T) = 2^{k} where k \in N_{0}$

The alignment constraint for a memory address $m$ storing data of type $T$ is expressed as

$m mod α (T) = 0$

Alignment follows a set of well-defined rules that vary based on data types and architecture specifications. In most modern architectures, primitive data types must be aligned to addresses that are multiples of their sizes. For instance, 4-byte integers typically align on 4-byte boundaries, while 8-byte doubles align on 8-byte boundaries. This requirement leads to the introduction of padding bytes in structures, where the compiler inserts unused bytes to ensure proper alignment of subsequent members.

Consider a basic structure containing members of different sizes.

struct basic_struct {
    char a;
    int b;
    char c;
};

struct packed_struct {
    char a;
    int b;
    char c;
} __attribute__((packed));

In the above example, basic_struct may occupy more memory than the sum of its member sizes due to padding, while packed_struct forces the compiler to pack members tightly together, likely sacrificing performance for some extra memory space. The actual memory layout of basic_struct might include padding bytes after both 'a' and 'c' to align with subsequent members.

The size and placement of padding bytes depend on multiple factors including:

The processor architecture (32-bit vs 64-bit)
The compiler's alignment strategy
The order of structure members
Any alignment specifications or pragmas in effect

Understanding these principles becomes particularly relevant when dealing with arrays of structures, where padding can significantly impact memory usage and cache efficiency.

Consider an array of basic_struct—each element must be aligned according to the most stringent alignment requirement of any member, likely leading to a massive memory overhead.

Modern processors are based on memory subsystems with multiple cache level hierarchies and complex prefetching algorithms. These hardware features work best when data is properly aligned, which is why we'd have to be careful with structure layout strategies and not incessantly throwing pragmas at the data.

For a structure $S$ with members $m_{1}, m_{2}, . . ., m_{n}$ of types $T_{1}, T_{2}, . . ., T_{n}$ respectively, we define:

$size (S) = pad (o_{n} + size (T_{n}), α (S))$

where:

$o_{i}$ is the offset of member $i$
$α (S) = max (α (T_{1}), α (T_{2}), . . ., α (T_{n}))$
$pad (x, a) = ⌈ \frac{x}{a} ⌉ \cdot a$

The offset $o_{i}$ of member $i$ is recursively defined as:

${\begin{cases} o_{1} = 0 \\ o_{i} = pad (o_{i - 1} + size (T_{i - 1}), α (T_{i})) for i > 1 \end{cases}$

The padding required after member $i$ is given by:

$p_{i} = o_{i + 1} - (o_{i} + size (T_{i}))$

The alignment of $S$ is the maximum possible alignment of its members.

Consider a more complex structure that demonstrates how alignment requirements cascade through nested structures:

struct nested_struct {
    char a;
    struct {
        int x;
        char y;
        double z;
    } nested;
    short c;
    int d;
};

The compiler must consider not only the alignment requirements of individual members but also the alignment requirements of the nested structure as a whole. This can lead to surprising amounts of padding, esp. when structures contain members with different alignment requirements.

Both the start of the structure and each member must be aligned according to the requirements of any member. Like we just mentioned, the nested structure's alignment requirement is determined by its most strictly aligned member. This requirement then influences the placement of the entire nested structure within the containing structure, introducing additional padding.

For a cache line size $L$ , the cache line crossing penalty function $ϕ (m, s)$ for accessing $s$ bytes starting at address $m$ is:

$ϕ (m, s) = ⌈ \frac{m + s}{L} ⌉ - ⌊ \frac{m}{L} ⌋$

A structure $S$ is optimally cache-aligned if:

$\forall m \in M : (m mod L = 0) ⟹ (ϕ (m, size (S)) = 1)$

The LLVM compiler infrastructure provides an implementation of structure layout calculations that serves as a good reference for understanding how modern compilers handle these challenges.

LLVM's approach involves tracking of both size and alignment requirements throughout compilation.

class StructLayout final : public TrailingObjects<StructLayout, TypeSize> {
  TypeSize StructSize;
  Align StructAlignment;
  unsigned IsPadded : 1;
  unsigned NumElements : 31;

public:
  TypeSize getSizeInBytes() const { return StructSize; }

  TypeSize getSizeInBits() const { return 8 * StructSize; }

  Align getAlignment() const { return StructAlignment; }

  bool hasPadding() const { return IsPadded; }

  /// Given a valid byte offset into the structure, returns the structure
  /// index that contains it.
  unsigned getElementContainingOffset(uint64_t FixedOffset) const;

  MutableArrayRef<TypeSize> getMemberOffsets() {
    return llvm::MutableArrayRef(getTrailingObjects<TypeSize>(), NumElements);
  }

  ArrayRef<TypeSize> getMemberOffsets() const {
    return llvm::ArrayRef(getTrailingObjects<TypeSize>(), NumElements);
  }

  TypeSize getElementOffset(unsigned Idx) const {
    assert(Idx < NumElements && "Invalid element idx!");
    return getMemberOffsets()[Idx];
  }

  TypeSize getElementOffsetInBits(unsigned Idx) const {
    return getElementOffset(Idx) * 8;
  }
};

There are a few things to keep in mind here:

The running size (StructSize) is adjusted to maintain alignment
Each member's alignment requirements affect both its own placement and the overall structure alignment
The final structure size must be rounded up to satisfy the structure's overall alignment requirements
Member offsets are calculated and stored for use during code generation
hasPadding does not consider nested elements, which are visited separately (?)

If we have a basic struct with a few fields, the IR code is as below (clang -S -emit-llvm).

%struct.A = type { i8, i32, double }

@__const.main.a = private unnamed_addr constant %struct.A { i8 1, i32 2, double 3.000000e+00 }, align 8

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = alloca %struct.A, align 8
  store i32 0, ptr %1, align 4
  call void @llvm.memcpy.p0.p0.i64(ptr align 8 %2, ptr align 8 @__const.main.a, i64 16, i1 false)
  ret i32 0
}

The struct %struct.A contains:

i8: 1 byte
i32: 4 bytes
double: 8 bytes

Each field in a struct must align to its natural alignment, and the overall struct alignment is determined by the largest alignment requirement among its fields.

i8 requires 1-byte alignment.
i32 requires 4-byte alignment.
double requires 8-byte alignment.

The overall struct alignment will be 8 bytes, because double (the largest field) requires 8-byte alignment.

But padding bytes are added where necessary! Therefore:

The i8 field starts at offset 0 and occupies 1 byte. A padding of 3 bytes are added to align the next field (i32) to a 4-byte boundary.
The i32 field starts at offset 4 and occupies 4 bytes.
The double field starts at offset 8 and occupies 8 bytes.

Thus, the total size of %struct.A is:

1 (i8) + 3 (padding) + 4 (i32) + 8 (double) = 16 bytes.

Both the global constant @__const.main.a and the stack-allocated struct %2 will have the same layout. The llvm.memcpy intrinsic copies 16 bytes, which includes both the actual data and the padding—which guarantees that the in-memory layout of the copied struct matches the original.

Beyond the basic layout algorithm, there are a bunch of optimization strategies for structure layout that LLVM implements but based on specific factors.

Field reordering which minimizes padding
Tail padding reuse for flexible array members
Platform-specific alignment requirements
Special handling for packed structures and bitfields

Optional optimizations can break ABI compatibility which may not always be an issue. You'd have to write a custom pass if there is a requirement to extend this. As such, the actual implementation includes some complexity to handle many edge cases while conforming to cross-platform ABI compatibility.

Structure padding becomes even more complex when considering cross-platform development. Different architectures may have different alignment requirements, and data structures that work well on one platform may perform poorly on another.

This becomes particularly important when developing programs that need to run fast across multiple platforms or when dealing with serialized data structures.

Consider the following structures and their layout implications across different platforms:

struct messed_up_cross_platform_struct {
    int16_t a;
    int32_t b;
    int64_t c;
    int8_t  d;
};

struct aligned_cross_platform_struct {
    int64_t c;
    int32_t b;
    int16_t a;
    int8_t  d;
};

struct platform_specific_requirements {
    void* ptr;
    int32_t val;
    int16_t short_val;
    char flag;
};

The first structure might seem natural, but it can lead to different amounts of padding on different platforms. The second structure is arranged to minimize padding on most platforms by ordering members from largest to smallest alignment requirements. The third structure demonstrates how pointer sizes affect layout across different architectures.

When writing cross-platform code, consider the different alignment requirements across architectures, varying structure packing defaults between compilers, platform-specific alignment pragmas and attributes, serialization and data exchange requirements, and pointer size variations between 32-bit and 64-bit platforms.

Rust brings a slightly different perspective to structure layout and memory alignment.

It provides several representation attributes that control structure layout:

#[repr(C)]
struct CCompatible {
    a: i8,
    b: i32,
    c: i8,
}

#[repr(packed)]
struct Packed {
    a: i8,
    b: i32,
    c: i8,
}

#[repr(align(8))]
struct Aligned {
    a: i8,
    b: i32,
    c: i8,
}

These attributes provide different guarantees and optimizations:

repr(C) ensures compatibility with C/C++ structure layout rules
repr(packed) eliminates padding, which can negatively impact performance
repr(align(n)) enforces a minimum alignment, which can be useful for SIMD operations or cache line alignment

Rust's type system forces repr(packed(N)) to be marked explicitly unsafe for &T or &mut T operations involving packed structures, helping developers avoid inadvertent misaligned access. repr(packed) is UB if misused.

On considerably major hardware or iterations, the impact of structure padding and alignment on performance can be substantial. Let's write a quick benchmark demonstrating the performance difference between aligned and unaligned access.

#include <chrono>
#include <vector>

struct aligned_struct {
  double a;
  int b;
  char c;
};

struct unaligned_struct {
  char c;
  double a;
  int b;
} __attribute__((packed));

void benchmark() {
  const int iterations = 1000000;
  std::vector<aligned_struct> aligned(1000);
  std::vector<unaligned_struct> unaligned(1000);

  auto start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < iterations; i++) {
    for (auto& s : aligned) {
      s.a += 1.0;
      s.b += 1;
    }
  }

  auto aligned_duration = std::chrono::high_resolution_clock::now() - start;

  start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < iterations; i++) {
    for (auto& s : unaligned) {
      s.a += 1.0;
      s.b += 1;
    }
  }

  auto unaligned_duration = std::chrono::high_resolution_clock::now() - start;
}

On cold iteration, aligned runs in 794 ms compared to unaligned at 828 ms (Intel x86 processor).

Below is a log of the access time durations for each case.

Iteration	Aligned (high resolution, ms)	Unaligned (high resolution, ms)
1	794	828
2	808	873
3	783	824
4	819	837
5	810	853

What we can conclude from these figures is that certain architectures heavily penalize misaligned access. Cf. benchmark source.

The performance impact becomes even more pronounced when accessing data in tight loops, working with SIMD instructions, cache-sensitive workloads and/or processing large arrays of structures.

For an array of structures, the address of the $i$ -th element is given by:

$addr (i) = base + i \cdot pad (size (S), α (S))$

where $base$ is the starting address of the array.

However, this will also introduce a waste factor $ω$ for a structure $S$ which we can may as well formalize.

$ω (S) = \frac{size (S) - \sum_{i = 1}^{n} size (T_{i})}{size (S)}$

The optimal structure layout minimizes:

$min_{layout} \sum_{i = 1}^{n} (E (t_{i}) + ω (S))$

subject to alignment constraints:

$\forall i : o_{i} mod α (T_{i}) = 0$

We haven't defined $E (t)$ , the expected memory access time for a structure member which can be modeled as:

$E (t) = t_{base} + t_{misalign} \cdot 1 (m mod α (T) \neq 0) + t_{cache} \cdot ϕ (m, size (T))$

where:

$t_{base}$ is the base access time
$t_{misalign}$ is the misalignment penalty
$t_{cache}$ is the cache line crossing penalty
$1$ is the indicator function

There are several techniques that come from the implications of structure padding.

Cache line alignment, demonstrated in the cache_aligned struct below, forces optimal SIMD operations by aligning data with CPU cache boundaries using alignas(64).

struct cache_aligned {
    alignas(64) double data[8];
};

We can also use bit fields to pack multiple values into minimal space, storing flags and states in just a few bits rather than full bytes. I see this pattern a lot but I actually never used it.

struct bitfield_optimized {
  unsigned int flags : 4;
  unsigned int mode : 2;
  unsigned int state : 2;
};

A flexible array member can create variable-sized structures for contiguous storage of metadata with actual data.

struct flexible_array {
  size_t size;
  alignas(8) char data[];
};

For a bit field $B$ with $n$ fields of widths $w_{1}, w_{2}, . . ., w_{n}$ , the packing constraint is:

$\sum_{i = 1}^{n} w_{i} \leq size (T) \cdot 8$

where $T$ is the underlying integral type.

The bit offset $b_{i}$ for field $i$ is:

$b_{i} = \sum_{j = 1}^{i - 1} w_{j} mod (size (T) \cdot 8)$

Consider a structure $S$ with a flexible array member $F$ , then the total size for $n$ elements is:

${size}_{total} (S, n) = pad (o_{F} + n \cdot size (T_{F}), α (S))$

where:

$o_{F}$ is the offset of the flexible array member
$T_{F}$ is the element type of the flexible array

References #

Previous: Review of 2024
Next: Essential template metaprogramming