Portable vectorization in C++23

10 March 2025
C/C++,
vectorization

We will refer to stdx as an alias for std::experimental::parallelism_v2.

Intrinsics are generally used for high-level SIMD workloads but are ugly and ABI-specific. Over the years, many changes have been made on vector types such as overloading operations in unrolled loops which is why the experimental portable SIMD proposal is more definitive in terms of being expressive and can scale to different extension widths (SSE, AVX2 or AVX512).

Consider the 4-element SIMD vector below (x86-64 clang 17.0.1 -O3 -std=c++23 Ice Lake).

volatile stdx::simd<float, stdx::simd_abi::fixed_size<4>> v;
volatile __m128 z = _mm_set_ps(0, 0, 0, 0);

The relevant instructions to load the vector is as follows (cf. Godbolt test).

_GLOBAL__sub_I_:
  vxorps  xmm0, xmm0, xmm0
  vmovaps xmmword ptr [rip + z], xmm0
  ret
v:
  .zero   16
z:
  .zero   16

Both v and z are allocated 16 bytes but std::experimental does not generate explicit initialization in .data or .bss. In other words, __m128 required vxorps and vmovaps. It is easier to reason about lazy constructs than functions that eagerly emit instructions, similar to the criteria of avoiding calloc initialization over malloc in C.

To be clear, I would suggest writing platform-specific assembly and linking for real performance. Otherwise, there is virtually no gains in intrinsics.

There are a few primitives that we will deal with—stdx::simd_abi which is the platform ABI for which the vectorization will apply, stdx::simd<T, Abi> which represents a high-level wrapper over a vector with additional metadata and stdx::simd_mask<T, Abi> which is a mask for conditional vector operations based on stdx::where.

`stdx::simd<T, Abi>` #

T: The scalar type (e.g. float, double, int).
Abi: Specifies the SIMD register size and behavior. Common options:
- stdx::simd_abi::fixed_size<N>: Fixed number of elements (N).
- stdx::simd_abi::native: Platform's default SIMD width.
- stdx::simd_abi::compatible: Compatible with the platform's instructions.

In stdx::simd_mask<T, Abi>, each element is a boolean. We can define a mask for vector elements representing positive floats.

stdx::simd_mask<float> mask = v > 0.0f;

stdx::simd_abi defines tags for the SIMD ABI (parallelism TS v2).

Aliases may be used:

stdx::simd_abi::scalar for no SIMD (fallback to scalar operations).
stdx::simd_abi::native which uses the platform's default SIMD width.

Vector operations #

Vertical operations work on vectors. Horizontal operations perform reduction on a single vector and return a scalar.

Element-wise arithmetic ops: +, -, *, /, %.

float a_in[4] = {1.0f, 2.0f, 3.0f, 4.0f};
float b_in[4] = {5.0f, 6.0f, 7.0f, 8.0f};

stdx::simd<float, stdx::simd_abi::fixed_size<4>> a(a_in,
                    stdx::vector_aligned);
stdx::simd<float, stdx::simd_abi::fixed_size<4>> b(b_in,
                    stdx::vector_aligned);
stdx::simd<float, stdx::simd_abi::fixed_size<4>> c = a + b;

Element-wise comparisons ops: ==, !=, <, >, <=, >=. Returns a stdx::simd_mask.

stdx::simd_mask<float> mask = a > b; // true where a[i] > b[i]

Element-wise logical operations: &, |, ^, ~.

int x_in[4] = {1, 2, 3, 4};
int y_in[4] = {5, 6, 7, 8};

stdx::simd<int, stdx::simd_abi::fixed_size<4>> x(x_in, stdx::vector_aligned);
stdx::simd<int, stdx::simd_abi::fixed_size<4>> y(y_in, stdx::vector_aligned);
stdx::simd<int, stdx::simd_abi::fixed_size<4>> z = x & y;

Reductions #

Reduce a SIMD object to a scalar with either:

stdx::reduce: Sum all elements.
stdx::min, stdx::max: Minimum/maximum element.

float sum = stdx::reduce(a); // sum of all elements in `a`

Masking and conditional execution #

Use stdx::where to apply operations conditionally. There are multiple overloads for this though.

stdx::where(mask, out) = a + b; // add where mask is true

Mask reductions have other conditional properties e.g. all_of, any_of, none_of, some_of.

Broadcasting #

Apply a scalar value across all lanes of a SIMD vector.

stdx::simd<float> v = 3.14f; // elements set to 3.14f

Load and store #

Load data into a SIMD object with a fixed size.

float data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
stdx::simd<float> v(data, stdx::simd_abi::fixed_size<4>);

Store data back by copying into an array.

v.copy_to(data, stdx::simd_abi::fixed_size<4>);

Compiler hints for alignment #

We use stdx::vector_aligned to hint alignment which in copy_to or copy_from can enforce the alignment on a vector type, mask or buffer. It may make sense to use alignas(64) on raw data before a copy_from.

Alternatively, instead of enforcing 64-byte alignment, we can use the platform cache line size which is the default alignment size for ABIs.

#ifdef __cpp_lib_hardware_interference_size
    using std::hardware_constructive_interference_size;
    using std::hardware_destructive_interference_size;
#else
    // 64 bytes on x86-64 │ L1_CACHE_BYTES │ L1_CACHE_SHIFT │ __cacheline_aligned │ ...
    constexpr std::size_t hardware_constructive_interference_size = 64;
    constexpr std::size_t hardware_destructive_interference_size = 64;
#endif

Swizzling #

Reorder elements in a SIMD object with stdx::shuffle.

Alternatively, we may need to use a permutation index vector when element-wise reordering is necessary and there is no support for index-based lookups. In this case, direct indexing can reconstruct the vector in the required order.

When the permutation is arbitrary then it's a shuffle (i.e. general reordering). If it's a fixed selection then it's a swizzle.

using simd_t = stdx::native_simd<int>;

alignas(simd_t) int data[] = {1, 2, 3, 4};
simd_t v;
v.copy_from(data, stdx::vector_aligned);

std::swap(data[0], data[3]);
std::swap(data[1], data[2]);

Architectures (AVX/SSE) offer intrinsics, _mm_shuffle_ps in SSE or _mm256_permutevar8x32_epi32 in AVX, for shuffling.

Splitting and concatenation #

Split with stdx::split and concatenate with stdx::concat. It could still require extraction & masking depending on intermediate values from reconstruction, similar to the shuffle above, except that std::swap becomes a std::memcpy.

Element-wise function application #

Functions can be applied on each element.

stdx::simd<float> out = stdx::sqrt(v);

Type conversion #

Convert between SIMD types with stdx::static_simd_cast.

stdx::simd<int, stdx::simd_abi::fixed_size<4>> int_vec =
  stdx::static_simd_cast<stdx::simd<int, stdx::simd_abi::fixed_size<4>>>(
    v);

We could trivially write a vectorized dot product function for two float arrays.

float dot_product(const float* a, const float* b, size_t size);

Initialize the accumulator.

stdx::simd<float> sum = 0.0f;
size_t i = 0;

Process elements in chunks of SIMD width. This step could be parallelized separately.

for (; i + stdx::simd<float>::size() <= size;
  i += stdx::simd<float>::size()) {}

Load aligned data into SIMD register.

/*..*/ {
  stdx::simd<float> va(a + i, stdx::vector_aligned);
  stdx::simd<float> vb(b + i, stdx::vector_aligned);
}

Perform element-wise multiplication and accumulation then reduce the sum vector.

/*..*/ {
  sum += va * vb;
}

return stdx::reduce(sum);

In usage, the float arrays are aligned accordingly.

alignas(hardware_destructive_interference_size)
  float a[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
alignas(hardware_destructive_interference_size)
  float b[] = {8.0f, 7.0f, 6.0f, 5.0f, 4.0f, 3.0f, 2.0f, 1.0f};
size_t size = sizeof(a) / sizeof(a[0]);

In a basic timing test, the SIMD version consistently performs better (-O3 -march=native).

C++26 improves on std::experimental::parallelism_v2 esp. for features that are exclusive to intrinsics.

References #

Previous: Cooperative multitasking in Rust
Next: Limitations of regex in parsing semantics