Portable vectorization in C++23
We will refer to
stdx
as an alias forstd::experimental::parallelism_v2
.
Intrinsics are generally used for high-level SIMD workloads but are ugly and ABI-specific. Over the years, many changes have been made on vector types such as overloading operations in unrolled loops which is why the experimental portable SIMD proposal is more definitive in terms of being expressive and can scale to different extension widths (SSE, AVX2 or AVX512).
Consider the 4-element SIMD vector below (x86-64 clang 17.0.1 -O3 -std=c++23
Ice Lake).
volatile stdx::simd<float, stdx::simd_abi::fixed_size<4>> v;
volatile __m128 z = _mm_set_ps(0, 0, 0, 0);
The relevant instructions to load the vector is as follows (cf. Godbolt test).
_GLOBAL__sub_I_:
vxorps xmm0, xmm0, xmm0
vmovaps xmmword ptr [rip + z], xmm0
ret
v:
.zero 16
z:
.zero 16
Both v
and z
are allocated 16 bytes but std::experimental
does not generate explicit initialization in .data or .bss. In other words, __m128
required vxorps
and vmovaps
. It is easier to reason about lazy constructs than functions that eagerly emit instructions, similar to the criteria of avoiding calloc
initialization over malloc
in C.
To be clear, I would suggest writing platform-specific assembly and linking for real performance. Otherwise, there is virtually no gains in intrinsics.
There are a few primitives that we will deal with—stdx::simd_abi
which is the platform ABI for which the vectorization will apply, stdx::simd<T, Abi>
which represents a high-level wrapper over a vector with additional metadata and stdx::simd_mask<T, Abi>
which is a mask for conditional vector operations based on stdx::where
.
stdx::simd<T, Abi>
#
T
: The scalar type (e.g.float
,double
,int
).Abi
: Specifies the SIMD register size and behavior. Common options:stdx::simd_abi::fixed_size<N>
: Fixed number of elements (N
).stdx::simd_abi::native
: Platform's default SIMD width.stdx::simd_abi::compatible
: Compatible with the platform's instructions.
In stdx::simd_mask<T, Abi>
, each element is a boolean. We can define a mask for vector elements representing positive floats.
stdx::simd_mask<float> mask = v > 0.0f;
stdx::simd_abi
defines tags for the SIMD ABI (parallelism TS v2).
Aliases may be used:
stdx::simd_abi::scalar
for no SIMD (fallback to scalar operations).stdx::simd_abi::native
which uses the platform's default SIMD width.
Vector operations #
Vertical operations work on vectors. Horizontal operations perform reduction on a single vector and return a scalar.
Element-wise arithmetic ops: +
, -
, *
, /
, %
.
float a_in[4] = {1.0f, 2.0f, 3.0f, 4.0f};
float b_in[4] = {5.0f, 6.0f, 7.0f, 8.0f};
stdx::simd<float, stdx::simd_abi::fixed_size<4>> a(a_in,
stdx::vector_aligned);
stdx::simd<float, stdx::simd_abi::fixed_size<4>> b(b_in,
stdx::vector_aligned);
stdx::simd<float, stdx::simd_abi::fixed_size<4>> c = a + b;
Element-wise comparisons ops: ==
, !=
, <
, >
, <=
, >=
.
Returns a stdx::simd_mask
.
stdx::simd_mask<float> mask = a > b; // true where a[i] > b[i]
Element-wise logical operations: &
, |
, ^
, ~
.
int x_in[4] = {1, 2, 3, 4};
int y_in[4] = {5, 6, 7, 8};
stdx::simd<int, stdx::simd_abi::fixed_size<4>> x(x_in, stdx::vector_aligned);
stdx::simd<int, stdx::simd_abi::fixed_size<4>> y(y_in, stdx::vector_aligned);
stdx::simd<int, stdx::simd_abi::fixed_size<4>> z = x & y;
Reductions #
Reduce a SIMD object to a scalar with either:
stdx::reduce
: Sum all elements.stdx::min
,stdx::max
: Minimum/maximum element.
float sum = stdx::reduce(a); // sum of all elements in `a`
Masking and conditional execution #
Use stdx::where
to apply operations conditionally. There are multiple overloads for this though.
stdx::where(mask, out) = a + b; // add where mask is true
Mask reductions have other conditional properties e.g. all_of
, any_of
, none_of
, some_of
.
Broadcasting #
Apply a scalar value across all lanes of a SIMD vector.
stdx::simd<float> v = 3.14f; // elements set to 3.14f
Load and store #
Load data into a SIMD object with a fixed size.
float data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
stdx::simd<float> v(data, stdx::simd_abi::fixed_size<4>);
Store data back by copying into an array.
v.copy_to(data, stdx::simd_abi::fixed_size<4>);
Compiler hints for alignment #
We use stdx::vector_aligned
to hint alignment which in copy_to
or copy_from
can enforce the alignment on a vector type, mask or buffer. It may make sense to use alignas(64)
on raw data before a copy_from
.
Alternatively, instead of enforcing 64-byte alignment, we can use the platform cache line size which is the default alignment size for ABIs.
#ifdef __cpp_lib_hardware_interference_size
using std::hardware_constructive_interference_size;
using std::hardware_destructive_interference_size;
#else
// 64 bytes on x86-64 │ L1_CACHE_BYTES │ L1_CACHE_SHIFT │ __cacheline_aligned │ ...
constexpr std::size_t hardware_constructive_interference_size = 64;
constexpr std::size_t hardware_destructive_interference_size = 64;
#endif
Swizzling #
Reorder elements in a SIMD object with stdx::shuffle
.
Alternatively, we may need to use a permutation index vector when element-wise reordering is necessary and there is no support for index-based lookups. In this case, direct indexing can reconstruct the vector in the required order.
When the permutation is arbitrary then it's a shuffle (i.e. general reordering). If it's a fixed selection then it's a swizzle.
using simd_t = stdx::native_simd<int>;
alignas(simd_t) int data[] = {1, 2, 3, 4};
simd_t v;
v.copy_from(data, stdx::vector_aligned);
std::swap(data[0], data[3]);
std::swap(data[1], data[2]);
Architectures (AVX/SSE) offer intrinsics, _mm_shuffle_ps
in SSE or _mm256_permutevar8x32_epi32
in AVX, for shuffling.
Splitting and concatenation #
Split with stdx::split
and concatenate with stdx::concat
. It could still require extraction & masking depending on intermediate values from reconstruction, similar to the shuffle above, except that std::swap
becomes a std::memcpy
.
Element-wise function application #
Functions can be applied on each element.
stdx::simd<float> out = stdx::sqrt(v);
Type conversion #
Convert between SIMD types with stdx::static_simd_cast
.
stdx::simd<int, stdx::simd_abi::fixed_size<4>> int_vec =
stdx::static_simd_cast<stdx::simd<int, stdx::simd_abi::fixed_size<4>>>(
v);
We could trivially write a vectorized dot product function for two float arrays.
float dot_product(const float* a, const float* b, size_t size);
Initialize the accumulator.
stdx::simd<float> sum = 0.0f;
size_t i = 0;
Process elements in chunks of SIMD width. This step could be parallelized separately.
for (; i + stdx::simd<float>::size() <= size;
i += stdx::simd<float>::size()) {}
Load aligned data into SIMD register.
/*..*/ {
stdx::simd<float> va(a + i, stdx::vector_aligned);
stdx::simd<float> vb(b + i, stdx::vector_aligned);
}
Perform element-wise multiplication and accumulation then reduce the sum vector.
/*..*/ {
sum += va * vb;
}
return stdx::reduce(sum);
In usage, the float arrays are aligned accordingly.
alignas(hardware_destructive_interference_size)
float a[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
alignas(hardware_destructive_interference_size)
float b[] = {8.0f, 7.0f, 6.0f, 5.0f, 4.0f, 3.0f, 2.0f, 1.0f};
size_t size = sizeof(a) / sizeof(a[0]);
In a basic timing test, the SIMD version consistently performs better (-O3 -march=native
).
C++26 improves on std::experimental::parallelism_v2
esp. for features that are exclusive to intrinsics.
References #
- Previous: Cooperative multitasking in Rust