Cooperative multitasking in Rust

03 February 2025
rust

Cooperative multitasking traces its origins to early works on coroutines and non-preemptive scheduling. In the paper Design of a Separable Transition-Diagram Compiler (Conway, 1963) coroutines are formalized as units of execution that voluntarily yield control, contrasting with preemptive threading models that rely on scheduler interrupts.

For a while, this idea gained traction in resource-constrained environments e.g. embedded & RT systems where deterministic execution and minimal context-switch overhead were prioritized.

Rust adopted stackless coroutines (via async/await) which aligns with this but it diverges from traditional stackful coroutine implementations (cf. Go's goroutines). Stackless coroutines avoid allocating separate stacks, instead encoding state transitions as finite automata within the compiler-generated Future type.

This design was likely inspired by Haskell's continuation-passing style for better zero-cost abstractions since cooperative tasks incur no heap allocation unless explicitly spawned into an executor.

Stackful coroutines:
- Require per-task stacks (~2–8 kB), creates memory bloat in high-concurrency workloads.
- Context switching involves register saves/restores and stack pointer swaps, incompatible with Rust's borrow checker due to potential dangling references across suspension points.
Stackless coroutines:
- Represent task state as an enum variant, stored inline within the Future struct.
- State transitions are compile-time resolved to avoid runtime borrow checker violations.

// 1. Compiler-generated state machine for an async block.

async fn example() {
    let a = 1;
    yield_once().await;
    let b = &a;
    yield_once().await;
    println!("{}", b);
}

// 2. Desugaring into a manual Future impl.
//    (it's the pseudocode of the FSM representation)

enum ExampleFuture {
    Start { a: i32 },
    Await1 { a: i32, b: &i32 },
    Completed,
}

impl Future for ExampleFuture {
    type Output = ();
    fn poll(self: Pin<&mut Self>, ctx: &mut Context) -> Poll<()> {
        match self.some_transition_fn() {
            ExampleFutureTransitionState::Start(a) => {
                // ...
            }
            // ----->
            // 3. states encode live variables at each await point
        }
    } // <----- live bindings die after poll
}

Rust guarantees that references (e.g. &a) cannot outlive their source, which is another way of saying that it achieves memory safety without runtime checks. Compile-time rigor on this level makes it straightforward to work with cooperative multitasking in environments where memory leaks or undefined behavior are intolerable, namely safety-critical systems.

Custom schedulers & bypassing the default executor #

While Rust's tokio and async-std executors optimize for general-purpose workloads, domain-specific scenarios demand custom schedulers. In most cases, you just need to know what you're doing.

Let's say we need a batch-processing executor for high-throughput of some CPU-bound tasks.

We will need a RawWakerVTable which manages a raw waker. Cf. documentation:

The pointer passed to all functions inside the vtable is the data pointer from the enclosing RawWaker object.

If there are offset writes or accesses that are not bound to the vtable then it is UB, and there are many issues regarding this e.g. issue #79889 (resolved) that may not be fixed at the language-level.

use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll, RawWaker, RawWakerVTable, Waker};

/// A scheduler that batches ready tasks into fixed-size groups,
/// minimizing cache thrashing through sequential execution.
struct BatchScheduler {
    tasks: Vec<Pin<Box<dyn Future<Output = ()>>>>,
    batch_size: usize,
}

impl BatchScheduler {
    fn run(&mut self) {
        // noop waker vtable
        const NOOP_WAKER_VTABLE: RawWakerVTable = RawWakerVTable::new(
            |_| RawWaker::new(std::ptr::null(), &NOOP_WAKER_VTABLE),
            |_| {}, |_| {}, |_| {},
        );

        // noop waker
        let waker = unsafe {
            Waker::from_raw(RawWaker::new(
                std::ptr::null(),
                &NOOP_WAKER_VTABLE,
            ))
        };

        let mut ctx = Context::from_waker(&waker);
        let mut current_batch = Vec::with_capacity(self.batch_size);

        for task in &mut self.tasks {
            if let Poll::Ready(()) = task.as_mut().poll(&mut ctx) {
                continue;
            }

            current_batch.push(task);

            if current_batch.len() == self.batch_size {
                for t in &mut current_batch {
                    let _ = t.as_mut().poll(&mut ctx);
                }

                current_batch.clear();
            }
        }
    }
}

Tasks deemed Pending are grouped and repolled in sequence, which exploits spatial locality in CPU caches. It's actually a good case of batch polling.

There are no wakeup notifications since the no-op RawWaker avoids cross-thread notifications, assuming tasks are CPU-bound and will complete within a fixed number of polls.

Realistically, how would we use this?

Define routines in the batch scheduler with batch_size set to 2.

use std::time::Duration;
use std::thread::sleep;

let mut scheduler = BatchScheduler {
    batch_size: 2,
    tasks: vec![
        Box::pin(async {
            println!("task 1 starting");
            sleep(Duration::from_secs(1));
            println!("task 1 finished");
        }),
        Box::pin(async {
            println!("task 2 starting");
            sleep(Duration::from_secs(2));
            println!("task 2 finished");
        }),
        Box::pin(async {
            println!("task 3 starting");
            sleep(Duration::from_secs(3));
            println!("task 3 finished");
        }),
    ],
};

scheduler.run();

Retry logic with state tracking #

In cases where explicit error handling is required, manual Future implementations bypass the limitations of async/await's syntactic sugar. The following Retry future forces exponential backoff with jitter.

We start by designing a future that wraps a delay for the executor. Always wake up the context to poll again, which is why call wake_by_ref.

use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll, Waker};
use std::time::{Duration, Instant};

struct Delay {
    end_time: Instant,
}

impl Future for Delay {
    type Output = ();

    fn poll(self: Pin<&mut Self>, ctx: &mut Context<'_>) -> Poll<Self::Output> {
        if Instant::now() >= unsafe { self.get_unchecked_mut() }.end_time {
            Poll::Ready(())
        } else {
            ctx.waker().wake_by_ref();
            Poll::Pending
        }
    }
}

The unsafe block is actually safe because we know Delay doesn't contain any self-referential data and we're just reading a field.

We now proceed to the retry future which will hold metadata relating to the instance on which the executor can poll.

struct Retry<F, O, E, Fut> {
    operation: F,
    max_retries: usize,
    current_attempt: usize,
    delay: Option<Pin<Box<Delay>>>,
    operation_future: Option<Pin<Box<Fut>>>,
    last_error: Option<E>,
    _phantom: std::marker::PhantomData<(O, Fut)>,
}

Retry tracks attempts, delays and the current operation so as to encapsulate all transient state without relying on async/await's implicit captures. While omitted here, jitter (e.g. randomized delays) can be injected by changing the delay.

impl<F, O, E, Fut> Future for Retry<F, O, E, Fut>
where
    F: Fn() -> Fut,
    Fut: Future<Output = Result<O, E>>,
{
    type Output = Result<O, E>;

    fn poll(mut self: Pin<&mut Self>, ctx: &mut Context) -> Poll<Self::Output> {
        let this = unsafe { self.get_unchecked_mut() };

        // 1. Check if we've exceeded max retries.
        if this.current_attempt > this.max_retries {
            return Poll::Ready(Err(this.last_error.take().unwrap()));
        }

        // 2. Manage delay if defined.
        if let Some(delay) = this.delay.as_mut() {
            match delay.as_mut().poll(ctx) {
                Poll::Pending => {
                    // 3. Wait for said delay.
                    ctx.waker().wake_by_ref();
                    return Poll::Pending;
                }
                // 4. Otherwise, retry.
                Poll::Ready(_) => {
                    println!("retry attempt {} after backoff", this.current_attempt);
                    this.delay = None;
                }
            }
        }

        // 5. If there is no operation future, create one.
        if this.operation_future.is_none() {
            this.operation_future = Some(Box::pin((this.operation)()));
        }

        // 6. Immediately poll the operation future.
        if let Some(mut operation) = this.operation_future.take() {
            match operation.as_mut().poll(ctx) {
                Poll::Ready(Ok(val)) => {
                    println!("op succeeded!");
                    return Poll::Ready(Ok(val));
                }

                // 7. Increment attempt only if less than max_retries.
                Poll::Ready(Err(e)) if this.current_attempt < this.max_retries => {
                    this.current_attempt += 1;
                    let delay_ms = 2u64.pow(this.current_attempt as u32) * 100;
                    println!("op failed. retrying in {}ms...", delay_ms);

                    this.last_error = Some(e);
                    this.delay = Some(Box::pin(Delay {
                        end_time: Instant::now() + Duration::from_millis(delay_ms),
                    }));

                    // 8. Wake the context to ensure we poll again.
                    ctx.waker().wake_by_ref();

                    return Poll::Pending;
                }

                Poll::Ready(Err(e)) => {
                    return Poll::Ready(Err(e));
                }

                Poll::Pending => {
                    // 9. Put the operation future back.
                    this.operation_future = Some(operation);
                    ctx.waker().wake_by_ref();
                    return Poll::Pending;
                }
            }
        }

        // 10. /!\ WE SHOULD NEVER EVER REACH HERE!! ARRRGH!
        Poll::Pending
    }
}

It is trivial to use the futures when we only need to simulate an async operation that might fail. What better example is there to use than random numbers that we conditionally retry in the executor upon awaiting the retry future's operation?

However, there is another piece of advice.

Always handle transient errors.

A placeholder for an output error is defined. It is easier to reason about custom errors than unsuspecting panics, which should be a common practice in general.

#[derive(Debug)]
struct RetryOutputError {
    message: String,
}

Proceeding with the implementation, we have:

let retry_future = Retry {
    operation: || async {
        let delay = Delay {
            end_time: Instant::now() + Duration::from_millis(100),
        };

        delay.await;

        let random_value = rand::random::<u8>() % 10;

        match random_value {
            0..=2 => Ok("passed"),
            _ => Err(RetryOutputError {
                message: "transient error".into(),
            }),
        }
    },
    max_retries: 3,
    current_attempt: 0,
    delay: None,
    operation_future: None,
    last_error: None,
    _phantom: std::marker::PhantomData,
};

println!("starting op with max 3 retries...");

futures::executor::block_on(async {
    match retry_future.await {
        Ok(result) => println!("outcome: op succeeded: {}", result),
        Err(e) => println!("abort due to: op failed after all retries: {:?}", e),
    }
});

We should get an output similar to something like this:

starting op with max 3 retries...
op failed. retrying in 200ms...
retry attempt 1 after backoff
op failed. retrying in 400ms...
retry attempt 2 after backoff
op failed. retrying in 800ms...
retry attempt 3 after backoff
op succeeded!
outcome: op succeeded: passed

Idempotent API calls to unreliable services can trivially use this mechanism, where retries must be bounded and work with backpressure.

An implementation of poll should strive to return quickly, and should not block. Returning quickly prevents unnecessarily clogging up threads or event loops. If it is known ahead of time that a call to poll may end up taking a while, the work should be offloaded to a thread pool (or something similar) to ensure that poll can return quickly (cf. Future).

Why do we use exponential backoff?

The pattern itself is common in network operations where you want to give temporary failures a chance to resolve themselves, but not wait forever. The increasing delay helps prevent overwhelming the system in cases where there are genuine issues.

Segmented stacks for deep async recursion #

Cooperative systems traditionally struggle with deep recursion due to stack overflow risks. Even though stackless coroutines avoid per-task stacks, deeply nested async calls can still exhaust memory. Segmented stacks—a hybrid approach inspired by Go's stack-splitting and Rust's generator model—allow dynamically chaining state machines to handle recursion without preallocation.

When is this useful? Tree traversal algorithms such as graph search where recursion depth is data-dependent and unpredictable.

use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll, RawWaker, RawWakerVTable};

struct RecursiveTask {
    depth: usize,
    child: Option<Pin<Box<RecursiveTask>>>,
}

impl Future for RecursiveTask {
    type Output = ();

    fn poll(mut self: Pin<&mut Self>, ctx: &mut Context) -> Poll<Self::Output> {
        println!("recursing at depth: {}", self.depth);

        if self.depth == 0 {
            return Poll::Ready(());
        }

        if self.child.is_none() {
            // lazy allocate the next segment
            self.child = Some(Box::pin(RecursiveTask {
                depth: self.depth - 1,
                child: None,
            }));
        }

        match self.child.as_mut().unwrap().as_mut().poll(ctx) {
            Poll::Ready(()) => Poll::Ready(()),
            Poll::Pending => Poll::Pending,
        }
    }
}

In this case, the executor polls the root task, which dynamically chains child segments.

Each recursive level allocates its state machine only when first polled (lazy allocation) which prevents upfront memory bloat. Will you need preallocation? Perhaps. In that case, it is basically a thread pool.

The child field acts as a stack segment. The executor traverses the chain without a contiguous stack.

We will once again use a no-op vtable and raw waker. Manually poll the Future until it's ready.

const NOOP_WAKER_VTABLE: RawWakerVTable = RawWakerVTable::new(
    |_| RawWaker::new(std::ptr::null(), &NOOP_WAKER_VTABLE),
    |_| {},
    |_| {},
    |_| {},
);

let waker = unsafe { Waker::from_raw(RawWaker::new(std::ptr::null(), &NOOP_WAKER_VTABLE)) };
let mut ctx = Context::from_waker(&waker);

let mut task = Pin::new(Box::new(RecursiveTask {
    depth: 5,
    child: None,
}));

loop {
    match task.as_mut().poll(&mut ctx) {
        Poll::Ready(_) => {
            println!("recursive task completed!");
            break;
        }

        Poll::Pending => {
            continue;
        }
    }
}

The output is conclusive.

recursing at depth: 5
recursing at depth: 4
recursing at depth: 3
recursing at depth: 2
recursing at depth: 1
recursing at depth: 0
recursive task completed!

Lazy allocation in segmented stacks is also defined in LLVM as incremental allocation instead of a monolithic chunk (worst case size) at thread initialization. Aside from the use cases, we will cover the case where stack pivoting relates to this, except that there is no recovery from low stack size which I won't discuss in this post (hint: there is a prologue of the actual function prologue that can predefine the stack pointer, effective address, etc).

Priority inversion mitigation via yield-aware mutexes #

Priority inversion is quite common in RT systems. It occurs when high-priority tasks are blocked by lower-priority ones holding shared resources.

A yield-aware mutex integrates priority metadata into the lock's wait queue so the executor can reorder task polling based on priority.

I tried the naïve approach but it turns out that a PriorityWaker wrapper is required in order to make Waker comparable. There is also a contingency on core traits for partial and total ordering as well as equality comparisons.

use std::cmp::{Eq, Ordering, PartialEq, PartialOrd};
use std::collections::BinaryHeap;
use std::future::Future;
use std::ops::{Deref, DerefMut};
use std::pin::Pin;
use std::sync::{Arc, Mutex};
use std::task::{Context, Poll, RawWaker, RawWakerVTable, Waker};

#[derive(Clone)]
struct PriorityWaker {
    priority: u8,
    waker: Waker,
}

impl PartialEq for PriorityWaker {
    fn eq(&self, other: &Self) -> bool {
        self.priority == other.priority
    }
}

impl Eq for PriorityWaker {}

impl PartialOrd for PriorityWaker {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        Some(self.priority.cmp(&other.priority))
    }
}

impl Ord for PriorityWaker {
    fn cmp(&self, other: &Self) -> Ordering {
        other.priority.cmp(&self.priority)
    }
}

We invert the order so that higher priority comes first in the heap.

A guard is required for the priority mutex. Wake the highest priority waker when the lock is released.

struct PriorityMutex<T> {
    data: Mutex<(T, BinaryHeap<PriorityWaker>)>,
}

impl<T> PriorityMutex<T> {
    fn new(value: T) -> Self {
        PriorityMutex {
            data: Mutex::new((value, BinaryHeap::new())),
        }
    }

    fn lock(&self, priority: u8) -> impl Future<Output = PriorityMutexGuard<'_, T>> {
        struct LockFuture<'a, T> {
            mutex: &'a PriorityMutex<T>,
            priority: u8,
        }

        impl<'a, T> Future for LockFuture<'a, T> {
            type Output = PriorityMutexGuard<'a, T>;

            fn poll(self: Pin<&mut Self>, ctx: &mut Context<'_>) -> Poll<Self::Output> {
                let mut guard = self.mutex.data.lock().unwrap();

                if guard.1.is_empty() {
                    // 1. Lock acquired, store priority for future contention.
                    guard.1.push(PriorityWaker {
                        priority: self.priority,
                        waker: ctx.waker().clone(),
                    });

                    Poll::Ready(PriorityMutexGuard {
                        guard: Some(guard),
                        mutex: self.mutex,
                    })
                } else {
                    // 2. Yield and re-insert with priority.
                    guard.1.push(PriorityWaker {
                        priority: self.priority,
                        waker: ctx.waker().clone(),
                    });

                    ctx.waker().wake_by_ref();
                    Poll::Pending
                }
            }
        }

        LockFuture {
            mutex: self,
            priority,
        }
    }
}

struct PriorityMutexGuard<'a, T> {
    guard: Option<std::sync::MutexGuard<'a, (T, BinaryHeap<PriorityWaker>)>>,
    mutex: &'a PriorityMutex<T>,
}

impl<'a, T> Deref for PriorityMutexGuard<'a, T> {
    type Target = T;

    fn deref(&self) -> &Self::Target {
        &self.guard.as_ref().unwrap().0
    }
}

impl<'a, T> DerefMut for PriorityMutexGuard<'a, T> {
    fn deref_mut(&mut self) -> &mut Self::Target {
        &mut self.guard.as_mut().unwrap().0
    }
}

impl<'a, T> Drop for PriorityMutexGuard<'a, T> {
    fn drop(&mut self) {
        if let Some(mut guard) = self.guard.take() {
            if let Some(priority_waker) = guard.1.pop() {
                priority_waker.waker.wake();
            }
        }
    }
}

DerefMut is required so as to increment the internal priority value in the guard.

Consider the test case below where we spin threads and join while using the priority mutex.

use std::thread;
use std::time::Duration;

let mutex = Arc::new(PriorityMutex::new(0));

let mutex1 = mutex.clone();
let mutex2 = mutex.clone();

let handle1 = thread::spawn(move || {
    const NOOP_WAKER_VTABLE: RawWakerVTable = RawWakerVTable::new(
        |_| RawWaker::new(std::ptr::null(), &NOOP_WAKER_VTABLE),
        |_| {},
        |_| {},
        |_| {},
    );

    let waker = unsafe { Waker::from_raw(RawWaker::new(std::ptr::null(), &NOOP_WAKER_VTABLE)) };
    let mut ctx = Context::from_waker(&waker);

    let mut lock_future = mutex1.lock(2);
    let mut pinned_future = unsafe { Pin::new_unchecked(&mut lock_future) };

    match pinned_future.as_mut().poll(&mut ctx) {
        Poll::Ready(mut value) => {
            println!("task 1 acquired lock with priority 2");
            *value += 1;
            thread::sleep(Duration::from_millis(100));
            println!("task 1 releasing lock");
        }

        Poll::Pending => println!("task 1 could not acquire lock"),
    }
});

let handle2 = thread::spawn(move || {
    const NOOP_WAKER_VTABLE: RawWakerVTable = RawWakerVTable::new(
        |_| RawWaker::new(std::ptr::null(), &NOOP_WAKER_VTABLE),
        |_| {},
        |_| {},
        |_| {},
    );

    let waker = unsafe { Waker::from_raw(RawWaker::new(std::ptr::null(), &NOOP_WAKER_VTABLE)) };
    let mut ctx = Context::from_waker(&waker);

    let mut lock_future = mutex2.lock(1);
    let mut pinned_future = unsafe { Pin::new_unchecked(&mut lock_future) };

    match pinned_future.as_mut().poll(&mut ctx) {
        Poll::Ready(mut value) => {
            println!("task 2 acquired lock with priority 1");
            *value += 2;
            println!("task 2 releasing lock");
        }

        Poll::Pending => println!("task 2 could not acquire lock"),
    }
});

handle1.join().unwrap();
handle2.join().unwrap();

With this in place, the output indicates the priority order exactly as defined.

task 1 acquired lock with priority 2
task 1 releasing lock
task 2 acquired lock with priority 1
task 2 releasing lock

In general, the executor must poll tasks in priority order using a priority queue.

Waiters are stored in a max-heap such that the highest-priority task is resumed first. Tasks yield on contention, meaning the executor is signaled to repoll based on updated priorities.

Try comparing this approach with Peterson's lock.

Thread-local stealing for heterogeneous workloads #

Work-stealing schedulers are great at load balancing but incur overhead from atomic operations. A thread-local stealing model partitions tasks into thread-bound queues, stealing only when local queues are exhausted, thereby reducing cross-thread synchronization.

use std::collections::VecDeque;
use std::future::Future;
use std::pin::Pin;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::task::{Context, Poll, RawWaker, RawWakerVTable, Waker};
use std::thread;
use std::time::Duration;

struct LocalQueue {
    tasks: VecDeque<Pin<Box<dyn Future<Output = ()> + Send>>>,
    steal_counter: AtomicUsize,
}

impl LocalQueue {
    fn new() -> Self {
        Self {
            tasks: VecDeque::new(),
            steal_counter: AtomicUsize::new(0),
        }
    }

    fn push(&mut self, task: Pin<Box<dyn Future<Output = ()> + Send>>) {
        self.tasks.push_back(task);
    }

    fn steal(&mut self, target: &mut LocalQueue) -> usize {
        let steal_count = self.steal_counter.load(Ordering::Relaxed);
        let len = self.tasks.len();

        if len > 1 {
            let split = len / 2;
            let stolen_tasks = self.tasks.drain(..split).collect::<VecDeque<_>>();

            target.tasks.extend(stolen_tasks);
            self.steal_counter.store(steal_count + 1, Ordering::Relaxed);
        }

        steal_count
    }
}

An executor loop polls the local queue with a no-op waker and vtable.

fn executor_loop(queue: &mut LocalQueue) {
    const NOOP_WAKER_VTABLE: RawWakerVTable = RawWakerVTable::new(
        |_| RawWaker::new(std::ptr::null(), &NOOP_WAKER_VTABLE),
        |_| {},
        |_| {},
        |_| {},
    );

    let waker = unsafe { Waker::from_raw(RawWaker::new(std::ptr::null(), &NOOP_WAKER_VTABLE)) };
    let mut ctx = Context::from_waker(&waker);

    while let Some(mut task) = queue.tasks.pop_front() {
        if task.as_mut().poll(&mut ctx) == Poll::Pending {
            queue.tasks.push_back(task);
        }
    }
}

Let's take a sample case which is not randomized. Push some tasks into a queue and then create a second queue for stealing.

async fn test_task(id: usize) {
    println!("task {} started", id);
    thread::sleep(Duration::from_millis(100 * id as u64));
    println!("task {} completed", id);
}

fn main() {
    let mut queue = LocalQueue::new();

    for i in 0..5 {
        let task = Box::pin(test_task(i));
        queue.push(task);
    }

    let mut steal_queue = LocalQueue::new();

    println!(
        "before stealing: main queue size = {}, steal queue size = {}",
        queue.tasks.len(),
        steal_queue.tasks.len()
    );

    queue.steal(&mut steal_queue);

    println!(
        "after stealing: main queue size = {}, steal queue size = {}",
        queue.tasks.len(),
        steal_queue.tasks.len()
    );

    executor_loop(&mut queue);
    executor_loop(&mut steal_queue);
}

A queue size of 5 with steal queue size of 0 results in an aggregate queue size equal to the initial queue size, which makes sense. Ordering will depend on the id.

before stealing: main queue size = 5, steal queue size = 0
after stealing: main queue size = 3, steal queue size = 2
task 2 started
task 2 completed
task 3 started
task 3 completed
task 4 started
task 4 completed
task 0 started
task 0 completed
task 1 started
task 1 completed

Locality preservation forces tasks to remain on their origin thread unless stolen which improves cache usage.

Notice that roughly half the queue is stolen at once to amortize synchronization costs. This is referred to as bulk stealing.

Mixed workloads such as web servers with CPU-bound and I/O-bound tasks where locality reduces contention are the edge cases which may use this pattern.

Lock-free task termination with epoch-based reclamation #

Terminating tasks in async Rust risks use-after-free if pending futures retain references to dropped data. Epoch-based reclamation (EBR), adapted from concurrent data structures (K Fraser, 2004), defers deallocation until no tasks can reference the data.

For this, you will only require an atomic epoch and a bunch of garbage (literally). We compare exchange the global epoch on the local one which is in a guard. Defer the free so as to only reclaim when the garbage size is over 64 bytes (since data is Vec<*mut u8>) which matches the cache line size of many modern CPU architectures.

Ordering is acquire for current epoch that is not yet reclaimed and seqcst for read-modify-write and relaxed as failure.

A quick use case may be shared caches in async databases where entries must outlive transient tasks.

Tasks entering a critical section increment the global epoch. Ideally, deferred frees are batched and reclaimed only when all active tasks have exited the epoch.

Stack pivoting for hybrid async/synchronous code #

Legacy C libraries often require stackful contexts, e.g. FFI callbacks.

Stack pivoting temporarily swaps the async executor's stack pointer to a dedicated stack to integrate stackful code within stackless coroutines.

Embedded systems can bridge async Rust drivers with legacy real-time OS APIs that may require an intermediate stack which forms a good case for this.

A stack pivot is a basic structure that can hold a sync stack.

struct StackPivot {
    sync_stack: Vec<u8>,
}

A vector could be used for automatic memory management.

fn new(stack_size: usize) -> Self {
    let sync_stack = vec![0; stack_size];
    StackPivot { sync_stack }
}

Enter the sync context.

unsafe fn enter_sync<F, R>(&self, f: F) -> R
    where
        F: FnOnce() -> R,
    {}

Preinitialize the top of stack and return value. Recall that the stack grows downwards.

let sync_stack_top = self.sync_stack.as_ptr().add(self.sync_stack.len());
let mut result: usize;

A libc function would be external and for reference we can use a stub function.

extern "C" fn call_closure<F, R>(f: F) -> R
    where
        F: FnOnce() -> R,
    {
        f()
    }

Use inline assembly to switch stack and call function.

asm!(
    "push rbp", // 1. Save the current stack ptr.
    "mov rbp, rsp",

    "mov rsp, {stack_top}", // 2. Switch to our sync stack.

    "call {func}", // 3. Call the function.

    "mov rsp, rbp", // 4. Restore original stack.
    "pop rbp",

    stack_top = in(reg) sync_stack_top,
    func = sym call_closure::<F, R>,
    lateout("rax") result,

    clobber_abi("C"), // 5. Clobber registers that might be used.
    options(nostack, preserves_flags)
);

Convert the result back to the expected type.

std::mem::transmute_copy(&result)

Allocate 16 kB of stack size and call a function that works within that frame.

let pivot = Arc::new(StackPivot::new(16 * 1024));

unsafe {
    pivot.enter_sync(|| {
        some_legacy_c_function();
        0
    });
}

We directly manipulate the stack pointer (rsp on x86-64) to switch contexts.

The async task uses its compiler-generated state machine stack, while synchronous code runs on a preallocated stack that is custom and works the way we expect it to.

Cooperative multitasking, in this context, becomes a framework for resource-aware concurrency.

References #

Rust RFC 2394 (2018)
Segmented Stacks - LLVM
Generators - Rust Unstable Book
Design of a Separable Transition-Diagram Compiler - Conway, M. E. (1963)
Cooperating Sequential Processes - Dijkstra, E. W. (1965), ref. University of Texas, Austin
Why Threads Are a Bad Idea (for most purposes) - Ousterhout, J, ref. Stanford University
Practical Lock-Freedom - K Fraser, ref. University of Cambridge

Previous: Essential template metaprogramming
Next: Portable vectorization in C++23