Machine: oci-saulire — ARM64 (aarch64), 4 vCPUs, 24 GB RAM, Debian stable
Compiler: rustc 1.95.0, release mode
Tokio commit: c6d58ce (baseline)
Benchmark harness: criterion via cargo bench --bench rt_multi_threaded
Patch: tokio-batch-pop.patch

Tokio’s scheduler is work-stealing1. Every worker gets its own local deque. When a worker runs out of tasks, it steals from neighbors. That part works great.

But tasks spawned outside the runtime need somewhere else to go. Overflow tasks do too. They land in the inject queue, a single MPSC structure that everyone shares. Workers check it on an adaptive interval. That interval targets roughly 61 tasks per 200 μs, clamped between 2 and 1272.

Under burst loads, every task contends for the same mutex. That single mutex becomes the bottleneck.

Batching the global tick

On a global queue tick, baseline tokio pulls exactly one task. The worker acquires the lock, pops one item, releases the lock, runs the task, and repeats later. Here is the code:

1
worker.handle.next_remote_task().or_else(|| self.next_local_task())

One task. One lock. Every time.

The idle path already knew a better trick. It popped n tasks, ran the first, and queued the rest locally. The patch extracts that logic into Core::batch_from_inject. Now both paths use it. The tick path caps at 32. The idle path caps at half the local queue to avoid overflow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
fn next_task(&mut self, worker: &Worker) -> Option<Notified> {
    if self.tick % self.global_queue_interval == 0 {
        self.tune_global_queue_interval(worker);

        const MAX_BATCH: usize = 32;

        self.batch_from_inject(worker, MAX_BATCH)
            .or_else(|| self.next_local_task())
    } else {
        self.next_local_task()
            .or_else(|| self.batch_from_inject(worker, self.run_queue.max_capacity() / 2))
    }
}

Let me walk through batch_from_inject. It first checks if the inject queue has anything. Then it computes a safe cap based on free slots and the caller’s limit. It never takes zero items. The .max(1) guarantees at least one.

The batch size scales with congestion. It divides inject_len by num_workers, adds one, and clamps the result. With LOCAL_QUEUE_CAPACITY = 2563 and four workers, a flooded queue yields inject_len / 4 + 1, capped at 32.

Then it locks once, pops n tasks, unlocks, and pushes the extras into the local queue. One lock instead of n locks. That is the entire trick.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
fn batch_from_inject(&mut self, worker: &Worker, max_batch: usize) -> Option<Notified> {
    if worker.inject().is_empty() {
        return None;
    }

    let cap = self.run_queue.remaining_slots().min(max_batch).max(1);

    let n = (worker.inject().len() / worker.handle.shared.remotes.len() + 1)
        .clamp(1, cap);

    let mut synced = worker.handle.shared.synced.lock();
    let mut tasks = unsafe { worker.inject().pop_n(&mut synced.inject, n) };

    let ret = tasks.next();
    self.run_queue.push_back(tasks);
    ret
}

Why 32? Crossbeam-deque’s steal_batch uses the same number. The cap only kicks in when the queue is really backed up.

Fairness note: Injected tasks go to the back of the local queue. But workers pop from the back using LIFO ordering. So those injected tasks run before older local work sitting deeper in the queue. With cap=32, at most 31 tasks can jump ahead per tick. The worker cycles back to original local work long before the next global tick arrives. The bound is small and explicit.

Cap choice

I benchmarked different caps on busy2:

capbusy2 medianvs baseline
217.28 ms−49%
411.58 ms−66%
86.48 ms−81%
163.99 ms−88%
322.74 ms−92%
642.42 ms−93%
1282.23 ms−93%

The throughput knee sits right at 32. Beyond that, the curve flattens quickly.

But throughput is not the only reason for the cap. Fairness matters more. With cap=128, one tick could grab 128 remote tasks, run one, and push 127 to the back of the local queue. Existing local work would sit behind a wall of converted-remote tasks. 32 captures 92% of the throughput gain without that risk.

When inject_len stays below 124 on four workers, the formula never hits the cap at all. Shallow queues see smaller batches automatically. The fairness bound gets even tighter in practice.

Prior art

Tokio already batch-moves tasks internally. push_overflow in queue.rs3 moves LOCAL_QUEUE_CAPACITY / 2 tasks into the inject queue in one operation. The inject queue exposes push_batch for bulk insert. The APIs existed. The scheduler simply never called pop_n from the global tick path.

Crossbeam’s Chase-Lev deque4 batches by design. Its steal_batch grabs roughly half the queue, capped at 32, in a single CAS5. Same idea: one synchronization event, multiple items moved.

Results

BenchmarkBaselineBatch-popΔ
spawn_many_local10.05 ms9.74 msnoise
spawn_many_remote_idle6.99 ms7.12 msnoise
spawn_many_remote_busy17.36 ms6.92 msnoise
spawn_many_remote_busy233.96 ms2.74 ms−92%
ping_pong1.17 ms1.16 msnoise
yield_many11.30 ms11.70 msnoise

busy2 creates a large burst from outside the runtime while every worker stays saturated. Baseline acquires the inject mutex once per task. Batch-pop acquires it once per batch. The improvement is structural; it reproduces cleanly.

Risks

  • Fairness: at most 31 injected tasks can jump ahead of existing local work per tick. Larger caps risk burying local work indefinitely.
  • inject().len() costs an AtomicUsize::load(Acquire). No lock is taken, but it still crosses cache lines on every tick.
  • Low-volume traffic: a steady trickle of remote tasks might wait slightly longer inside local queues. The +1 floor and the cap keep this bounded.

Batch-pop uses tokio’s existing pop_n API. It applies a standard work-stealing amortization pattern. Whether upstream accepts it depends on the fairness trade-off.



  1. Carl Lerche, “Making the Tokio scheduler 10x faster,” Tokio Blog, October 13, 2019. https://tokio.rs/blog/2019-10-scheduler ↩︎

  2. tokio/src/runtime/scheduler/multi_thread/stats.rs, lines 33–39 and worker.rs:1063. https://github.com/tokio-rs/tokio/blob/c6d58ce/tokio/src/runtime/scheduler/multi_thread/stats.rs#L33 ↩︎

  3. tokio/src/runtime/scheduler/multi_thread/queue.rs, push_overflow method. https://github.com/tokio-rs/tokio/blob/c6d58ce/tokio/src/runtime/scheduler/multi_thread/queue.rs#L246 ↩︎ ↩︎

  4. David Chase and Yossi Lev. 2005. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures (SPAA ‘05). ACM, 21–28. https://doi.org/10.1145/1073970.1073974 ↩︎

  5. crossbeam-deque v0.8, Stealer::steal_batch documentation. https://docs.rs/crossbeam-deque/0.8/crossbeam_deque/struct.Stealer.html#method.steal_batch ↩︎