I watched a 4-vCPU ARM64 box crawl because of a single mutex, and the bottleneck wasn’t even in my application code.

I ran everything on oci-saulire, a Debian stable machine with an ARM64 (aarch64) chip, 4 vCPUs, and 24 GB RAM. I built with rustc 1.95.0 in release mode, baselined against Tokio commit c6d58ce (baseline), and measured with criterion via cargo bench --bench rt_multi_threaded. The patch I’m testing is tokio-batch-pop.patch.

Tokio’s scheduler is work-stealing1: each worker owns a local deque and steals from peers when idle. Tasks spawned from outside the runtime—or that overflow a local queue—land in a separate inject queue (MPSC).

Workers poll it on an adaptive interval that targets ~61 tasks per 200 μs, clamped between 2 and 1272.

Under burst loads the inject queue becomes a serialization point: every task contends for the same mutex.

If we can batch the global tick, can we turn that mutex from a hard ceiling into a footnote and actually keep all four ARM64 cores busy under burst load?

Batching the global tick

I used to think a global queue tick meant a full sweep. Baseline proved me wrong. It pulls exactly one task and returns to local work.

That single steal is the entire global tick. What happens when the global queue starts backing up faster than one task at a time?

1
worker.handle.next_remote_task().or_else(|| self.next_local_task())

You know what beats writing new batching logic from scratch? Finding it already written. The idle path was already batching: it pops n tasks, executes the first, and queues the rest locally.

This patch extracts that logic into a Core::batch_from_inject helper and reuses it on both paths.

The tick path caps the batch at 32. The idle path caps at half the local queue to avoid overflow.

If we can unify batching this cleanly, what other scheduler paths are duplicating work they don’t need to?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
fn next_task(&mut self, worker: &Worker) -> Option<Notified> {
    if self.tick % self.global_queue_interval == 0 {
        self.tune_global_queue_interval(worker);

        const MAX_BATCH: usize = 32;

        self.batch_from_inject(worker, MAX_BATCH)
            .or_else(|| self.next_local_task())
    } else {
        self.next_local_task()
            .or_else(|| self.batch_from_inject(worker, self.run_queue.max_capacity() / 2))
    }
}

I used to assume the global injector sprayed tasks evenly across workers. It doesn’t.

Instead, batch_from_inject pops a batch proportional to inject_len / num_workers + 1, bounded by the caller’s cap. It pushes the remainder into the local queue and returns the first task directly.

That immediate hand-off skips a queue operation, but the real question is whether that batch size actually matches your hardware layout—what happens when inject_len is barely larger than num_workers?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
fn batch_from_inject(&mut self, worker: &Worker, max_batch: usize) -> Option<Notified> {
    if worker.inject().is_empty() {
        return None;
    }

    let cap = self.run_queue.remaining_slots().min(max_batch).max(1);

    let n = (worker.inject().len() / worker.handle.shared.remotes.len() + 1)
        .clamp(1, cap);

    let mut synced = worker.handle.shared.synced.lock();
    let mut tasks = unsafe { worker.inject().pop_n(&mut synced.inject, n) };

    let ret = tasks.next();
    self.run_queue.push_back(tasks);
    ret
}

I didn’t pick 32 because it looks clean in a config file. MAX_BATCH = 32 matches crossbeam-deque’s steal_batch size exactly, and the arithmetic leaves no wiggle room.

Run the numbers with 4 workers and LOCAL_QUEUE_CAPACITY = 2563, and you get inject_len / 4 + 1, capped at 32.

Here’s the part that surprised me. Injected tasks land at the back of the local queue, but because the worker pops from the back—LIFO—they run before older local work that sits deeper in the queue.

With cap=32 this head-of-line blocking is bounded to at most 31 tasks per tick. The worker cycles back to original-local work well before the next global tick.

So the cap isn’t just a performance knob—it’s a fairness guarantee you can count in tasks. If you add a fifth worker, do you trust yourself to remember that the denominator changes?

Cap choice

You usually hit a wall after two doublings. With busy2, I kept cranking the concurrency cap and the median latency kept collapsing—until it suddenly didn’t.

At a cap of 2, the benchmark limped along at 17.28 ms, a 49% reduction from baseline. Doubling to 4 dropped it to 11.58 ms, and doubling again to 8 hit 6.48 ms, an 81% reduction.

By 16, I was staring at 3.99 ms, an 88% reduction. The knee in the curve landed at 32, where the median fell to 2.74 ms, a 92% reduction from baseline.

Past that, the returns evaporated. 64 gave me 2.42 ms, and 128 only managed 2.23 ms, both 93% below baseline.

That final doubling from 64 to 128 bought me less than 0.2 ms.

So where is the actual limit? If 128 threads barely beats 32, the bottleneck has already shifted from concurrency starvation to something deeper in the pipeline.

capbusy2 medianvs baseline
217.28 ms−49%
411.58 ms−66%
86.48 ms−81%
163.99 ms−88%
322.74 ms−92%
642.42 ms−93%
1282.23 ms−93%

The throughput knee is at 32. Everything beyond that is just diminishing returns.

But I capped it there for a larger reason: fairness. With 128, a single global tick can grab 128 remote tasks, execute one, and push the other 127 to the back of the local queue.

Your existing local work gets buried behind a wall of converted-remote tasks. Cap at 32 and you still capture 92% of the throughput win without that risk.

Below inject_len = 124 on four workers, the formula never even hits the cap. Shallow queues see smaller batches, and the fairness bound tightens even further.

How much lower can we drop the cap on eight workers before the throughput curve actually breaks?

Prior art

You don’t need new queue code to speed up Tokio’s scheduler. The batching machinery is already sitting there.

push_overflow in queue.rs3 moves LOCAL_QUEUE_CAPACITY / 2 tasks into the inject queue in one go. The inject queue also exposes push_batch for batch insert.

The API wasn’t missing. The scheduler simply never called pop_n from the global tick path.

Crossbeam’s Chase-Lev deque4 — the de facto Rust work-stealing implementation — batches by design. Its steal_batch grabs roughly half the queue (capped at 32) in a single CAS5. Same principle: one synchronization, multiple items moved.

If both the overflow and steal paths already move tasks in bulk, why is the global tick path still asking for one at a time?

Results

BenchmarkBaselineBatch-popΔ
spawn_many_local10.05 ms9.74 msnoise
spawn_many_remote_idle6.99 ms7.12 msnoise
spawn_many_remote_busy17.36 ms6.92 msnoise
spawn_many_remote_busy233.96 ms2.74 ms−92%
ping_pong1.17 ms1.16 msnoise
yield_many11.30 ms11.70 msnoise

I love a benchmark that isolates exactly one bottleneck. busy2 slams the runtime with a large burst of external spawns while every worker is already saturated. That single condition makes the injector mutex the dominant cost.

Baseline acquires that mutex once per task. Batch-pop acquires it once per batch. The improvement is structural, not noise, and it reproduces cleanly every time.

The real question isn’t whether batching wins under this kind of contention. It’s whether your production workload looks enough like busy2 for the mutex to matter.

Risks

  • Fairness: bounded to at most 31 injected tasks ahead of existing local work per tick. Larger caps would risk burying local work indefinitely.
  • inject().len() is an AtomicUsize::load(Acquire) — no lock, but still a cache-coherence operation on every tick.
  • Low-volume traffic: a small steady trickle of remote tasks could see slightly worse latency if batch-pop defers them into local queues. The +1 floor and cap keep this bounded.

Batch-pop uses tokio’s existing pop_n API and applies a standard work-stealing amortization pattern. Whether upstream accepts it depends on the fairness trade-off.



  1. Carl Lerche, “Making the Tokio scheduler 10x faster,” Tokio Blog, October 13, 2019. https://tokio.rs/blog/2019-10-scheduler ↩︎

  2. tokio/src/runtime/scheduler/multi_thread/stats.rs, lines 33–39 and worker.rs:1063. https://github.com/tokio-rs/tokio/blob/c6d58ce/tokio/src/runtime/scheduler/multi_thread/stats.rs#L33 ↩︎

  3. tokio/src/runtime/scheduler/multi_thread/queue.rs, push_overflow method. https://github.com/tokio-rs/tokio/blob/c6d58ce/tokio/src/runtime/scheduler/multi_thread/queue.rs#L246 ↩︎ ↩︎

  4. David Chase and Yossi Lev. 2005. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures (SPAA ‘05). ACM, 21–28. https://doi.org/10.1145/1073970.1073974 ↩︎

  5. crossbeam-deque v0.8, Stealer::steal_batch documentation. https://docs.rs/crossbeam-deque/0.8/crossbeam_deque/struct.Stealer.html#method.steal_batch ↩︎