Two #[inline] hints on the mpsc receive path shave 14.7% off large
messages (32 KB) and 11% off medium (512 B). Small objects (8 B) barely
move. No regressions >5% across the full suite.
Patch: tokio-mpsc-recv-inline.patch
The call chain
Every rx.recv().await walks three layers inside tokio:
flowchart LR
A["chan::Rx::recv()
async poll wrapper"] --> B["list::Rx::pop()
advance cursor, reclaim blocks"]
B --> C["Block::read()
ptr::read(slot)"]
C --> D["option devirt
→ caller"]
recv() is pub. pop() and read() are pub(crate). They all live in the
same crate (tokio/src/sync/mpsc/), just different modules. So they can
see each other.
The problem is release-mode codegen. Rust splits each crate into multiple
codegen units (CGUs). By default a crate like tokio is carved into ~16
parallel chunks. Functions in different CGUs are compiled separately and linked
together, just like cross-crate calls. The compiler won’t inline across a CGU
boundary unless the function is marked #[inline] or LTO is enabled.
So pop() calls read() through a real call/return boundary:
| |
#[inline] tells the compiler: keep this body’s IR available, inline it even
across CGU boundaries. After adding it to both pop() and read(), the
call chain collapses into the caller:
| |
The optimizer now sees through both Option layers and routes the ptr::read
straight to where the caller needs it, skipping intermediate copies.
Numbers
Runs on oci-saulire (ARM64, 24 GB), rustc 1.95.0, criterion --sample-size 50 --measurement-time 3, n=3 with 15 s cooldown. We report minimum of
medians.
| benchmark | size | baseline | patched | Δ |
|---|---|---|---|---|
recv_only/small_1000 | 8 B | 37.8 µs | 37.4 µs | −1.1% |
recv_only/medium_1000 | 512 B | 100.9 µs | 89.8 µs | −11.0% |
recv_only/large_1000 | 32 KB | 4.80 ms | 4.09 ms | −14.7% |
Per-message:
| size | baseline | patched |
|---|---|---|
| 8 B | ~37.8 ns | ~37.4 ns |
| 512 B | ~100.9 ns | ~89.8 ns |
| 32 KB | ~4.80 µs | ~4.09 µs |
Full-suite cross-check (no regressions >5%):
| benchmark | baseline | patched | Δ |
|---|---|---|---|
contention/bounded | 1.14 ms | 1.17 ms | +2.6% |
contention/bounded_recv_many | 974 µs | 932 µs | −4.3% |
uncontented/bounded | 485 µs | 487 µs | +0.4% |
send/medium_1000 | 706 ns | 701 ns | −0.7% |
send/large_1000 | 12.4 µs | 12.0 µs | −3.2% |
Contention benchmarks wobbled ±2–3%, normal for multi-threaded runs on a shared host. Not a result.
Why the scaling
For 8 B objects, queue bookkeeping (atomic loads, cursor math, branch mispredicts) dominates. Eliminating two call frames saves ~0.4 ns — barely measurable.
For 32 KB objects, the dominant cost is memory traffic. Before the patch,
ptr::read(slot) copies 32 KB into a temporary Option<Read<T>> in
read(), then into another Option in pop(), then finally into the
user’s variable. After inlining, the compiler routes the read directly to the
final destination, saving one or two 32 KB copies per message. Across 1000
messages that’s 32–64 MB less memory traffic — exactly what the benchmark
measures.
The fix
Two lines:
| |
No new public API. No behavior change. No MSRV bump.
Reproduce
| |
Patch ready to apply against tokio:master (c6d58ce7): tokio-mpsc-recv-inline.patch
References
tokio/src/sync/mpsc/block.rs—Block::read()tokio/src/sync/mpsc/list.rs—list::Rx::pop()tokio/src/sync/mpsc/chan.rs—Receiver::recv()wrapper