Inlining Tokio MPSC recv: removing the async tax

Two #[inline] hints on the mpsc receive path shave 14.7% off large messages (32 KB) and 11% off medium (512 B). Small objects (8 B) barely move. No regressions >5% across the full suite.

Patch: tokio-mpsc-recv-inline.patch

The call chain

Every rx.recv().await walks three layers inside tokio:

flowchart LR
    A["chan::Rx::recv()
async poll wrapper"] --> B["list::Rx::pop()
advance cursor, reclaim blocks"]
    B --> C["Block::read()
ptr::read(slot)"]
    C --> D["option devirt
→ caller"]

recv() is pub. pop() and read() are pub(crate). They all live in the same crate (tokio/src/sync/mpsc/), just different modules. So they can see each other.

The problem is release-mode codegen. Rust splits each crate into multiple codegen units (CGUs). By default a crate like tokio is carved into ~16 parallel chunks. Functions in different CGUs are compiled separately and linked together, just like cross-crate calls. The compiler won’t inline across a CGU boundary unless the function is marked #[inline] or LTO is enabled.

So pop() calls read() through a real call/return boundary:

1
2
3
4
pop() [CGU 3]
  CALL read() [CGU 7]
    ...
  RET

#[inline] tells the compiler: keep this body’s IR available, inline it even across CGU boundaries. After adding it to both pop() and read(), the call chain collapses into the caller:

1
2
3
4
recv() poll closure
  [inlined pop()]
    [inlined read()]
      ptr::read(slot) → direct into destination

The optimizer now sees through both Option layers and routes the ptr::read straight to where the caller needs it, skipping intermediate copies.

Numbers

Runs on oci-saulire (ARM64, 24 GB), rustc 1.95.0, criterion --sample-size 50 --measurement-time 3, n=3 with 15 s cooldown. We report minimum of medians.

benchmark	size	baseline	patched	Δ
`recv_only/small_1000`	8 B	37.8 µs	37.4 µs	−1.1%
`recv_only/medium_1000`	512 B	100.9 µs	89.8 µs	−11.0%
`recv_only/large_1000`	32 KB	4.80 ms	4.09 ms	−14.7%

Per-message:

size	baseline	patched
8 B	~37.8 ns	~37.4 ns
512 B	~100.9 ns	~89.8 ns
32 KB	~4.80 µs	~4.09 µs

Full-suite cross-check (no regressions >5%):

benchmark	baseline	patched	Δ
`contention/bounded`	1.14 ms	1.17 ms	+2.6%
`contention/bounded_recv_many`	974 µs	932 µs	−4.3%
`uncontented/bounded`	485 µs	487 µs	+0.4%
`send/medium_1000`	706 ns	701 ns	−0.7%
`send/large_1000`	12.4 µs	12.0 µs	−3.2%

Contention benchmarks wobbled ±2–3%, normal for multi-threaded runs on a shared host. Not a result.

Why the scaling

For 8 B objects, queue bookkeeping (atomic loads, cursor math, branch mispredicts) dominates. Eliminating two call frames saves ~0.4 ns — barely measurable.

For 32 KB objects, the dominant cost is memory traffic. Before the patch, ptr::read(slot) copies 32 KB into a temporary Option<Read<T>> in read(), then into another Option in pop(), then finally into the user’s variable. After inlining, the compiler routes the read directly to the final destination, saving one or two 32 KB copies per message. Across 1000 messages that’s 32–64 MB less memory traffic — exactly what the benchmark measures.

The fix

Two lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
--- a/tokio/src/sync/mpsc/block.rs
+++ b/tokio/src/sync/mpsc/block.rs
@@ -149,6 +149,7 @@ impl<T> Block<T> {
     /// To maintain safety, the caller must ensure:
     ///
     /// * No concurrent access to the slot.
+    #[inline]
     pub(crate) unsafe fn read(&self, slot_index: usize) -> Option<Read<T>> {
         let offset = offset(slot_index);

--- a/tokio/src/sync/mpsc/list.rs
+++ b/tokio/src/sync/mpsc/list.rs
@@ -317,6 +317,7 @@ impl<T> Rx<T> {
     }
 
     /// Pops the next value off the queue.
+    #[inline]
     pub(crate) fn pop(&mut self, tx: &Tx<T>) -> Option<block::Read<T>> {
         // Advance `head`, if needed
         if !self.try_advancing_head() {

No new public API. No behavior change. No MSRV bump.

Reproduce

1
cargo bench -p benches --bench sync_mpsc -- recv_only

Patch ready to apply against tokio:master (c6d58ce7): tokio-mpsc-recv-inline.patch

References

tokio/src/sync/mpsc/block.rs — Block::read()
tokio/src/sync/mpsc/list.rs — list::Rx::pop()
tokio/src/sync/mpsc/chan.rs — Receiver::recv() wrapper

This post was researched and drafted with AI assistance. I picked the ideas, reviewed every claim, and verified all benchmarks.

The call chain#

Numbers#

Why the scaling#

The fix#

Reproduce#

References#

Related posts