Two #[inline] hints on the mpsc receive path shave 14.7% off large messages (32 KB) and 11% off medium (512 B). Small objects (8 B) barely move. No regressions >5% across the full suite.

Patch: tokio-mpsc-recv-inline.patch


The call chain

Every rx.recv().await walks three layers inside tokio:

flowchart LR
    A["chan::Rx::recv()
async poll wrapper"] --> B["list::Rx::pop()
advance cursor, reclaim blocks"] B --> C["Block::read()
ptr::read(slot)"] C --> D["option devirt
→ caller"]

recv() is pub. pop() and read() are pub(crate). They all live in the same crate (tokio/src/sync/mpsc/), just different modules. So they can see each other.

The problem is release-mode codegen. Rust splits each crate into multiple codegen units (CGUs). By default a crate like tokio is carved into ~16 parallel chunks. Functions in different CGUs are compiled separately and linked together, just like cross-crate calls. The compiler won’t inline across a CGU boundary unless the function is marked #[inline] or LTO is enabled.

So pop() calls read() through a real call/return boundary:

1
2
3
4
pop() [CGU 3]
  CALL read() [CGU 7]
    ...
  RET

#[inline] tells the compiler: keep this body’s IR available, inline it even across CGU boundaries. After adding it to both pop() and read(), the call chain collapses into the caller:

1
2
3
4
recv() poll closure
  [inlined pop()]
    [inlined read()]
      ptr::read(slot) → direct into destination

The optimizer now sees through both Option layers and routes the ptr::read straight to where the caller needs it, skipping intermediate copies.


Numbers

Runs on oci-saulire (ARM64, 24 GB), rustc 1.95.0, criterion --sample-size 50 --measurement-time 3, n=3 with 15 s cooldown. We report minimum of medians.

benchmarksizebaselinepatchedΔ
recv_only/small_10008 B37.8 µs37.4 µs−1.1%
recv_only/medium_1000512 B100.9 µs89.8 µs−11.0%
recv_only/large_100032 KB4.80 ms4.09 ms−14.7%

Per-message:

sizebaselinepatched
8 B~37.8 ns~37.4 ns
512 B~100.9 ns~89.8 ns
32 KB~4.80 µs~4.09 µs

Full-suite cross-check (no regressions >5%):

benchmarkbaselinepatchedΔ
contention/bounded1.14 ms1.17 ms+2.6%
contention/bounded_recv_many974 µs932 µs−4.3%
uncontented/bounded485 µs487 µs+0.4%
send/medium_1000706 ns701 ns−0.7%
send/large_100012.4 µs12.0 µs−3.2%

Contention benchmarks wobbled ±2–3%, normal for multi-threaded runs on a shared host. Not a result.


Why the scaling

For 8 B objects, queue bookkeeping (atomic loads, cursor math, branch mispredicts) dominates. Eliminating two call frames saves ~0.4 ns — barely measurable.

For 32 KB objects, the dominant cost is memory traffic. Before the patch, ptr::read(slot) copies 32 KB into a temporary Option<Read<T>> in read(), then into another Option in pop(), then finally into the user’s variable. After inlining, the compiler routes the read directly to the final destination, saving one or two 32 KB copies per message. Across 1000 messages that’s 32–64 MB less memory traffic — exactly what the benchmark measures.


The fix

Two lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
--- a/tokio/src/sync/mpsc/block.rs
+++ b/tokio/src/sync/mpsc/block.rs
@@ -149,6 +149,7 @@ impl<T> Block<T> {
     /// To maintain safety, the caller must ensure:
     ///
     /// * No concurrent access to the slot.
+    #[inline]
     pub(crate) unsafe fn read(&self, slot_index: usize) -> Option<Read<T>> {
         let offset = offset(slot_index);

--- a/tokio/src/sync/mpsc/list.rs
+++ b/tokio/src/sync/mpsc/list.rs
@@ -317,6 +317,7 @@ impl<T> Rx<T> {
     }
 
     /// Pops the next value off the queue.
+    #[inline]
     pub(crate) fn pop(&mut self, tx: &Tx<T>) -> Option<block::Read<T>> {
         // Advance `head`, if needed
         if !self.try_advancing_head() {

No new public API. No behavior change. No MSRV bump.


Reproduce

1
cargo bench -p benches --bench sync_mpsc -- recv_only

Patch ready to apply against tokio:master (c6d58ce7): tokio-mpsc-recv-inline.patch


References