Two #[inline] hints on the mpsc receive path shave 14.7% off large messages (32 KB) and 11% off medium (512 B). Small objects (8 B) barely move. No regressions >5% across the full suite.

Patch: tokio-mpsc-recv-inline.patch


How a single .await turns into three calls

Every time you write rx.recv().await, what actually runs? Three different functions spread across three files.

chan::Rx::recv() wraps the async machinery and returns a Future. Behind that sits list::Rx::pop(), which advances the cursor and reclaims empty blocks. Then Block::read() pulls the value out with ptr::read(slot). The data takes a detour through Option wrappers at each layer.

flowchart LR
    A["chan::Rx::recv()
async poll wrapper"] --> B["list::Rx::pop()
advance cursor, reclaim blocks"] B --> C["Block::read()
ptr::read(slot)"] C --> D["option devirt
→ caller"]

Why three layers? Rust’s standard library got there by separating concerns. The channel surface lives in chan.rs. The wait list lives in list.rs. The memory block lives in block.rs. Each module is visible to the others because every function is at least pub(crate), so the compiler could fuse them. The question is whether it will.


Why the compiler left the seams visible

Rust splits each crate into codegen units (CGUs). Tokio defaults to about 16. The compiler compiles each chunk in parallel, then links them back together. A function in one CGU cannot see the body of a function in another unless the body is explicitly marked for inlining.

Without #[inline], the call between pop() (in CGU 3, say) and read() (in CGU 7) is a real CALL and RET:

1
2
3
4
pop() [CGU 3]
  CALL read() [CGU 7]
    ...
  RET

That means read() returns its Option into a stack slot in pop(). Then pop() wraps its own Option around that result. Then the async wrapper in recv() awaits it. The optimizer cannot see through the CGU wall, so it cannot eliminate the copies.

#[inline] does one simple thing. It keeps the function’s IR available across CGU boundaries. The compiler is free to fold the body into its caller, kill the intermediate stack slot, and route ptr::read straight to the final destination.


How the patch collapses the stack

We add #[inline] to two functions:

  • Block::read() in block.rs
  • list::Rx::pop() in list.rs

After the hint, the optimizer sees this:

1
2
3
4
recv() poll closure
  [inlined pop()]
    [inlined read()]
      ptr::read(slot) → direct into destination

Both Option wrappers devirtualize. The ptr::read no longer writes to a temporary; it writes straight to where the user needs the value. For a 32 KB struct, that skips one or two full copies per message.


The numbers

I ran these on oci-saulire (ARM64, 24 GB) with rustc 1.95.0. Criterion settings: --sample-size 50 --measurement-time 3, three runs, 15 s cooldown between each. I report the minimum of medians.

benchmarksizebaselinepatchedΔ
recv_only/small_10008 B37.8 µs37.4 µs−1.1%
recv_only/medium_1000512 B100.9 µs89.8 µs−11.0%
recv_only/large_100032 KB4.80 ms4.09 ms−14.7%

Per-message cost:

sizebaselinepatched
8 B~37.8 ns~37.4 ns
512 B~100.9 ns~89.8 ns
32 KB~4.80 µs~4.09 µs

The full suite shows nothing alarming:

benchmarkbaselinepatchedΔ
contention/bounded1.14 ms1.17 ms+2.6%
contention/bounded_recv_many974 µs932 µs−4.3%
uncontented/bounded485 µs487 µs+0.4%
send/medium_1000706 ns701 ns−0.7%
send/large_100012.4 µs12.0 µs−3.2%

Contention numbers wobbled ±2–3%. That is normal on a shared host. I do not trust that as a real signal.


Why bigger payloads win more

For 8 B objects, the cost is almost all queue bookkeeping. Atomic loads, cursor math, and branch mispredicts dominate the trace. Removing two call frames saves maybe a fraction of a nanosecond. That is lost in the noise.

For 32 KB objects, memory traffic dominates. Before the patch, read() copies 32 KB into a temporary Option<Read<T>>. Then pop() copies that into another Option. Then the caller finally gets it. After inlining, the compiler routes ptr::read directly to the user’s variable. One or two extra 32 KB copies disappear. Over 1000 messages that is 32–64 MB less memory traffic. The benchmark captures exactly that.


The fix

Two lines. No new API. No behavior change. No MSRV bump.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
--- a/tokio/src/sync/mpsc/block.rs
+++ b/tokio/src/sync/mpsc/block.rs
@@ -149,6 +149,7 @@ impl<T> Block<T> {
     /// To maintain safety, the caller must ensure:
     ///
     /// * No concurrent access to the slot.
+    #[inline]
     pub(crate) unsafe fn read(&self, slot_index: usize) -> Option<Read<T>> {
         let offset = offset(slot_index);

--- a/tokio/src/sync/mpsc/list.rs
+++ b/tokio/src/sync/mpsc/list.rs
@@ -317,6 +317,7 @@ impl<T> Rx<T> {
     }
 
     /// Pops the next value off the queue.
+    #[inline]
     pub(crate) fn pop(&mut self, tx: &Tx<T>) -> Option<block::Read<T>> {
         // Advance `head`, if needed
         if !self.try_advancing_head() {

Reproduce

1
cargo bench -p benches --bench sync_mpsc -- recv_only

Patch ready to apply against tokio:master (c6d58ce7): tokio-mpsc-recv-inline.patch


Terminology

TermMeaning
CGUCodegen unit: a parallel compilation chunk inside one crate
#[inline]A hint that keeps a function’s IR available for cross-unit inlining
ptr::readCopies a value out of a raw pointer without running Drop on the source
pub(crate)Visible to every module inside the same crate, but not outside it
OptionRust’s nullable single-value container; the compiler sometimes bloats it
recv().awaitThe async call that waits for the next message on a channel
MSRVMinimum Supported Rust Version

References