Inlining Tokio MPSC recv: removing the async tax

Two #[inline] hints on the mpsc receive path shave 14.7% off large messages (32 KB) and 11% off medium (512 B). Small objects (8 B) barely move. No regressions >5% across the full suite.

Patch: tokio-mpsc-recv-inline.patch

How a single `.await` turns into three calls

Every time you write rx.recv().await, what actually runs? Three different functions spread across three files.

chan::Rx::recv() wraps the async machinery and returns a Future. Behind that sits list::Rx::pop(), which advances the cursor and reclaims empty blocks. Then Block::read() pulls the value out with ptr::read(slot). The data takes a detour through Option wrappers at each layer.

flowchart LR
    A["chan::Rx::recv()
async poll wrapper"] --> B["list::Rx::pop()
advance cursor, reclaim blocks"]
    B --> C["Block::read()
ptr::read(slot)"]
    C --> D["option devirt
→ caller"]

Why three layers? Rust’s standard library got there by separating concerns. The channel surface lives in chan.rs. The wait list lives in list.rs. The memory block lives in block.rs. Each module is visible to the others because every function is at least pub(crate), so the compiler could fuse them. The question is whether it will.

Why the compiler left the seams visible

Rust splits each crate into codegen units (CGUs). Tokio defaults to about 16. The compiler compiles each chunk in parallel, then links them back together. A function in one CGU cannot see the body of a function in another unless the body is explicitly marked for inlining.

Without #[inline], the call between pop() (in CGU 3, say) and read() (in CGU 7) is a real CALL and RET:

1
2
3
4
pop() [CGU 3]
  CALL read() [CGU 7]
    ...
  RET

That means read() returns its Option into a stack slot in pop(). Then pop() wraps its own Option around that result. Then the async wrapper in recv() awaits it. The optimizer cannot see through the CGU wall, so it cannot eliminate the copies.

#[inline] does one simple thing. It keeps the function’s IR available across CGU boundaries. The compiler is free to fold the body into its caller, kill the intermediate stack slot, and route ptr::read straight to the final destination.

How the patch collapses the stack

We add #[inline] to two functions:

Block::read() in block.rs
list::Rx::pop() in list.rs

After the hint, the optimizer sees this:

1
2
3
4
recv() poll closure
  [inlined pop()]
    [inlined read()]
      ptr::read(slot) → direct into destination

Both Option wrappers devirtualize. The ptr::read no longer writes to a temporary; it writes straight to where the user needs the value. For a 32 KB struct, that skips one or two full copies per message.

The numbers

I ran these on oci-saulire (ARM64, 24 GB) with rustc 1.95.0. Criterion settings: --sample-size 50 --measurement-time 3, three runs, 15 s cooldown between each. I report the minimum of medians.

benchmark	size	baseline	patched	Δ
`recv_only/small_1000`	8 B	37.8 µs	37.4 µs	−1.1%
`recv_only/medium_1000`	512 B	100.9 µs	89.8 µs	−11.0%
`recv_only/large_1000`	32 KB	4.80 ms	4.09 ms	−14.7%

Per-message cost:

size	baseline	patched
8 B	~37.8 ns	~37.4 ns
512 B	~100.9 ns	~89.8 ns
32 KB	~4.80 µs	~4.09 µs

The full suite shows nothing alarming:

benchmark	baseline	patched	Δ
`contention/bounded`	1.14 ms	1.17 ms	+2.6%
`contention/bounded_recv_many`	974 µs	932 µs	−4.3%
`uncontented/bounded`	485 µs	487 µs	+0.4%
`send/medium_1000`	706 ns	701 ns	−0.7%
`send/large_1000`	12.4 µs	12.0 µs	−3.2%

Contention numbers wobbled ±2–3%. That is normal on a shared host. I do not trust that as a real signal.

Why bigger payloads win more

For 8 B objects, the cost is almost all queue bookkeeping. Atomic loads, cursor math, and branch mispredicts dominate the trace. Removing two call frames saves maybe a fraction of a nanosecond. That is lost in the noise.

For 32 KB objects, memory traffic dominates. Before the patch, read() copies 32 KB into a temporary Option<Read<T>>. Then pop() copies that into another Option. Then the caller finally gets it. After inlining, the compiler routes ptr::read directly to the user’s variable. One or two extra 32 KB copies disappear. Over 1000 messages that is 32–64 MB less memory traffic. The benchmark captures exactly that.

The fix

Two lines. No new API. No behavior change. No MSRV bump.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
--- a/tokio/src/sync/mpsc/block.rs
+++ b/tokio/src/sync/mpsc/block.rs
@@ -149,6 +149,7 @@ impl<T> Block<T> {
     /// To maintain safety, the caller must ensure:
     ///
     /// * No concurrent access to the slot.
+    #[inline]
     pub(crate) unsafe fn read(&self, slot_index: usize) -> Option<Read<T>> {
         let offset = offset(slot_index);

--- a/tokio/src/sync/mpsc/list.rs
+++ b/tokio/src/sync/mpsc/list.rs
@@ -317,6 +317,7 @@ impl<T> Rx<T> {
     }
 
     /// Pops the next value off the queue.
+    #[inline]
     pub(crate) fn pop(&mut self, tx: &Tx<T>) -> Option<block::Read<T>> {
         // Advance `head`, if needed
         if !self.try_advancing_head() {

Reproduce

1
cargo bench -p benches --bench sync_mpsc -- recv_only

Patch ready to apply against tokio:master (c6d58ce7): tokio-mpsc-recv-inline.patch

Terminology

Term	Meaning
CGU	Codegen unit: a parallel compilation chunk inside one crate
`#[inline]`	A hint that keeps a function’s IR available for cross-unit inlining
`ptr::read`	Copies a value out of a raw pointer without running Drop on the source
`pub(crate)`	Visible to every module inside the same crate, but not outside it
`Option`	Rust’s nullable single-value container; the compiler sometimes bloats it
`recv().await`	The async call that waits for the next message on a channel
MSRV	Minimum Supported Rust Version

References

tokio/src/sync/mpsc/block.rs — Block::read()
tokio/src/sync/mpsc/list.rs — list::Rx::pop()
tokio/src/sync/mpsc/chan.rs — Receiver::recv() wrapper

The experiments in this post were run with AI assistance. I wrote the words and checked every number myself.

How a single .await turns into three calls#

Why the compiler left the seams visible#

How the patch collapses the stack#

The numbers#

Why bigger payloads win more#

The fix#

Reproduce#

Terminology#

References#

Related posts