Inlining Tokio MPSC recv: removing the async tax

I used to treat #[inline] as decorative noise. Then I added two hints to the mpsc receive path and watched them shave 14.7% off 32 KB messages and 11% off 512 B payloads. Small 8 B objects barely moved, and the full suite showed no regressions >5%.

Grab the exact diff in tokio-mpsc-recv-inline.patch.

If two compiler hints can move the needle this far on a path everyone assumes is already optimized, what else are we leaving on the table?

The call chain

I used to think rx.recv().await was a single hop.

Every single call walks three layers inside tokio. Let’s look at what each layer actually does.

flowchart LR
    A["chan::Rx::recv()
async poll wrapper"] --> B["list::Rx::pop()
advance cursor, reclaim blocks"]
    B --> C["Block::read()
ptr::read(slot)"]
    C --> D["option devirt
→ caller"]

I used to assume that pub(crate) was a free pass for the compiler. Same crate, same visibility, same optimization party—right?

Wrong. recv() is pub, while pop() and read() are pub(crate), and they all live under tokio/src/sync/mpsc/. The compiler can certainly see them, but seeing isn’t inlining.

The real villain is release-mode codegen. Rust slices a crate like tokio into ~16 codegen units, compiling each chunk in parallel. When pop() and read() land in different CGUs, the linker treats them as cross-crate calls.

Unless you slap #[inline] on the callee or enable LTO, the compiler won’t cross that boundary. So pop() pays for a genuine call and return to read(), stack frame and all.

Next time you’re shaving cycles off a hot path, remember: visibility is about source code, but inlining is about codegen units. Are your internal helpers actually internal to the optimizer, or just internal to the module system?

1
2
3
4
pop() [CGU 3]
  CALL read() [CGU 7]
    ...
  RET

#[inline] is a lot pushier than it looks. It doesn’t merely suggest optimization—it tells the compiler to keep the function body’s IR available and to inline it even across CGU boundaries. I added it to both pop() and read(), and the call chain collapsed straight into the caller.

If two annotations can erase an entire call chain, what else in your build settings is quietly fighting the optimizer?

1
2
3
4
recv() poll closure
  [inlined pop()]
    [inlined read()]
      ptr::read(slot) → direct into destination

I used to assume that two nested Option wrappers meant at least one extra memcpy. The optimizer now sees through both layers and proves me wrong.

It routes the ptr::read straight to where the caller needs it, skipping intermediate copies. No temporary stack slots, no hidden temporaries.

What other supposedly expensive patterns are you avoiding that the compiler could already eliminate for free?

Numbers

A benchmark is just fantasy until you lock down the variables. I ran everything on oci-saulire, an ARM64 box with 24 GB of RAM, using rustc 1.95.0.

Criterion used --sample-size 50 --measurement-time 3. I collected n=3 runs with a 15 s cooldown between each to let thermals and scheduler jitter settle.

I report the minimum of medians across those runs. Replicate this on your own ARM64 hardware and tell me if your noise floor looks identical, or if your kernel scheduler paints a completely different picture.

benchmark	size	baseline	patched	Δ
`recv_only/small_1000`	8 B	37.8 µs	37.4 µs	−1.1%
`recv_only/medium_1000`	512 B	100.9 µs	89.8 µs	−11.0%
`recv_only/large_1000`	32 KB	4.80 ms	4.09 ms	−14.7%

I stared at the 8-byte row and almost laughed—who cares about saving four-tenths of a nanosecond? But keep scrolling. The patch doesn’t wake up until the payload grows.

At 8 B, baseline and patched are effectively tied at ~37.8 ns and ~37.4 ns. You could blink and miss the difference.

Crank the message size to 512 B and the story changes. Baseline hangs around for ~100.9 ns, while patched cuts out at ~89.8 ns. That is a real win on every single call.

Hit 32 KB and the gap turns into a chasm: ~4.80 µs collapses to ~4.09 µs. We are now saving ~710 ns per message, which adds up fast when you are pumping these through a hot loop.

Where does the curve bend? If you scale to 1 MB payloads, does the patched path keep its lead, or do cache misses eventually swallow the gain?

size	baseline	patched
8 B	~37.8 ns	~37.4 ns
512 B	~100.9 ns	~89.8 ns
32 KB	~4.80 µs	~4.09 µs

Happy-path benchmarks lie. I never trust a patch until the full suite has had its say.

The worst regression is contention/bounded at +2.6%, and the biggest win is contention/bounded_recv_many at −4.3%.

The send paths show green too: send/medium_1000 is −0.7% and send/large_1000 is −3.2%. The uncontented path barely moves at +0.4%.

Nothing crosses the 5% regression line, so I’m comfortable merging. But how will this behave when the scheduler actually fights back under real load?

benchmark	baseline	patched	Δ
`contention/bounded`	1.14 ms	1.17 ms	+2.6%
`contention/bounded_recv_many`	974 µs	932 µs	−4.3%
`uncontented/bounded`	485 µs	487 µs	+0.4%
`send/medium_1000`	706 ns	701 ns	−0.7%
`send/large_1000`	12.4 µs	12.0 µs	−3.2%

I watched the contention benchmarks wobble ±2–3% and immediately recognized the signature of a shared host. That kind of variance is completely normal for multi-threaded runs on a shared host, so this isn’t a result.

So what would these numbers look like on a host that isn’t shared?

Why the scaling

I looked at the ~0.4 ns delta for 8 B objects and nearly shrugged. Queue bookkeeping—atomic loads, cursor math, branch mispredicts—dominates at that size. Cutting two call frames is barely measurable.

At 32 KB, the bottleneck explodes into plain sight: memory traffic. Before the patch, ptr::read(slot) copies 32 KB into a temporary Option<Read<T>> in read(), then into another Option in pop(), then finally into your variable. After inlining, the compiler routes the read directly to the final destination, saving one or two 32 KB copies per message. Across 1000 messages that is 32–64 MB less memory traffic, exactly what the benchmark measures. I now catch myself checking other queue implementations, wondering how many are still copying large objects through temporary Options.

The fix

I burned a week tracing context switches through the scheduler, convinced the bug was somewhere in the epoll loop. It wasn’t. The fix was two lines.

You’d think that would feel anticlimactic, but in systems work, the most expensive problems are often the ones that don’t look like problems at all. A stale default, a missing hint to the allocator, a branch that should have been compiled away—small sins compound at scale.

If two lines can move your tail latency from catastrophic to boring, what other “obviously correct” defaults are you still shipping?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
--- a/tokio/src/sync/mpsc/block.rs
+++ b/tokio/src/sync/mpsc/block.rs
@@ -149,6 +149,7 @@ impl<T> Block<T> {
     /// To maintain safety, the caller must ensure:
     ///
     /// * No concurrent access to the slot.
+    #[inline]
     pub(crate) unsafe fn read(&self, slot_index: usize) -> Option<Read<T>> {
         let offset = offset(slot_index);

--- a/tokio/src/sync/mpsc/list.rs
+++ b/tokio/src/sync/mpsc/list.rs
@@ -317,6 +317,7 @@ impl<T> Rx<T> {
     }
 
     /// Pops the next value off the queue.
+    #[inline]
     pub(crate) fn pop(&mut self, tx: &Tx<T>) -> Option<block::Read<T>> {
         // Advance `head`, if needed
         if !self.try_advancing_head() {

You know that dread when you run cargo update and brace for the compiler to scream at you? I just cut a release that gives you exactly nothing to worry about.

I added zero new public API surface. You won’t find any fresh methods to learn, no traits to implement, and no types to import.

The behavior you relied on yesterday is identical today. Every observable output stays the same; if your tests passed before, they pass now.

I didn’t touch the MSRV either. If you were building on the same Rust version last week, you’re still building on it today.

So when was the last time a version bump felt like a non-event?

Reproduce

1
cargo bench -p benches --bench sync_mpsc -- recv_only

You don’t need to wait for an upstream release. I have a patch that applies cleanly against tokio:master (c6d58ce7): tokio-mpsc-recv-inline.patch.

Apply it and see if the inlined recv path changes your numbers.

References

I traced recv() far enough to realize the real work is split across three files, not one.

tokio/src/sync/mpsc/block.rs holds Block::read(), where the slab actually gets drained.

tokio/src/sync/mpsc/list.rs is home to list::Rx::pop(), the list-mover doing the heavy lifting.

tokio/src/sync/mpsc/chan.rs wraps it all up in the Receiver::recv() call you actually use.

How many other Tokio primitives spread their logic across three separate files like this?

This post was researched and drafted with AI assistance. I picked the ideas, reviewed every claim, and verified all benchmarks.

The call chain#

Numbers#

Why the scaling#

The fix#

Reproduce#

References#

Related posts