Two #[inline] hints on the mpsc receive path shave 14.7% off large
messages (32 KB) and 11% off medium (512 B). Small objects (8 B) barely
move. No regressions >5% across the full suite.
Patch: tokio-mpsc-recv-inline.patch
How a single .await turns into three calls
Every time you write rx.recv().await, what actually runs? Three different functions spread across three files.
chan::Rx::recv() wraps the async machinery and returns a Future. Behind that sits list::Rx::pop(), which advances the cursor and reclaims empty blocks. Then Block::read() pulls the value out with ptr::read(slot). The data takes a detour through Option wrappers at each layer.
flowchart LR
A["chan::Rx::recv()
async poll wrapper"] --> B["list::Rx::pop()
advance cursor, reclaim blocks"]
B --> C["Block::read()
ptr::read(slot)"]
C --> D["option devirt
→ caller"]
Why three layers? Rust’s standard library got there by separating concerns. The channel surface lives in chan.rs. The wait list lives in list.rs. The memory block lives in block.rs. Each module is visible to the others because every function is at least pub(crate), so the compiler could fuse them. The question is whether it will.
Why the compiler left the seams visible
Rust splits each crate into codegen units (CGUs). Tokio defaults to about 16. The compiler compiles each chunk in parallel, then links them back together. A function in one CGU cannot see the body of a function in another unless the body is explicitly marked for inlining.
Without #[inline], the call between pop() (in CGU 3, say) and read() (in CGU 7) is a real CALL and RET:
| |
That means read() returns its Option into a stack slot in pop(). Then pop() wraps its own Option around that result. Then the async wrapper in recv() awaits it. The optimizer cannot see through the CGU wall, so it cannot eliminate the copies.
#[inline] does one simple thing. It keeps the function’s IR available across CGU boundaries. The compiler is free to fold the body into its caller, kill the intermediate stack slot, and route ptr::read straight to the final destination.
How the patch collapses the stack
We add #[inline] to two functions:
Block::read()inblock.rslist::Rx::pop()inlist.rs
After the hint, the optimizer sees this:
| |
Both Option wrappers devirtualize. The ptr::read no longer writes to a temporary; it writes straight to where the user needs the value. For a 32 KB struct, that skips one or two full copies per message.
The numbers
I ran these on oci-saulire (ARM64, 24 GB) with rustc 1.95.0. Criterion settings: --sample-size 50 --measurement-time 3, three runs, 15 s cooldown between each. I report the minimum of medians.
| benchmark | size | baseline | patched | Δ |
|---|---|---|---|---|
recv_only/small_1000 | 8 B | 37.8 µs | 37.4 µs | −1.1% |
recv_only/medium_1000 | 512 B | 100.9 µs | 89.8 µs | −11.0% |
recv_only/large_1000 | 32 KB | 4.80 ms | 4.09 ms | −14.7% |
Per-message cost:
| size | baseline | patched |
|---|---|---|
| 8 B | ~37.8 ns | ~37.4 ns |
| 512 B | ~100.9 ns | ~89.8 ns |
| 32 KB | ~4.80 µs | ~4.09 µs |
The full suite shows nothing alarming:
| benchmark | baseline | patched | Δ |
|---|---|---|---|
contention/bounded | 1.14 ms | 1.17 ms | +2.6% |
contention/bounded_recv_many | 974 µs | 932 µs | −4.3% |
uncontented/bounded | 485 µs | 487 µs | +0.4% |
send/medium_1000 | 706 ns | 701 ns | −0.7% |
send/large_1000 | 12.4 µs | 12.0 µs | −3.2% |
Contention numbers wobbled ±2–3%. That is normal on a shared host. I do not trust that as a real signal.
Why bigger payloads win more
For 8 B objects, the cost is almost all queue bookkeeping. Atomic loads, cursor math, and branch mispredicts dominate the trace. Removing two call frames saves maybe a fraction of a nanosecond. That is lost in the noise.
For 32 KB objects, memory traffic dominates. Before the patch, read() copies 32 KB into a temporary Option<Read<T>>. Then pop() copies that into another Option. Then the caller finally gets it. After inlining, the compiler routes ptr::read directly to the user’s variable. One or two extra 32 KB copies disappear. Over 1000 messages that is 32–64 MB less memory traffic. The benchmark captures exactly that.
The fix
Two lines. No new API. No behavior change. No MSRV bump.
| |
Reproduce
| |
Patch ready to apply against tokio:master (c6d58ce7): tokio-mpsc-recv-inline.patch
Terminology
| Term | Meaning |
|---|---|
| CGU | Codegen unit: a parallel compilation chunk inside one crate |
#[inline] | A hint that keeps a function’s IR available for cross-unit inlining |
ptr::read | Copies a value out of a raw pointer without running Drop on the source |
pub(crate) | Visible to every module inside the same crate, but not outside it |
Option | Rust’s nullable single-value container; the compiler sometimes bloats it |
recv().await | The async call that waits for the next message on a channel |
| MSRV | Minimum Supported Rust Version |
References
tokio/src/sync/mpsc/block.rs—Block::read()tokio/src/sync/mpsc/list.rs—list::Rx::pop()tokio/src/sync/mpsc/chan.rs—Receiver::recv()wrapper