Tokio MPSC Sweep: message size vs latency

I wanted to know how much tokio’s async runtime slows down message passing. I ran a head-to-head benchmark against crossbeam on two different machines. Here is where the async tax bites hardest, and where it disappears.

How I benchmarked

I used Criterion.rs¹ to measure latency. Each iteration sends 1000 messages through a bounded channel with capacity 1000, then receives all 1000 back. I varied the payload across eight sizes: 8 B, 64 B, 256 B, 512 B, 1 KB, 4 KB, 16 KB, and 32 KB.

Criterion.rs collected 50 samples per test. Each sample ran for 3 seconds after a 1-second warmup. I ran the whole suite on two machines: an ARM64 OCI Ampere A1 with 24 GB RAM, and an x86-64 AMD Ryzen 9 9900X with 64 GB RAM.

Results

Data table

Size	Tokio ARM64 (µs)	Crossbeam ARM64 (µs)	Ratio	Tokio x86-64 (µs)	Crossbeam x86-64 (µs)	Ratio
8 B	101.7	73.6	1.38×	56.7	29.5	1.92×
64 B	138.2	135.9	1.02×	71.5	49.0	1.46×
256 B	204.0	190.2	1.07×	93.9	58.3	1.61×
512 B	301.7	255.6	1.18×	133.6	81.7	1.64×
1 KB	484.3	412.6	1.17×	190.1	192.9	0.98×
4 KB	1,629.5	1,279.3	1.27×	604.6	609.6	0.99×
16 KB	7,917.7	5,763.3	1.37×	3,456.5	2,652.3	1.30×
32 KB	17,548.0	12,519.0	1.40×	7,579.6	5,822.5	1.30×

What I noticed

On x86-64, tokio and crossbeam look identical at 1 KB and 4 KB. The ratios are 0.98× and 0.99×. Memory bandwidth completely swallows the async overhead at those sizes.

On ARM64, tokio is always slower. The gap ranges from 1.02× at 64 B up to 1.40× at 32 KB. The async tax never fully disappears on this chip.

Small messages hurt most on x86-64. At 8 B, tokio is 1.92× slower. Crossbeam’s tight loop loves branch prediction and low-latency atomics there. Tokio still pays runtime coordination costs even when the message is tiny.

Both architectures settle near 1.30× at 32 KB. That tells me the bottleneck shifts from scheduling logic to plain memcpy bandwidth.

How the overhead works

The gap is not about throughput. Both crates move the same bytes. The difference is per-message coordination overhead.

Tokio’s mpsc does more work on every send and receive. It books wakers², advances state machines, and polls the runtime. On x86-64, out-of-order execution and large caches can overlap that bookkeeping with the memory copy. The overhead hides at medium sizes.

On ARM64, smaller caches and simpler cores expose every extra instruction. The bookkeeping is visible in the latency.

This makes ARM64 a great reality check for MPSC optimizations. A win there is probably real. It is not just buried under a fast x86-64 memory subsystem.

Glossary

Term	What it means
`tokio::sync::mpsc`	Tokio’s async multi-producer, single-consumer channel
`crossbeam::channel`	Crossbeam’s lock-free synchronous channel
Criterion.rs	Statistics-driven benchmark harness for Rust
bounded	Fixed capacity; senders block when full
waker	Object that notifies the runtime a task may resume
OoO	Out-of-order execution; CPU reorders instructions

A statistics-driven benchmark harness that uses linear regression to reject noisy samples. ↩︎
An async primitive that tells the scheduler a suspended task should be polled again. ↩︎

The experiments in this post were run with AI assistance. I wrote the words and checked every number myself.

How I benchmarked#

Results#

Data table#

What I noticed#

How the overhead works#

Glossary#

Related posts