Tokio MPSC Sweep: message size vs latency

Benchmarking tokio::sync::mpsc against crossbeam::channel across two architectures to see where the async tax bites hardest — and where it disappears.

Method

Messages: 1000 bounded sends + 1000 recvs per iteration (capacity = 1000)
Sizes: 8 B, 64 B, 256 B, 512 B, 1 KB, 4 KB, 16 KB, 32 KB
Tool: Criterion.rs, 50 samples, 3 s measurement (warmup 1 s)
Machines:
- ARM64: OCI Ampere A1, 24 GB RAM
- x86-64: AMD Ryzen 9 9900X, 64 GB RAM

Results

Data table

Size	Tokio ARM64 (µs)	Crossbeam ARM64 (µs)	Ratio	Tokio x86-64 (µs)	Crossbeam x86-64 (µs)	Ratio
8 B	101.7	73.6	1.38×	56.7	29.5	1.92×
64 B	138.2	135.9	1.02×	71.5	49.0	1.46×
256 B	204.0	190.2	1.07×	93.9	58.3	1.61×
512 B	301.7	255.6	1.18×	133.6	81.7	1.64×
1 KB	484.3	412.6	1.17×	190.1	192.9	0.98×
4 KB	1,629.5	1,279.3	1.27×	604.6	609.6	0.99×
16 KB	7,917.7	5,763.3	1.37×	3,456.5	2,652.3	1.30×
32 KB	17,548.0	12,519.0	1.40×	7,579.6	5,822.5	1.30×

What stands out

On x86-64, tokio and crossbeam are indistinguishable at 1 KB–4 KB (0.98–0.99×). The async overhead is completely hidden by memory bandwidth.
On ARM64, tokio is consistently slower — the gap ranges from 1.02× (64 B) to 1.40× (32 KB). The async tax never fully amortises.
Small-message overhead is worst on x86-64 (1.92× at 8 B) — branch prediction and low-latency atomics make crossbeam’s tight loop exceptionally efficient there, while tokio still pays the runtime coordination cost.
Both architectures converge toward ~1.30× at 32 KB, suggesting the bottleneck becomes pure memcpy bandwidth rather than scheduling logic.

Interpretation

The gap is not throughput (both move the same bytes) but per-message coordination overhead. Tokio’s mpsc does more work per operation: waker bookkeeping, state-machine progress, runtime polling. On x86-64 with aggressive OoO execution and large caches, this overhead can overlap with the memory copy and vanish at medium sizes. On ARM64 (smaller caches, simpler cores), the same bookkeeping is more exposed.

This makes ARM64 a better platform for validating MPSC optimisations: any win there is likely real, not just hidden by a fast memory subsystem.

This post was researched and drafted with AI assistance. I picked the ideas, reviewed every claim, and verified all benchmarks.

Method#

Results#

Data table#

What stands out#

Interpretation#

Related posts

Method

Results

Data table

What stands out

Interpretation