I wanted to know how much tokio’s async runtime slows down message passing. I ran a head-to-head benchmark against crossbeam on two different machines. Here is where the async tax bites hardest, and where it disappears.
How I benchmarked
I used Criterion.rs1 to measure latency. Each iteration sends 1000 messages through a bounded channel with capacity 1000, then receives all 1000 back. I varied the payload across eight sizes: 8 B, 64 B, 256 B, 512 B, 1 KB, 4 KB, 16 KB, and 32 KB.
Criterion.rs collected 50 samples per test. Each sample ran for 3 seconds after a 1-second warmup. I ran the whole suite on two machines: an ARM64 OCI Ampere A1 with 24 GB RAM, and an x86-64 AMD Ryzen 9 9900X with 64 GB RAM.
Results
Data table
| Size | Tokio ARM64 (µs) | Crossbeam ARM64 (µs) | Ratio | Tokio x86-64 (µs) | Crossbeam x86-64 (µs) | Ratio |
|---|---|---|---|---|---|---|
| 8 B | 101.7 | 73.6 | 1.38× | 56.7 | 29.5 | 1.92× |
| 64 B | 138.2 | 135.9 | 1.02× | 71.5 | 49.0 | 1.46× |
| 256 B | 204.0 | 190.2 | 1.07× | 93.9 | 58.3 | 1.61× |
| 512 B | 301.7 | 255.6 | 1.18× | 133.6 | 81.7 | 1.64× |
| 1 KB | 484.3 | 412.6 | 1.17× | 190.1 | 192.9 | 0.98× |
| 4 KB | 1,629.5 | 1,279.3 | 1.27× | 604.6 | 609.6 | 0.99× |
| 16 KB | 7,917.7 | 5,763.3 | 1.37× | 3,456.5 | 2,652.3 | 1.30× |
| 32 KB | 17,548.0 | 12,519.0 | 1.40× | 7,579.6 | 5,822.5 | 1.30× |
What I noticed
On x86-64, tokio and crossbeam look identical at 1 KB and 4 KB. The ratios are 0.98× and 0.99×. Memory bandwidth completely swallows the async overhead at those sizes.
On ARM64, tokio is always slower. The gap ranges from 1.02× at 64 B up to 1.40× at 32 KB. The async tax never fully disappears on this chip.
Small messages hurt most on x86-64. At 8 B, tokio is 1.92× slower. Crossbeam’s tight loop loves branch prediction and low-latency atomics there. Tokio still pays runtime coordination costs even when the message is tiny.
Both architectures settle near 1.30× at 32 KB. That tells me the bottleneck shifts from scheduling logic to plain memcpy bandwidth.
How the overhead works
The gap is not about throughput. Both crates move the same bytes. The difference is per-message coordination overhead.
Tokio’s mpsc does more work on every send and receive. It books wakers2, advances state machines, and polls the runtime. On x86-64, out-of-order execution and large caches can overlap that bookkeeping with the memory copy. The overhead hides at medium sizes.
On ARM64, smaller caches and simpler cores expose every extra instruction. The bookkeeping is visible in the latency.
This makes ARM64 a great reality check for MPSC optimizations. A win there is probably real. It is not just buried under a fast x86-64 memory subsystem.
Glossary
| Term | What it means |
|---|---|
tokio::sync::mpsc | Tokio’s async multi-producer, single-consumer channel |
crossbeam::channel | Crossbeam’s lock-free synchronous channel |
| Criterion.rs | Statistics-driven benchmark harness for Rust |
| bounded | Fixed capacity; senders block when full |
| waker | Object that notifies the runtime a task may resume |
| OoO | Out-of-order execution; CPU reorders instructions |