Benchmarking tokio::sync::mpsc against crossbeam::channel across two architectures to see where the async tax bites hardest — and where it disappears.
Method
- Messages: 1000 bounded sends + 1000 recvs per iteration (capacity = 1000)
- Sizes: 8 B, 64 B, 256 B, 512 B, 1 KB, 4 KB, 16 KB, 32 KB
- Tool: Criterion.rs, 50 samples, 3 s measurement (warmup 1 s)
- Machines:
- ARM64: OCI Ampere A1, 24 GB RAM
- x86-64: AMD Ryzen 9 9900X, 64 GB RAM
Results
Data table
| Size | Tokio ARM64 (µs) | Crossbeam ARM64 (µs) | Ratio | Tokio x86-64 (µs) | Crossbeam x86-64 (µs) | Ratio |
|---|---|---|---|---|---|---|
| 8 B | 101.7 | 73.6 | 1.38× | 56.7 | 29.5 | 1.92× |
| 64 B | 138.2 | 135.9 | 1.02× | 71.5 | 49.0 | 1.46× |
| 256 B | 204.0 | 190.2 | 1.07× | 93.9 | 58.3 | 1.61× |
| 512 B | 301.7 | 255.6 | 1.18× | 133.6 | 81.7 | 1.64× |
| 1 KB | 484.3 | 412.6 | 1.17× | 190.1 | 192.9 | 0.98× |
| 4 KB | 1,629.5 | 1,279.3 | 1.27× | 604.6 | 609.6 | 0.99× |
| 16 KB | 7,917.7 | 5,763.3 | 1.37× | 3,456.5 | 2,652.3 | 1.30× |
| 32 KB | 17,548.0 | 12,519.0 | 1.40× | 7,579.6 | 5,822.5 | 1.30× |
What stands out
- On x86-64, tokio and crossbeam are indistinguishable at 1 KB–4 KB (0.98–0.99×). The async overhead is completely hidden by memory bandwidth.
- On ARM64, tokio is consistently slower — the gap ranges from 1.02× (64 B) to 1.40× (32 KB). The async tax never fully amortises.
- Small-message overhead is worst on x86-64 (1.92× at 8 B) — branch prediction and low-latency atomics make crossbeam’s tight loop exceptionally efficient there, while tokio still pays the runtime coordination cost.
- Both architectures converge toward ~1.30× at 32 KB, suggesting the bottleneck becomes pure memcpy bandwidth rather than scheduling logic.
Interpretation
The gap is not throughput (both move the same bytes) but per-message coordination overhead. Tokio’s mpsc does more work per operation: waker bookkeeping, state-machine progress, runtime polling. On x86-64 with aggressive OoO execution and large caches, this overhead can overlap with the memory copy and vanish at medium sizes. On ARM64 (smaller caches, simpler cores), the same bookkeeping is more exposed.
This makes ARM64 a better platform for validating MPSC optimisations: any win there is likely real, not just hidden by a fast memory subsystem.