I wanted to know how much tokio’s async runtime slows down message passing. I ran a head-to-head benchmark against crossbeam on two different machines. Here is where the async tax bites hardest, and where it disappears.

How I benchmarked

I used Criterion.rs1 to measure latency. Each iteration sends 1000 messages through a bounded channel with capacity 1000, then receives all 1000 back. I varied the payload across eight sizes: 8 B, 64 B, 256 B, 512 B, 1 KB, 4 KB, 16 KB, and 32 KB.

Criterion.rs collected 50 samples per test. Each sample ran for 3 seconds after a 1-second warmup. I ran the whole suite on two machines: an ARM64 OCI Ampere A1 with 24 GB RAM, and an x86-64 AMD Ryzen 9 9900X with 64 GB RAM.

Results

Data table

SizeTokio ARM64 (µs)Crossbeam ARM64 (µs)RatioTokio x86-64 (µs)Crossbeam x86-64 (µs)Ratio
8 B101.773.61.38×56.729.51.92×
64 B138.2135.91.02×71.549.01.46×
256 B204.0190.21.07×93.958.31.61×
512 B301.7255.61.18×133.681.71.64×
1 KB484.3412.61.17×190.1192.90.98×
4 KB1,629.51,279.31.27×604.6609.60.99×
16 KB7,917.75,763.31.37×3,456.52,652.31.30×
32 KB17,548.012,519.01.40×7,579.65,822.51.30×

What I noticed

On x86-64, tokio and crossbeam look identical at 1 KB and 4 KB. The ratios are 0.98× and 0.99×. Memory bandwidth completely swallows the async overhead at those sizes.

On ARM64, tokio is always slower. The gap ranges from 1.02× at 64 B up to 1.40× at 32 KB. The async tax never fully disappears on this chip.

Small messages hurt most on x86-64. At 8 B, tokio is 1.92× slower. Crossbeam’s tight loop loves branch prediction and low-latency atomics there. Tokio still pays runtime coordination costs even when the message is tiny.

Both architectures settle near 1.30× at 32 KB. That tells me the bottleneck shifts from scheduling logic to plain memcpy bandwidth.

How the overhead works

The gap is not about throughput. Both crates move the same bytes. The difference is per-message coordination overhead.

Tokio’s mpsc does more work on every send and receive. It books wakers2, advances state machines, and polls the runtime. On x86-64, out-of-order execution and large caches can overlap that bookkeeping with the memory copy. The overhead hides at medium sizes.

On ARM64, smaller caches and simpler cores expose every extra instruction. The bookkeeping is visible in the latency.

This makes ARM64 a great reality check for MPSC optimizations. A win there is probably real. It is not just buried under a fast x86-64 memory subsystem.

Glossary

TermWhat it means
tokio::sync::mpscTokio’s async multi-producer, single-consumer channel
crossbeam::channelCrossbeam’s lock-free synchronous channel
Criterion.rsStatistics-driven benchmark harness for Rust
boundedFixed capacity; senders block when full
wakerObject that notifies the runtime a task may resume
OoOOut-of-order execution; CPU reorders instructions

  1. A statistics-driven benchmark harness that uses linear regression to reject noisy samples. ↩︎

  2. An async primitive that tells the scheduler a suspended task should be polled again. ↩︎