I always assumed the gap between async and sync channels was just noise. Then I benchmarked tokio::sync::mpsc against crossbeam::channel across two architectures to see exactly where the async tax bites hardest—and where it disappears.

On one architecture, the penalty was impossible to miss. On the other, it practically vanished.

That split is what makes this worth your time. If the async tax can disappear on one architecture and dominate on another, why are you still choosing channels by habit?

Method

I set out to stress-test channels across two completely different architectures, and I wanted every parameter locked down so you can reproduce the pain.

Each iteration fires 1,000 bounded sends and 1,000 receives through a channel with capacity fixed at 1,000. I swept eight payload sizes—8 B, 64 B, 256 B, 512 B, 1 KB, 4 KB, 16 KB, and 32 KB—because cache-line behavior changes dramatically across that span.

Criterion.rs handled the timing: 50 samples, 3 s measurement windows, and a 1 s warmup to keep the CPU governor from skewing early runs.

The ARM64 box is an OCI Ampere A1 with 24 GB RAM. The x86-64 workhorse is an AMD Ryzen 9 9900X with 64 GB RAM.

With that much memory and core diversity between architectures, any throughput delta we see is almost certainly the implementation, not the hardware gasping for headroom. So which platform actually wins when the bytes start flying?

Results

Data table

SizeTokio ARM64 (µs)Crossbeam ARM64 (µs)RatioTokio x86-64 (µs)Crossbeam x86-64 (µs)Ratio
8 B101.773.61.38×56.729.51.92×
64 B138.2135.91.02×71.549.01.46×
256 B204.0190.21.07×93.958.31.61×
512 B301.7255.61.18×133.681.71.64×
1 KB484.3412.61.17×190.1192.90.98×
4 KB1,629.51,279.31.27×604.6609.60.99×
16 KB7,917.75,763.31.37×3,456.52,652.31.30×
32 KB17,548.012,519.01.40×7,579.65,822.51.30×

What stands out

I expected tokio to pay some async tax across the board, but x86-64 had other plans. Between 1 KB and 4 KB, tokio and crossbeam are indistinguishable at 0.98–0.99×, because memory bandwidth completely hides the async overhead.

ARM64 refuses to let tokio off that easy. The gap runs from 1.02× at 64 B up to 1.40× at 32 KB, and the async tax never fully amortises.

Where tokio really bleeds is tiny messages on x86-64. At 8 B, the overhead hits 1.92×, since branch prediction and low-latency atomics make crossbeam’s tight loop exceptionally efficient there, while tokio still pays the runtime coordination cost.

But look at 32 KB on both architectures: they converge toward ~1.30×. That is the point where the bottleneck stops being scheduling logic and becomes pure memcpy bandwidth.

If your workload is a firehose of small control messages, that 1.92× gap is the difference between saturating a core and leaving headroom. At what payload size does your own runtime stop mattering and the memory bus take over?

Interpretation

I used to assume the bottleneck was raw throughput until I measured it. Both channels move the same bytes; the real drag is per-message coordination overhead.

Tokio’s mpsc does more work per operation: waker bookkeeping, state-machine progress, runtime polling.

x86-64 tolerates this overhead. Its aggressive out-of-order execution and large caches let that bookkeeping overlap with the memory copy, so the overhead vanishes at medium message sizes.

ARM64 does not grant that overlap. Smaller caches and simpler cores keep the same bookkeeping in the critical path, so the cost remains visible even at medium sizes.

That is exactly why I now reach for ARM64 first when I am validating MPSC optimisations. A speedup there is likely real, not just hidden by a fast memory subsystem.

When did you last run your channel benchmarks on a machine that actually punishes coordination overhead?