I ran an allocator shootout for async Rust on a 4-core ARM64 server. I wanted to know which allocator handles Tokio tasks and channels best. The answer depends heavily on allocation size.

jemalloc regresses large-message MPSC by 62% versus std on ARM64. At 16 KB messages it is 1.62× slower than glibc; at 32 KB the gap narrows to 1.23× but snmalloc and mimalloc are both ~1.93× faster than std.

The root cause is madvise(..., MADV_DONTNEED), not cache misses. jemalloc calls MADV_DONTNEED 228,455 times during the 32 KB benchmark. It aggressively returns physical pages to the OS. The next write triggers a demand-zero minor page fault. snmalloc uses MADV_FREE (38,796 calls) which lazily defers reclamation. No forced re-faulting happens there.

jemalloc tuning can fix it, but only with the correct env var and only for specific patterns. tikv-jemallocator 0.6.1 is built with JEMALLOC_PREFIX="_rjem_", so the environment variable must be _RJEM_MALLOC_CONF, not MALLOC_CONF. With _RJEM_MALLOC_CONF=tcache_max:4096, jemalloc’s 32 KB MPSC time drops by 62% and actually beats snmalloc on the single-sender workload. However, the same tuning hurts single-threaded allocation (+96%) and multi-sender contention (+10%).

The penalty is allocation-size dependent, not async-pattern dependent. Reproducing tokio’s own sync_mpsc benchmark with 5 senders, 1000 messages: usize payloads show all allocators within 4%, but 32 KB Vec payloads reproduce the exact same jemalloc regression.

jemalloc wins on small-object churn. spawn_many_local (10,000 task spawns) is 2.02× faster under jemalloc than std. Thread-local slab caches dominate when objects stay under jemalloc’s small-class threshold.

snmalloc and mimalloc remain broadly safer on this machine. They win or tie on 10 of 13 reported cases. They avoid the large-message cliff entirely. No MALLOC_CONF archaeology is required.

Method

All runs on a single machine to eliminate cross-hardware noise.

  • Machine: OCI Ampere A1, 4 vCore ARM64 (aarch64), 24 GB RAM
  • OS: Debian 13 (kernel 6.12.86+deb13-arm64)
  • Page size: 4 KB
  • Compiler: rustc 1.95.0, cargo 1.95.0
  • Libraries: tokio 1.52.3, criterion 0.5.1
  • Allocators: tikv-jemallocator 0.6.1 (jemalloc 5.3.0), mimalloc 0.1.50 (mimalloc v2.x), snmalloc-rs 0.7.4, glibc 2.41 (std)

I built one benchmark binary per allocator. Each binary sets a different #[global_allocator] and links the same lib.rs routines. I used Criterion.rs to collect timings. It runs 50 samples with a 3 s measurement window and 1 s warmup. The build profile is Cargo’s built-in bench. That means opt-level 3, debug assertions off, and equivalent to --release.

Reproduction:

1
2
3
4
5
6
git clone <repo> allocator-shootout
cd allocator-shootout
cargo bench --bench std_alloc -- --noplot
cargo bench --bench jemalloc --features jemalloc -- --noplot
cargo bench --bench mimalloc --features mimalloc -- --noplot
cargo bench --bench snmalloc --features snmalloc -- --noplot

Raw numbers: bench_results.csv
Perf counters: perf_stat_results.txt
Tuning sweep: jemalloc_tuning_results.txt
Strace traces: jemalloc_strace.txt, snmalloc_strace.txt, mimalloc_strace.txt, std_strace.txt

Results

spawn_many: tasks spawned and awaited

spawn_many latency by allocator

Tasksstd (µs)jemalloc (µs)mimalloc (µs)snmalloc (µs)snmalloc vs std
1 000748.0630.3594.4577.41.30×
5 0003 524.73 053.82 847.82 727.71.29×
10 0007 387.56 413.55 499.05 216.41.42×

All three alternatives beat std. The win grows with task count. Each task carries a Task box, a join handle, and at least one waker allocation. On ARM64 these small frequent allocations are expensive under glibc’s default arena strategy.

mpsc: latency vs message size

mpsc latency by allocator and message size

Sizestd (µs)jemalloc (µs)mimalloc (µs)snmalloc (µs)Best vs std
8 B139.5113.3112.6112.51.24× snmalloc
64 B139.3119.4114.5114.21.22× snmalloc
256 B170.7138.4115.2117.31.48× mimalloc
512 B177.5161.0127.6125.01.42× snmalloc
1 KB211.3203.5138.4138.91.53× mimalloc
4 KB264.3377.1267.5212.91.24× snmalloc
16 KB1 077.81 741.0637.7593.01.82× snmalloc
32 KB1 875.32 299.2984.4970.71.93× snmalloc

Two regimes appear here. Small messages show snmalloc ≈ mimalloc ≪ jemalloc < std. The gap is only 1.2–1.5×. Large messages flip the story. jemalloc drops below std. At 16 KB it is 1.62× slower than glibc; at 32 KB it is still 1.23× slower. Meanwhile snmalloc and mimalloc are nearly 2× faster than std.

The threshold where jemalloc inverts sits between 4 KB and 16 KB. That is the boundary between allocator slab caches and fresh extent allocation.

block_on_alloc: Vec churn inside block_on

block_on_alloc latency by allocator

Casestd (µs)jemalloc (µs)mimalloc (µs)snmalloc (µs)Best vs std
100 × 64 B2.121.481.361.301.63× snmalloc
100 × 4 KB11.005.848.836.551.88× jemalloc
1 000 × 64 B20.3713.9712.7211.901.71× snmalloc
1 000 × 4 KB109.9757.7887.5364.741.90× jemalloc

jemalloc wins on large individual allocations. Those 4 KB blobs hit its huge-page path and thread-local cache. snmalloc wins on high-count small batches. mimalloc is competitive but never leads.

alloc_stress: random-size Vec churn

alloc_stress latency by allocator

Allocsstd (µs)jemalloc (µs)mimalloc (µs)snmalloc (µs)Best vs std
1009.373.835.454.472.45× jemalloc
1 000141.6174.3134.3111.41.27× snmalloc

jemalloc dominates the small-batch case. Its thread-local caching is excellent there. At 1 000 allocs snmalloc’s global-free-list batching takes over and it leads.

Root cause: MADV_DONTNEED vs MADV_FREE

I traced memory syscalls with strace -e memory on the 32 KB MPSC case. strace overhead slows each iteration, so Criterion adapts by running fewer iterations. The counts below are per benchmark iteration. That is one pass of 1 000 sends plus 1 000 recvs. Raw captured traces: jemalloc_strace.txt, snmalloc_strace.txt, mimalloc_strace.txt, std_strace.txt

Allocatormadvise / iterbrk / iterTypical adviceIterations captured
jemalloc~1,140~0.3MADV_DONTNEED200
snmalloc~90MADV_FREE3 825
mimalloc<0.020(various)3 825
std0~38n/a1 275

jemalloc issues roughly 130× more madvise calls per iteration than snmalloc. snmalloc’s calls use MADV_FREE rather than MADV_DONTNEED.

MADV_DONTNEED tells the kernel to drop physical pages immediately. The virtual addresses stay valid. The next write triggers a demand-zero minor page fault. The kernel must allocate a fresh physical page and zero it. On ARM64 with 4 KB pages and smaller TLBs than x86-64, this faulting is proportionally more expensive.

MADV_FREE tells the kernel the pages may be reclaimed later. They stay valid in RAM until reclaim actually happens. No forced re-faulting occurs on subsequent writes.

jemalloc is aggressively returning memory. snmalloc is deferring. The benchmark repeatedly allocates and frees 32 KB buffers through a bounded channel. That workload pings the MADV_DONTNEED / demand-fault cycle perfectly.

I confirmed the per-iteration fault rate with perf stat -e page-faults. Criterion’s adaptive iteration count means totals vary. Rates are reproducible.

AllocatorPage faults / iterSys % of elapsed
jemalloc~1,580~73%
std~1,400~47%
mimalloc<0.1<1%
snmalloc<0.1~4%

jemalloc spends most of elapsed time in the kernel servicing those faults. This is not a cache-miss problem. Cache miss rates are comparable across allocators. It is a syscall-level page-fault storm.

Is this a jemalloc bug?

No. It is documented behavior. The jemalloc man page explicitly describes the two purge paths:

“A lazy extent purge function (e.g. implemented via madvise(...,MADV_FREE)) can delay purging indefinitely and leave the pages within the purged virtual memory range in an indeterminate state, whereas a forced extent purge function immediately purges, and the pages within the virtual memory range will be zero-filled the next time they are accessed.”1

I followed the source code in jemalloc 5.3.0. The cascade lives in extent_dalloc_wrapper() inside src/extent.c. On Linux with default retain:true, the flow works like this:

  1. ehooks_dalloc_will_fail() returns true because opt_retain is enabled. munmap is skipped entirely.
  2. extent_decommit_wrapper() calls pages_decommit()pages_commit_impl(). That returns true immediately because os_overcommits is true on Linux. Decommit is a no-op on overcommitting systems.
  3. Fall through to forced purgepages_purge_forced()madvise(..., MADV_DONTNEED).
  4. Next access to those pages = demand-zero fault.

The 228,455 MADV_DONTNEED calls are not a fallback after failed syscalls. They are the only allocator action that actually runs on this path.

jemalloc’s own TUNING.md frames this as a CPU-versus-memory trade-off:

“Decay time determines how fast jemalloc returns unused pages back to the operating system, and therefore provides a fairly straightforward trade-off between CPU and memory usage. Shorter decay time purges unused pages faster to reduces memory usage (usually at the cost of more CPU cycles spent on purging)."2

The 62% regression is simply the “more CPU cycles” side of that trade-off. ARM64’s stricter TLB behavior amplifies it. The benchmark’s tight allocation loop also magnifies the cost. On a memory-constrained server the same behavior would be correct. On a 24 GB machine running allocation-heavy async workloads it is not.

Could jemalloc be tuned out of this?

A common first attempt is to set MALLOC_CONF. On tikv-jemallocator 0.6.1 that variable is silently ignored. The build uses JEMALLOC_PREFIX="_rjem_", so the correct name is _RJEM_MALLOC_CONF. Any testing with MALLOC_CONF will measure the default regardless of the string passed.

I retested with _RJEM_MALLOC_CONF on the 32 KB MPSC case.

TunePage faultsPer-iteration time
Baseline~2 019 6352.35 ms
retain:false~2 018 9742.36 ms
dirty_decay_ms:0,muzzy_decay_ms:0~2 032 0912.33 ms
narenas:1~2 031 0852.33 ms
tcache:false9490.86 ms
tcache_max:4096~1 0230.89 ms

What changed

tcache:false and tcache_max:4096 do fix the regression. They improve this benchmark by 2.6–2.7×. The page-fault count drops from ~1,500 faults/iteration to near-zero. The other knobs (retain, decay, arenas) genuinely do nothing.

Why does limiting the tcache help? With default tcache_max, large deallocated chunks sit in the thread cache. When the cache fills, jemalloc purges them with madvise(..., MADV_DONTNEED). The next allocation from the same thread immediately reclaims from the tcache and touches those zeroed pages. That triggers a demand-zero fault. The strace data shows this clearly: baseline jemalloc triggers ~1,140 madvise calls per iteration, each followed by a fault on reuse.

With tcache_max:4096, allocations ≥ 4 KB bypass the tcache. They flow through the arena’s extent management. Notably, the madvise count does not drop. It rises slightly to ~1,300 calls/iteration. Those purged extents are no longer on the immediate-reuse hot path. The next allocation tends to come from a different, non-purged extent. Demand-zero faults collapse from ~1,580/iteration to near-zero.

This fix is so pattern-dependent because it only helps when the benchmark immediately reclaims from the same thread’s cache. When a single thread does all the work (block_on_alloc_st) or when many senders share the channel (tokio_mpsc_large), bypassing the tcache forces allocations through the slower arena path. Things get worse.

The catch — pattern dependence

The fix is not universal.

PatternBaseline jemalloctcache_max:4096Change
mpsc/32 KB (1 sender)2.35 ms0.89 ms−62%
mpsc/16 KB (1 sender)1.79 ms0.83 ms−54%
mpsc/1 KB (1 sender)209 µs205 µs−2% ✅
mpsc/8 B (1 sender)113 µs113 µs0% ✅
tokio_mpsc_large/5×1000_32 KB (5 senders)4.27 ms4.70 ms+10%
block_on_alloc_st/1000×32 KB (single-threaded)390 µs763 µs+96%
spawn_many/10 0006.41 ms6.55 ms~+2% (within noise)

tcache_max:4096 wins for cross-thread, single-pair channel workloads with multi-KB messages. It hurts single-threaded and multi-sender patterns. The mechanism is workload-specific. It is not a general jemalloc panacea.

Practical option

If a workload is dominated by large MPSC messages and locked into jemalloc, _RJEM_MALLOC_CONF=tcache_max:4096 is worth testing. A more conservative starting point is _RJEM_MALLOC_CONF=tcache_max:16384. That fixes 32 KB while keeping 16 KB in cache. 16 KB stays at ~1.78 ms.

For most users, snmalloc and mimalloc avoid the cliff without tuning. tcache_max:4096 has no effect on small-object churn like spawn_many. It stays within noise. It hurts single-threaded large allocations and multi-sender channel contention.

Is this tokio-specific, or universal?

I replicated six patterns drawn directly from tokio’s own benchmark suite. I wanted to test whether the effect is specific to multi-threaded async, channels, or something else.

PatternTypestd (µs)jemalloc (µs)snmalloc (µs)Key finding
spawn_many_local/10kMT spawn10 6935 2895 393jemalloc wins (2× std)
spawn_many_remote_idle/10kMT spawn8 0506 2375 021snmalloc wins
mpsc_contention/5x1000MT channel (usize)1 2361 2161 194All tied within 4%
mpsc_large/5x1000_32KBMT channel (32 KB)4 4854 3143 299jemalloc loses — reproduces cliff
block_on_alloc_st/1000x32KBST alloc only402390366snmalloc wins; not async-specific
spawn_st/1kST spawn336275244snmalloc wins; not MT-specific

Three conclusions emerge.

usize contention barely matters for the allocator. Small objects short-circuit jemalloc’s MADV_DONTNEED path.

Large Vec<u8> payloads reproduce the cliff. That happens even with tokio’s own sync_mpsc contention pattern. The issue is not the custom harness.

Single-threaded block_on with 32 KB allocs shows the same penalty. This is not an async-specific or multi-thread-specific pathology. It is an allocator-allocation-size interaction. Tokio channels expose it when messages are multi-KB.

Summary

WorkloadWinnerstd vs bestKey driver
spawn_manysnmalloc ≈ mimalloc1.42×Small-object fast path
mpsc smallsnmalloc ≈ mimalloc1.24–1.53×Thread cache, low coordination
mpsc large (untuned)snmalloc ≈ mimalloc~1.93×MADV_FREE avoids fault storm
mpsc large (jemalloc, tcache_max:4096)jemalloc~2.6× vs untuned jemallocBypass tcache to avoid madvise reuse cycle
block_on_alloc (small)snmalloc1.71×Batch deallocation
block_on_alloc (large)jemalloc1.90×Huge-page / slab path
alloc_stress (small)jemalloc2.45×Thread-local cache
alloc_stress (large)snmalloc1.27×Global-free-list batching

Glossary

TermMeaning
madviseLinux syscall that tells the kernel how a memory range will be used.
MADV_DONTNEEDAdvice to drop physical pages immediately; next write faults.
MADV_FREEAdvice to lazily reclaim pages; no immediate fault on reuse.
straceTool that traces syscalls made by a process.
perf statLinux profiler that counts hardware and software events.
Criterion.rsRust benchmarking harness that handles statistics and warming.
tcachejemalloc’s per-thread cache for recently freed small objects.
arenajemalloc’s logical partition of the heap assigned to threads.

Appendix: Does this reproduce on x86-64?

All numbers so far come from an OCI Ampere A1. That is ARM64 with 4 vCore. To test architecture specificity I ran the same suite on an AMD Ryzen 9 9900X. That machine has 16C/32T, 64 GB DDR5, x86-64, Debian, kernel 6.x, and 4 KB pages.

Key results on x86-64

Workloadstdjemallocmimallocsnmalloctcache_max:4096
mpsc/16 KB387 µs728 µs (+88%)226 µs241 µs394 µs (+2%)
mpsc/32 KB744 µs1 004 µs (+35%)382 µs406 µs393 µs (−47%)
block_on_alloc/1000×4 KB43.6 µs21.5 µs (−51%)37.9 µs20.9 µs (−52%)23.2 µs (−47%)
spawn_many/10 0002.62 ms3.13 ms (+19%)3.04 ms3.51 ms4.48 ms (+71%)
tokio_mpsc_large/5×1000_32KB902 µs1.80 ms (+100%)1.74 ms1.52 ms2.78 ms (+208%)

Three findings stand out.

1. The jemalloc regression is present on x86-64 too, and at some sizes it is worse. At 16 KB, jemalloc is 1.88× slower than std on x86-64. That is 1.62× on ARM64. At 32 KB the gap narrows to 1.35× versus 1.23× on ARM64. The direction is identical. The magnitude depends on the exact size threshold between slab and extent allocation on each platform.

2. tcache_max:4096 fixes the single-sender large-MPSC case on x86-64 as well. 32 KB drops from 1.0 ms to 393 µs. That is a 61% improvement. It matches the 62% seen on ARM64. The mechanism is the same. Bypassing the thread cache avoids immediate reuse of MADV_DONTNEED-purged extents.

3. The pattern-dependence is identical. The same tuning hurts multi-sender contention. It rises +208% on tokio_mpsc_large on x86-64. That is +10% on ARM64. It also hurts single-threaded allocation. spawn_many rises +71% on x86-64 versus ~0% on ARM64. The variance in magnitude is platform-specific. The sign is the same.

strace confirms the same syscall pattern

strace -f -e memory on the 32 KB MPSC case shows the same mechanism. The x86-64 machine runs more iterations per sample, so absolute counts are lower.

Allocatormadvise callsIterationsMADV_DONTNEED / iterMADV_FREE / iter
jemalloc4223 8250.11<0.001
std485 1000.009
mimalloc318 9250.0030
snmalloc567 6500.0060.001

jemalloc issues roughly 12× more MADV_DONTNEED calls per iteration than std on x86-64. snmalloc again uses MADV_FREE instead. The ratio is lower than on ARM64 where it was ~130×. The direction is unchanged. The effect is allocator behavior, not an ARM64 pathology.

Platform differences worth noting

  • spawn_many: std wins on x86-64 (2.62 ms) while snmalloc wins on ARM64 (5.22 ms). glibc’s arena strategy performs better on the larger x86-64 cache hierarchy.
  • Absolute latencies: x86-64 is roughly 2–3× faster across the board. That is expected from a much faster CPU. The relative allocator rankings, however, are preserved.

Take

jemalloc is not “bad on ARM64.” It is appropriately aggressive for memory-tight servers. It is inappropriately aggressive for allocation-heavy, page-reuse workloads. Its MADV_DONTNEED strategy triggers a demand-zero fault storm. That happens when large buffers are repeatedly allocated and freed. Multi-KB MPSC messages produce exactly that pattern. The same regression appears on x86-64, so this is allocator behavior.

The practical rule is still: if a Tokio service pushes > 4 KB through channels, test snmalloc or mimalloc first. They avoid the cliff with no configuration.

If the binary is already committed to jemalloc, _RJEM_MALLOC_CONF=tcache_max:4096 can recover the loss on single-pair channel workloads. It is a sharp knife. Measure the actual pattern before deploying it. The only tuning that matters here is controlling the tcache size threshold.

One final, counter-intuitive result: perf stat shows cache-miss rates are higher for snmalloc than jemalloc on the large MPSC case (1.51% vs 0.96%). snmalloc is winning despite worse cache behavior. Its page-fault count is near-zero. The bottleneck is kernel entry and exit, not the cache hierarchy.


  1. jemalloc 5.3.0 man page, extent_hooks_t section. Source: jemalloc.xml in the tikv-jemalloc-sys 0.6.1 build. ↩︎

  2. jemalloc 5.3.0, TUNING.md↩︎