1. jemalloc regresses large-message MPSC by 62% versus std on ARM64. At 16 KB messages it is 1.62× slower than glibc; at 32 KB the gap narrows to 1.23× but snmalloc and mimalloc are both ~1.93× faster than std.
  2. The root cause is madvise(..., MADV_DONTNEED), not cache misses. jemalloc calls MADV_DONTNEED 228,455 times during the 32 KB benchmark, aggressively returning physical pages to the OS. The next write to any returned page triggers a demand-zero minor page fault. snmalloc uses MADV_FREE (38,796 calls) which lazily defers reclamation — no forced re-faulting.
  3. jemalloc tuning can fix it, but only with the correct env var and only for specific patterns. tikv-jemallocator 0.6.1 is built with JEMALLOC_PREFIX="_rjem_", so the environment variable must be _RJEM_MALLOC_CONF, not MALLOC_CONF. With _RJEM_MALLOC_CONF=tcache_max:4096, jemalloc’s 32 KB MPSC time drops by 62% and actually beats snmalloc on the single-sender workload. However, the same tuning hurts single-threaded allocation (+96%) and multi-sender contention (+10%).
  4. The penalty is allocation-size dependent, not async-pattern dependent. Reproducing tokio’s own sync_mpsc benchmark with 5 senders, 1000 messages: usize payloads show all allocators within 4%, but 32 KB Vec payloads reproduce the exact same jemalloc regression.
  5. jemalloc wins on small-object churn. spawn_many_local (10,000 task spawns) is 2.02× faster under jemalloc than std. Thread-local slab caches dominate when objects stay under jemalloc’s small-class threshold.
  6. snmalloc and mimalloc remain broadly safer on this machine. They win or tie on 10 of 13 reported cases, avoid the large-message cliff entirely, and require no MALLOC_CONF archaeology.

Method

All runs on a single machine to eliminate cross-hardware noise.

  • Machine: OCI Ampere A1, 4 vCore ARM64 (aarch64), 24 GB RAM
  • OS: Debian 13 (kernel 6.12.86+deb13-arm64)
  • Page size: 4 KB
  • Compiler: rustc 1.95.0, cargo 1.95.0
  • Libraries: tokio 1.52.3, criterion 0.5.1
  • Allocators: tikv-jemallocator 0.6.1 (jemalloc 5.3.0), mimalloc 0.1.50 (mimalloc v2.x), snmalloc-rs 0.7.4, glibc 2.41 (std)

Each bench binary sets a different #[global_allocator] and links the same lib.rs routines. Criterion.rs config: 50 samples, 3 s measurement, 1 s warmup. Build profile is Cargo’s built-in bench (opt-level 3, debug assertions off, equivalent to --release).

Reproduction:

1
2
3
4
5
6
git clone <repo> allocator-shootout
cd allocator-shootout
cargo bench --bench std_alloc -- --noplot
cargo bench --bench jemalloc --features jemalloc -- --noplot
cargo bench --bench mimalloc --features mimalloc -- --noplot
cargo bench --bench snmalloc --features snmalloc -- --noplot

Raw numbers: bench_results.csv
Perf counters: perf_stat_results.txt
Tuning sweep: jemalloc_tuning_results.txt
Strace traces: jemalloc_strace.txt, snmalloc_strace.txt, mimalloc_strace.txt, std_strace.txt

Results

spawn_many: tasks spawned and awaited

spawn_many latency by allocator

Tasksstd (µs)jemalloc (µs)mimalloc (µs)snmalloc (µs)snmalloc vs std
1 000748.0630.3594.4577.41.30×
5 0003 524.73 053.82 847.82 727.71.29×
10 0007 387.56 413.55 499.05 216.41.42×

All three alternatives beat std. The win grows with task count because each task carries a Task box, a join handle, and at least one waker allocation. On ARM64 these small frequent allocations are expensive under glibc’s default arena strategy.

mpsc: latency vs message size

mpsc latency by allocator and message size

Sizestd (µs)jemalloc (µs)mimalloc (µs)snmalloc (µs)Best vs std
8 B139.5113.3112.6112.51.24× snmalloc
64 B139.3119.4114.5114.21.22× snmalloc
256 B170.7138.4115.2117.31.48× mimalloc
512 B177.5161.0127.6125.01.42× snmalloc
1 KB211.3203.5138.4138.91.53× mimalloc
4 KB264.3377.1267.5212.91.24× snmalloc
16 KB1 077.81 741.0637.7593.01.82× snmalloc
32 KB1 875.32 299.2984.4970.71.93× snmalloc

Two regimes with very different allocator behavior:

  • Small messages (≤ 1 KB): snmalloc ≈ mimalloc ≪ jemalloc < std. The gap is 1.2–1.5× at most.
  • Large messages (≥ 16 KB): jemalloc drops below std. At 16 KB it is 1.62× slower than glibc; at 32 KB it is still 1.23× slower. Meanwhile snmalloc and mimalloc are nearly 2× faster than std.

The size threshold where jemalloc inverts is between 4 KB and 16 KB — the transition from allocator slab caches to fresh extent/page allocation.

block_on_alloc: Vec churn inside block_on

block_on_alloc latency by allocator

Casestd (µs)jemalloc (µs)mimalloc (µs)snmalloc (µs)Best vs std
100 × 64 B2.121.481.361.301.63× snmalloc
100 × 4 KB11.005.848.836.551.88× jemalloc
1 000 × 64 B20.3713.9712.7211.901.71× snmalloc
1 000 × 4 KB109.9757.7887.5364.741.90× jemalloc

jemalloc wins on large individual allocations (4 KB blobs) where its huge-page path and thread-local cache pay off. snmalloc wins on high-count small batches. mimalloc is competitive but never leads.

alloc_stress: random-size Vec churn

alloc_stress latency by allocator

Allocsstd (µs)jemalloc (µs)mimalloc (µs)snmalloc (µs)Best vs std
1009.373.835.454.472.45× jemalloc
1 000141.6174.3134.3111.41.27× snmalloc

jemalloc dominates the small-batch case (likely due to excellent thread-local caching). At 1 000 allocs snmalloc’s global-free-list batching takes over and it leads.

Root cause: MADV_DONTNEED vs MADV_FREE

strace -e memory on the 32 KB MPSC case reveals the mechanism. Because strace overhead slows each iteration, Criterion adapts by running fewer iterations; the counts below are per benchmark iteration (single pass of 1 000 sends + 1 000 recvs). Raw captured traces: jemalloc_strace.txt, snmalloc_strace.txt, mimalloc_strace.txt, std_strace.txt.

Allocatormadvise / iterbrk / iterTypical adviceIterations captured
jemalloc~1,140~0.3MADV_DONTNEED200
snmalloc~90MADV_FREE3 825
mimalloc<0.020(various)3 825
std0~38n/a1 275

jemalloc issues roughly 130× more madvise calls per iteration than snmalloc, and snmalloc’s calls use MADV_FREE rather than MADV_DONTNEED.

MADV_DONTNEED tells the kernel to drop physical pages for the mapped range immediately. The virtual addresses stay valid, but the next write to any page in that range triggers a demand-zero minor page fault — the kernel must allocate a fresh physical page and zero it. On ARM64 with a 4 KB page size and smaller TLBs than x86-64, this faulting is proportionally more expensive.

MADV_FREE tells the kernel the pages may be reclaimed under memory pressure, but they stay valid in RAM until reclaim actually happens. No forced re-faulting on subsequent writes.

jemalloc is aggressively returning memory; snmalloc is deferring. The benchmark repeatedly allocates and frees 32 KB buffers through a bounded channel — exactly the workload that pings the MADV_DONTNEED / demand-fault cycle.

perf stat -e page-faults on the 32 KB MPSC case confirms the per-iteration rate. Criterion’s adaptive iteration count means totals vary; rates are reproducible.

AllocatorPage faults / iterSys % of elapsed
jemalloc~1,580~73%
std~1,400~47%
mimalloc<0.1<1%
snmalloc<0.1~4%

jemalloc spends most of elapsed time in the kernel servicing those faults. This is not a cache-miss problem — cache miss rates are comparable across allocators. It is a syscall-level page-fault storm.

Is this a jemalloc bug?

No — it is documented behavior. The jemalloc man page explicitly describes the two purge paths:

“A lazy extent purge function (e.g. implemented via madvise(...,MADV_FREE)) can delay purging indefinitely and leave the pages within the purged virtual memory range in an indeterminate state, whereas a forced extent purge function immediately purges, and the pages within the virtual memory range will be zero-filled the next time they are accessed.”1

The source code (jemalloc 5.3.0) shows the cascade in extent_dalloc_wrapper() (src/extent.c). On Linux with default retain:true, the flow is:

  1. ehooks_dalloc_will_fail() returns true because opt_retain is enabled — munmap is skipped entirely.
  2. extent_decommit_wrapper() calls pages_decommit()pages_commit_impl(), which returns true (failure) immediately because os_overcommits is true on Linux. Decommit is a no-op on overcommitting systems.
  3. Fall through to forced purgepages_purge_forced()madvise(..., MADV_DONTNEED).
  4. Next access to those pages = demand-zero fault.

The 228,455 MADV_DONTNEED calls are not a fallback after failed syscalls — they are the only allocator action that actually runs on this path.

jemalloc’s own TUNING.md frames this as a CPU-versus-memory trade-off:

“Decay time determines how fast jemalloc returns unused pages back to the operating system, and therefore provides a fairly straightforward trade-off between CPU and memory usage. Shorter decay time purges unused pages faster to reduces memory usage (usually at the cost of more CPU cycles spent on purging).”2

The 62% regression is simply the “more CPU cycles” side of that trade-off, amplified by ARM64’s stricter TLB behavior and the benchmark’s tight allocation loop. On a memory-constrained server the same behavior would be the correct choice; on a 24 GB machine running allocation-heavy async workloads it is not.

Could jemalloc be tuned out of this?

A common first attempt is to set MALLOC_CONF. On tikv-jemallocator 0.6.1 that variable is silently ignored — the build uses JEMALLOC_PREFIX="_rjem_", so the correct name is _RJEM_MALLOC_CONF. Any testing with MALLOC_CONF will measure the default configuration regardless of what string you pass.

Retested with _RJEM_MALLOC_CONF on the 32 KB MPSC case:

TunePage faultsPer-iteration time
Baseline~2 019 6352.35 ms
retain:false~2 018 9742.36 ms
dirty_decay_ms:0,muzzy_decay_ms:0~2 032 0912.33 ms
narenas:1~2 031 0852.33 ms
tcache:false9490.86 ms
tcache_max:4096~1 0230.89 ms

What changed

tcache:false and tcache_max:4096 do fix the regression — by 2.6–2.7× on this benchmark. The page-fault count drops from ~1,500 faults/iteration to near-zero. The other knobs (retain, decay, arenas) genuinely do nothing.

Why does limiting the tcache help? With default tcache_max, large deallocated chunks sit in the thread cache. When the cache fills, jemalloc purges them with madvise(..., MADV_DONTNEED). The next allocation from the same thread immediately reclaims from the tcache and touches those zeroed pages → demand-zero fault. The strace data shows this clearly: baseline jemalloc triggers ~1,140 madvise calls per iteration, each followed by a fault on reuse.

With tcache_max:4096, allocations ≥ 4 KB bypass the tcache and flow through the arena’s extent management. Notably, the madvise count does not drop — in fact it rises slightly to ~1,300 calls/iteration — but those purged extents are no longer on the immediate-reuse hot path. The next allocation tends to come from a different, non-purged extent, so demand-zero faults collapse from ~1,580/iteration to near-zero.

This is why the fix is so pattern-dependent: it only helps when the benchmark’s allocation pattern immediately reclaims from the same thread’s cache. When a single thread does all the work (block_on_alloc_st) or when many senders share the channel (tokio_mpsc_large), bypassing the tcache forces allocations through the slower arena path and makes things worse.

The catch — pattern dependence

The fix is not universal:

PatternBaseline jemalloctcache_max:4096Change
mpsc/32 KB (1 sender)2.35 ms0.89 ms−62%
mpsc/16 KB (1 sender)1.79 ms0.83 ms−54%
mpsc/1 KB (1 sender)209 µs205 µs−2% ✅
mpsc/8 B (1 sender)113 µs113 µs0% ✅
tokio_mpsc_large/5×1000_32 KB (5 senders)4.27 ms4.70 ms+10%
block_on_alloc_st/1000×32 KB (single-threaded)390 µs763 µs+96%
spawn_many/10 0006.41 ms6.55 ms~+2% (within noise)

tcache_max:4096 is a win for cross-thread, single-pair channel workloads with multi-KB messages, but it hurts single-threaded and multi-sender patterns. The mechanism is workload-specific, not a general jemalloc panacea.

Practical option

If your service is dominated by large MPSC messages and you are locked into jemalloc, _RJEM_MALLOC_CONF=tcache_max:4096 is worth testing. A more conservative starting point is _RJEM_MALLOC_CONF=tcache_max:16384, which fixes 32 KB while keeping 16 KB in cache (16 KB stays at ~1.78 ms).

For most users, snmalloc and mimalloc avoid the cliff without requiring per-workload tuning. tcache_max:4096 has no effect on small-object churn like spawn_many (within noise) and hurts single-threaded large allocations and multi-sender channel contention.

Is this tokio-specific, or universal?

I replicated six patterns drawn directly from tokio’s own benchmark suite to test whether the effect is specific to multi-threaded async, channels, or something else.

PatternTypestd (µs)jemalloc (µs)snmalloc (µs)Key finding
spawn_many_local/10kMT spawn10 6935 2895 393jemalloc wins (2× std)
spawn_many_remote_idle/10kMT spawn8 0506 2375 021snmalloc wins
mpsc_contention/5x1000MT channel (usize)1 2361 2161 194All tied within 4%
mpsc_large/5x1000_32KBMT channel (32 KB)4 4854 3143 299jemalloc loses — reproduces cliff
block_on_alloc_st/1000x32KBST alloc only402390366snmalloc wins; not async-specific
spawn_st/1kST spawn336275244snmalloc wins; not MT-specific

Three conclusions emerge:

  1. usize contention (tokio’s real sync_mpsc bench): allocator barely matters. Small objects short-circuit jemalloc’s MADV_DONTNEED path.
  2. Large Vec<u8> payloads reproduce the cliff even with tokio’s own sync_mpsc contention pattern. The issue is not our custom harness.
  3. Single-threaded block_on with 32 KB allocs shows the same penalty. This is not an async-specific or multi-thread-specific pathology. It is an allocator-allocation-size interaction that happens to be exposed by Tokio channels when messages are multi-KB.

Summary

WorkloadWinnerstd vs bestKey driver
spawn_manysnmalloc ≈ mimalloc1.42×Small-object fast path
mpsc smallsnmalloc ≈ mimalloc1.24–1.53×Thread cache, low coordination
mpsc large (untuned)snmalloc ≈ mimalloc~1.93×MADV_FREE avoids fault storm
mpsc large (jemalloc, tcache_max:4096)jemalloc~2.6× vs untuned jemallocBypass tcache to avoid madvise reuse cycle
block_on_alloc (small)snmalloc1.71×Batch deallocation
block_on_alloc (large)jemalloc1.90×Huge-page / slab path
alloc_stress (small)jemalloc2.45×Thread-local cache
alloc_stress (large)snmalloc1.27×Global-free-list batching

Appendix: Does this reproduce on x86-64?

All numbers so far are from an OCI Ampere A1 (ARM64, 4 vCore). To test whether the effect is architecture-specific, I ran the same benchmark suite on an AMD Ryzen 9 9900X (16C/32T, 64 GB DDR5, x86-64, Debian, kernel 6.x, 4 KB pages).

Key results on x86-64

Workloadstdjemallocmimallocsnmalloctcache_max:4096
mpsc/16 KB387 µs728 µs (+88%)226 µs241 µs394 µs (+2%)
mpsc/32 KB744 µs1 004 µs (+35%)382 µs406 µs393 µs (−47%)
block_on_alloc/1000×4 KB43.6 µs21.5 µs (−51%)37.9 µs20.9 µs (−52%)23.2 µs (−47%)
spawn_many/10 0002.62 ms3.13 ms (+19%)3.04 ms3.51 ms4.48 ms (+71%)
tokio_mpsc_large/5×1000_32KB902 µs1.80 ms (+100%)1.74 ms1.52 ms2.78 ms (+208%)

Three findings stand out.

1. The jemalloc regression is present on x86-64 too, and at some sizes it is worse. At 16 KB, jemalloc is 1.88× slower than std on x86-64 versus 1.62× on ARM64. At 32 KB the gap narrows to 1.35× versus 1.23× on ARM64. The direction is identical; the magnitude depends on the exact size threshold between slab and extent allocation on each platform.

2. tcache_max:4096 fixes the single-sender large-MPSC case on x86-64 as well. 32 KB drops from 1.0 ms to 393 µs — a 61% improvement, almost exactly the 62% seen on ARM64. The mechanism is the same: bypassing the thread cache avoids the immediate-reuse of MADV_DONTNEED-purged extents.

3. The pattern-dependence is identical. The same tuning hurts multi-sender contention (+208% on tokio_mpsc_large, versus +10% on ARM64) and single-threaded allocation (+71% on spawn_many, versus ~0% on ARM64). The variance in magnitude is platform-specific, but the sign is the same.

strace confirms the same syscall pattern

strace -f -e memory on the 32 KB MPSC case shows the same mechanism at much lower absolute counts (the x86-64 machine runs more iterations per sample):

Allocatormadvise callsIterationsMADV_DONTNEED / iterMADV_FREE / iter
jemalloc4223 8250.11<0.001
std485 1000.009
mimalloc318 9250.0030
snmalloc567 6500.0060.001

jemalloc issues roughly 12× more MADV_DONTNEED calls per iteration than std on x86-64, and snmalloc again uses MADV_FREE instead. The ratio is lower than on ARM64 (where it was ~130×), but the direction is unchanged. The effect is an allocator behavior, not an ARM64 pathology.

Platform differences worth noting

  • spawn_many: std wins on x86-64 (2.62 ms) while snmalloc wins on ARM64 (5.22 ms). glibc’s arena strategy performs better on the larger x86-64 cache hierarchy.
  • Absolute latencies: x86-64 is roughly 2–3× faster across the board, as expected from a much faster CPU. The relative allocator rankings, however, are preserved.

Take

jemalloc is not “bad on ARM64.” It is appropriately aggressive for memory-tight servers and inappropriately aggressive for allocation-heavy, page-reuse workloads. Its MADV_DONTNEED strategy triggers a demand-zero fault storm when large buffers are repeatedly allocated and freed — the exact pattern produced by multi-KB MPSC messages. The same regression appears on x86-64 (see Appendix), so this is allocator behavior, not architecture-specific.

The practical rule is still: if your Tokio service pushes > 4 KB through channels, test snmalloc or mimalloc first. They avoid the cliff with no configuration.

If you are already committed to jemalloc, _RJEM_MALLOC_CONF=tcache_max:4096 can recover the loss on single-pair channel workloads, but it is a sharp knife: measure your own pattern before deploying it. The only tuning that matters here is controlling the tcache size threshold.

One final, counter-intuitive result: perf stat shows cache-miss rates are higher for snmalloc than jemalloc on the large MPSC case (1.51% vs 0.96%). snmalloc is winning despite worse cache behavior because its page-fault count is near-zero. The bottleneck on this workload is kernel entry/exit, not cache hierarchy.


  1. jemalloc 5.3.0 man page, extent_hooks_t section. Source: jemalloc.xml in the tikv-jemalloc-sys 0.6.1 build. ↩︎

  2. jemalloc 5.3.0, TUNING.md↩︎