Allocator shootout for async Rust on ARM64

jemalloc regresses large-message MPSC by 62% versus std on ARM64. At 16 KB messages it is 1.62× slower than glibc; at 32 KB the gap narrows to 1.23× but snmalloc and mimalloc are both ~1.93× faster than std.
The root cause is madvise(..., MADV_DONTNEED), not cache misses. jemalloc calls MADV_DONTNEED 228,455 times during the 32 KB benchmark, aggressively returning physical pages to the OS. The next write to any returned page triggers a demand-zero minor page fault. snmalloc uses MADV_FREE (38,796 calls) which lazily defers reclamation — no forced re-faulting.
jemalloc tuning can fix it, but only with the correct env var and only for specific patterns. tikv-jemallocator 0.6.1 is built with JEMALLOC_PREFIX="_rjem_", so the environment variable must be _RJEM_MALLOC_CONF, not MALLOC_CONF. With _RJEM_MALLOC_CONF=tcache_max:4096, jemalloc’s 32 KB MPSC time drops by 62% and actually beats snmalloc on the single-sender workload. However, the same tuning hurts single-threaded allocation (+96%) and multi-sender contention (+10%).
The penalty is allocation-size dependent, not async-pattern dependent. Reproducing tokio’s own sync_mpsc benchmark with 5 senders, 1000 messages: usize payloads show all allocators within 4%, but 32 KB Vec payloads reproduce the exact same jemalloc regression.
jemalloc wins on small-object churn. spawn_many_local (10,000 task spawns) is 2.02× faster under jemalloc than std. Thread-local slab caches dominate when objects stay under jemalloc’s small-class threshold.
snmalloc and mimalloc remain broadly safer on this machine. They win or tie on 10 of 13 reported cases, avoid the large-message cliff entirely, and require no MALLOC_CONF archaeology.

Method

All runs on a single machine to eliminate cross-hardware noise.

Machine: OCI Ampere A1, 4 vCore ARM64 (aarch64), 24 GB RAM
OS: Debian 13 (kernel 6.12.86+deb13-arm64)
Page size: 4 KB
Compiler: rustc 1.95.0, cargo 1.95.0
Libraries: tokio 1.52.3, criterion 0.5.1
Allocators: tikv-jemallocator 0.6.1 (jemalloc 5.3.0), mimalloc 0.1.50 (mimalloc v2.x), snmalloc-rs 0.7.4, glibc 2.41 (std)

Each bench binary sets a different #[global_allocator] and links the same lib.rs routines. Criterion.rs config: 50 samples, 3 s measurement, 1 s warmup. Build profile is Cargo’s built-in bench (opt-level 3, debug assertions off, equivalent to --release).

Reproduction:

1
2
3
4
5
6
git clone <repo> allocator-shootout
cd allocator-shootout
cargo bench --bench std_alloc -- --noplot
cargo bench --bench jemalloc --features jemalloc -- --noplot
cargo bench --bench mimalloc --features mimalloc -- --noplot
cargo bench --bench snmalloc --features snmalloc -- --noplot

Raw numbers: bench_results.csv
Perf counters: perf_stat_results.txt
Tuning sweep: jemalloc_tuning_results.txt
Strace traces: jemalloc_strace.txt, snmalloc_strace.txt, mimalloc_strace.txt, std_strace.txt

Results

spawn_many: tasks spawned and awaited

Tasks	std (µs)	jemalloc (µs)	mimalloc (µs)	snmalloc (µs)	snmalloc vs std
1 000	748.0	630.3	594.4	577.4	1.30×
5 000	3 524.7	3 053.8	2 847.8	2 727.7	1.29×
10 000	7 387.5	6 413.5	5 499.0	5 216.4	1.42×

All three alternatives beat std. The win grows with task count because each task carries a Task box, a join handle, and at least one waker allocation. On ARM64 these small frequent allocations are expensive under glibc’s default arena strategy.

mpsc: latency vs message size

Size	std (µs)	jemalloc (µs)	mimalloc (µs)	snmalloc (µs)	Best vs std
8 B	139.5	113.3	112.6	112.5	1.24× snmalloc
64 B	139.3	119.4	114.5	114.2	1.22× snmalloc
256 B	170.7	138.4	115.2	117.3	1.48× mimalloc
512 B	177.5	161.0	127.6	125.0	1.42× snmalloc
1 KB	211.3	203.5	138.4	138.9	1.53× mimalloc
4 KB	264.3	377.1	267.5	212.9	1.24× snmalloc
16 KB	1 077.8	1 741.0	637.7	593.0	1.82× snmalloc
32 KB	1 875.3	2 299.2	984.4	970.7	1.93× snmalloc

Two regimes with very different allocator behavior:

Small messages (≤ 1 KB): snmalloc ≈ mimalloc ≪ jemalloc < std. The gap is 1.2–1.5× at most.
Large messages (≥ 16 KB): jemalloc drops below std. At 16 KB it is 1.62× slower than glibc; at 32 KB it is still 1.23× slower. Meanwhile snmalloc and mimalloc are nearly 2× faster than std.

The size threshold where jemalloc inverts is between 4 KB and 16 KB — the transition from allocator slab caches to fresh extent/page allocation.

block_on_alloc: Vec churn inside block_on

Case	std (µs)	jemalloc (µs)	mimalloc (µs)	snmalloc (µs)	Best vs std
100 × 64 B	2.12	1.48	1.36	1.30	1.63× snmalloc
100 × 4 KB	11.00	5.84	8.83	6.55	1.88× jemalloc
1 000 × 64 B	20.37	13.97	12.72	11.90	1.71× snmalloc
1 000 × 4 KB	109.97	57.78	87.53	64.74	1.90× jemalloc

jemalloc wins on large individual allocations (4 KB blobs) where its huge-page path and thread-local cache pay off. snmalloc wins on high-count small batches. mimalloc is competitive but never leads.

alloc_stress: random-size Vec churn

Allocs	std (µs)	jemalloc (µs)	mimalloc (µs)	snmalloc (µs)	Best vs std
100	9.37	3.83	5.45	4.47	2.45× jemalloc
1 000	141.6	174.3	134.3	111.4	1.27× snmalloc

jemalloc dominates the small-batch case (likely due to excellent thread-local caching). At 1 000 allocs snmalloc’s global-free-list batching takes over and it leads.

Root cause: `MADV_DONTNEED` vs `MADV_FREE`

strace -e memory on the 32 KB MPSC case reveals the mechanism. Because strace overhead slows each iteration, Criterion adapts by running fewer iterations; the counts below are per benchmark iteration (single pass of 1 000 sends + 1 000 recvs). Raw captured traces: jemalloc_strace.txt, snmalloc_strace.txt, mimalloc_strace.txt, std_strace.txt.

Allocator	`madvise` / iter	`brk` / iter	Typical advice	Iterations captured
jemalloc	~1,140	~0.3	`MADV_DONTNEED`	200
snmalloc	~9	0	`MADV_FREE`	3 825
mimalloc	<0.02	0	(various)	3 825
std	0	~38	n/a	1 275

jemalloc issues roughly 130× more madvise calls per iteration than snmalloc, and snmalloc’s calls use MADV_FREE rather than MADV_DONTNEED.

MADV_DONTNEED tells the kernel to drop physical pages for the mapped range immediately. The virtual addresses stay valid, but the next write to any page in that range triggers a demand-zero minor page fault — the kernel must allocate a fresh physical page and zero it. On ARM64 with a 4 KB page size and smaller TLBs than x86-64, this faulting is proportionally more expensive.

MADV_FREE tells the kernel the pages may be reclaimed under memory pressure, but they stay valid in RAM until reclaim actually happens. No forced re-faulting on subsequent writes.

jemalloc is aggressively returning memory; snmalloc is deferring. The benchmark repeatedly allocates and frees 32 KB buffers through a bounded channel — exactly the workload that pings the MADV_DONTNEED / demand-fault cycle.

perf stat -e page-faults on the 32 KB MPSC case confirms the per-iteration rate. Criterion’s adaptive iteration count means totals vary; rates are reproducible.

Allocator	Page faults / iter	Sys % of elapsed
jemalloc	~1,580	~73%
std	~1,400	~47%
mimalloc	<0.1	<1%
snmalloc	<0.1	~4%

jemalloc spends most of elapsed time in the kernel servicing those faults. This is not a cache-miss problem — cache miss rates are comparable across allocators. It is a syscall-level page-fault storm.

Is this a jemalloc bug?

No — it is documented behavior. The jemalloc man page explicitly describes the two purge paths:

“A lazy extent purge function (e.g. implemented via madvise(...,MADV_FREE)) can delay purging indefinitely and leave the pages within the purged virtual memory range in an indeterminate state, whereas a forced extent purge function immediately purges, and the pages within the virtual memory range will be zero-filled the next time they are accessed.”¹

The source code (jemalloc 5.3.0) shows the cascade in extent_dalloc_wrapper() (src/extent.c). On Linux with default retain:true, the flow is:

ehooks_dalloc_will_fail() returns true because opt_retain is enabled — munmap is skipped entirely.
extent_decommit_wrapper() calls pages_decommit() → pages_commit_impl(), which returns true (failure) immediately because os_overcommits is true on Linux. Decommit is a no-op on overcommitting systems.
Fall through to forced purge → pages_purge_forced() → madvise(..., MADV_DONTNEED).
Next access to those pages = demand-zero fault.

The 228,455 MADV_DONTNEED calls are not a fallback after failed syscalls — they are the only allocator action that actually runs on this path.

jemalloc’s own TUNING.md frames this as a CPU-versus-memory trade-off:

“Decay time determines how fast jemalloc returns unused pages back to the operating system, and therefore provides a fairly straightforward trade-off between CPU and memory usage. Shorter decay time purges unused pages faster to reduces memory usage (usually at the cost of more CPU cycles spent on purging).”²

The 62% regression is simply the “more CPU cycles” side of that trade-off, amplified by ARM64’s stricter TLB behavior and the benchmark’s tight allocation loop. On a memory-constrained server the same behavior would be the correct choice; on a 24 GB machine running allocation-heavy async workloads it is not.

Could jemalloc be tuned out of this?

A common first attempt is to set MALLOC_CONF. On tikv-jemallocator 0.6.1 that variable is silently ignored — the build uses JEMALLOC_PREFIX="_rjem_", so the correct name is _RJEM_MALLOC_CONF. Any testing with MALLOC_CONF will measure the default configuration regardless of what string you pass.

Retested with _RJEM_MALLOC_CONF on the 32 KB MPSC case:

Tune	Page faults	Per-iteration time
Baseline	~2 019 635	2.35 ms
`retain:false`	~2 018 974	2.36 ms
`dirty_decay_ms:0,muzzy_decay_ms:0`	~2 032 091	2.33 ms
`narenas:1`	~2 031 085	2.33 ms
`tcache:false`	949	0.86 ms
`tcache_max:4096`	~1 023	0.89 ms

What changed

tcache:false and tcache_max:4096 do fix the regression — by 2.6–2.7× on this benchmark. The page-fault count drops from ~1,500 faults/iteration to near-zero. The other knobs (retain, decay, arenas) genuinely do nothing.

Why does limiting the tcache help? With default tcache_max, large deallocated chunks sit in the thread cache. When the cache fills, jemalloc purges them with madvise(..., MADV_DONTNEED). The next allocation from the same thread immediately reclaims from the tcache and touches those zeroed pages → demand-zero fault. The strace data shows this clearly: baseline jemalloc triggers ~1,140 madvise calls per iteration, each followed by a fault on reuse.

With tcache_max:4096, allocations ≥ 4 KB bypass the tcache and flow through the arena’s extent management. Notably, the madvise count does not drop — in fact it rises slightly to ~1,300 calls/iteration — but those purged extents are no longer on the immediate-reuse hot path. The next allocation tends to come from a different, non-purged extent, so demand-zero faults collapse from ~1,580/iteration to near-zero.

This is why the fix is so pattern-dependent: it only helps when the benchmark’s allocation pattern immediately reclaims from the same thread’s cache. When a single thread does all the work (block_on_alloc_st) or when many senders share the channel (tokio_mpsc_large), bypassing the tcache forces allocations through the slower arena path and makes things worse.

The catch — pattern dependence

The fix is not universal:

Pattern	Baseline jemalloc	`tcache_max:4096`	Change
mpsc/32 KB (1 sender)	2.35 ms	0.89 ms	−62% ✅
mpsc/16 KB (1 sender)	1.79 ms	0.83 ms	−54% ✅
mpsc/1 KB (1 sender)	209 µs	205 µs	−2% ✅
mpsc/8 B (1 sender)	113 µs	113 µs	0% ✅
tokio_mpsc_large/5×1000_32 KB (5 senders)	4.27 ms	4.70 ms	+10% ❌
block_on_alloc_st/1000×32 KB (single-threaded)	390 µs	763 µs	+96% ❌
spawn_many/10 000	6.41 ms	6.55 ms	~+2% (within noise)

tcache_max:4096 is a win for cross-thread, single-pair channel workloads with multi-KB messages, but it hurts single-threaded and multi-sender patterns. The mechanism is workload-specific, not a general jemalloc panacea.

Practical option

If your service is dominated by large MPSC messages and you are locked into jemalloc, _RJEM_MALLOC_CONF=tcache_max:4096 is worth testing. A more conservative starting point is _RJEM_MALLOC_CONF=tcache_max:16384, which fixes 32 KB while keeping 16 KB in cache (16 KB stays at ~1.78 ms).

For most users, snmalloc and mimalloc avoid the cliff without requiring per-workload tuning. tcache_max:4096 has no effect on small-object churn like spawn_many (within noise) and hurts single-threaded large allocations and multi-sender channel contention.

Is this tokio-specific, or universal?

I replicated six patterns drawn directly from tokio’s own benchmark suite to test whether the effect is specific to multi-threaded async, channels, or something else.

Pattern	Type	std (µs)	jemalloc (µs)	snmalloc (µs)	Key finding
`spawn_many_local/10k`	MT spawn	10 693	5 289	5 393	jemalloc wins (2× std)
`spawn_many_remote_idle/10k`	MT spawn	8 050	6 237	5 021	snmalloc wins
`mpsc_contention/5x1000`	MT channel (`usize`)	1 236	1 216	1 194	All tied within 4%
`mpsc_large/5x1000_32KB`	MT channel (32 KB)	4 485	4 314	3 299	jemalloc loses — reproduces cliff
`block_on_alloc_st/1000x32KB`	ST alloc only	402	390	366	snmalloc wins; not async-specific
`spawn_st/1k`	ST spawn	336	275	244	snmalloc wins; not MT-specific

Three conclusions emerge:

usize contention (tokio’s real sync_mpsc bench): allocator barely matters. Small objects short-circuit jemalloc’s MADV_DONTNEED path.
Large Vec<u8> payloads reproduce the cliff even with tokio’s own sync_mpsc contention pattern. The issue is not our custom harness.
Single-threaded block_on with 32 KB allocs shows the same penalty. This is not an async-specific or multi-thread-specific pathology. It is an allocator-allocation-size interaction that happens to be exposed by Tokio channels when messages are multi-KB.

Summary

Workload	Winner	std vs best	Key driver
spawn_many	snmalloc ≈ mimalloc	1.42×	Small-object fast path
mpsc small	snmalloc ≈ mimalloc	1.24–1.53×	Thread cache, low coordination
mpsc large (untuned)	snmalloc ≈ mimalloc	~1.93×	`MADV_FREE` avoids fault storm
mpsc large (jemalloc, `tcache_max:4096`)	jemalloc	~2.6× vs untuned jemalloc	Bypass tcache to avoid madvise reuse cycle
block_on_alloc (small)	snmalloc	1.71×	Batch deallocation
block_on_alloc (large)	jemalloc	1.90×	Huge-page / slab path
alloc_stress (small)	jemalloc	2.45×	Thread-local cache
alloc_stress (large)	snmalloc	1.27×	Global-free-list batching

Appendix: Does this reproduce on x86-64?

All numbers so far are from an OCI Ampere A1 (ARM64, 4 vCore). To test whether the effect is architecture-specific, I ran the same benchmark suite on an AMD Ryzen 9 9900X (16C/32T, 64 GB DDR5, x86-64, Debian, kernel 6.x, 4 KB pages).

Key results on x86-64

Workload	std	jemalloc	mimalloc	snmalloc	`tcache_max:4096`
mpsc/16 KB	387 µs	728 µs (+88%)	226 µs	241 µs	394 µs (+2%)
mpsc/32 KB	744 µs	1 004 µs (+35%)	382 µs	406 µs	393 µs (−47%)
block_on_alloc/1000×4 KB	43.6 µs	21.5 µs (−51%)	37.9 µs	20.9 µs (−52%)	23.2 µs (−47%)
spawn_many/10 000	2.62 ms	3.13 ms (+19%)	3.04 ms	3.51 ms	4.48 ms (+71%)
tokio_mpsc_large/5×1000_32KB	902 µs	1.80 ms (+100%)	1.74 ms	1.52 ms	2.78 ms (+208%)

Three findings stand out.

1. The jemalloc regression is present on x86-64 too, and at some sizes it is worse. At 16 KB, jemalloc is 1.88× slower than std on x86-64 versus 1.62× on ARM64. At 32 KB the gap narrows to 1.35× versus 1.23× on ARM64. The direction is identical; the magnitude depends on the exact size threshold between slab and extent allocation on each platform.

2. tcache_max:4096 fixes the single-sender large-MPSC case on x86-64 as well. 32 KB drops from 1.0 ms to 393 µs — a 61% improvement, almost exactly the 62% seen on ARM64. The mechanism is the same: bypassing the thread cache avoids the immediate-reuse of MADV_DONTNEED-purged extents.

3. The pattern-dependence is identical. The same tuning hurts multi-sender contention (+208% on tokio_mpsc_large, versus +10% on ARM64) and single-threaded allocation (+71% on spawn_many, versus ~0% on ARM64). The variance in magnitude is platform-specific, but the sign is the same.

strace confirms the same syscall pattern

strace -f -e memory on the 32 KB MPSC case shows the same mechanism at much lower absolute counts (the x86-64 machine runs more iterations per sample):

Allocator	`madvise` calls	Iterations	MADV_DONTNEED / iter	MADV_FREE / iter
jemalloc	422	3 825	0.11	<0.001
std	48	5 100	0.009	—
mimalloc	31	8 925	0.003	0
snmalloc	56	7 650	0.006	0.001

jemalloc issues roughly 12× more MADV_DONTNEED calls per iteration than std on x86-64, and snmalloc again uses MADV_FREE instead. The ratio is lower than on ARM64 (where it was ~130×), but the direction is unchanged. The effect is an allocator behavior, not an ARM64 pathology.

Platform differences worth noting

spawn_many: std wins on x86-64 (2.62 ms) while snmalloc wins on ARM64 (5.22 ms). glibc’s arena strategy performs better on the larger x86-64 cache hierarchy.
Absolute latencies: x86-64 is roughly 2–3× faster across the board, as expected from a much faster CPU. The relative allocator rankings, however, are preserved.

Take

jemalloc is not “bad on ARM64.” It is appropriately aggressive for memory-tight servers and inappropriately aggressive for allocation-heavy, page-reuse workloads. Its MADV_DONTNEED strategy triggers a demand-zero fault storm when large buffers are repeatedly allocated and freed — the exact pattern produced by multi-KB MPSC messages. The same regression appears on x86-64 (see Appendix), so this is allocator behavior, not architecture-specific.

The practical rule is still: if your Tokio service pushes > 4 KB through channels, test snmalloc or mimalloc first. They avoid the cliff with no configuration.

If you are already committed to jemalloc, _RJEM_MALLOC_CONF=tcache_max:4096 can recover the loss on single-pair channel workloads, but it is a sharp knife: measure your own pattern before deploying it. The only tuning that matters here is controlling the tcache size threshold.

One final, counter-intuitive result: perf stat shows cache-miss rates are higher for snmalloc than jemalloc on the large MPSC case (1.51% vs 0.96%). snmalloc is winning despite worse cache behavior because its page-fault count is near-zero. The bottleneck on this workload is kernel entry/exit, not cache hierarchy.

jemalloc 5.3.0 man page, extent_hooks_t section. Source: jemalloc.xml in the tikv-jemalloc-sys 0.6.1 build. ↩︎
jemalloc 5.3.0, TUNING.md. ↩︎

This post was researched and drafted with AI assistance. I picked the ideas, reviewed every claim, and verified all benchmarks.

Method#

Results#

spawn_many: tasks spawned and awaited#

mpsc: latency vs message size#

block_on_alloc: Vec churn inside block_on#

alloc_stress: random-size Vec churn#

Root cause: MADV_DONTNEED vs MADV_FREE#

Is this a jemalloc bug?#

Could jemalloc be tuned out of this?#

What changed#

The catch — pattern dependence#

Practical option#

Is this tokio-specific, or universal?#

Summary#

Appendix: Does this reproduce on x86-64?#

Key results on x86-64#

strace confirms the same syscall pattern#

Platform differences worth noting#

Take#

Related posts