Allocator shootout for async Rust on ARM64

I ran an allocator shootout for async Rust on a 4-core ARM64 server. I wanted to know which allocator handles Tokio tasks and channels best. The answer depends heavily on allocation size.

jemalloc regresses large-message MPSC by 62% versus std on ARM64. At 16 KB messages it is 1.62× slower than glibc; at 32 KB the gap narrows to 1.23× but snmalloc and mimalloc are both ~1.93× faster than std.

The root cause is madvise(..., MADV_DONTNEED), not cache misses. jemalloc calls MADV_DONTNEED 228,455 times during the 32 KB benchmark. It aggressively returns physical pages to the OS. The next write triggers a demand-zero minor page fault. snmalloc uses MADV_FREE (38,796 calls) which lazily defers reclamation. No forced re-faulting happens there.

jemalloc tuning can fix it, but only with the correct env var and only for specific patterns. tikv-jemallocator 0.6.1 is built with JEMALLOC_PREFIX="_rjem_", so the environment variable must be _RJEM_MALLOC_CONF, not MALLOC_CONF. With _RJEM_MALLOC_CONF=tcache_max:4096, jemalloc’s 32 KB MPSC time drops by 62% and actually beats snmalloc on the single-sender workload. However, the same tuning hurts single-threaded allocation (+96%) and multi-sender contention (+10%).

The penalty is allocation-size dependent, not async-pattern dependent. Reproducing tokio’s own sync_mpsc benchmark with 5 senders, 1000 messages: usize payloads show all allocators within 4%, but 32 KB Vec payloads reproduce the exact same jemalloc regression.

jemalloc wins on small-object churn. spawn_many_local (10,000 task spawns) is 2.02× faster under jemalloc than std. Thread-local slab caches dominate when objects stay under jemalloc’s small-class threshold.

snmalloc and mimalloc remain broadly safer on this machine. They win or tie on 10 of 13 reported cases. They avoid the large-message cliff entirely. No MALLOC_CONF archaeology is required.

Method

All runs on a single machine to eliminate cross-hardware noise.

Machine: OCI Ampere A1, 4 vCore ARM64 (aarch64), 24 GB RAM
OS: Debian 13 (kernel 6.12.86+deb13-arm64)
Page size: 4 KB
Compiler: rustc 1.95.0, cargo 1.95.0
Libraries: tokio 1.52.3, criterion 0.5.1
Allocators: tikv-jemallocator 0.6.1 (jemalloc 5.3.0), mimalloc 0.1.50 (mimalloc v2.x), snmalloc-rs 0.7.4, glibc 2.41 (std)

I built one benchmark binary per allocator. Each binary sets a different #[global_allocator] and links the same lib.rs routines. I used Criterion.rs to collect timings. It runs 50 samples with a 3 s measurement window and 1 s warmup. The build profile is Cargo’s built-in bench. That means opt-level 3, debug assertions off, and equivalent to --release.

Reproduction:

1
2
3
4
5
6
git clone <repo> allocator-shootout
cd allocator-shootout
cargo bench --bench std_alloc -- --noplot
cargo bench --bench jemalloc --features jemalloc -- --noplot
cargo bench --bench mimalloc --features mimalloc -- --noplot
cargo bench --bench snmalloc --features snmalloc -- --noplot

Raw numbers: bench_results.csv
Perf counters: perf_stat_results.txt
Tuning sweep: jemalloc_tuning_results.txt
Strace traces: jemalloc_strace.txt, snmalloc_strace.txt, mimalloc_strace.txt, std_strace.txt

Results

spawn_many: tasks spawned and awaited

Tasks	std (µs)	jemalloc (µs)	mimalloc (µs)	snmalloc (µs)	snmalloc vs std
1 000	748.0	630.3	594.4	577.4	1.30×
5 000	3 524.7	3 053.8	2 847.8	2 727.7	1.29×
10 000	7 387.5	6 413.5	5 499.0	5 216.4	1.42×

All three alternatives beat std. The win grows with task count. Each task carries a Task box, a join handle, and at least one waker allocation. On ARM64 these small frequent allocations are expensive under glibc’s default arena strategy.

mpsc: latency vs message size

Size	std (µs)	jemalloc (µs)	mimalloc (µs)	snmalloc (µs)	Best vs std
8 B	139.5	113.3	112.6	112.5	1.24× snmalloc
64 B	139.3	119.4	114.5	114.2	1.22× snmalloc
256 B	170.7	138.4	115.2	117.3	1.48× mimalloc
512 B	177.5	161.0	127.6	125.0	1.42× snmalloc
1 KB	211.3	203.5	138.4	138.9	1.53× mimalloc
4 KB	264.3	377.1	267.5	212.9	1.24× snmalloc
16 KB	1 077.8	1 741.0	637.7	593.0	1.82× snmalloc
32 KB	1 875.3	2 299.2	984.4	970.7	1.93× snmalloc

Two regimes appear here. Small messages show snmalloc ≈ mimalloc ≪ jemalloc < std. The gap is only 1.2–1.5×. Large messages flip the story. jemalloc drops below std. At 16 KB it is 1.62× slower than glibc; at 32 KB it is still 1.23× slower. Meanwhile snmalloc and mimalloc are nearly 2× faster than std.

The threshold where jemalloc inverts sits between 4 KB and 16 KB. That is the boundary between allocator slab caches and fresh extent allocation.

block_on_alloc: Vec churn inside block_on

Case	std (µs)	jemalloc (µs)	mimalloc (µs)	snmalloc (µs)	Best vs std
100 × 64 B	2.12	1.48	1.36	1.30	1.63× snmalloc
100 × 4 KB	11.00	5.84	8.83	6.55	1.88× jemalloc
1 000 × 64 B	20.37	13.97	12.72	11.90	1.71× snmalloc
1 000 × 4 KB	109.97	57.78	87.53	64.74	1.90× jemalloc

jemalloc wins on large individual allocations. Those 4 KB blobs hit its huge-page path and thread-local cache. snmalloc wins on high-count small batches. mimalloc is competitive but never leads.

alloc_stress: random-size Vec churn

Allocs	std (µs)	jemalloc (µs)	mimalloc (µs)	snmalloc (µs)	Best vs std
100	9.37	3.83	5.45	4.47	2.45× jemalloc
1 000	141.6	174.3	134.3	111.4	1.27× snmalloc

jemalloc dominates the small-batch case. Its thread-local caching is excellent there. At 1 000 allocs snmalloc’s global-free-list batching takes over and it leads.

Root cause: `MADV_DONTNEED` vs `MADV_FREE`

I traced memory syscalls with strace -e memory on the 32 KB MPSC case. strace overhead slows each iteration, so Criterion adapts by running fewer iterations. The counts below are per benchmark iteration. That is one pass of 1 000 sends plus 1 000 recvs. Raw captured traces: jemalloc_strace.txt, snmalloc_strace.txt, mimalloc_strace.txt, std_strace.txt

Allocator	`madvise` / iter	`brk` / iter	Typical advice	Iterations captured
jemalloc	~1,140	~0.3	`MADV_DONTNEED`	200
snmalloc	~9	0	`MADV_FREE`	3 825
mimalloc	<0.02	0	(various)	3 825
std	0	~38	n/a	1 275

jemalloc issues roughly 130× more madvise calls per iteration than snmalloc. snmalloc’s calls use MADV_FREE rather than MADV_DONTNEED.

MADV_DONTNEED tells the kernel to drop physical pages immediately. The virtual addresses stay valid. The next write triggers a demand-zero minor page fault. The kernel must allocate a fresh physical page and zero it. On ARM64 with 4 KB pages and smaller TLBs than x86-64, this faulting is proportionally more expensive.

MADV_FREE tells the kernel the pages may be reclaimed later. They stay valid in RAM until reclaim actually happens. No forced re-faulting occurs on subsequent writes.

jemalloc is aggressively returning memory. snmalloc is deferring. The benchmark repeatedly allocates and frees 32 KB buffers through a bounded channel. That workload pings the MADV_DONTNEED / demand-fault cycle perfectly.

I confirmed the per-iteration fault rate with perf stat -e page-faults. Criterion’s adaptive iteration count means totals vary. Rates are reproducible.

Allocator	Page faults / iter	Sys % of elapsed
jemalloc	~1,580	~73%
std	~1,400	~47%
mimalloc	<0.1	<1%
snmalloc	<0.1	~4%

jemalloc spends most of elapsed time in the kernel servicing those faults. This is not a cache-miss problem. Cache miss rates are comparable across allocators. It is a syscall-level page-fault storm.

Is this a jemalloc bug?

No. It is documented behavior. The jemalloc man page explicitly describes the two purge paths:

“A lazy extent purge function (e.g. implemented via madvise(...,MADV_FREE)) can delay purging indefinitely and leave the pages within the purged virtual memory range in an indeterminate state, whereas a forced extent purge function immediately purges, and the pages within the virtual memory range will be zero-filled the next time they are accessed.”¹

I followed the source code in jemalloc 5.3.0. The cascade lives in extent_dalloc_wrapper() inside src/extent.c. On Linux with default retain:true, the flow works like this:

ehooks_dalloc_will_fail() returns true because opt_retain is enabled. munmap is skipped entirely.
extent_decommit_wrapper() calls pages_decommit() → pages_commit_impl(). That returns true immediately because os_overcommits is true on Linux. Decommit is a no-op on overcommitting systems.
Fall through to forced purge → pages_purge_forced() → madvise(..., MADV_DONTNEED).
Next access to those pages = demand-zero fault.

The 228,455 MADV_DONTNEED calls are not a fallback after failed syscalls. They are the only allocator action that actually runs on this path.

jemalloc’s own TUNING.md frames this as a CPU-versus-memory trade-off:

“Decay time determines how fast jemalloc returns unused pages back to the operating system, and therefore provides a fairly straightforward trade-off between CPU and memory usage. Shorter decay time purges unused pages faster to reduces memory usage (usually at the cost of more CPU cycles spent on purging)."²

The 62% regression is simply the “more CPU cycles” side of that trade-off. ARM64’s stricter TLB behavior amplifies it. The benchmark’s tight allocation loop also magnifies the cost. On a memory-constrained server the same behavior would be correct. On a 24 GB machine running allocation-heavy async workloads it is not.

Could jemalloc be tuned out of this?

A common first attempt is to set MALLOC_CONF. On tikv-jemallocator 0.6.1 that variable is silently ignored. The build uses JEMALLOC_PREFIX="_rjem_", so the correct name is _RJEM_MALLOC_CONF. Any testing with MALLOC_CONF will measure the default regardless of the string passed.

I retested with _RJEM_MALLOC_CONF on the 32 KB MPSC case.

Tune	Page faults	Per-iteration time
Baseline	~2 019 635	2.35 ms
`retain:false`	~2 018 974	2.36 ms
`dirty_decay_ms:0,muzzy_decay_ms:0`	~2 032 091	2.33 ms
`narenas:1`	~2 031 085	2.33 ms
`tcache:false`	949	0.86 ms
`tcache_max:4096`	~1 023	0.89 ms

What changed

tcache:false and tcache_max:4096 do fix the regression. They improve this benchmark by 2.6–2.7×. The page-fault count drops from ~1,500 faults/iteration to near-zero. The other knobs (retain, decay, arenas) genuinely do nothing.

Why does limiting the tcache help? With default tcache_max, large deallocated chunks sit in the thread cache. When the cache fills, jemalloc purges them with madvise(..., MADV_DONTNEED). The next allocation from the same thread immediately reclaims from the tcache and touches those zeroed pages. That triggers a demand-zero fault. The strace data shows this clearly: baseline jemalloc triggers ~1,140 madvise calls per iteration, each followed by a fault on reuse.

With tcache_max:4096, allocations ≥ 4 KB bypass the tcache. They flow through the arena’s extent management. Notably, the madvise count does not drop. It rises slightly to ~1,300 calls/iteration. Those purged extents are no longer on the immediate-reuse hot path. The next allocation tends to come from a different, non-purged extent. Demand-zero faults collapse from ~1,580/iteration to near-zero.

This fix is so pattern-dependent because it only helps when the benchmark immediately reclaims from the same thread’s cache. When a single thread does all the work (block_on_alloc_st) or when many senders share the channel (tokio_mpsc_large), bypassing the tcache forces allocations through the slower arena path. Things get worse.

The catch — pattern dependence

The fix is not universal.

Pattern	Baseline jemalloc	`tcache_max:4096`	Change
mpsc/32 KB (1 sender)	2.35 ms	0.89 ms	−62% ✅
mpsc/16 KB (1 sender)	1.79 ms	0.83 ms	−54% ✅
mpsc/1 KB (1 sender)	209 µs	205 µs	−2% ✅
mpsc/8 B (1 sender)	113 µs	113 µs	0% ✅
tokio_mpsc_large/5×1000_32 KB (5 senders)	4.27 ms	4.70 ms	+10% ❌
block_on_alloc_st/1000×32 KB (single-threaded)	390 µs	763 µs	+96% ❌
spawn_many/10 000	6.41 ms	6.55 ms	~+2% (within noise)

tcache_max:4096 wins for cross-thread, single-pair channel workloads with multi-KB messages. It hurts single-threaded and multi-sender patterns. The mechanism is workload-specific. It is not a general jemalloc panacea.

Practical option

If a workload is dominated by large MPSC messages and locked into jemalloc, _RJEM_MALLOC_CONF=tcache_max:4096 is worth testing. A more conservative starting point is _RJEM_MALLOC_CONF=tcache_max:16384. That fixes 32 KB while keeping 16 KB in cache. 16 KB stays at ~1.78 ms.

For most users, snmalloc and mimalloc avoid the cliff without tuning. tcache_max:4096 has no effect on small-object churn like spawn_many. It stays within noise. It hurts single-threaded large allocations and multi-sender channel contention.

Is this tokio-specific, or universal?

I replicated six patterns drawn directly from tokio’s own benchmark suite. I wanted to test whether the effect is specific to multi-threaded async, channels, or something else.

Pattern	Type	std (µs)	jemalloc (µs)	snmalloc (µs)	Key finding
`spawn_many_local/10k`	MT spawn	10 693	5 289	5 393	jemalloc wins (2× std)
`spawn_many_remote_idle/10k`	MT spawn	8 050	6 237	5 021	snmalloc wins
`mpsc_contention/5x1000`	MT channel (`usize`)	1 236	1 216	1 194	All tied within 4%
`mpsc_large/5x1000_32KB`	MT channel (32 KB)	4 485	4 314	3 299	jemalloc loses — reproduces cliff
`block_on_alloc_st/1000x32KB`	ST alloc only	402	390	366	snmalloc wins; not async-specific
`spawn_st/1k`	ST spawn	336	275	244	snmalloc wins; not MT-specific

Three conclusions emerge.

usize contention barely matters for the allocator. Small objects short-circuit jemalloc’s MADV_DONTNEED path.

Large Vec<u8> payloads reproduce the cliff. That happens even with tokio’s own sync_mpsc contention pattern. The issue is not the custom harness.

Single-threaded block_on with 32 KB allocs shows the same penalty. This is not an async-specific or multi-thread-specific pathology. It is an allocator-allocation-size interaction. Tokio channels expose it when messages are multi-KB.

Summary

Workload	Winner	std vs best	Key driver
spawn_many	snmalloc ≈ mimalloc	1.42×	Small-object fast path
mpsc small	snmalloc ≈ mimalloc	1.24–1.53×	Thread cache, low coordination
mpsc large (untuned)	snmalloc ≈ mimalloc	~1.93×	`MADV_FREE` avoids fault storm
mpsc large (jemalloc, `tcache_max:4096`)	jemalloc	~2.6× vs untuned jemalloc	Bypass tcache to avoid madvise reuse cycle
block_on_alloc (small)	snmalloc	1.71×	Batch deallocation
block_on_alloc (large)	jemalloc	1.90×	Huge-page / slab path
alloc_stress (small)	jemalloc	2.45×	Thread-local cache
alloc_stress (large)	snmalloc	1.27×	Global-free-list batching

Glossary

Term	Meaning
`madvise`	Linux syscall that tells the kernel how a memory range will be used.
`MADV_DONTNEED`	Advice to drop physical pages immediately; next write faults.
`MADV_FREE`	Advice to lazily reclaim pages; no immediate fault on reuse.
`strace`	Tool that traces syscalls made by a process.
`perf stat`	Linux profiler that counts hardware and software events.
Criterion.rs	Rust benchmarking harness that handles statistics and warming.
tcache	jemalloc’s per-thread cache for recently freed small objects.
arena	jemalloc’s logical partition of the heap assigned to threads.

Appendix: Does this reproduce on x86-64?

All numbers so far come from an OCI Ampere A1. That is ARM64 with 4 vCore. To test architecture specificity I ran the same suite on an AMD Ryzen 9 9900X. That machine has 16C/32T, 64 GB DDR5, x86-64, Debian, kernel 6.x, and 4 KB pages.

Key results on x86-64

Workload	std	jemalloc	mimalloc	snmalloc	`tcache_max:4096`
mpsc/16 KB	387 µs	728 µs (+88%)	226 µs	241 µs	394 µs (+2%)
mpsc/32 KB	744 µs	1 004 µs (+35%)	382 µs	406 µs	393 µs (−47%)
block_on_alloc/1000×4 KB	43.6 µs	21.5 µs (−51%)	37.9 µs	20.9 µs (−52%)	23.2 µs (−47%)
spawn_many/10 000	2.62 ms	3.13 ms (+19%)	3.04 ms	3.51 ms	4.48 ms (+71%)
tokio_mpsc_large/5×1000_32KB	902 µs	1.80 ms (+100%)	1.74 ms	1.52 ms	2.78 ms (+208%)

Three findings stand out.

1. The jemalloc regression is present on x86-64 too, and at some sizes it is worse. At 16 KB, jemalloc is 1.88× slower than std on x86-64. That is 1.62× on ARM64. At 32 KB the gap narrows to 1.35× versus 1.23× on ARM64. The direction is identical. The magnitude depends on the exact size threshold between slab and extent allocation on each platform.

2. tcache_max:4096 fixes the single-sender large-MPSC case on x86-64 as well. 32 KB drops from 1.0 ms to 393 µs. That is a 61% improvement. It matches the 62% seen on ARM64. The mechanism is the same. Bypassing the thread cache avoids immediate reuse of MADV_DONTNEED-purged extents.

3. The pattern-dependence is identical. The same tuning hurts multi-sender contention. It rises +208% on tokio_mpsc_large on x86-64. That is +10% on ARM64. It also hurts single-threaded allocation. spawn_many rises +71% on x86-64 versus ~0% on ARM64. The variance in magnitude is platform-specific. The sign is the same.

strace confirms the same syscall pattern

strace -f -e memory on the 32 KB MPSC case shows the same mechanism. The x86-64 machine runs more iterations per sample, so absolute counts are lower.

Allocator	`madvise` calls	Iterations	MADV_DONTNEED / iter	MADV_FREE / iter
jemalloc	422	3 825	0.11	<0.001
std	48	5 100	0.009	—
mimalloc	31	8 925	0.003	0
snmalloc	56	7 650	0.006	0.001

jemalloc issues roughly 12× more MADV_DONTNEED calls per iteration than std on x86-64. snmalloc again uses MADV_FREE instead. The ratio is lower than on ARM64 where it was ~130×. The direction is unchanged. The effect is allocator behavior, not an ARM64 pathology.

Platform differences worth noting

spawn_many: std wins on x86-64 (2.62 ms) while snmalloc wins on ARM64 (5.22 ms). glibc’s arena strategy performs better on the larger x86-64 cache hierarchy.
Absolute latencies: x86-64 is roughly 2–3× faster across the board. That is expected from a much faster CPU. The relative allocator rankings, however, are preserved.

Take

jemalloc is not “bad on ARM64.” It is appropriately aggressive for memory-tight servers. It is inappropriately aggressive for allocation-heavy, page-reuse workloads. Its MADV_DONTNEED strategy triggers a demand-zero fault storm. That happens when large buffers are repeatedly allocated and freed. Multi-KB MPSC messages produce exactly that pattern. The same regression appears on x86-64, so this is allocator behavior.

The practical rule is still: if a Tokio service pushes > 4 KB through channels, test snmalloc or mimalloc first. They avoid the cliff with no configuration.

If the binary is already committed to jemalloc, _RJEM_MALLOC_CONF=tcache_max:4096 can recover the loss on single-pair channel workloads. It is a sharp knife. Measure the actual pattern before deploying it. The only tuning that matters here is controlling the tcache size threshold.

One final, counter-intuitive result: perf stat shows cache-miss rates are higher for snmalloc than jemalloc on the large MPSC case (1.51% vs 0.96%). snmalloc is winning despite worse cache behavior. Its page-fault count is near-zero. The bottleneck is kernel entry and exit, not the cache hierarchy.

jemalloc 5.3.0 man page, extent_hooks_t section. Source: jemalloc.xml in the tikv-jemalloc-sys 0.6.1 build. ↩︎
jemalloc 5.3.0, TUNING.md. ↩︎

The experiments in this post were run with AI assistance. I wrote the words and checked every number myself.

Method#

Results#

spawn_many: tasks spawned and awaited#

mpsc: latency vs message size#

block_on_alloc: Vec churn inside block_on#

alloc_stress: random-size Vec churn#

Root cause: MADV_DONTNEED vs MADV_FREE#

Is this a jemalloc bug?#

Could jemalloc be tuned out of this?#

What changed#

The catch — pattern dependence#

Practical option#

Is this tokio-specific, or universal?#

Summary#

Glossary#

Appendix: Does this reproduce on x86-64?#

Key results on x86-64#

strace confirms the same syscall pattern#

Platform differences worth noting#

Take#

Related posts