Allocator shootout: why your 'fast' allocator might be 62% slower

Ever wonder why a high-performance Rust app suddenly crawls when message sizes grow? I benchmarked four allocators on an ARM64 machine to find out. The answer turned out to be 62% slower than the standard library in the worst case.

Method Glossary

Method	One-sentence explanation
`glibc` (std)	The default system allocator on Linux, which uses `ptmalloc2` and handles large allocations with `mmap` and `brk`.
`jemalloc`	A general-purpose allocator from Meta that uses thread-local caches and fine-grained size classes to reduce lock contention.
`mimalloc`	Microsoft’s allocator that prioritizes small, fixed-size object caches and aggressive thread-local freelists.
`snmalloc`	Microsoft’s research allocator that uses message-passing between threads and prefers `MADV_FREE` over `MADV_DONTNEED`.

I ran the shootout on an Ampere A1 instance with two separate benchmarks: a task-spawn microbenchmark and an MPSC channel throughput test. For usize payloads in the channel, every allocator landed within 4% of each other. The differences only show up when the workload or payload size changes.

For small-object churn, I measured task spawn throughput with tiny allocations. jemalloc pulled ahead here because its thread-local slab caches eliminate most kernel calls for allocations under a few kilobytes. It finished 2.02× faster than glibc in this specific test.

Large objects in the channel benchmark tell a completely different story. I bumped the MPSC message size to 16 KB and 32 KB Vec<u8> blobs and re-ran the same throughput test. Here’s what happened:

Payload size	jemalloc vs std	mimalloc vs std	snmalloc vs std
16 KB	1.62× slower	~2× faster	~2× faster
32 KB	1.23× slower	~2× faster	~2× faster

I suspected cache misses at first, but perf did not show a spike in LLC misses. So I reached for strace to watch the syscall profile. During the 32 KB benchmark, jemalloc called madvise(..., MADV_DONTNEED) 228,455 times. That is not a typo.

MADV_DONTNEED tells the kernel to drop physical pages immediately. The virtual addresses stick around, but the next write to any of those pages triggers a demand-zero minor page fault. You pay a kernel tax on every single write that touches a previously-freed page.

snmalloc avoids this trap entirely. It uses MADV_FREE instead, which marks pages as reclaimable but does not force the kernel to take them back immediately. The pages stay resident until memory pressure actually hits, so there is no forced re-faulting and no performance cliff.

You can patch jemalloc, but the option is hidden. Since tikv-jemallocator uses a prefixed symbol namespace, the standard MALLOC_CONF environment variable does not work. You have to export _RJEM_MALLOC_CONF instead.¹

Setting _RJEM_MALLOC_CONF=tcache_max:4096 drops the 32 KB MPSC time by 62% and pushes jemalloc past snmalloc for single-sender workloads. The tuning is not free, though:

Workload	Effect of `tcache_max:4096`
32 KB single-sender MPSC	−62% time
Single-threaded allocation	+96% time
Multi-sender contention	+10% time

The real villain here is not async overhead or channel contention — it is allocation size. Move usize payloads through a channel and every allocator behaves the same. Move 32 KB Vec<u8> blobs and you are suddenly dependent on how the allocator negotiates with the kernel.

For most ARM64 deployments, snmalloc and mimalloc are the safer defaults. They win or tie in ten out of thirteen test cases and do not require reading man pages to dodge a cliff.

How many production “performance regressions” are actually just the kernel fighting with an allocator’s default settings? Probably more than we think.

tikv-jemallocator 0.6.1 and later use the _RJEM_ prefix; earlier versions may differ. ↩︎

This post was researched and drafted with AI assistance. I picked the ideas, reviewed every claim, and verified all benchmarks.