Your “fast” allocator might be the reason your high-performance Rust app is hitting a wall. I tested four global allocators on an ARM64 machine by linking the same Tokio MPSC benchmark to each one in turn and measuring end-to-end latency. jemalloc clocked in at 1.62× the time of glibc for 16 KB messages. At 32 KB the gap narrowed to 1.23×, but snmalloc and mimalloc both finished in roughly half the time — about 1.93× faster than std.
I assumed cache misses were the culprit. The real cause is way more boring and a lot more annoying: it’s a system call.
Method glossary
| Method | What it does |
|---|---|
spawn_many | Spawns and awaits a configurable number of async tasks to stress small-object allocation paths. |
mpsc | Sends messages of varying sizes through a bounded Tokio multi-producer single-consumer channel. |
block_on_alloc | Allocates Vec<u8> blobs inside block_on to measure allocation churn during executor blocking. |
alloc_stress | Randomly sizes and drops Vec instances to exercise mixed-size allocation and deallocation. |
tokio_mpsc_large | A multi-sender variant of the MPSC benchmark with five concurrent senders. |
block_on_alloc_st | A single-threaded version of block_on_alloc that isolates allocator path behavior. |
strace | Traces every system call issued by a process to identify kernel-level overhead. |
MADV_DONTNEED | A Linux madvise flag that immediately drops physical backing pages and forces demand-zero faults on reuse. |
MADV_FREE | A Linux madvise flag that lazily marks pages reclaimable without forcing immediate re-faulting. |
MALLOC_CONF | The standard environment variable for jemalloc tuning — ignored by prefixed builds. |
_RJEM_MALLOC_CONF | The actual environment variable that controls tikv-jemallocator when built with JEMALLOC_PREFIX="_rjem_". |
retain | A jemalloc boolean option that controls whether unused virtual memory is retained or unmapped. |
dirty_decay_ms | A jemalloc tuning knob that sets how long dirty pages sit before decay to muzzy. |
muzzy_decay_ms | A jemalloc tuning knob that sets how long muzzy pages sit before being purged. |
narenas | A jemalloc option that sets the number of allocation arenas. |
tcache | A jemalloc boolean option that enables or disables per-thread caching entirely. |
tcache_max | A jemalloc tuning knob that sets the maximum allocation size cached per thread. |
Results
spawn_many: tasks spawned and awaited
The spawn_many benchmark spawns and awaits a configurable number of async tasks, then records total wall-clock latency. I linked the same binary against four different global allocators and ran it on an ARM64 Ampere A1. Each task allocates a Task box, a join handle, and at least one waker — all small objects that hit the allocator’s fast path repeatedly.
| Tasks | std (µs) | jemalloc (µs) | mimalloc (µs) | snmalloc (µs) | snmalloc vs std |
|---|---|---|---|---|---|
| 1 000 | 748.0 | 630.3 | 594.4 | 577.4 | 1.30× |
| 5 000 | 3 524.7 | 3 053.8 | 2 847.8 | 2 727.7 | 1.29× |
| 10 000 | 7 387.5 | 6 413.5 | 5 499.0 | 5 216.4 | 1.42× |
The widening gap at higher task counts comes from glibc’s default arena strategy. On ARM64, those small frequent allocations pay a higher penalty than on x86-64.
mpsc: latency vs message size
The mpsc benchmark sends messages of different sizes through a bounded Tokio channel and measures round-trip latency. I linked the same binary against each allocator and swept message size from 8 B to 32 KB.
| Size | std (µs) | jemalloc (µs) | mimalloc (µs) | snmalloc (µs) | Best vs std |
|---|---|---|---|---|---|
| 8 B | 139.5 | 113.3 | 112.6 | 112.5 | 1.24× snmalloc |
| 64 B | 139.3 | 119.4 | 114.5 | 114.2 | 1.22× snmalloc |
| 256 B | 170.7 | 138.4 | 115.2 | 117.3 | 1.48× mimalloc |
| 512 B | 177.5 | 161.0 | 127.6 | 125.0 | 1.42× snmalloc |
| 1 KB | 211.3 | 203.5 | 138.4 | 138.9 | 1.53× mimalloc |
| 4 KB | 264.3 | 377.1 | 267.5 | 212.9 | 1.24× snmalloc |
| 16 KB | 1 077.8 | 1 741.0 | 637.7 | 593.0 | 1.82× snmalloc |
| 32 KB | 1 875.3 | 2 299.2 | 984.4 | 970.7 | 1.93× snmalloc |
Two distinct patterns showed up in the numbers. For small messages — 8 B through 1 KB — snmalloc and mimalloc stay within 1.2–1.5× of std, while jemalloc sits in the middle. At 16 KB and 32 KB the story flips: jemalloc falls behind std, and snmalloc or mimalloc lead by nearly 2×.
The threshold where jemalloc inverts sits between 4 KB and 16 KB. That is the point where the allocator moves from slab caches to fresh extent and page allocation.
block_on_alloc: Vec churn inside block_on
The block_on_alloc benchmark allocates Vec<u8> blobs inside block_on to stress the allocator during executor blocking. I timed two sizes — 64 B and 4 KB — at counts of 100 and 1 000 allocations.
| Case | std (µs) | jemalloc (µs) | mimalloc (µs) | snmalloc (µs) | Best vs std |
|---|---|---|---|---|---|
| 100 × 64 B | 2.12 | 1.48 | 1.36 | 1.30 | 1.63× snmalloc |
| 100 × 4 KB | 11.00 | 5.84 | 8.83 | 6.55 | 1.88× jemalloc |
| 1 000 × 64 B | 20.37 | 13.97 | 12.72 | 11.90 | 1.71× snmalloc |
| 1 000 × 4 KB | 109.97 | 57.78 | 87.53 | 64.74 | 1.90× jemalloc |
jemalloc wins on the 4 KB blobs because its huge-page path and thread-local cache pay off for large individual allocations. snmalloc wins on the high-count small batches. mimalloc stays competitive but never takes first place.
alloc_stress: random-size Vec churn
The alloc_stress benchmark randomly sizes and drops Vec instances to exercise mixed-size allocation paths. I ran it for 100 and 1 000 allocations and recorded total latency.
| Allocs | std (µs) | jemalloc (µs) | mimalloc (µs) | snmalloc (µs) | Best vs std |
|---|---|---|---|---|---|
| 100 | 9.37 | 3.83 | 5.45 | 4.47 | 2.45× jemalloc |
| 1 000 | 141.6 | 174.3 | 134.3 | 111.4 | 1.27× snmalloc |
jemalloc dominates the 100-allocation run — likely because its thread-local cache handles small batches efficiently. At 1 000 allocations, snmalloc’s global-free-list batching takes over and it leads.
Root cause: MADV_DONTNEED vs MADV_FREE
I ran everything on a 4 vCore ARM64 Ampere A1 instance with 24 GB of RAM, Debian 13, rustc 1.95.0, and tokio 1.52.3. The test setup links the same routines to different global allocators to keep noise out.
The mpsc benchmark with 32 KB messages showed jemalloc at 1.23× slower than glibc and snmalloc at 1.93× faster than std. I suspected cache misses, so I reached for strace — a tool that traces every system call a process makes.
jemalloc called MADV_DONTNEED 228,455 times during the 32 KB run. MADV_DONTNEED tells the kernel to drop physical pages immediately. The virtual addresses stay valid, but the next write to any page triggers a demand-zero minor page fault — the kernel has to allocate a fresh physical page and zero it. On ARM64 with 4 KB pages and smaller TLBs than x86-64, that faulting is proportionally more expensive.
snmalloc took a different path. It called MADV_FREE 38,796 times. MADV_FREE tells the kernel the pages may be reclaimed under memory pressure, but they stay in RAM until reclaim actually happens. No forced re-faulting on subsequent writes.
I confirmed this with strace -e memory on the 32 KB MPSC case. jemalloc issued roughly 130× more madvise calls per iteration than snmalloc, and snmalloc’s calls used MADV_FREE rather than MADV_DONTNEED.
The benchmark repeatedly allocates and frees 32 KB buffers through a bounded channel. That pattern pings the MADV_DONTNEED / demand-fault cycle over and over. jemalloc is aggressively returning memory. snmalloc is deferring.
Can jemalloc be tuned out of this?
The first thing most people try is MALLOC_CONF. On tikv-jemallocator 0.6.1 that variable is silently ignored — the build uses JEMALLOC_PREFIX="_rjem_", so the correct name is _RJEM_MALLOC_CONF. I learned this the hard way.
I tried MALLOC_CONF first. Nothing happened. Then I realized tikv-jemallocator 0.6.1 ignores that variable entirely because of the prefix, so I switched to _RJEM_MALLOC_CONF. I’d been measuring the default config the whole time.
I tested several jemalloc tuning knobs — retain, dirty_decay_ms, muzzy_decay_ms, narenas, tcache, and tcache_max — by passing each through _RJEM_MALLOC_CONF and recording page faults and per-iteration time.
| Tune | Page faults | Per-iteration time |
|---|---|---|
| Baseline | ~2 019 635 | 2.35 ms |
retain:false | ~2 018 974 | 2.36 ms |
dirty_decay_ms:0,muzzy_decay_ms:0 | ~2 032 091 | 2.33 ms |
narenas:1 | ~2 031 085 | 2.33 ms |
tcache:false | 949 | 0.86 ms |
tcache_max:4096 | ~1 023 | 0.89 ms |
The tcache:false and tcache_max:4096 settings changed the game. With _RJEM_MALLOC_CONF=tcache_max:4096, jemalloc’s 32 KB MPSC time dropped by 62% — it actually beat snmalloc on the single-sender workload. But the same tuning hurt single-threaded allocation by 96% and multi-sender contention by 10%.
I compared the tuned and untuned configurations across five specific patterns: mpsc at 16 KB and 32 KB, tokio_mpsc_large with five senders, block_on_alloc_st single-threaded, and spawn_many at 10 000 tasks.
| Pattern | Baseline jemalloc | tcache_max:4096 | Change |
|---|---|---|---|
| mpsc/32 KB (1 sender) | 2.35 ms | 0.89 ms | −62% ✅ |
| mpsc/16 KB (1 sender) | 1.79 ms | 0.83 ms | −54% ✅ |
| tokio_mpsc_large/5×1000_32 KB (5 senders) | 4.27 ms | 4.70 ms | +10% ❌ |
| block_on_alloc_st/1000×32 KB (single-threaded) | 390 µs | 763 µs | +96% ❌ |
| spawn_many/10 000 | 6.41 ms | 6.55 ms | ~+2% (within noise) |
tcache_max:4096 is a win for cross-thread, single-pair channel workloads with multi-KB messages. It hurts single-threaded and multi-sender patterns. The mechanism is workload-specific, not a general jemalloc panacea.
I used strace again to understand why limiting the tcache worked. Baseline jemalloc triggered 1,140 madvise calls per iteration, each followed by a fault on reuse. Large deallocated chunks sat in the thread cache, got purged with MADV_DONTNEED, and were immediately reclaimed and zeroed.
Setting tcache_max:4096 forces allocations 4 KB and larger to bypass the tcache. The madvise count rose to 1,300 calls per iteration. But those purged extents weren’t on the immediate-reuse hot path anymore, so demand-zero faults collapsed from 1,580 per iteration to almost zero.
This fix only works for specific patterns. It helps when a benchmark reclaims from the same thread’s cache. When I ran block_on_alloc_st or tokio_mpsc_large, bypassing the tcache forced everything through the slower arena path and tanked performance.
It’s not a universal win. tcache_max:4096 helps cross-thread, single-pair channel workloads with multi-KB messages, but it hurts single-threaded and multi-sender patterns. It’s a workload-specific tweak, not a magic bullet.
Summary
| Workload | Winner | std vs best | Key driver |
|---|---|---|---|
| spawn_many | snmalloc ≈ mimalloc | 1.42× | Small-object fast path |
| mpsc small | snmalloc ≈ mimalloc | 1.24–1.53× | Thread cache, low coordination |
| mpsc large (untuned) | snmalloc ≈ mimalloc | ~1.93× | MADV_FREE avoids fault storm |
mpsc large (jemalloc, tcache_max:4096) | jemalloc | ~2.6× vs untuned jemalloc | Bypass tcache to avoid madvise reuse cycle |
| block_on_alloc (small) | snmalloc | 1.71× | Batch deallocation |
| block_on_alloc (large) | jemalloc | 1.90× | Huge-page / slab path |
| alloc_stress (small) | jemalloc | 2.45× | Thread-local cache |
| alloc_stress (large) | snmalloc | 1.27× | Global-free-list batching |
The penalty is allocation-size dependent, not async-pattern dependent. I measured this by running each benchmark linked to each allocator and comparing end-to-end latency on the same ARM64 machine. For most people on ARM64, snmalloc and mimalloc are the safer bets. They win or tie in 10 out of 13 test cases, avoid the large-message cliff entirely, and require no MALLOC_CONF archaeology.
I ran these tests on an OCI Ampere A1 — ARM64, 4 vCore, 24 GB RAM, Debian 13, kernel 6.12.86, rustc 1.95.0, tokio 1.52.3. It makes me wonder: how many other “performance regressions” in production are actually just the OS and the allocator having a disagreement about page reclamation?







