GPA is Dead. Long Live the Debug Allocator

Calder-Ty · February 9, 2025, 4:57am

I was doing some studying on the General Purpose Allocator (and maybe even found a bug) when i found this Gem in some commits from yesterday:

I think this is a great move. While the GPA has been good for handling “General” cases (i.e unpredictable memory usage patterns), the reall sell of the GPA is it’s Debugging features: stack trace, leak detection, etc.

vulpesx · February 9, 2025, 5:15am

:p

pfech · February 9, 2025, 7:13am

What is Smp?

cryptocode · February 9, 2025, 9:46am

The name comes from Symmetric multiprocessing - Wikipedia

It’s a thread safe allocator intended to be faster than DebugAllocator (which also supports a thread safe mode)

Luke · February 9, 2025, 9:55am

There is an entry in the devlog about it: Devlog ⚡ Zig Programming Language and a discussion on lobste.rs: No-Libc Zig Now Outperforms Glibc Zig | Lobsters.

ericlang · February 9, 2025, 12:48pm

So… what is THE non-debug allocator, now that name / functionality has changed?

squeek502 · February 9, 2025, 1:08pm

There isn’t one yet. See the “How to use it” and “Follow-up issues” sections of

github.com/ziglang/zig

introduce std.heap.SmpAllocator

ziglang:master ← ziglang:fast-gpa

opened 10:35PM - 07 Feb 25 UTC

andrewrk

+326 -43

An allocator that is designed for ReleaseFast optimization mode, with multi-thre…ading enabled. This allocator is a singleton; it uses global state and only one should be instantiated for the entire process. This is a "sweet spot" - the implementation is about 200 lines of code and yet competitive with glibc performance. ## Basic Design Each thread gets a separate freelist, however, the data must be recoverable when the thread exits. We do not directly learn when a thread exits, so occasionally, one thread must attempt to reclaim another thread's resources. Above a certain size, those allocations are memory mapped directly, with no storage of allocation metadata. This works because the implementation refuses resizes that would move an allocation from small category to large category or vice versa. Each allocator operation checks the thread identifier from a threadlocal variable to find out which metadata in the global state to access, and attempts to grab its lock. This will usually succeed without contention, unless another thread has been assigned the same id. In the case of such contention, the thread moves on to the next thread metadata slot and repeats the process of attempting to obtain the lock. By limiting the thread-local metadata array to the same number as the CPU count, ensures that as threads are created and destroyed, they cycle through the full set of freelists. ## Performance Data Points This is building hello world with glibc vs SmpAllocator: * master branch (`0.14.0-dev.3145+6a6e72fff`) `stage3/bin/zig build -p glibc -Doptimize=ReleaseFast -Dno-lib -Dforce-link-libc` * this branch, `stage3/bin/zig build -p SmpAllocator -Doptimize=ReleaseFast -Dno-lib`, which now uses SmpAllocator rather than DebugAllocator with this build configuration ``` Benchmark 1 (24 runs): glibc/bin/zig build-exe ../test/standalone/simple/hello_world/hello.zig measurement mean ± σ min … max outliers delta wall_time 211ms ± 9.91ms 193ms … 237ms 4 (17%) 0% peak_rss 73.2MB ± 708KB 71.9MB … 74.3MB 0 ( 0%) 0% cpu_cycles 1.16G ± 9.10M 1.14G … 1.18G 0 ( 0%) 0% instructions 2.32G ± 81.4K 2.32G … 2.32G 1 ( 4%) 0% cache_references 86.5M ± 299K 86.1M … 87.3M 2 ( 8%) 0% cache_misses 7.77M ± 85.3K 7.62M … 7.90M 0 ( 0%) 0% branch_misses 7.11M ± 33.1K 7.05M … 7.21M 1 ( 4%) 0% Benchmark 2 (24 runs): SmpAllocator/bin/zig build-exe ../test/standalone/simple/hello_world/hello.zig measurement mean ± σ min … max outliers delta wall_time 208ms ± 7.30ms 196ms … 224ms 0 ( 0%) - 1.3% ± 2.4% peak_rss 79.1MB ± 817KB 77.8MB … 81.2MB 1 ( 4%) 💩+ 8.0% ± 0.6% cpu_cycles 1.15G ± 16.9M 1.12G … 1.18G 0 ( 0%) - 0.8% ± 0.7% instructions 2.22G ± 28.1K 2.22G … 2.22G 0 ( 0%) ⚡- 4.1% ± 0.0% cache_references 82.8M ± 407K 82.1M … 84.1M 1 ( 4%) ⚡- 4.3% ± 0.2% cache_misses 7.93M ± 96.6K 7.74M … 8.12M 0 ( 0%) 💩+ 2.1% ± 0.7% branch_misses 7.35M ± 23.6K 7.30M … 7.40M 0 ( 0%) 💩+ 3.4% ± 0.2% ``` A particularly allocation-heavy ast-check: ``` Benchmark 1 (32 runs): glibc/bin/zig ast-check ../lib/compiler_rt/udivmodti4_test.zig measurement mean ± σ min … max outliers delta wall_time 156ms ± 6.58ms 151ms … 173ms 4 (13%) 0% peak_rss 45.0MB ± 20.9KB 45.0MB … 45.1MB 1 ( 3%) 0% cpu_cycles 766M ± 10.2M 754M … 796M 0 ( 0%) 0% instructions 3.19G ± 12.7 3.19G … 3.19G 0 ( 0%) 0% cache_references 4.12M ± 498K 3.88M … 6.13M 3 ( 9%) 0% cache_misses 128K ± 2.42K 125K … 134K 0 ( 0%) 0% branch_misses 1.14M ± 215K 925K … 1.43M 0 ( 0%) 0% Benchmark 2 (34 runs): SmpAllocator/bin/zig ast-check ../lib/compiler_rt/udivmodti4_test.zig measurement mean ± σ min … max outliers delta wall_time 149ms ± 1.87ms 146ms … 156ms 1 ( 3%) ⚡- 4.9% ± 1.5% peak_rss 39.6MB ± 141KB 38.8MB … 39.6MB 2 ( 6%) ⚡- 12.1% ± 0.1% cpu_cycles 750M ± 3.77M 744M … 756M 0 ( 0%) ⚡- 2.1% ± 0.5% instructions 3.05G ± 11.5 3.05G … 3.05G 0 ( 0%) ⚡- 4.5% ± 0.0% cache_references 2.94M ± 99.2K 2.88M … 3.36M 4 (12%) ⚡- 28.7% ± 4.2% cache_misses 48.2K ± 1.07K 45.6K … 52.1K 2 ( 6%) ⚡- 62.4% ± 0.7% branch_misses 890K ± 28.8K 862K … 1.02M 2 ( 6%) ⚡- 21.8% ± 6.5% ``` Building the self-hosted compiler: ``` Benchmark 1 (3 runs): glibc/bin/zig build -Dno-lib -p trash measurement mean ± σ min … max outliers delta wall_time 12.2s ± 99.4ms 12.1s … 12.3s 0 ( 0%) 0% peak_rss 975MB ± 21.7MB 951MB … 993MB 0 ( 0%) 0% cpu_cycles 88.7G ± 68.3M 88.7G … 88.8G 0 ( 0%) 0% instructions 188G ± 1.40M 188G … 188G 0 ( 0%) 0% cache_references 5.88G ± 33.2M 5.84G … 5.90G 0 ( 0%) 0% cache_misses 383M ± 2.26M 381M … 385M 0 ( 0%) 0% branch_misses 368M ± 1.77M 366M … 369M 0 ( 0%) 0% Benchmark 2 (3 runs): SmpAllocator/fast/bin/zig build -Dno-lib -p trash measurement mean ± σ min … max outliers delta wall_time 12.2s ± 49.0ms 12.2s … 12.3s 0 ( 0%) + 0.0% ± 1.5% peak_rss 953MB ± 3.47MB 950MB … 957MB 0 ( 0%) - 2.2% ± 3.6% cpu_cycles 88.4G ± 165M 88.2G … 88.6G 0 ( 0%) - 0.4% ± 0.3% instructions 181G ± 6.31M 181G … 181G 0 ( 0%) ⚡- 3.9% ± 0.0% cache_references 5.48G ± 17.5M 5.46G … 5.50G 0 ( 0%) ⚡- 6.9% ± 1.0% cache_misses 386M ± 1.85M 384M … 388M 0 ( 0%) + 0.6% ± 1.2% branch_misses 377M ± 899K 377M … 378M 0 ( 0%) 💩+ 2.6% ± 0.9% ``` [more performance data points](https://github.com/andrewrk/CarmensPlayground?tab=readme-ov-file#example-runs) ## How to use it Put something like this in your main function: ```zig var debug_allocator: std.heap.DebugAllocator(.{}) = .init; pub fn main() !void { const gpa, const is_debug = gpa: { if (native_os == .wasi) break :gpa .{ std.heap.wasm_allocator, false }; break :gpa switch (builtin.mode) { .Debug, .ReleaseSafe => .{ debug_allocator.allocator(), true }, .ReleaseFast, .ReleaseSmall => .{ std.heap.smp_allocator, false }, }; }; defer if (is_debug) { _ = debug_allocator.deinit(); }; } ``` # Follow-up issues * Provide some kind of abstraction that does the above logic for choosing an allocator * Look into asking for page align rather than slab align * Look into restartable sequences * Look into VirtualAlloc2 to improve PageAllocator * #12484

deckarep · February 17, 2025, 1:23am

Hello!

So is my understanding correct that the new SmpAllocator is now intended to be the default production-grade (release mode) allocator?

The reason I ask is because it looks like it’s thread-safe by default, but for example if I’m working on a traditional single-threaded game loop maybe I would rather opt-in to enable thread-safety? In the Zig docs I know there is a ThreadedAllocator that can wrap child allocators to provide the necessary synchronization which I liked the opt-in choice.

This isn’t intended to be a nitpick or critical but I’m just trying to get prepared for what to use since the GPA is no longer recommended to actually be general purpose and should be only used in Debug builds.

Thanks y’all.

squeek502 · February 17, 2025, 1:42am

From here:

std.heap.SmpAllocator fills the niche of -OReleaseFast -fno-single-threaded

There’s currently no allocator tailored towards -OReleaseFast -fsingle-threaded, but I would assume such an allocator would be welcomed.

(see also my last comment about follow up issues)