Do we really need a thread-safe ArenaAllocator by default?

I noticed that the ArenaAllocator in the master branch has been changed to a lock-free thread-safe implementation. I have serious concerns about this change, so I would like to get more explanation.

Although I am a heavy user of ArenaAllocator, nearly 99% of my usage needs are on a single thread. I mainly use ArenaAllocator to amortize the allocation and deallocation overhead of the global allocator (including the thread-safety overhead of the global allocator) and to improve cache locality. The default thread-safe ArenaAllocator weakens its performance advantage. Even with a lock-free implementation, atomic operations still affect optimization.

Another point I am deeply skeptical about regarding the default thread-safe ArenaAllocator is that I believe it encourages an anti pattern of concurrent high-frequency writes to the same bump allocator. The ArenaAllocator is a bump allocator, where allocated memory is contiguous, so threads can easily access the same cache line, leading to a false-sharing disaster. This is also the key reason why I almost never use ArenaAllocator across threads.

4 Likes

What’s actually persuasive is a post which says “I’ve run the 0.15 ArenaAllocator head-to-head with the thread-safe 0.16 version and performance is N% degraded on a single thread, is that really worth it?”.

What you’ve presented is a hypothetical, and it’s a hypothetical which can be checked. Without that it’s not at all clear that there’s a problem here.

But the general issue you’re gesturing at is common to standard libraries, it’s not Zig-specific. Stdlib code should perform correctly, with decent performance, in a wide range of circumstances. I think guarantees like “if the backing allocator is thread-safe, then the arena is thread-safe also” are useful to have in general-purpose code.

Anything can be misused, conversely, that’s optional. Multi-threaded arena use can prevent false sharing by always asking for cache-aligned memory, and always in even cache-line units, for instance. That’s a decision for user code to make.

But it’s certainly the case that, due to that very property, it’s often possible to beat the standard library with code optimized for one’s use case. For example, MemoryPool is generic by backing allocator, which is nice and flexible, but it means that preheats are requested one allocation at a time, and there are faster ways to do that. But those make assumptions about the allocator which aren’t safe to make about the Allocator interface.

6 Likes

To this point, I also expect a healthy ecosystem of alternative allocators to pop up in the Zig community. C already has plenty and people use different ones depend on their test. Zig will be similar. I expect the std library to have allocators to cover the blanket cases, the most common expected uses. Then the community can come up with others.

Like mnemnion said, it will come down to testing and providing data. And you can always copy the old ArenaAllocator code and use that directly.

5 Likes

If you want numbers:

In my hobby project (a simple OO language with GC based automatic memory management), one of the benchmark programs creates lots of objects on the heap.

Since my language probably will never share objects between threads, it was worth creating a custom allocator.

I wrote this in my commit message (it was a while ago):

- Use a custom allocator, which is the SMPAllocator, but stripped down for a single-threaded program.

This alone reduces the wall-clock time from 37s to 32s

for test/benchmark/binary_trees.moin on my Windows machine.

I think it would be worth to tell the compiler if our binary will be single-threaded, then the allocators could detect that at comptime and avoid generating code for MT-safety.

https://codeberg.org/tumirnix/exla/commit/1d3f466e0575a9217941c5109b243fc1c4174174

I’m the person that’s implemented the lock-free version of ArenaAllocator and from my measurements (which I’ve included in the PR) there really isn’t any measurable difference for single-threaded usage, at least not in the Zig compiler codebase (which uses single-threaded arenas quite liberally) or in the benchmark I’ve included.
The hot-path is one atomic RMW and one atomic CAS plus a bit of regular arithmetic, and oftentimes you won’t call alloc directly anyway but have some data structure like an ArrayList in-between which will make calls to alloc even less frequent.
And if you do run into perf issues, the old implementation is only ~100LOC (without the fancy reset stuff).

5 Likes

If there are performance issues in real projects caused by this it would also be feasible to do something similar to what FixedBufferAllocator does and expose both an allocator() and a threadSafeAllocator() function to provide two Allocator implementations that can operate on the same underlying data structure.

1 Like

FWIW, “multiple threads hammering on a single shared atomic is bad” is a qualitative thing that doesn’t really need a benchmark, imo, there’s a whole book on how to avoid that :rofl:

1 Like

You can do this with the build system, add .single_threaded = true to the build options.

Does that change things?

This is much the opposite of “thread-safe code used in a single-threaded program”, however.

1 Like

This is related: `std.heap.SmpAllocator` asserts `!builtin.single_threaded` · Issue #25992 · ziglang/zig · GitHub

This is still unanswered, but maybe the answer is what I noticed: the SMPAllocator is developed with only MT in mind?

It seemed to me back then that the SMP allocator is the most efficient for my use-case (because the Debug allocator contains overhead, an ArenaAllocator is only usable in special cases, a Page allocator is only useful as a backend for other allocators).

My use-case is of course not typical for Zig (an OO language VM with GC), I’m talking about millions of allocations in a tight loop.

I could try using comptime elimination of that code, maybe next weekend. If the results are promising, I’ll post them here.

2 Likes

This is probably a case where you want to take ownership over allocations then. Although as a separate matter, it seems to me that Zig should never “charge extra” for atomic operations when building single-threaded (does it? did we determine that? I guess if you can’t build SmpAllocator in single-threaded mode, that’s hard to answer, huh).

It’s not uncommon for language runtimes to provide their own malloc etc, they’re probably the most allocation sensitive kind of code one can write. Near the top at least.

Zig’s arenas and memory pools, and so on, they’re not the final word on what they are. The benefit of composition, wrapping another allocator, is flexibility and generality: the cost is speed, the rationale is that memory allocation “shouldn’t be hot”. In your case, that’s hard to avoid.

I didn’t realize that the stdlib basically doesn’t have a tuned-up allocator optimized for single-threaded use until just now. Maybe one of the third-party allocators on zigistry will work for you?

1 Like

Adding another qualitative idea here: I really like that when I jump to definition on a piece of zig code, the implementations are so simple and easy to understand. I agree it’s normal for programming language stdlibs to try to offer extreme flexibility / compromise, but I wouldn’t jump to the conclusion this is good only because it’s typical.

I think there’s a case to be made that ArenaAllocator and e.g, ConcurrentArenaAllocator should be broken into two structs for this readability/discoverability argument. I can imagine myself as a zig learner reading these side by side and studying the differences.

1 Like