Do we really need a thread-safe ArenaAllocator by default?

I noticed that the ArenaAllocator in the master branch has been changed to a lock-free thread-safe implementation. I have serious concerns about this change, so I would like to get more explanation.

Although I am a heavy user of ArenaAllocator, nearly 99% of my usage needs are on a single thread. I mainly use ArenaAllocator to amortize the allocation and deallocation overhead of the global allocator (including the thread-safety overhead of the global allocator) and to improve cache locality. The default thread-safe ArenaAllocator weakens its performance advantage. Even with a lock-free implementation, atomic operations still affect optimization.

Another point I am deeply skeptical about regarding the default thread-safe ArenaAllocator is that I believe it encourages an anti pattern of concurrent high-frequency writes to the same bump allocator. The ArenaAllocator is a bump allocator, where allocated memory is contiguous, so threads can easily access the same cache line, leading to a false-sharing disaster. This is also the key reason why I almost never use ArenaAllocator across threads.

6 Likes

What’s actually persuasive is a post which says ā€œI’ve run the 0.15 ArenaAllocator head-to-head with the thread-safe 0.16 version and performance is N% degraded on a single thread, is that really worth it?ā€.

What you’ve presented is a hypothetical, and it’s a hypothetical which can be checked. Without that it’s not at all clear that there’s a problem here.

But the general issue you’re gesturing at is common to standard libraries, it’s not Zig-specific. Stdlib code should perform correctly, with decent performance, in a wide range of circumstances. I think guarantees like ā€œif the backing allocator is thread-safe, then the arena is thread-safe alsoā€ are useful to have in general-purpose code.

Anything can be misused, conversely, that’s optional. Multi-threaded arena use can prevent false sharing by always asking for cache-aligned memory, and always in even cache-line units, for instance. That’s a decision for user code to make.

But it’s certainly the case that, due to that very property, it’s often possible to beat the standard library with code optimized for one’s use case. For example, MemoryPool is generic by backing allocator, which is nice and flexible, but it means that preheats are requested one allocation at a time, and there are faster ways to do that. But those make assumptions about the allocator which aren’t safe to make about the Allocator interface.

9 Likes

To this point, I also expect a healthy ecosystem of alternative allocators to pop up in the Zig community. C already has plenty and people use different ones depend on their test. Zig will be similar. I expect the std library to have allocators to cover the blanket cases, the most common expected uses. Then the community can come up with others.

Like mnemnion said, it will come down to testing and providing data. And you can always copy the old ArenaAllocator code and use that directly.

6 Likes

If you want numbers:

In my hobby project (a simple OO language with GC based automatic memory management), one of the benchmark programs creates lots of objects on the heap.

Since my language probably will never share objects between threads, it was worth creating a custom allocator.

I wrote this in my commit message (it was a while ago):

- Use a custom allocator, which is the SMPAllocator, but stripped down for a single-threaded program.

This alone reduces the wall-clock time from 37s to 32s

for test/benchmark/binary_trees.moin on my Windows machine.

I think it would be worth to tell the compiler if our binary will be single-threaded, then the allocators could detect that at comptime and avoid generating code for MT-safety.

https://codeberg.org/tumirnix/exla/commit/1d3f466e0575a9217941c5109b243fc1c4174174

2 Likes

I’m the person that’s implemented the lock-free version of ArenaAllocator and from my measurements (which I’ve included in the PR) there really isn’t any measurable difference for single-threaded usage, at least not in the Zig compiler codebase (which uses single-threaded arenas quite liberally) or in the benchmark I’ve included.
The hot-path is one atomic RMW and one atomic CAS plus a bit of regular arithmetic, and oftentimes you won’t call alloc directly anyway but have some data structure like an ArrayList in-between which will make calls to alloc even less frequent.
And if you do run into perf issues, the old implementation is only ~100LOC (without the fancy reset stuff).

9 Likes

If there are performance issues in real projects caused by this it would also be feasible to do something similar to what FixedBufferAllocator does and expose both an allocator() and a threadSafeAllocator() function to provide two Allocator implementations that can operate on the same underlying data structure.

2 Likes

FWIW, ā€œmultiple threads hammering on a single shared atomic is badā€ is a qualitative thing that doesn’t really need a benchmark, imo, there’s a whole book on how to avoid that :rofl:

4 Likes

You can do this with the build system, add .single_threaded = true to the build options.

Does that change things?

This is much the opposite of ā€œthread-safe code used in a single-threaded programā€, however.

1 Like

This is related: `std.heap.SmpAllocator` asserts `!builtin.single_threaded` Ā· Issue #25992 Ā· ziglang/zig Ā· GitHub

This is still unanswered, but maybe the answer is what I noticed: the SMPAllocator is developed with only MT in mind?

It seemed to me back then that the SMP allocator is the most efficient for my use-case (because the Debug allocator contains overhead, an ArenaAllocator is only usable in special cases, a Page allocator is only useful as a backend for other allocators).

My use-case is of course not typical for Zig (an OO language VM with GC), I’m talking about millions of allocations in a tight loop.

I could try using comptime elimination of that code, maybe next weekend. If the results are promising, I’ll post them here.

2 Likes

This is probably a case where you want to take ownership over allocations then. Although as a separate matter, it seems to me that Zig should never ā€œcharge extraā€ for atomic operations when building single-threaded (does it? did we determine that? I guess if you can’t build SmpAllocator in single-threaded mode, that’s hard to answer, huh).

It’s not uncommon for language runtimes to provide their own malloc etc, they’re probably the most allocation sensitive kind of code one can write. Near the top at least.

Zig’s arenas and memory pools, and so on, they’re not the final word on what they are. The benefit of composition, wrapping another allocator, is flexibility and generality: the cost is speed, the rationale is that memory allocation ā€œshouldn’t be hotā€. In your case, that’s hard to avoid.

I didn’t realize that the stdlib basically doesn’t have a tuned-up allocator optimized for single-threaded use until just now. Maybe one of the third-party allocators on zigistry will work for you?

1 Like

Adding another qualitative idea here: I really like that when I jump to definition on a piece of zig code, the implementations are so simple and easy to understand. I agree it’s normal for programming language stdlibs to try to offer extreme flexibility / compromise, but I wouldn’t jump to the conclusion this is good only because it’s typical.

I think there’s a case to be made that ArenaAllocator and e.g, ConcurrentArenaAllocator should be broken into two structs for this readability/discoverability argument. I can imagine myself as a zig learner reading these side by side and studying the differences.

10 Likes

One bitter lesson I’ve learned here is that, to a large degree, this is just a product of language age. I was able to say exactly that about Rust’s std around 1.0, but it is significantly harder to read today due to increased amount of optimization (esp specialization stuff) and language lawering (writing code in the form that passes miri).

The converse, positive lesson: if you want to educate yourself by reading source code, don’t read the current version, read the historical first public release, where the code works, but is still simple.

2 Likes

Yeah purpose built code is always going to be simpler to follow and understand as it can make the tradeoffs to support it. Std containers and code that has to work in many different settings do not have that leisure, and it’s more difficult to come up with a (good) solution. The current ArenaAllocator seems well engineered considering it still maintains the old performance characteristics.

FWIW back in my C++ days I’ve seen quite substantial differences between updating the refcount in a custom shared-pointer implementation when switching between atomic operations and regular increment/decrement.

This was easily 1..2 decades back though, maybe on modern CPUs it doesn’t matter anymore (yet at the same time, I would be careful about generalising across all modern CPUs).

(still though, making thread-safety for high-frequency operations the general case just feels wrong to me, it should be opt-in, or at the least allow to opt-out - since usually interactions between threads should be low-frequency and (IMHO!) thread-safety should be ensured by the caller)

2 Likes

The measurements I’ve taken were all taken on the same machine (my laptop) so I definitely see that point (and I really respect your opinion in general).

Still I think this whole discussion is a bit premature. It’s not like ā€˜thread-safety by default’ is something that comes with using the Zig language, it’s quite the opposite: if you think that a different implementation would be better for the code you’re working on and ideally have the measurements to back that up, the Allocator interface makes it very easy to just not use std.heap.ArenaAllocator in your code.

In terms of versatility, a single-threaded arena is a more specialized piece of code than a thread-safe one so I think that the latter is a better fit for for a standard library. And IMO with the arrival of std.process.Init it’s just very convenient for the ā€˜default’ init.arena to be thread-safe.

If it turns out that there are severe performance problems on some machines even with limited use, there’s still plenty of time until 1.0 to add the old implementation back, however I’d like to see actual measurements supporting that decision.

1 Like

I would expect the primary impact here to be not on the CPU side, but on the compiler side. Atomic operations generally inhibit optimizations around them. Eg:

var a: i32 = 0;
var b: i32 = 0;

export fn f() void {
    a += 1;
    b += 1;
    a += 1;
}

export fn g() void {
    a += 1;
    _ = @atomicRmw(i32, &b, .Add, 1, .acq_rel);
    a += 1;
}
f:
        push    rbp
        mov     rbp, rsp
        inc     dword ptr [rip + example.b]
        add     dword ptr [rip + example.a], 2
        pop     rbp
        ret

g:
        push    rbp
        mov     rbp, rsp
        inc     dword ptr [rip + example.a]
        lock            inc     dword ptr [rip + example.b]
        inc     dword ptr [rip + example.a]  // Second load/store round-trip
        pop     rbp
        ret

An atomic there prevents folding two increments into one, which, again, matters not in and of itself, but as a signal that compiler is loath to reason about memory state across atomics.

2 Likes

Very much agree with both major points, but I’d vote for the general name, ArenaAllocator, to be general-purpose (thread-safe), and another name, like NonthreadedArenaAllocator (or I’m sure there are better name options), would be given to the special simple case, for purely sequential use. Then, the less studious code writer will simply choose the basic name, ArenaAllocator, and, later, when parallelizing the code, will safely ignore details like this. (Or, alternately, when optimizing the code for a single thread/processor situation, will study the options, and find the NonthreadedArenaAllocator to be available.)

2 Likes

Not entirely. I’ve learned that when I want some insight into a libc function, I should read musl, because glib is going to be some monstrous beast of a macro-injected rat’s nest, which obscures more than it illuminates. But musl will give me the goods. The C language is the same age either way.

So culture matters too. Zig is designed around legibility, Rust is… not, and culturally favors it as well. So we have that going for us. It’s only mitigating, the general point that optimization tends to make code harder to understand is sound. But I also remember reading just-pre-1.0 Rust code from the standard library, and… we’re starting from a higher baseline of legibility IMHO.

Rust syntax does a lot. It has to, to be Rust, but it’s not easy to look at. Someone should do some blog posts comparing and contrasting the syntax of both languages, because I think there’s some wisdom there. I’d probably conclude that there’s a reason I find Zig easier to read, and it might even stay that way, who knows.

7 Likes

My take on this is that performance of std library functions is considered so critical and must be optimal in so many use cases, that readability becomes a lesser goal, especially over time.

For my app code, I find readability is more a function of whether I spend the time to make the code clear before moving on, rather than the language. It really helps to have a language like Zig that is clear and simple, but the main burden is still on me.

2 Likes