I noticed that the ArenaAllocator in the master branch has been changed to a lock-free thread-safe implementation. I have serious concerns about this change, so I would like to get more explanation.
Although I am a heavy user of ArenaAllocator, nearly 99% of my usage needs are on a single thread. I mainly use ArenaAllocator to amortize the allocation and deallocation overhead of the global allocator (including the thread-safety overhead of the global allocator) and to improve cache locality. The default thread-safe ArenaAllocator weakens its performance advantage. Even with a lock-free implementation, atomic operations still affect optimization.
Another point I am deeply skeptical about regarding the default thread-safe ArenaAllocator is that I believe it encourages an anti pattern of concurrent high-frequency writes to the same bump allocator. The ArenaAllocator is a bump allocator, where allocated memory is contiguous, so threads can easily access the same cache line, leading to a false-sharing disaster. This is also the key reason why I almost never use ArenaAllocator across threads.
Whatās actually persuasive is a post which says āIāve run the 0.15 ArenaAllocator head-to-head with the thread-safe 0.16 version and performance is N% degraded on a single thread, is that really worth it?ā.
What youāve presented is a hypothetical, and itās a hypothetical which can be checked. Without that itās not at all clear that thereās a problem here.
But the general issue youāre gesturing at is common to standard libraries, itās not Zig-specific. Stdlib code should perform correctly, with decent performance, in a wide range of circumstances. I think guarantees like āif the backing allocator is thread-safe, then the arena is thread-safe alsoā are useful to have in general-purpose code.
Anything can be misused, conversely, thatās optional. Multi-threaded arena use can prevent false sharing by always asking for cache-aligned memory, and always in even cache-line units, for instance. Thatās a decision for user code to make.
But itās certainly the case that, due to that very property, itās often possible to beat the standard library with code optimized for oneās use case. For example, MemoryPool is generic by backing allocator, which is nice and flexible, but it means that preheats are requested one allocation at a time, and there are faster ways to do that. But those make assumptions about the allocator which arenāt safe to make about the Allocator interface.
To this point, I also expect a healthy ecosystem of alternative allocators to pop up in the Zig community. C already has plenty and people use different ones depend on their test. Zig will be similar. I expect the std library to have allocators to cover the blanket cases, the most common expected uses. Then the community can come up with others.
Like mnemnion said, it will come down to testing and providing data. And you can always copy the old ArenaAllocator code and use that directly.
In my hobby project (a simple OO language with GC based automatic memory management), one of the benchmark programs creates lots of objects on the heap.
Since my language probably will never share objects between threads, it was worth creating a custom allocator.
I wrote this in my commit message (it was a while ago):
- Use a custom allocator, which is the SMPAllocator, but stripped down for a single-threaded program.
This alone reduces the wall-clock time from 37s to 32s
for test/benchmark/binary_trees.moin on my Windows machine.
I think it would be worth to tell the compiler if our binary will be single-threaded, then the allocators could detect that at comptime and avoid generating code for MT-safety.
Iām the person thatās implemented the lock-free version of ArenaAllocator and from my measurements (which Iāve included in the PR) there really isnāt any measurable difference for single-threaded usage, at least not in the Zig compiler codebase (which uses single-threaded arenas quite liberally) or in the benchmark Iāve included.
The hot-path is one atomic RMW and one atomic CAS plus a bit of regular arithmetic, and oftentimes you wonāt call alloc directly anyway but have some data structure like an ArrayList in-between which will make calls to alloc even less frequent.
And if you do run into perf issues, the old implementation is only ~100LOC (without the fancy reset stuff).
If there are performance issues in real projects caused by this it would also be feasible to do something similar to what FixedBufferAllocator does and expose both an allocator() and a threadSafeAllocator() function to provide two Allocator implementations that can operate on the same underlying data structure.
FWIW, āmultiple threads hammering on a single shared atomic is badā is a qualitative thing that doesnāt really need a benchmark, imo, thereās a whole book on how to avoid that
This is still unanswered, but maybe the answer is what I noticed: the SMPAllocator is developed with only MT in mind?
It seemed to me back then that the SMP allocator is the most efficient for my use-case (because the Debug allocator contains overhead, an ArenaAllocator is only usable in special cases, a Page allocator is only useful as a backend for other allocators).
My use-case is of course not typical for Zig (an OO language VM with GC), Iām talking about millions of allocations in a tight loop.
I could try using comptime elimination of that code, maybe next weekend. If the results are promising, Iāll post them here.
This is probably a case where you want to take ownership over allocations then. Although as a separate matter, it seems to me that Zig should never ācharge extraā for atomic operations when building single-threaded (does it? did we determine that? I guess if you canāt build SmpAllocator in single-threaded mode, thatās hard to answer, huh).
Itās not uncommon for language runtimes to provide their own malloc etc, theyāre probably the most allocation sensitive kind of code one can write. Near the top at least.
Zigās arenas and memory pools, and so on, theyāre not the final word on what they are. The benefit of composition, wrapping another allocator, is flexibility and generality: the cost is speed, the rationale is that memory allocation āshouldnāt be hotā. In your case, thatās hard to avoid.
I didnāt realize that the stdlib basically doesnāt have a tuned-up allocator optimized for single-threaded use until just now. Maybe one of the third-party allocators on zigistry will work for you?
Adding another qualitative idea here: I really like that when I jump to definition on a piece of zig code, the implementations are so simple and easy to understand. I agree itās normal for programming language stdlibs to try to offer extreme flexibility / compromise, but I wouldnāt jump to the conclusion this is good only because itās typical.
I think thereās a case to be made that ArenaAllocator and e.g, ConcurrentArenaAllocator should be broken into two structs for this readability/discoverability argument. I can imagine myself as a zig learner reading these side by side and studying the differences.
One bitter lesson Iāve learned here is that, to a large degree, this is just a product of language age. I was able to say exactly that about Rustās std around 1.0, but it is significantly harder to read today due to increased amount of optimization (esp specialization stuff) and language lawering (writing code in the form that passes miri).
The converse, positive lesson: if you want to educate yourself by reading source code, donāt read the current version, read the historical first public release, where the code works, but is still simple.
Yeah purpose built code is always going to be simpler to follow and understand as it can make the tradeoffs to support it. Std containers and code that has to work in many different settings do not have that leisure, and itās more difficult to come up with a (good) solution. The current ArenaAllocator seems well engineered considering it still maintains the old performance characteristics.
FWIW back in my C++ days Iāve seen quite substantial differences between updating the refcount in a custom shared-pointer implementation when switching between atomic operations and regular increment/decrement.
This was easily 1..2 decades back though, maybe on modern CPUs it doesnāt matter anymore (yet at the same time, I would be careful about generalising across all modern CPUs).
(still though, making thread-safety for high-frequency operations the general case just feels wrong to me, it should be opt-in, or at the least allow to opt-out - since usually interactions between threads should be low-frequency and (IMHO!) thread-safety should be ensured by the caller)
The measurements Iāve taken were all taken on the same machine (my laptop) so I definitely see that point (and I really respect your opinion in general).
Still I think this whole discussion is a bit premature. Itās not like āthread-safety by defaultā is something that comes with using the Zig language, itās quite the opposite: if you think that a different implementation would be better for the code youāre working on and ideally have the measurements to back that up, the Allocator interface makes it very easy to just not use std.heap.ArenaAllocator in your code.
In terms of versatility, a single-threaded arena is a more specialized piece of code than a thread-safe one so I think that the latter is a better fit for for a standard library. And IMO with the arrival of std.process.Init itās just very convenient for the ādefaultā init.arena to be thread-safe.
If it turns out that there are severe performance problems on some machines even with limited use, thereās still plenty of time until 1.0 to add the old implementation back, however Iād like to see actual measurements supporting that decision.
I would expect the primary impact here to be not on the CPU side, but on the compiler side. Atomic operations generally inhibit optimizations around them. Eg:
var a: i32 = 0;
var b: i32 = 0;
export fn f() void {
a += 1;
b += 1;
a += 1;
}
export fn g() void {
a += 1;
_ = @atomicRmw(i32, &b, .Add, 1, .acq_rel);
a += 1;
}
An atomic there prevents folding two increments into one, which, again, matters not in and of itself, but as a signal that compiler is loath to reason about memory state across atomics.
Very much agree with both major points, but Iād vote for the general name, ArenaAllocator, to be general-purpose (thread-safe), and another name, like NonthreadedArenaAllocator (or Iām sure there are better name options), would be given to the special simple case, for purely sequential use. Then, the less studious code writer will simply choose the basic name, ArenaAllocator, and, later, when parallelizing the code, will safely ignore details like this. (Or, alternately, when optimizing the code for a single thread/processor situation, will study the options, and find the NonthreadedArenaAllocator to be available.)
Not entirely. Iāve learned that when I want some insight into a libc function, I should read musl, because glib is going to be some monstrous beast of a macro-injected ratās nest, which obscures more than it illuminates. But musl will give me the goods. The C language is the same age either way.
So culture matters too. Zig is designed around legibility, Rust is⦠not, and culturally favors it as well. So we have that going for us. Itās only mitigating, the general point that optimization tends to make code harder to understand is sound. But I also remember reading just-pre-1.0 Rust code from the standard library, and⦠weāre starting from a higher baseline of legibility IMHO.
Rust syntax does a lot. It has to, to be Rust, but itās not easy to look at. Someone should do some blog posts comparing and contrasting the syntax of both languages, because I think thereās some wisdom there. Iād probably conclude that thereās a reason I find Zig easier to read, and it might even stay that way, who knows.
My take on this is that performance of std library functions is considered so critical and must be optimal in so many use cases, that readability becomes a lesser goal, especially over time.
For my app code, I find readability is more a function of whether I spend the time to make the code clear before moving on, rather than the language. It really helps to have a language like Zig that is clear and simple, but the main burden is still on me.