Allocators in Zig and what can be better in another language

Sorry if this isn’t the right category, I think it might fit better in brainstorming but I can’t post there.

I’m working on an allocator API for Mojo, and I wanted to get thoughts and opinions from (1) other systems programmers and (2) from the community around what I think may be one of the only widely used allocator APIs. Most of my work is in places where the lack of async (or other facility to manage coroutines) causes some fairly substantial problems, so I haven’t been able to use Zig as much as I would like. Instead, most of my time is spent with C++ (mostly when I need to talk to GPUs) or Rust (where I feel the pain of the lack of a stable allocator API fairly deeply), with some C SIMD kernels sprinkled in because WG21 still hasn’t gotten around to adding restrict to ISO C++. So, I need to lean on all of you for how Zig has managed custom allocators since I tend to write what I need and I don’t have people re-using my code most of the time.

Background on Mojo

Mojo takes most of its inspiration from Rust, C++ and Swift, with the end goal for its type system looking like a mixture of Zig and a theorem prover like Lean 4. The general consensus is that we’re going to have reflection API that looks like Zig’s comptime, and aside from struct manipulation many of the capabilities of comptime already exist in Mojo (ex: you can make a List of types, iterate over time, and make a HashTable of the second type to the fifth type). However, Mojo is leaning on algebraic types to integrate better with how most programmers expect types to work, and because they are more flushed out in type theory which gives us a solid theoretical foundation to be able to eventually specify things like “Any function which does not perform blocking syscalls and does not take any locks” as a function argument and have the compiler be able to enforce that. Mojo also has linear types, which can be used to implement pointers that you must eventually free for the program to compile.

If you go look at Mojo, you’ll see a lot of AI branding, because “We can do AI better” is probably the best/only way to get VCs to pay for a new systems programming language. It is a general purpose language that’s designed to deal with heterogeneous hardware, and has the very attractive quality of try to make the compiler as “dumb” as possible. When I say, making the compiler dumb, I mean that the autovectorization pass in LLVM, which has many, many, many issues, is disabled, and the compiler does a more or less direct mapping from SIMD[DType.float32, 16].__add__ to vaddps ymm, ymm, ymm, and if you put that in a loop the compiler won’t get in your way and try to use the potentially slower (like Skylake X) AVX512 version. This is a very nice property for high performance code, since we can be generic over SIMD width and when you say “I want the AVX2 version”, you get the AVX2 version even on an AVX512 capable CPU, with a single implementation of the kernel.

Mojo is also in an odd place where it can do GPU compute and SIMD very well, but we’re mostly calling out to Python or directly to libc for IO.

Things I’m already planning to address

VTables only when you need them

Algebraic types mean that Mojo can easily devirtualize everything that’s known at compile time. This means that if you just want to call out to libc malloc/realloc/free then the allocator struct is a zero-sized type. If you need a bit of state, you can store just that state and not the vtable. If you need runtime flexibility, then trait objects give you a vtable.

Breaking Up Capabilities

One pattern I frequently find myself implementing for IO buffer recycling is to have a MPMC queue of free buffers. Being able to use a consumer handle on a queue like that as an allocator would be great, but the consumer handles can’t deallocate. So, this means that Allocator, Reallocator and Deallocator are separate traits. This does produce a bit of a mess when implementing an allocator since most allocators should do alloc + free just fine, but I’m leaning towards the venerable tradition of “writing reusable library code is harder so that application code is easier to write”.

Parameterizing Allocators by Type

This is one I’m still on the fence on, but I think the best way to handle things like arena/slab allocators which have particular type requirements is to make it so that Allocator[T] means you can allocate a T, and then that is implemented for all of the types the allocator can provide. For example, a slab allocator might implement Allocator[T] for all T which are 64 bytes, which can be done via generics (extension[T] Allocator[T] for SlabAllocator[64] requires sizeof[T]() == 64: ...)

Integrating Ownership

Mojo has a Rust-like ownership system, although it’s designed to be less annoying. This means that the type system can express “Allocator which can only allocate once” as a zero-overhead abstraction. This can be used to turn a buffer into an “allocator” for things which only allocate once with no overhead over having a version which just takes a buffer. With some type system tricks, this can be expanded to any number of allocations.

Algebraic Effects

Some allocators are infallible (like a single buffer that only handles one allocation of a comptime known size), others are not (malloc). In the case where the allocator is infallible, error handling is unwelcome overhead. There are also some cases in distributed systems where it is actually useful to have an async allocator (Distributed shared memory via RDMA or CXL). Later on, this will likely get expanded to handle whether the allocator can block, whether it takes locks, and other concerns for various areas of systems programming alongside the language-level effect system.

Things I’m looking for

I know that there are all sorts of weird things out there, so let me know what kinds of weird things you’ve had to do to allocate memory that Zig’s normal allocator API can’t handle. Anything with a program counter is in bounds (especially weird AI/ML accelerators), and FPGAs are “nice to have” since Mojo can technically use CIRCT, but the meaning of “allocate memory” is a bit hazy when using HLS.

What do you think Zig’s allocator API is missing?

Any thoughts on my current plans?

Thanks for any feedback you can provide!

3 Likes

check this long post

Shortly - I’d like to get following information from Allocator regarding :

  • thread safety
  • Life time
  • reusing after free

For now I have no clue whether allocator provided by caller supports all above.
So I have to write comment or use gpa as identifier for allocator in order to hint required functionality

This is phrased as if using VC money is the only way to make a language happen and the Zig community is probably the worst place where to start a discussion based on that assumption. It also sounds like you’re tricking the VC into doing your bidding (“to get VCs to pay”) instead of all collaboratively chasing an exit, and that’s another questionable statement. Not because you shouldn’t “trick” VCs, but because it’s nowhere near as easy as you make it sound, and if I had to bet money, I would bet all of them on Mojo eventually doing exactly what the VCs want it to do. Not because I have anything against Mojo, or because I deem you guys incapable, but because statistically this is where almost all tech infra startups who don’t have an obviously good business model end up.

I dunno, this is the vtable, what do you think is missing?

pub const VTable = struct {
    /// Return a pointer to `len` bytes with specified `alignment`, or return
    /// `null` indicating the allocation failed.
    ///
    /// `ret_addr` is optionally provided as the first return address of the
    /// allocation call stack. If the value is `0` it means no return address
    /// has been provided.
    alloc: *const fn (*anyopaque, len: usize, alignment: Alignment, ret_addr: usize) ?[*]u8,

    /// Attempt to expand or shrink memory in place.
    ///
    /// `memory.len` must equal the length requested from the most recent
    /// successful call to `alloc`, `resize`, or `remap`. `alignment` must
    /// equal the same value that was passed as the `alignment` parameter to
    /// the original `alloc` call.
    ///
    /// A result of `true` indicates the resize was successful and the
    /// allocation now has the same address but a size of `new_len`. `false`
    /// indicates the resize could not be completed without moving the
    /// allocation to a different address.
    ///
    /// `new_len` must be greater than zero.
    ///
    /// `ret_addr` is optionally provided as the first return address of the
    /// allocation call stack. If the value is `0` it means no return address
    /// has been provided.
    resize: *const fn (*anyopaque, memory: []u8, alignment: Alignment, new_len: usize, ret_addr: usize) bool,

    /// Attempt to expand or shrink memory, allowing relocation.
    ///
    /// `memory.len` must equal the length requested from the most recent
    /// successful call to `alloc`, `resize`, or `remap`. `alignment` must
    /// equal the same value that was passed as the `alignment` parameter to
    /// the original `alloc` call.
    ///
    /// A non-`null` return value indicates the resize was successful. The
    /// allocation may have same address, or may have been relocated. In either
    /// case, the allocation now has size of `new_len`. A `null` return value
    /// indicates that the resize would be equivalent to allocating new memory,
    /// copying the bytes from the old memory, and then freeing the old memory.
    /// In such case, it is more efficient for the caller to perform the copy.
    ///
    /// `new_len` must be greater than zero.
    ///
    /// `ret_addr` is optionally provided as the first return address of the
    /// allocation call stack. If the value is `0` it means no return address
    /// has been provided.
    remap: *const fn (*anyopaque, memory: []u8, alignment: Alignment, new_len: usize, ret_addr: usize) ?[*]u8,

    /// Free and invalidate a region of memory.
    ///
    /// `memory.len` must equal the length requested from the most recent
    /// successful call to `alloc`, `resize`, or `remap`. `alignment` must
    /// equal the same value that was passed as the `alignment` parameter to
    /// the original `alloc` call.
    ///
    /// `ret_addr` is optionally provided as the first return address of the
    /// allocation call stack. If the value is `0` it means no return address
    /// has been provided.
    free: *const fn (*anyopaque, memory: []u8, alignment: Alignment, ret_addr: usize) void,
};
3 Likes

Thread safety and allocator lifetimes are enforced by the compiler as they are with anything else (Mojo takes that much from Rust). Allocators that hand out references is an interesting idea, since it would force you to free everything before the allocator is destroyed, and Mojo’s UnsafePointer does have a place to put an Origin (similar to a lifetime in Rust) if you want to bind pointers to the lifetime of the allocator. Reusing after a free would probably get modeled by ownership, unless you’re thinking of something like bump allocators where frees don’t really give you space back. I’m not sure how I would model that.

already discussed

If you go look at Mojo, you’ll see a lot of AI branding, because “We can do AI better” is probably the best/only way to get VCs to pay for a new systems programming language.

This is phrased as if using VC money is the only way to make a language happen and the Zig community is probably the worst place where to start a discussion based on that assumption. It also sounds like you’re tricking the VC into doing your bidding (“to get VCs to pay”) instead of all collaboratively chasing an exit, and that’s another questionable statement. Not because you shouldn’t “trick” VCs, but because it’s nowhere near as easy as you make it sound, and if I had to bet money, I would bet all of them on Mojo eventually doing exactly what the VCs want it to do. Not because I have anything against Mojo, or because I deem you guys incapable, but because statistically this is where almost all tech infra startups who don’t have an obviously good business model end up.

What I meant was that it’s difficult to hire a few dozen people to work on a language full time without VC money. And the VCs are getting what they want. MAX, the AI inference framework built on top of Mojo, is faster than pytorch by a pretty good margin, and compared very favorably to TensorRT on Nvidia GPUs back when we could still publish benchmarks with TensorRT, while still being portable. Modular makes money by offering AI inference infrastructure for cheaper, either directly from them or from a partner who uses MAX, than other options and letting the fact that inference of open weight models is a commodity drive people to them since Modular can easily pivot to the cheapest hardware per token to keep their prices low. Mojo is a tool in service of that goal, but the best way to make it serve that goal is to make it a very capable systems language.

I dunno, this is the vtable, what do you think is missing?

  • There’s no way to express an allocator that can only hand out memory slices of a given size or for a given type at comptime, meaning you have to have an error handling path for “user passed in a request I can’t satisfy”.
  • Infalliable allocators still have overhead from error handling
  • For arena allocators, resize/remap often make no sense.
  • I can’t make any of the functions async if I’m in a distributed shared memory environment where it might take 20ms to go ask a coordinator node for some memory or to return said memory and get error information
  • This is more language level but Zig doesn’t have a good way to make sure you try to free in the right kind of allocator. In a langauge with linear types, I can “name” an allocator at compile time with some unique ID and then make it return pointers which have that ID embedded in them. If I try to free one of those pointers in the wrong type of allocator or one of the same type but with the wrong ID, I would get a compiler error.
  • Which leads into the next one, some allocators have other kinds of pointers they want to use, such as wide pointers in HPC, a (node_id, ptr) tuple, or any of the mechanisms used to implement highmem on Linux to allow for >4GB of memory on a 32 bit system, or pointers which use type system features for enhanced safety without a runtime cost. This doesn’t account for things which look “sufficently pointer-like” to be used as a pointer for all intents and purposes. Another example would be Boost’s offset_ptr, which is often used in environments with multiple processes that want to share memory but not necessarily map it at the same address. An allocator which distributes that memory likely should hand out something like that.
1 Like

std.heap.MemoryPool(T)?

1 Like

From what I can see, I can’t stick std.heap.MemoryPool(T) behind a normal allocator API without first making sure that the requested alignment is doable and that the request is the correct size. This means I suffer runtime overhead for things which should be checked at comptime. This is especially bad since memory pools/arena allocators are often hot loop constructs for IO heavy applications.

Some things I’ve learned after familiarizing myself deeper with Zig’s allocators:

  • VTable is curious, it’s not your typical malloc/free. See, for example, how resize can say nope, and its up to the layers above the vtable to provide realloc semantics (the motivation here is that, for things like ArrayList, you don’t want to memcpy everything on realloc, only the allocated part. At the same time)
  • VTable is curious by its presence. I remember reading somewhere that it’s pretty important to actually statically compile your allocator in, so that you can inline the hot path, and then Zig doesn’t do that. But good Zig code doesn’t actually allocate all that much, so I’d expect that Zig might not be a good language to learn performance allocation patterns from.
  • Zig uses Call Site Dependency Injection for allocations, meaning that its idiomatic not to store pointer to allocator as a part of the collection, but rather to pass allocator in to the subset of functions that actually need to allocate. This is surprisingly flexible! My takeaway is that Zig code end up looking roughly like Rust code, in that, while you do minimize allocations (somewhat more in Zig), its still considered normal to thread gpa around and allocate stuff when it is convenient. This is not the pattern which we use at TigerBeetle, where we have an absolute “no allocation after init” policy. Nonetheless, we are able to re-use Zig’s hash table, because CSDI for allocator allows TigerStyle static allocation as a subset.
  • “No allocation after startup” really does work for distributed databases. Not only you can write those kinds of things without heap, you actually don’t even notice that there’s no heap! That’s a major update to me, I thought that working on TigerBeetle would be a lot of pain for a lot of gain, but it’s … mostly just the gain? Though I suspect it dependent a lot on the domain, I don’t see how you could write compiler in a similar style, without having really silly constraints like “you can have at most 10k functions in your program”.
2 Likes

VTable is curious, it’s not your typical malloc/free. See, for example, how resize can say nope, and its up to the layers above the vtable to provide realloc semantics (the motivation here is that, for things like ArrayList, you don’t want to memcpy everything on realloc, only the allocated part. At the same time)

I agree, and I actually prefer to have a resize when I write my own allocators (usually called try_realloc_in_place or similar). I didn’t think about that optimization for ArrayList, but that makes a lot of sense.

VTable is curious by its presence. I remember reading somewhere that it’s pretty important to actually statically compile your allocator in, so that you can inline the hot path, and then Zig doesn’t do that. But good Zig code doesn’t actually allocate all that much, so I’d expect that Zig might not be a good language to learn performance allocation patterns from.

I want to draw a distinction between “reaching past the allocator API” and “asking the OS for more memory”. For the sake of generality, it’s useful for libraries to be able to pull from both a general purpose allocator and something more like a memory pool or arena allocator. For a general purpose allocator, I’ll ask the OS for my 8/16/32/etc GB at the start and then never touch it again as I slice that memory up into various more specialized allocators, such as arena allocators of per-request bump allocators. I know that TigerBeetle does something similar. Not asking the OS for memory in a hot loop shouldn’t be that hard to do in most applications. But, not asking for IO buffers is much harder to avoid in an IO intensive application. While good Zig avoids the former, Zig programs still deal with the latter, and those are the more interesting areas to look at from a UX and API design perspective. I already know how I plan to inline the hot path even for bump/arena allocators, so long as they are comptime known, so Zig not doing that isn’t as relevant. The reason I’m looking to Zig is because it has a culture of custom allocators beyond what most languages have, so it seemed like a good place to ask people for opinions on allocators. I have a few other places around high performance networking I’m also asking for opinions since programs that need to deal with doing IO 50 million times a second tend to have strong opinions on memory management, or at least their authors do.

Zig uses Call Site Dependency Injection for allocations, meaning that its idiomatic not to store pointer to allocator as a part of the collection, but rather to pass allocator in to the subset of functions that actually need to allocate. This is surprisingly flexible! My takeaway is that Zig code end up looking roughly like Rust code, in that, while you do minimize allocations (somewhat more in Zig), its still considered normal to thread gpa around and allocate stuff when it is convenient. This is not the pattern which we use at TigerBeetle, where we have an absolute “no allocation after init” policy. Nonetheless, we are able to re-use Zig’s hash table, because CSDI for allocator allows TigerStyle static allocation as a subset.

CSDI is something I was aware of from the time I’ve spent learning Zig’s basics, and I agree it is a very nice feature to have since it can provide more flexibility to the caller. However, if I tell most Python devs (a community which Mojo heavily draws from since it’s a member of the python language family syntactically) to “just thread around an allocator”, I will get a resounding rejection, so I also need something where the collection owns the allocator. Most likely, a lot of collections will get implemented as a CSDI version, and then a wrapper around that which stores an allocator for use with RAII.

The capability to have a “no allocation after init” policy is also quite desirable for Mojo, since GPUs have shrunk memory resources back down to a place where developers are forced to care again, and doing allocations on a GPU involves a lot of compute that is better spent on matmuls. Allocators that allocate memory from somewhere other than the CPU, or in memory which is coherent to some subset of devices on the system (ex: coherent iGPU + CPU and a non-coherent GPU) are also more expensive since they frequently involve IOMMU manipulation, making up-front allocations again more desirable.

“No allocation after startup” really does work for distributed databases. Not only you can write those kinds of things without heap, you actually don’t even notice that there’s no heap! That’s a major update to me, I thought that working on TigerBeetle would be a lot of pain for a lot of gain, but it’s … mostly just the gain? Though I suspect it dependent a lot on the domain, I don’t see how you could write compiler in a similar style, without having really silly constraints like “you can have at most 10k functions in your program”.

My own distributed DBs don’t quite go that far (mostly due to deficiencies in Rust and being research projects), but I am very aware of the value of never talking to the OS in a hot loop as a heavy user DPDK. However, as you point out, other domains are less willing or unable to deal with these constraints so I can’t reasonably propose a design for allocators which doesn’t allow dynamic allocations even if it works for databases. Asking people who primarily program in python to jump directly to upfront allocation may also be a bit of a bridge too far.

One thing that I think is missing in the Zig allocator’s API and allocators in general, is a way to facilitate DOD design, without loosing type safety of just allocating bytes. I would love to see an allocator, where you can pass a list of types at comptime and get an allocator capable of allocating neatly packed elements by type, or even one step further by field of types, you can probably hack something in Zig that would achieve that goal. But Having an API capable of handling what is effectively a pool of memory pool, would be extremely convenient and would eliminate one thin layer of having a bunch of arraylist for each types or a bunch of MultiArrayList. To be fair it’s quite niche, but there are a lot of code out there that could get a free lift in performance just by having better cache coherency.

Sounds like an opportunity to do something interesting. Let us know when you’ve built it.

1 Like

There’s a bunch of compiler work that needs to happen first, so it’s going to be a bit. However, I’ll come back here when it goes live.

1 Like