Allocator.allocWithScalar

So I’m working on a library which uses bitsets extensively. It allocates slices of memory, which need to be zero.

In C, of course, we use calloc for this, and for good reason: in hosted systems, there tends to be zeroed memory available, and if that happens to be the case, we get our zeroes for free.

I was a bit surprised to not find an Allocator.calloc, so I went digging.

Looking around, I was able to find an issue for Allocator.allocZeroed, which was closed as a duplicate of #2090, where I found the reasoning (quotes are @andrewrk):

I want to note that the Allocator API intentionally does not have a allocate-and-zero-initialize function, and likewise Zig itself does not have the concept of zero initialization. It was removed in 6a5e61a.

Because:

This goes against the core language design that “zero initialization” isn’t meaningful. This is one area where Go and Zig differ strongly.

The reasoning here makes sense to me. There’s even a quote in the documentation of std.mem.zeroes which asks us to consider very carefully whether this is necessary, and that use of it might be a code smell.

I’m also strongly opposed to the idea that memory should have a zero default. It causes inefficiency when not needed, it encourages buggy and lazy code, and it subverts the use of optionals: all of these things are Bad.

This leaves me with a problem, however: I want optimal code. I’m following the salutary Ziggish practice of taking an allocator for the functions which allocate memory, but this means that the compiler can’t optimize @memset(mySlice, 0). Since the allocator is provided at runtime, and there is no guarantee that any given allocator will provide zeroed memory, it can’t know whether it’s safe to eliminate the @memset while compiling. Meanwhile, I know that release mode can mostly give me my zeros for free, but the above means that it will run @memset anyway.

If I were to reach for calloc for internal allocations, and copy the final values over to the user-provided allocator, that would take control away from the user, and make the library useful in fewer circumstances. And I have no other need for libc, so it would be a shame to link it just for this one thing.

I use a lot of Julia as well, and there are dozens of posts where people are trying to find the fastest way to get zeroed memory for their algorithm. I fully support rejecting “make zero a meaningful value”, and discouraging lazy programming which sets everything to zero out of habits formed using different languages.

But there’s still this way that zero is a special value, namely, an allocator is very likely to have a bunch of zeroes sitting around, waiting to hand over. At the same time, I follow the reasoning that allocators shouldn’t provide a special zeroing alloc. The most common value to initialize memory to is always “what I put in it once I get it”, but zero is unquestionably the most common value after uninitialized, and happens to be very cheap to provide, in many cases of interest.

So the proposal is: add allocWithScalar(self: Allocator, comptime T: Type, n: usize, comptime val: T), which returns a slice []T where all elements are val. This will only compile if the user can provide a comptime-known value for val.

This way zero isn’t special, but if val is 0, allocators which have zeroed memory around can provide it optimally. Otherwise it will fallback to using @memset(newAllocation, val); before returning the memory.

This is a function which would pull its weight. Zero isn’t a special value anymore, but it respects the reality that many allocators can provide zero cheaply. It expresses intention explicitly, favors correct code, and doesn’t provide the same moral hazard of tempting people to use it “just in case”.

Remember, the compiler can’t optimize alloc + memset, only the allocator can do that.

One can even imagine creating allocators which pre-fill with a non-zero value, and which could provide that value, instead of zero, cheaply. This isn’t so far-fetched that I can’t come up with an example: consider a garbage-collected allocator which prefills some pages with the gray bit already set. That can be SIMD optimized, and then allocation can spare a few cycles while the mutator is running.

Or a specialized MemPool which page-allocates a known default, and resets on free. You can implement that now by writing functions to only take the special allocator, but then you can no longer swap in a testing allocator, logging allocator, or anything but your specialized implementation.

Or even simpler, let’s say that a program uses a lot of bitsets where most bits are expected to be one. Same basic argument, a specialized allocator can request pages from the OS and SIMD fill them with ff, then provide that faster at the point of use.

It’s true that zeroed memory is the most common kind to want, and the cheapest to provide, in the general case. I’m making a case here that providing the general mechanism isn’t trying to sneak a zero allocator in the back door, but is instead a powerful and general mechanism, providing control and optimal code.

I think this wouldn’t lead to code bloat. Let’s say a given allocator can provide one value cheaply (in most cases this is zero), so allocWithScalar has a comptime if statement: one branch is the cheap value, the other branch uses @memset. So this shouldn’t monomorphize into more functions than would be useful to provide, right? Which is one function for an allocator which doesn’t have preinitialized memory available, and two functions for the ones which do.

If I’m right about that (unclear), then allocWithScalar would become the one obvious way to allocate to a default value. Either you get a performance boost, or you don’t, but it clearly expresses the intention “give me memory preinitialized to this value”, and the not-optimal case is identical to alloc followed by @memset.

I didn’t want to add this as a proposal to the issue board, because the “proposal” button says “no thank you, at this time”, and one must be respectful of that sort of choice.

By beginning the discussion here, it can be examined on its merits, without imposing a maintenance burden on core.

I don’t know, but I wonder whether the idea of picking one specific value becomes meaningless in the absence of specific underlying pre-exisiting optimizations for having specialized support for zero initialized memory.

I think one of the reasons people want to use zero initialized memory is because there can be special things that allow that to be a cheap operation, for example memory pages that are just flagged as containing only zeroes, I doubt these things exist for arbitrary values, if they don’t exist, then I don’t know why you would hardcode any specific value. Wouldn’t that just turn out to be same in performance as just setting it?

Are there lowerlevel / os operations that make this more performant, for values other than zero?

The only thing I can imagine at the moment is pre-filling a page with a specific value and then mapping it to many virtual addresses and then setting it to copy on write, to prefill the virtual address space with a specific value, but also allow modification.
But I am not familiar with the details and variety of os apis. Are there other ways?

1 Like

I wouldn’t say it’s meaningless. There are other values where setting an allocation to a specific default makes sense, an example I gave is a bitset where you want all values to start true and get toggled false. There are others.

I was pretty up-front about the fact that the big performance win for allocWithScalar would be when val happens to be zero. I gave some examples of cases where having the ability to allocate to any default value, not just zero, could cooperate with specialized memory allocators to improve performance.

But it’s better to provide a general mechanism which works for everything, rather than specialize a specific allocZero. The fact that zero-allocation happens to be more efficient on existing systems is a contingent fact, something which merely happens to be the case. Zig is a general purpose language which should work just as well on custom systems which haven’t been designed yet.

So yes, the general case of using allocWithScalar for any given value just defaults to alloc with a @memset to that value. But this allows other parts of the system, be it the allocator or the host OS, to provide more efficient implementations for any useful value. There’s no reason to privilege zero here just because it happens to be what most operating systems set a page to before allocating it.

The point of the proposal is to allow an existing, and rather important, optimization. But do so without privileging zero, just because it happens to be the cheap value.

I’m looking at the proposal and I’d like to point something out from a library/affordances standpoint. This function signature:

allocWithScalar(self: Allocator, comptime T: Type, n: usize, comptime val: T)

…has a few warning signals for me. The first issue is that since T is generic, you can hand this structures that have slices. Then if you do that, the pointers in val’s slices will be copied to every element. We may want that behaviour (such as setting each instance to something with a slice pointing to an empty string), but we may not.

Making val generic may be solving one problem but stepping into a bunch of others when really what we’re looking for here is closer to wanting “zeroed memory”.

I’ve noticed this around assignment semantics in Zig - it’s very trivial to solve these problems for scalar values like integers and floats but gets more tricky as you move into user defined types.

2 Likes

It’s a reasonable thing to bring up. Zig’s comptime type system can’t currently do things like narrow T to include only primitive types. An implementation of allocWithScalar could impose that limitation with comptime reflection, but that can’t be expressed in the signature, so it can’t be guaranteed that every function implementing the interface would do it.

It isn’t 100% clear to me what you can and can’t initialize at comptime, whether pointers are allowed at all, for instance. I seem to recall that structs have to be extern or packed to have a comptime-known layout, and that packed structs can’t have pointers.

I’m fairly sure that Zig recently eliminated the ability to carry a comptime pointer to mutable data over into runtime. So if T had pointers in it, slices or otherwise, those pointers would have to point somewhere safe. This is actually safer than a runtime val, which could have undefined pointers in it. All pointers visible at comptime also last the duration of the program. If the pointer were an optional, then it could be null, and Zig will force code to check for that before referencing it.

But, as a counterexample, a slice pointing to an .rodata string might be a perfectly reasonable thing to allocate to every memory region. It’s pretty far off the intended use path, but where’s the harm? You now have a slice of slices and they all point to the same data.

“zero is a special case so let’s provide a special allocator for it” has already been rejected. My aim here is to provide an alternative which won’t raise the same objections, but which will also give the cheap access to zero-initialized memory which is already available. The availability of free zeros is a good reason to provide an allocator which can give them to you, but it isn’t a good reason to make that a special case.

Making val a runtime value is a bad idea. That imposes a branch on what would otherwise be branchless: the comptime resolution of the function call would compile a specialization, if that were available, or just alloc and @memset.

Still, the intention is to use it with primitive values, and sure, mostly to set them to zero. If Zig had “semi-generic” types like anyprimitive or the like, I’d be quite content to limit the use of allocWithScalar to those types.

Zig doesn’t allow null pointers. Isn’t that the main reason why zero initialization is discouraged?

The proposed allocWithScalar can not be implemented like this.
The allocator interface uses runtime function pointers to call the implementation, and runtime function pointers cannot contain comptime paramaters such as val.
So in practice allocWithScalar could not be implemented more efficiently than alloc + @memset

Instead what you could do is create your own ZeroingAllocator interface (without needing to change the standard library).
Inside of your project you can basically just pass it around as usual.
When converting from a zig allocator to a ZeroingAllocator you could check whether the allocator is on some list of allocators that automatically zero memory. If not you would need to insert a wrapper allocator that does the memset.
It could look something like this:

const ZeroingAllocator = struct {
    allocator: Allocator,
    zeroAlloc: *const fn..., // Could also put this into the VTable, but that's a bit more difficult.
    fn defaultZeroAlloc(...) ... {
        const result = allocator.rawAlloc(...);
        @memset(result, 0);
        return result;
    }
    pub fn init(alloc: Allocator) void {
        if(alloc.vtable == &YourFancyZeroingAllocator.VTable) {
            return .{
                .allocator = alloc,
                .zeroAlloc = &YourFancyZeroingAllocator.zeroAlloc,
            };
        } else if(alloc.vtable == &YourOtherZeroingAllocator.VTable) {...}
        } else if(alloc.vtable == &PageAllocator.VTable) {...} // You could even integrate this with some of zig's allocators
        return .{
            .allocator = alloc,
            .zeroAlloc = &defaultZeroAlloc,
        };
    }
    pub fn alloc()...
    pub fn free()...
    pub fn zeroAlloc()...
    ...
};

This would have a small cost every time you want to convert from a zig allocator to your new interface, but if you internally only use your ZeroingAllocator that’s in the responsibility of the user of your project (and they can easily take care of it, since the conversion is clearly visible in code).

2 Likes

I think it’s easier to implement an allocator that calls calloc, and use that as parent for other allocators.

1 Like

It would call for some re-engineering, yes. The allocator interface already comptime specializes on a comptime T in alloc though. We should take a crack at figuring out how that can work, it’s clearly not impossible. I do understand your point: the current vtable doesn’t support this, and you can’t comptime-specialize a function pointer. That means this calls for some careful engineering.

Zeroing allocators already exist, for instance, Zig programs can just call calloc from libc. The idea here is to provide a general interface, so that libraries can be written to take a generic allocator, and any allocator capable of providing fast zeros (or any other fast value) can do that.

I consider the practice of every function which allocates taking an allocator to be central to the Zig philosophy, it’s one of the main reasons I’m betting on the language.

It’s essential that this interface be as powerful as memory allocation in C, and it’s a blunt fact that this isn’t currently true.

While we’re comparing it to C, the difference is, of course, calloc. There’s a reason modern OSes provide zeroed pages, and not, say, 0xdeadbeef like early Berkeley Unix used to. It’s because there’s a large class of algorithms which start with big regions of zeroed-out memory, this includes nearly every numeric recipe. Providing this as quickly as it can be allocated is crucial to implementing these efficiently.

What I’m arguing is, why stop at calloc? Current Zig practice means that if you’re implementing an algorithm which expects a lot of zeros, you have a choice: accept an allocator and take the efficiency hit, or pull in a specialized allocator and take the choice away from the user.

This isn’t ideal! What would be better is to have a way to tell the allocator what you want T to be initialized to, so that an allocator can be specialized to provide that.

In a way, C doesn’t even have an edge on Zig here, because in C, calloc is a global. My issue is that Zig authors shouldn’t have to choose between degenerating to C style global allocation, with all its flaws, and getting maximum efficiency.

I believe that’s part of it, yes. Maybe the right thing to do is to revisit the allocZero question, but I don’t think so. Finding a way for allocator implementations to cooperate with the allocator interface, to provide an optimized implementation of any default value, is strictly more powerful than special-casing zero.

We can make up stories all day about ways this could be used which aren’t just efficient zero-allocation, but it isn’t really necessary to do so. allocZero has one concrete advantage, which is that it would be one more pointer in the vtable. But there are obviously ways of structuring things so that allocWithScalar are possible, I don’t want to YOLO an implementation off the top of my head, but C++ does some very sophisticated things with vtable structure which could be drawn on for inspiration.

Easier, yes. But I hope I’ve explained why that isn’t ideal. Tying library implementations to allocators is a regression, the Zig way is better. We should be able to have it both ways.

I am not convinced that adding explicit support to the interface is beneficial for a general purpose allocator interface. (Only some users will make use of it, thus it seems wrong to always add it, seems much better in some specialized interface that is only used when it is relevant)

What I would find more interesting would be to have some standardized way to signal to other parts of the program what kind of allocator they are expected to pass into as an allocator.

With arenas you already have the case that you no longer have to call free on your allocated things. With this it would be good to have some way of knowing that the memory was pre initialized.


But I tend to think that maybe none of this is needed at all and the solution is to communicate what kind of allocator is expected and then provide that.

That is probably better than inventing new type system features, trying to prevent you from doing things, only to end up with a more heavy and complicated rust like language, that you wanted to avoid to begin with.

Sometimes the solution is just to write simple code and fix the code if it is broken, instead of having many layers of auto-adapting machinery.

3 Likes

Sure, but I think you’d need to fundamentally change the allocator interface in order to achieve the proposed behavior.

That’s exactly what my proposed user-space interface is doing? It can take a generic zig allocator by calling ZeroingAllocator.init on it.
Additionally it gives the advantage that the user stops and notices “Oh this library function takes a ZeroingAllocator, maybe I should give it one that fullfills that property for free”, whereas if this was baked into the zig allocators, then you wouldn’t be able to convey the nuance that your library can benefit from these special allocators.

3 Likes

I have a zeroing allocator in my own code, actually. The allocator provided by zigar basically creates an ArrayBuffer object on the JavaScript side, which is always zeroed out. Since existence of allocators behaving in this manner is inevitable, it’s actually kind of unsafe to not have something like an allocZeroed().

1 Like

It’s a good approach, possible a better one. It would need to be part of Zig std to really allow code to take advantage of it.

It’s still special-casing zero memory, though. On the one hand, that’s fine: zeroed memory is important enough that having an allocator interface for it would pay off, and I speculate that the reasoning which lead to #2090 and #11559 being closed doesn’t necessarily imply opposition to providing an interface for requesting fast zero allocation, and fulfilling that with an allocator which provides it. The possibility of a ZeroedAllocator was discussed in #11559, and it would solve the dilemma which initially prompted me to post this topic. The gimme library linked in that issue exists, but it just uses @memset, and saving myself one statement per allocation is not the goal here.

Regarding

and

I see several good reasons to reify allocators in the type system.

The current implementation is that an allocator is anything which provides a specific vtable, and this has achieved a remarkable combination of simplicity and power.

But let’s consider a construct which didn’t take this approach, and was reified: Slices.

A slice could just be a convention of using structs with a pointer and a length, provided by std.

const Slice = struct {
    ptr: *anyopaque,
    len: usize,
}

This could be supplemented with some built-ins to do the requisite operations.

But it’s good that slices are a compiler-known primitive, with special syntax. From a user perspective, it makes them easier to use, easier than ‘thin’ pointers, which increases safety. It also means that the compiler knows what a slice is, which makes it easier (and easier is faster) to emit optimal machine code for slices. In many cases this results in in-bounds safety with no extra cost, such as casting &array to a slice and inlining the comptime-known value of its length.

Another example is tagged unions and switch statements. I’ve written tagged-union code in C, Zig, and Julia, and Zig offers the best experience there. The language doesn’t have to reify tagged unions, but the benefits are worth the additional complexity.

Making allocators something the compiler knows about should be seriously considered. It wouldn’t add extra complexity to the language, we’ll always have allocators and they are by definition a distinct sort of Zig construct. If anything, reifying them would simplify the use of them, and therefore, the language as a whole.

Making allocators a specific thing that the compiler is aware of would enable several good things. One is that allocation can be tracked as a matter of course, this would allow annotation that a function doesn’t allocate. It would offer opportunities for third-party code to implement static analyzers which provide move semantics, ownership semantics, prove bounds for heap use, and this is an incomplete list.

Also, it would provide more opportunity to elide the vtable dereferencing, thereby emitting better code. This is directly comparable to Zig structs, which are allowed to rearrange fields for optimal layout and codegen. A reified Allocator type could be compiled directly, with or without vtables, and definitely without imposing a specific internal layout for them.

I strongly believe that the sort of effect tracking and code proving I’m gesturing at here is the future of Zig. Not built into the compiler, making every compile cycle slow, and baking in some specific memory policy, but simply the core language design making it easy for third party tools to provide high-assurance static analysis.

It’s a good idea on its own merits, and it neatly solves any problem with specializing vtables to specific applications. If the compiler knows what an allocator is, it can take care of that. This doesn’t have to lead to C++ insanity where every possible abstraction is exposed to the compiler, and it won’t, the core team would never do that. We don’t need private/abstract/virtual/protected/friend or anything which goes along with that. But we do need allocators, and we have allocators. Reifying them into the type system lets the compiler do things which it can’t do now.

Such a change should be carefully considered, but it should be considered. It would cause little to no disruption to ordinary Zig code, because a function with alloc: Allocator would continue to work exactly as it does now, maybe with some forward-compatible changes in what the interface exposes. User-provided allocators would have to adapt to the change, but this is reasonable.

Not entirely coincidentally, this would allow for enough flexibility to implement an allocWithScalar, but that’s far from the major benefit. Just a happy confluence.

There isnt as much performance benefit here as some would think unless you write a very specific allocator tuned for handing out large zerod blocks.

The big benefit of calloc isn’t that it has pre-zerod memory, it is that it doesn’t fault in the entire memory allocation up front by touching every byte (even then, that might not always be desirable behavior – if you are going to have to mmap more pages you might want to take the zeroing hit up front because that will help the hot loop by pre-faulting in the pages needed and if you are using mmap POPULATE or mlock to request more pages, it might not be that much more of a cost anyways.

To hand back freshly zerod memory without cost that memory either needs to be freshly requested from the OS (paying the mmap cost so expensive and only usable for large allocations) or the allocator would need to keep track of all the small blocks of memory knowing which have been used and freed already – doable bot adding another bit flag to each potential block of memory – now it isn’t just free/inuse but fresh/freed/inuse. And then you have to have a way to make it fast to find fresh vs free memory.

The cost is going to be large memory fragmentation if you are always requesting zeros or having to take a hit every once in a while to zero freed block (it might amortize better but you will have less predicable alloc/free times).

Calloc takes the approach to only really optimize the large allocations. On small allocations it is mostly just malloc+memset. On larger allocations it will get fresh pages. Since it doesn’t memset those, those are still just virtual pages.

The bigger win here would be to do a non-temporal memset (gcc memset will use MOVNT if the buffer is a certain size of cache I think). I don’t know about zig’s relationship to that though. NT prefetch doesn’t help as much here.

1 Like

The flip side of this is that when you do have a need for large allocations of zeroed memory, the savings can be substantial, and this need is common in numerics, where very large arrays of various dimensions need to be preallocated to zero. This lead to the ArrayAllocators.jl package in the Julia ecosystem.

Feedback on my proposal has lead me to the conclusion that it’s not a good one, though. A ZeroedAllocator interface might be the way to go, and useful applications for it might actually be niche enough that the standard library needn’t concern itself with this.

My point was that if you really need a large pre-zerod block, just use the page allocator or mmap directly. I don’t think the page allocator keeps a pool - i think it just munmaps everything immediately on free, so those two ways are prob almost the same thing.

At best you can make a Allocator interface that catches large allocations and returns newly mapped pages and passes all the smaller or non-zero requests to another allocator, basically what calloc does.

I don’t think oses keep pre zero pages around anymore. thinks like that and prefetching have really fallen out of favor because of cache pollution and cpu issues.