Fast memory and GeneralPurposeAllocator

ericlang · November 16, 2024, 2:03pm

I have 2 questions about memory.

The first is: I am building a tree where each node has x children.
The array grows by 1 if a child is added.
There are a lot of nodes, thats why I do not want to pre-allocate.
What is the fastest allocator for that? Or the fastest way to do it?
How can I use this deinit() to check for memory leaks?

var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer gpa.deinit(); // how?
const allocator = gpa.allocator();
// go with code using allocator

Calder-Ty · November 16, 2024, 2:18pm

I’ll answer in reverse order:

How to detect leaks:

var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer {
    if(gpa.detectLeaks()) {
        // do stuff
    }
    _=gpa.deinit(); // both funcs return values
}
const allocator = gpa.allocator();
// go with code using allocator

Selecting Allocators is more about program behavior and memory usage patterns. If the lifetime of each node is tied together, then you would want to consider the Arena Allocator. If you need to pop and push nodes frequently, then the GPA might be a good choice. What are your plans with the data you will be allocating?

ericlang · November 16, 2024, 2:26pm

The data is a word-tree, loading a textfile. An advanced type of Directed Acyclic Graph. There is only a “load from textfile” building the tree. From then on it is static. The children need (probably, maybe i can omit this) to be ordered as well.
So during loading there is a lot of resizing / inserting going on.

edit: sorted is not needed.

IntegratedQuantum · November 16, 2024, 3:07pm

There is a fun little trick to pre-allocate memory, without actually using any physical memory:
You can tell the OS to reserve a region of virtual memory without actually mapping it to physical memory. Then whenever the list needs to be resized, you tell the OS to commit more of the reserved memory.
Such a list doesn’t need to move around any memory on resizing, instead it’s just a syscall.
Additionally this has the advantage of stable pointers, so you could for example store the node children pointers directly, without having to worry about it changing whenever you add more nodes to the graph.

I implemented a version of this for my game here: Cubyz/src/utils/list.zig at master · PixelGuys/Cubyz · GitHub (note: this only works on x86 at the moment)

mnemnion · November 16, 2024, 4:17pm

Are the nodes all the same size, or do they (for instance) point to slices of different sizes?

If they’re all the same size, a MemoryPool might be a good choice. It takes a backing allocator, and specializes in allocating and freeing one type, this makes allocation cheaper, potentially by quite a bitt.

But if you have to also allocate e.g. slices of varied sizes, then you’d need to have a second allocator around for that, which is more complexity than you need (and the performance advantage probably isn’t there either).

Even in ReleaseFast mode, the GPA is not especially fast. Personally, I ignore this, because there’s an issue to replace it and I’m fine with just benefiting from that work when it becomes available.

There’s a nice handful of custom Zig allocators available, and it’s fairly easy to set things up so your code uses the GPA in safe modes, for all the memory error detection, and switches to one of those in performance modes. I’d suggest not bothering with this until you’re otherwise performance-optimizing your code, though.

ericlang · November 16, 2024, 4:20pm

First I need to get more raw…
Each Node takes up too much memory because of the array overhead.
I am at 800 MB which should be 200 max.

So somehow I need to allocate raw memory and keep my own count.

const Node = packed struct
{
    data: u8,
    count: u8,
    children: *Node // or ?*Node
}

mnemnion · November 16, 2024, 4:30pm

This looks like a good candidate for a MemoryPool. Since you said it’s a DAG structure, then children won’t appear on leaf nodes, and you definitely want ?*Node for the .child field. It’s a nullable pointer in disguise, so it’s “free” in terms of memory.

Keep in mind that if you’re testing memory pressure in Debug or (I think?) ReleaseSafe modes, the results will be meaningless, because there’s considerable overhead to make memory detection bugs viable.

A MemoryPool uses a custom stride based on the size of the type it specializes, so there should be no bookkeeping overhead other than that of the backing allocator. It’s well worthwhile reading the source code for these things, to get a sense of how they should perform.

ericlang · November 16, 2024, 4:35pm

I only heard of the MemoryPool. That will require some investigation.
What is the way to access ‘children’ in that case.
And how to ‘grow’ the children when needed?

I was playing around with allocator.alloc but that gives again an array. Which means 16 bytes overhead per node.

pachde · November 16, 2024, 4:36pm

Optional can be replaced by a sentinel value. This can sometimes simplify syntax / logic, and can save storage (at least until Zig gets better at optimizing optional mem usage).

mnemnion · November 16, 2024, 5:15pm

Optional pointers (and only pointers) use a sentinel 0 for null. There are plans afoot to add more niche optimizations for optionals, but right now it’s pointers.

I almost always choose to follow the logic of the language, rather than details of the status quo implementation. I was just working on code with a ?u21 field. That should take up four bytes, but it currently takes eight.

So there’s a choice there: I could complicate everything else I’m doing by using, say, a u22 with a high value as a sentinel meaning null, or I can work with the language and get the advantages of an optional, and then wake up one day, build Zig master, and presto, four bytes.

I nearly always choose the latter course. Realistically, an extra half-word here and there will make zero actual difference in this specific case. If you’re planning on having a million copies of a struct with a ?u21 field then yeah, maybe you have to pick a sentinel and fix it later when Zig catches up.

But it’s been my experience that people think they have that kind of problem far more often than they actually do.

Sze · November 16, 2024, 9:17pm

I find it confusing that your field is called children but you are saying it is a Directed Acyclic Graph and then you call it a word-tree, I would expect children for a tree and something like next or out or edges or something that actually describes what kind of directed edge is formed to that other node.

So is it a tree, a trie or a DAG?
I think the most specific one that applies matters, my suspicion is that it is a Trie - Wikipedia containing all the words loaded from the file.

I mostly point that out because having an accurate mental model of the needed data-structure makes it much easier to suggest compact memory layouts.

Acylic Directed Graphs are more general than tries and thus are also more difficult to handle because they have more flexibility, for example they could have multiple start nodes (or internal nodes with multiple incoming edges), where trees and tries have a root node and are more constrained.

Naively implemented Tries are nice from an educational perspective, but have very poor memory utilization.

But creating a less naive implementation requires knowledge about the operations that are required to be efficient for where it is used later.

For example it is a big difference whether you just want to know whether a word is present in the data structure or whether you also want to be able to discover similar words that share a common prefix.

So what are the actual operations you need to support?

If you just wanted presence you could use a hash table or a HAMT.
With a hash table you only have reduced potential for key compression (reusing prefix data).
(I am working (on and off) on a HAMT implementation, but there are still a bunch of things I want to clean up and figure out)

A HAMT has the benefit that it still has the trie characteristic of deeper nodes reusing data from nodes that are closer to the root (if there is a lot of similar prefix data that can save memory).

The naive trie implementations save 1 character in a node.
The HAMT instead stores n branches (where n is a power of 2 like usually 8, 16, 32, 64) in a node together with a bitmap that is used to only store the slots which aren’t empty, this avoids saving a lot of bytes for null pointers.

So far I haven’t needed a search trie, but if I did I would probably go with something like this: Bitwise trie with bitmap - Wikipedia and the related paper Fast And Space Efficient Trie Searches - Phil Bagwell
(I just found that paper through the article, I eventually need to read it in detail)

However the AMT and the HAMT are quite related to another, where the latter builds on the former, thus it seems reasonable for a library that supports HAMTs to also support AMTs (so I am considering to support AMTs in my WIP library).

details of my Implementation

My current implementation already supports disabling hashing, in that case it just consumes bits from the key directly, but it doesn’t have functions to iterate over ranges from a start to an end key yet. (It should be possible to build these functions they just don’t make sense for a hashed trie, so I hadn’t thought of that use case yet)

It also would be possible to avoid storing the parts of the key which are encoded in the path and only store the tail, but so far I haven’t bothered implementing that because I didn’t have a good use case for it (and it adds implementation complexity), but in search tries it might be relevant.

My implementation has support for pointer based nodes and index based nodes and a custom allocator for variable sized slices that have a predictable range of sizes (essentially an allocator optimized for allocating hamt-nodes).

The advantage of index based nodes is that you save memory by essentially cutting the u64 pointers in half and storing u32 indices instead.
(I actually use something more complex than an index to support indexing into a bunch of different blocks of memory that contain slices of one particular size (to support incremental allocation and efficient reuse, without wasting space for redundant slice meta data), so what I use is more like a handle but it fits in a u32)

Its support for both index and pointer use is interesting, but also kind of ugly and I am split on whether I want to separate it into two implementations or not.

I wanted to figure out the things about its implementation that I am not satisfied with before publishing it, but I am beginning to think it may make more sense to publish it while I am still not satisfied with it, so that the community can help me find solutions to some of the ugly things about it.

Depending on what you actually want to do with your datastructure, you might also be interested in this post and the topic in general:

ericlang · November 16, 2024, 9:51pm

I already did this thing in delphi long time ago and want to optimize it. Currently it only supports up until 31 letters in a Scrabble game, for letters in the range up until 255. So no unicode, because that slows down.

Someone invented the term ‘gad-dag’ for it.
Here is my first implementation, which is not tested yet.

github.com

ericlangedijk/scrabble/blob/main/src/gaddag.zig#L252


      
          /// `2. ni + stances`\
          /// `3. sni + tances`\
          /// `4. tsni + ances`\
          /// `5. atsni + nces`\
          /// `6. natsni + ces`\
          /// `7. cnatsni + es`\
          /// `7. ecnatsni + s`\
          /// `8. secnatsni`\
          /// The character before the "+" is marked as nf_bow. A bow-node is inserted at index 0 of the children.\
          /// The character "i" at #8 is marked as nf_whole
          fn add_word(self: *Gaddag, word: []const u8) !void
          {
              assert(self.node_count < 200000000);
              self.word_count += 1;
              if (self.word_count % (8192 * 4) == 0)
              {
                   std.debug.print("words {} nodes {} {s}\n", .{self.word_count, self.node_count, word});
              }
          
              const len: usize = word.len;
              var buf = try std.BoundedArray(CharCode, 32).init(0);

see the method ‘add_word’ for some details.

The implementation is maybe naive but it worked quite well. But looking for an easier / faster / less memory consuming structure. a Node can have up to 32 children (including a begin-of-word node).
And these children can have 32 children too.
The main point is you can jump in anywhere in a word and go forwards or backwards, thus when generating moves hardly have any misses!

The main problem with the Node children is the relatively enormous overhead of allocator.alloc.
I would like to allocate a growing raw pointer, which seems impossible!
The current way it takes 800 MB (around 10 million nodes) while it should occupy at least 4 times less.

When this works I want to optimize it, maybe put the stuff in one large sequence of Nodes and maybe joining equal paths.

At least the memory is reasonably performant now. Building 10 MB nodes in the tree from 350.000 words is done in 1.3 seconds.

Sze · November 16, 2024, 10:27pm

Overhead in allocated size or time required to allocate?

If you need to grow arbitrary nodes by 1 element, I would just keep a table to a bunch of size specific free lists, you should be able to determine statically what the possible sizes are so you can have an array of pointers that are the start of the freelists.

Then when allocating you first check the freelist, otherwise you allocate the needed size. On deallocating you add the element to the freelist, keeping it allocated for the next node that needs that size.

You also could keep a count to limit the amount of things you add to the freelist, past a certain maximum you could avoid adding them to the freelist and instead free them (because if a free list grows past a certain size it indicates that this freelist isn’t actually used to allocate from).

Then when you are done building you can free all the remaining nodes from the freelists.

ericlang · November 16, 2024, 10:58pm

Now that I have to overthink
Yes the memory overhead: a node is 1 byte of data and a few children (like 8 or maybe 15 or so). The phenomena array is already 16 bytes, making the whole thing so big.
The children should really be sequential in memory for access speed.

When creating it in Delphi I had just a raw memory pointer which I casted into an array, thus avoiding the overhead.

GigaGrunch · November 16, 2024, 11:27pm

This is not specific to your problem, but a general note on allocator selection: I like to use the GPA a lot because of the debug features, but in release builds, I just swap it for something else like the c-allocator. This can be done with a few lines of code and is fully comptime. I can share an example if anyone wants/needs it.

ericlang · November 16, 2024, 11:29pm

I have zero knowledge of C. I tried the c_allocator, but got a bunch of errors about linking C.
I read it is faster, so always eager to learn…

GigaGrunch · November 16, 2024, 11:32pm

Yes, you do have to link lib c when using the c_allocator. That’s just one extra line in the build script, but some people or some projects want to avoid this I guess.
I’ll not write up any code today though. It’s night time in Germany and I’m just on the phone now.

GigaGrunch · November 17, 2024, 12:01am

You can see my thing here: trecker/src/main.zig at main · GigaGrunch/trecker · GitHub
At the very beginning of the main function, I’m selecting the allocator. The actual decision making is in the build.zig file and passes it via build options to the project.

Sze · November 17, 2024, 1:36am

I don’t know what you are referring to, do you mean the children slice?

I made some rough calculations, I think the StencilVector implementation I have would work very well for your usecase and the custom allocator too.

You could use the StencilVector directly to implement your nodes, this would give you a compact memory layout and no need to manually sort children of nodes, the only part which would need some figuring out is how the building actually works.

I think your current code uses mutation of nodes to build the data structure, so it mixes the node identity and changing the size of the data associated with that identity.

Because the StencilVector is allocated with a specific count of nodes it can’t change the size directly, but it would be relatively easy to work around, while building you would have an array of stencilvector references (either index or pointer).

When creating a new node you would add it to the array, the index is the logical identity of the node, the value is the index/pointer to the current stencilvector that represents the node data. Then when a node grows the value is replaced with the new stencilvector, while the identity stays the same.

When the building is done and there won’t be any size changes anymore, you could do a pass over all the logical ids and replace them with the direct ref (index or pointer). Thus removing that extra indirection.

I think this would achieve much better space utilization than a solution that tries to make the nodes change in size while keeping an identity directly, the existing node also has an indirection that isn’t needed, there it is the children slice.

For this StencilVector solution the extra indirection would only be needed while the datastructure is being constructed and can be removed after building.

Here are the rough calculations:

Scribble node sizes:
---------------------------------

Orig Node
node size: 1 byte data + 3 bytes padding + 16 bytes slice
min node size: 20 (0 children)
max node size: 640 (31 children)
---------------------------------

multiarraylist Node
node size: 4 bytes start of children + 1 byte children count + 1 byte data
min node size: 6 (0 children)
max node size: 192 (31 children)
---------------------------------

PointerNode StencilVector
MaskInt: u31
MetaBits: 33
---------------------------------
min node size: 8 (0 children)
max node size: 256 (31 children)
---------------------------------

IndexNode StencilVector
MaskInt: u31
MetaBits: 1
---------------------------------
min node size: 8 (0 children)
max node size: 132 (31 children)
---------------------------------

IndexNode StencilVector (avoid bad flags layout)
MaskInt: u31
MetaBits: 1
---------------------------------
min node size: 4 (0 children)
max node size: 128 (31 children)
plus 1-12 bytes for flags (using DOD) for 1-32 nodes
---------------------------------

The only slightly annoying thing is the IndexNode variant being 8 bytes for empty children, but when you think about it you realize that it doesn’t matter, because (If I am simulating things correctly in my head) we don’t actually need to store node.data.code when using these StencilVectors and that means that there are only 2^3 = 8 different possible empty nodes (based on the remaining 3 boolean flags) so we can just create these as static instances. Further we can check whether only some of them are actually used and then only create those.

Thus a node ref directly tells us whether it is one of the empty nodes, thus we don’t need to store flags for them. Alternatively we can use Data Oriented Design and store the nodes and flags using SoA thus the flags just become either a single byte/nibble in a flags array or 3 bit in three separate bitarrays.
Or store 1 bit in the meta bit and 2 bit in two separate bitarrays.
(there might be a whole lot of ways to actually deal with these flags)

If that doesn’t work for some reason, we also could use the remaining single meta bit, to indicate that the bytes need to be interpreted differently for some kind of small node optimization.

The reason why I say that node.data.code isn’t needed, is that when using StencilVectors, every node that refers to other nodes already stores the same information in its bitmask, so only external references or code that does calculations would need to hold on to that information until it is encoded in a bitmask again. External references would basically be StencilVector-index + u5.

There might be more other tricks possible, if I understood the algorithm and the connections between the nodes fully. I guess that could be found out by actually trying to implement it.

But this should be enough info so that you could calculate some rough numbers about how much memory each variant would require.

ericlang · November 17, 2024, 1:37am

const Element = extern struct
{
    ptr: ?*Element align(1),
    count: u8 align(1),
    data: u8 align(1),
}

This one is finally 10 bytes. But no clue how to handle the memory allocation. Everything I tried crashes mercilessly.