Fast memory and GeneralPurposeAllocator

Sze · November 17, 2024, 2:11am

I think with MultiArrayList you would get most of the benefit without manual allocation or the more complex things I am doing with StencilVectors, something like this:

const Data = struct {
    first_child: u32,
    data: u8,
    count: u8,
};

const Gaddag = struct {
    const NodeData = std.MultiArrayList(Data);
    const NodeIds = std.ArrayListUnmanaged(u32); // id -> index

    nodes: NodeData,
    ids: NodeIds,

    pub fn init() Gaddag { // or use a named empty value for usage with 0.14 decl literals
        return .{
            .nodes = .{},
            .ids = .{},
        };
    }
};

Then create some function to resize a node given its id to a new size.
The old node elements get added to a freelist, the new nodes are gotten from a free list or appended to the MultiArrayList, the id is updated to point to the new index.

The part that is better about StencilVectors is that they allow you to index into the StencilVector directly by using the bitpattern, so there is no need for scanning through the child element to find the correct child, so if you have a whole lot of lookups that might increase the performance quite a bit. (But I don’t really know why I would care for scrabble (are you writing a simulator again that runs millions of iterations?))

ericlang · November 17, 2024, 10:19am

The idea of the stencilvectors I still have to grasp but the ‘freelist’ idea I get, I believe.
BTW: For the dutch language I have 353.688 words which leads to a total of 10.881.796 nodes in our gad-dag. For the english list I have nr of nodes is 6.592.251 for 267.751 words.

So first let’s check if I understand the freelist idea…
Nodes are stored simply in a list.
When the child nodes have to grow

A ‘re-usable hole’ arises inside our buffer. We keep track of these re-usable holes in some dictionary or hash for later use (location + available length). Probably fill these holes with zeroes for clarity.
If we find a re-usable hole of the correct length we copy the children there and otherwise copy the new-length-children to the end of the buffer.

I am not making some simulator now I just want the algorithm to be as fast as possible for any scrabbleboard (15x15) for any rack of letters (7) for every language I support.

I don’t know what you are referring to, do you mean the children slice?
Yes that one only is 16 bytes for nothing, even if the array is of zero length.

Later on - but not now - I will enhance the whole thing for unicode, any set of characters, any size of board, any size of rack, up until reasonable limits.
This whole thing was born from my rant about “the unicode mess”. To practise unicode. Very soon I got into performance problems, but as said that is for later.

ericlang · November 17, 2024, 3:02pm

Ok I think I figured it out. Insanely fast. I will update github and send link when ready.

ericlang · November 17, 2024, 8:31pm

git updated. speed is quite ok. loading time is around 400 ms.
scrabble/src/gaddag.zig at main · ericlangedijk/scrabble · GitHub.

@Sve is this about the freelist strategy you described?

Sze · November 17, 2024, 10:12pm

Yes the freelists are used like what I meant.

I think I would use self.freelists[index] directly most of the time or maybe sometimes const fl = &self.freelists[index], I don’t think the get_freelist function is really useful.

I think you overuse range for loops, I would change the loops that operate on freelists to iterate over the slices, so instead of:

for (0..32) |i|
{
    self.freelists[i].deinit();
}

this:

for (&self.freelists) |*f| f.deinit();

instead of:

for(1..32) |i|
{
    const freelist = self.get_freelist(@intCast(i));
    self.wasted += @truncate(freelist.items.len * i);
    for (freelist.items) |idx|
    {
        @memset(self.nodes.items[idx..idx + i], Node.EMPTYNODE);
    }
}

for (0..32) |i|
{
    var freelist = self.get_freelist(@intCast(i));
    freelist.clearAndFree();
}

I don’t think you need to special case the zero freelist, I think it should result in a no-op?:

for(&self.freelists, 0..) |*f, i| {
    self.wasted += @truncate(f.items.len * i);
    for (f.items) |idx| @memset(self.nodes.items[idx..][0..i], Node.EMPTYNODE);
    f.clearAndFree();
}

I also used re-slicing instead of the [idx..idx + i] arithmetic, I just like it better, I think it is mostly the same to the compiler.

ericlang · November 17, 2024, 10:19pm

You are a genius. And right: the code is much shorter.
Reslicing? Where? (ah you mean for(&self.freelists, 0..)
(btw I read your post with these iterators. interesting)
And: Thank you very much! You put me on the path I needed!!

Sze · November 17, 2024, 11:38pm

With reslicing I mean using self.nodes.items[idx..][0..i] instead of self.nodes.items[idx..idx + i] , I like that with the former you don’t have to mention idx twice.

Thank you for the interesting conversations!

ericlang · November 18, 2024, 12:47am

And I reduced memory use from 800 MB downto 75.