Flat Memory Model Tree Search Brainstorm (scrabble)

ericlang · November 19, 2024, 9:30pm

As a sequel to this link which got a bit out of hand:
memory and trees

I created a quite well performing words graph, where each letter contains children which can follow that letter. So a letternode can have 32 max subchildren.

pub fn find_child_node(self: *Graph, node: Node, charcode: CharCode) ?Node
{
    if (node.count == 0) return null;
    const children = self.nodes.items[node.child_ptr..node.child_ptr + node.count];
    for(children) |run| if (run.data.code == charcode) return run;
    return null;
}

In the past and now again I came to three conclusions, which I think are still valid:

Sorting the children does not speed up. A binary search is much slower for only 32 items.
Even one little check inside the loop if (run.data.code > charcode) return null slows the search process down.
Adding a 32 bit bitmask for pre-checking an existing child does not speed up either!

Obviously (also in my old delphi code) a simple loop without branching is insanely fast.
BTW: I do not use Node pointers because the Nodes are very small.

One question I have: is the zig compiler able to SIMD these kind of loops?

If not would some crazy SIMD instruction speed things up or not?
Something like:
SIMD.gather_all_charcodes_from_the_addressed_slice(node.count);
and then:
SIMD.is_any_charcode_the_same_as_input_charcode()
and then:
return the found node (which is in the chunk of memory)

Currently building the scabble movegenerator, which is already quite promising.
Calculating 100.000 moves, including calculating the score, in 1/100 of a second.

Another question: is there any performance difference in these parameters *Graph or *const Graph?

As a bonus for clarity: this is my Node.

/// 6 byte node. or 10 bytes with mask
pub const Node = extern struct
{
    pub const EMPTYNODE: Node = Node.init(0);

    /// When cleaning up the graph, we fill the unused nodes (from the remaining freelists) with invalid nodes.
    pub const INVALIDNODE: Node = Node.invalidnode();

    pub const Data = packed struct
    {
        /// The 5 bits character code
        code: CharCode = 0,
        /// Is this node the beginning of a word?
        is_bow: bool = false,
        /// Is this node the end of a word?
        is_eow: bool = false,
        /// Is this node a whole word?
        is_whole_word: bool = false,
    };

    /// Our 1 byte node data
    data: Data align(1),
    /// maximum count is 32
    count: u8 align(1),
    /// Index into Graph.nodes. Children are always sequential in memory.
    child_ptr: u32 align(1),
    // The charcode mask of children (not yet decided if we need that, probably not)
    //mask: CharCodeMask align(1),

    fn init(charcode: CharCode) Node
    {
        return Node {.data = Data {.code = charcode }, .count = 0, .child_ptr = 0 };
    }

    fn invalidnode() Node
    {
        return Node
        {
            .data = Data {},
            .count = 0,
            .child_ptr = 0xffffffff,
        };
    }
};

Sze · November 19, 2024, 10:05pm

If you use a bitmask you don’t need the loop and instead can calculate the index directly from the bitmask, that is the whole idea behind the StencilVector it just does this to calculate the index:

inline fn index(mask: MaskInt, bit_val: MaskInt) u16 {
    return @popCount(mask & (bit_val -% 1));
}

ericlang · November 19, 2024, 10:10pm

Yes, this afternoon in my car it entered my mind that this phenomena must be stencil vector. I think I can directly implement it.

ericlang · November 19, 2024, 10:15pm

This u32 will add 50 MB

Sze · November 19, 2024, 10:33pm

No with StencilVector and indicies you don’t need the count and child_ptr fields instead of them you have a single u32 mask, the count is the @popCount(mask) and the elements are directly located after the mask, additionally this increases the memory locality, not needing to load other cachelines.

I think you also could get rid of data because other nodes already store it by what is encoded in the bitmask of nodes that refer to the current node, so this information can be acquired and kept around while traversing from one node to another and only needs to be stored for nodes which are used only as entry points (if such nodes exist).
So I think it would save 1-2 bytes per node.

ericlang · November 19, 2024, 10:40pm

Putting the childnodes directly after the parent will be quite a change in my crazy building algorithm! But true, that would again save 10 MB * (u32 + u8). I will try that later on!
Data i will definitely keep. The whole system is based on this little byte.

ericlang · November 20, 2024, 7:43pm

The only way I can see this structured way of storing is building the tree and after that ‘reorganizing’ it to a new memory block.

Sze · November 20, 2024, 8:50pm

The nodes would look like this:

index | 0    | 1    | 2       | 3       | 4       | 5    | 6       | 7       | 8    | ...
kind  | mask | mask | child 1 | child 2 | child 3 | mask | child 1 | child 2 | mask | ...
value | 0000 | 1011 | 2       | 3       | 4       | 1001 | 6       | 7       | 0010 | ...

id    | 0 | 1 |  2 |  3 |  4 | 5 |  6 |  7 |  8 | ...
index | 0 | 1 | 11 | 15 | 20 | 5 | 33 | 35 |  8 | ...

kind is just description, the mask-values are bitfields telling us how many children and in which logical slots (corresponding to a specific data.code), after every mask there are child-count many children that hold the id of the child.

The id is mapped to the index of the mask of the child node. This extra indirection allows us to move nodes and change their size.

Once building is done you can iterate over all children and replace the ids with the indices, then you can throw away the id->index mapping (Array).
But this means that you no longer can add new nodes.

Btw. I am working on pulling the hamt / StencilVector stuff into a wip package.

ericlang · November 20, 2024, 8:59pm

I understand the concept, I believe. When my scrabble algorithm works I will come back here…

Sze · November 22, 2024, 5:12am

I have published the implementation now HAMT and StencilArrays, maybe you can take a look at it and play around with it to see whether you can use it for your use-case.

I think it also might make sense to come up with a minimal api, that describes the inputs, building and the operations that are done on the built data structure, to basically create a benchmark where people can plugin their solution to your use-case. Maybe we could even turn it into a challenge topic.

I think doing that may have the potential that somebody finds a whole other way to approach the problem. While the gaddag seems clever, I am not completely convinced that it is actually a data structure that is well suited to how the hardware operates, but this is just a hunch that that algorithm may be a local maximum, that distracts from seeing a solution that is way better. But it is difficult for me to say whether there is anything to that hunch, would require some more in-depth research and that might just end up with no answer. So far I haven’t spent a whole lot of time on graph related problems.

Sze · November 22, 2024, 8:59am

After reading this from GADDAG - Wikipedia :

A GADDAG is a specialization of a Trie, containing states and branches to other GADDAGs. It is distinct for its storage of every reversed prefix of every word in a dictionary. This means every word has as many representations as it does letters; since the average word in most Scrabble regulation dictionaries is 5 letters long, this makes the GADDAG about 5 times as big as a simple DAWG.

And from the referenced paper https://ericsink.com/downloads/faster-scrabble-gordon.pdf :

Placing letter sets on arcs avoids designating states as final or not

It seems similar to the things I was thinking about, but maybe even sharing more aggressively.

Compression
A GADDAG (or DAWG) could be represented in a various expanded or com-
pressed forms
[…]
The simplest compressed representation is a single array of 32-bit words. States are a bit map indicating which letters have arcs. Each arc is encoded in 32-bit words as in the expanded representation. In this combined array, arcs directly follow the states they originate from in alphabetical order.
Compression has two disadvantages.

The first is that encoding destination states in arcs limits the potential size of the lexicon. Under expanded representation, a state is just the first index of a two-dimensional array, whereas, under compression, a state is the index of its bit-map in a single array of both arcs and states. This second index grows much faster, so that a larger lexicon would allow arcs to be encoded in 32 bits under expanded representation than under compression.

The second disadvantage is the time it takes to find arcs. Each arc is addressed directly under expanded representation. Under compression, the bit map in a state indicates if a given letter has an associated arc. One must count how many of the preceding bits are set and then skip that many 32-bit words to actually find the arc. Although the actual number of preceding bits that are set is usually small, each preceding bit must be examined to compute the correct offset of the associated arc.

I think the 2nd disadvantage isn’t actually true (on modern hardware), because we have hardware instructions that allow to compute mask->index efficiently.

I think the first disadvantage might be preventable by instead approaching it like the 2d mapping that is also mentioned in the paper, see at the end.

The minimized and compressed DAWG structure represents the lexicon with an impressive 4·3 bits per character, less than half the number of bits required to
represent the lexicon as text (including one byte per word for a delimiter). Appel and Jacobson achieved an even better ratio by encoding many arcs in 24 bits. The GADDAG minimized more effectively than the DAWG, being just less than 5 times larger rather than the 6·7 times larger expected with equally effective minimization.

Almost seems to me like the ideal would be to store the data as DAWG and expand it to GADDAG for use, I wonder whether it would be possible to construct the GADDAG from the DAWG.

One thing that I would try is to use the HAMT directly to represent the 2D array in a way that is sparsely allocated, I think that might lead to a representation that can be accessed like the expanded form but also saves memory.
Either the hamt could be used without hashing which enables prefix sharing and ordered traversal, so the state + letter could be combined into a single key.
Or maybe using hashing where a big mapping maps to many small ones.

But I am not completely sure and probably should stop, before my sleepy brain talks non-sense.

ericlang · November 22, 2024, 12:43pm

Yes a GADDAG is crazy. As in my code:
Let’s take the word “instances”. it is added to the tree as follows:

1. i + nstances
2. ni + stances
3. sni + tances
4. tsni + ances
5. atsni + nces
6. natsni + ces
7. cnatsni + es
7. ecnatsni + s
8. secnatsni

The character before the + is marked as “is_bow” (begin-of-word)
A sentinel bow-node is inserted at index 0 of the children of that node.
The character i at #8 is marked as “is_whole_word”.

I never got into sharing paths.

Algorithm almost finished, so I will come back soon here.

ericlang · November 22, 2024, 1:07pm

And that is exactly the link I based my scrabble on loooooooong time ago.
I added some small optimizations as well.

ericlang · November 24, 2024, 7:35pm

OK!! When the childnodes are sorted this crazy bitmask indexing function works.
It speeds up my scrabble by 400.000 moves per second. But there is more to optimize still now I can use the mask.