Flat Memory Model Tree Search Brainstorm (scrabble)

As a sequel to this link which got a bit out of hand:
memory and trees

I created a quite well performing words graph, where each letter contains children which can follow that letter. So a letternode can have 32 max subchildren.

pub fn find_child_node(self: *Graph, node: Node, charcode: CharCode) ?Node
{
    if (node.count == 0) return null;
    const children = self.nodes.items[node.child_ptr..node.child_ptr + node.count];
    for(children) |run| if (run.data.code == charcode) return run;
    return null;
}

In the past and now again I came to three conclusions, which I think are still valid:

  1. Sorting the children does not speed up. A binary search is much slower for only 32 items.
  2. Even one little check inside the loop if (run.data.code > charcode) return null slows the search process down.
  3. Adding a 32 bit bitmask for pre-checking an existing child does not speed up either!

Obviously (also in my old delphi code) a simple loop without branching is insanely fast.
BTW: I do not use Node pointers because the Nodes are very small.

One question I have: is the zig compiler able to SIMD these kind of loops?

If not would some crazy SIMD instruction speed things up or not?
Something like:
SIMD.gather_all_charcodes_from_the_addressed_slice(node.count);
and then:
SIMD.is_any_charcode_the_same_as_input_charcode()
and then:
return the found node (which is in the chunk of memory)

Currently building the scabble movegenerator, which is already quite promising.
Calculating 100.000 moves, including calculating the score, in 1/100 of a second.

Another question: is there any performance difference in these parameters *Graph or *const Graph?

As a bonus for clarity: this is my Node.

/// 6 byte node. or 10 bytes with mask
pub const Node = extern struct
{
    pub const EMPTYNODE: Node = Node.init(0);

    /// When cleaning up the graph, we fill the unused nodes (from the remaining freelists) with invalid nodes.
    pub const INVALIDNODE: Node = Node.invalidnode();

    pub const Data = packed struct
    {
        /// The 5 bits character code
        code: CharCode = 0,
        /// Is this node the beginning of a word?
        is_bow: bool = false,
        /// Is this node the end of a word?
        is_eow: bool = false,
        /// Is this node a whole word?
        is_whole_word: bool = false,
    };

    /// Our 1 byte node data
    data: Data align(1),
    /// maximum count is 32
    count: u8 align(1),
    /// Index into Graph.nodes. Children are always sequential in memory.
    child_ptr: u32 align(1),
    // The charcode mask of children (not yet decided if we need that, probably not)
    //mask: CharCodeMask align(1),

    fn init(charcode: CharCode) Node
    {
        return Node {.data = Data {.code = charcode }, .count = 0, .child_ptr = 0 };
    }

    fn invalidnode() Node
    {
        return Node
        {
            .data = Data {},
            .count = 0,
            .child_ptr = 0xffffffff,
        };
    }
};

If you use a bitmask you don’t need the loop and instead can calculate the index directly from the bitmask, that is the whole idea behind the StencilVector it just does this to calculate the index:

inline fn index(mask: MaskInt, bit_val: MaskInt) u16 {
    return @popCount(mask & (bit_val -% 1));
}

Yes, this afternoon in my car it entered my mind that this phenomena must be stencil vector. I think I can directly implement it.

This u32 will add 50 MB :slight_smile:

No with StencilVector and indicies you don’t need the count and child_ptr fields instead of them you have a single u32 mask, the count is the @popCount(mask) and the elements are directly located after the mask, additionally this increases the memory locality, not needing to load other cachelines.

I think you also could get rid of data because other nodes already store it by what is encoded in the bitmask of nodes that refer to the current node, so this information can be acquired and kept around while traversing from one node to another and only needs to be stored for nodes which are used only as entry points (if such nodes exist).
So I think it would save 1-2 bytes per node.

Putting the childnodes directly after the parent will be quite a change in my crazy building algorithm! But true, that would again save 10 MB * (u32 + u8). I will try that later on!
Data i will definitely keep. The whole system is based on this little byte.

The only way I can see this structured way of storing is building the tree and after that ‘reorganizing’ it to a new memory block.

The nodes would look like this:

index | 0    | 1    | 2       | 3       | 4       | 5    | 6       | 7       | 8    | ...
kind  | mask | mask | child 1 | child 2 | child 3 | mask | child 1 | child 2 | mask | ...
value | 0000 | 1011 | 2       | 3       | 4       | 1001 | 6       | 7       | 0010 | ...

id    | 0 | 1 |  2 |  3 |  4 | 5 |  6 |  7 |  8 | ...
index | 0 | 1 | 11 | 15 | 20 | 5 | 33 | 35 |  8 | ...

kind is just description, the mask-values are bitfields telling us how many children and in which logical slots (corresponding to a specific data.code), after every mask there are child-count many children that hold the id of the child.

The id is mapped to the index of the mask of the child node. This extra indirection allows us to move nodes and change their size.

Once building is done you can iterate over all children and replace the ids with the indices, then you can throw away the id->index mapping (Array).
But this means that you no longer can add new nodes.


Btw. I am working on pulling the hamt / StencilVector stuff into a wip package.

1 Like

I understand the concept, I believe. When my scrabble algorithm works I will come back here…