Gaddag optimization

ericlang · December 15, 2024, 9:21am

A little related tail to this… I am using the “gaddag” structure in my program, which takes around 100 MB in memory.

The makers of the macondo scrabble engine say that their gaddag fits in around 5 MB. Some words about the structure are here:

github.com

andy-k/wolges/blob/6d9c6a631d52f596dc0b79e67490c7fb4aa96131/details.txt#L89


      
          
          - Casting true/false to integer gives 1 and 0.
          - In two's complement, -1 has all bits set, use it with bitwise-and.
          - Right-shifting a signed integer copies the sign bit.
          - Blank tile on board has high bit set. Blank tile on rack is 00.
          - Cross set bit 0 (adjacency) toggles perpendicular word multiplier.
          
          
          DATA STRUCTURE
          
          alphabet defines tile numbers. Tiles are numbered consecutively from 01. Tile
          00 is reserved and may mean different things. By convention, 00 on rack means
          blank. High bit is set when blank is played, so 81 means 01 played as a blank.
          Tile labels are strings, so multi-codepoint labels are possible. This module
          does not define how strings are parsed into tiles as that is a
          non-deterministic process in the general case and is only useful when building
          the word graph.
          
          bag stores a bag. A new bag is sorted, shuffle it first. Tiles are drawn by
          popping from the end. When putting back tiles each returned tile is inserted at
          a random position but the rest of the bag maintains the original order.

I cannot follow and or believe if you add nodes in “gaddag” way this fits in 5 MB for lets say 250.000 words.
(The word CARE would be entered here as ERAC, RAC@E, AC@RE, C@ARE)
Does anyone understand if it is true and how?

Sze · December 15, 2024, 12:25pm

I only had a rough look at the text, but it seems very highly optimized.

Different from other implementations of similar structures, it is allowed to share the tail end of a node.

This makes me think that it may save on a lot of nodes through structural sharing.

It also seems to me that the “Kurnia Word Graph” may be a different graph and not really be a gaddag, but that is just a hunch, finding that out would require reverse engineering the code and I don’t really want to stare at rust code.

Sze · December 15, 2024, 12:39pm

I think you would have to find out how many words where actually in the word list, the word list may have been a lot smaller.

Or use your own word list to build it and see if it is way bigger with that.

ericlang · December 15, 2024, 1:26pm

I know these source lists are around 250.000 words.
And yes: I am 20 times bigger. That is why I cannot believe it. Or they use some insane completely different structure. I will have to check it out…

edit: ah, i missed your previous post. No… I also don’t want to stare at Rust. I will focus on my engine for now and keep smart gaddags for later.

sgordonphd · December 18, 2024, 11:04pm

Using standard pointers will bloat the structure. The original implementation rolled its own pointers. If memory serves from 30 years ago, 22 bits were used for pointing within its own large array, while the other 10 bits encoded other essential information. I recall having about 100K words encoded. I believe the SPE article provides lots of clues about this.

Also, there are lots of duplicate nodes, so FSA minimization is very effective.

Depending on the lexicon, 22 bit addresses might be insufficient for the unminimized version, so you may have to use standard pointers and then minimize it before hand-rolling a small structure that can easily stay in memory.

Steven Gordon, PhD

ericlang · April 21, 2025, 1:29pm

I optimized the gaddag. It uses now a 32 bits node:

u24 for the “childpointer”
u5 for stored character
2 bit flags
1 ‘terminator’ flag indicating this is a last child.

So now I am in the 40 MB range without any compression for 360.000 words.

Now I want to compress the tree by eliminating all equal subtrees.

I am trying to recursively hash all nodes so that each subtree has a hashcode which is “the sum key of all its children” but i keep getting collisions. (or my algorithm is wrong).

My frist (very) naive implementation looks like this (a little bit simplified).

fn recursive_hash_node(node: *Node) u64
{
	var key = key_of_node(node); // initial key (zobrist hashing like in chess)
	const children = get_children(nodes, node);
	 if (children.len == 0)
	 {
        	node.key = key;
		return key;
	 }
	for (children) |*child|
	{
		const childkey = recursive_hash_node(nodes, child, hasher);
		key ^= childkey; // how combine
		key = key *% 0x9e3779b97f4a7c15; // how combine
	}
	node.key ^= key; // how to combine with all children
	return key;
}

Could I somehow use some of the Zig’s hash functions to avoid collisions?
How to combine the keys?

ericlang · April 21, 2025, 1:56pm

I also was wondering if real hashing is needed. A subtree is simply a (smart) sum of its child-values…

Sze · April 21, 2025, 3:28pm

With hashing you have to expect collisions unless the probability for them is extremely small (which you only get if you have a good hash function and the hash value has many bits, which can mean that the hash itself wastes space by being too big) or if you have a perfect hash function.

Usually to get a perfect hash function the inputs need to be sparse enough (not cover the entire value space of possible inputs) and you need to spend a long time computing a hash function that doesn’t result in any collision for your actually used inputs (or also may be impossible/not-worth-it for certain sets of inputs).

My guess would be that in your case, both of these aren’t very promising approaches for compression, you might get better results by instead trying to figure out which nodes are worth trying to share vs store inline directly as values, you also could try to apply some of the more classic compression algorithms, either by adapting them to your data structure or by writing your your data structure as a binary blob and then compressing that with different compression algorithms and comparing which one gives you the best result.

I think I would start with the latter (compressing the binary data with available compression algorithms) and if that doesn’t give you satisfying results, you could create a histogram on all your sub-trees to see if there is still enough entropy to expect better compression results through a custom compression method. With the histogram you should be able to calculate the frequencies of individual sub-trees and give you some insight about how compressible the data should be, by calculating how much space you would save through sharing.

If all that isn’t enough to give you a satisfying solution or an idea for something custom, you could look into Asymmetric numeral systems - Wikipedia which I don’t know a lot about, but it looks like it is a modern way that is able to give you good results and it also seems to be used by a lot of implementations. At least it looks like an interesting thing to research and get familiar with.

This would be my approach, but so far I didn’t have a reason to do something like this, so others who actually have tried to do this may have more practical experiences and insights.

Sze · April 21, 2025, 3:59pm

Calling these trees bothers me a bit, because these nodes have a more complicated structure than simple tree nodes, don’t they?

~~For example I think there are more than one incoming edge, making it much more difficult to share without introducing something that results in sharing something that shouldn’t be.~~

ericlang · April 21, 2025, 4:14pm

Trie, tree whatever
No the structure is extremely simple from a tree-standpoint.
Each letter can contain a set of subletters.
And then there is just one boolean: is_whole_word.

Sze · April 21, 2025, 4:37pm

Just an idea, but you could make a challenge out of it, create a git repository with the 40 MB file and the nodes struct and maybe some boilerplate to benchmark different implementations that decompress to that same data, then people can add their own implementations to it.

So somebody who wants to participate only needs to pick a name and description and provide a compress and decompress method, choosing whatever approach they find interesting.

squeek502 · April 21, 2025, 6:42pm

Just reading/learning about GADDAG now so take this with a grain of salt, but if it really is just a DAWG/DAFSA/MA-FSA with more word variations added to it, then the same techniques for minimizing nodes in a DAFSA might work.

However, the DAFSA minimization technique I’m familiar with requires the words to be sorted before construction, so what I’d try would be:

Generate the list of words and all their reversed prefix variations
Sort the generated list
Construct a trie and minimize it like you would a DAFSA

Here’s the resource I used for the construction/minimization step:

And here’s my generation code in Zig for a DAFSA (note: it writes out Zig code to avoid the construction cost at runtime).

Some notes about my implementation:

I didn’t care about making a good/efficient trie, since it’s going to be thrown away at the end
I used a [256]?*Node for the children of each node in the trie (where the index into the array is the u8 ASCII character)
This choice made implementing hash/eql pretty easy
I added a number field to the nodes to allow for minimal perfect hashing (to associate values with each word); that’s not relevant for what you’re doing

pierrelgol · April 21, 2025, 9:21pm

Maybe you could use a genetic algorithm to find a perfect hash function ?

ericlang · April 21, 2025, 11:07pm

@squeek502 : omg material to study. thanks!

Probably my assumptions about equal subtrees are not correct yet. to be continued…

mnemnion · April 25, 2025, 5:21pm

While not a direct answer, you might find Liang’s algorithm useful in figuring all this out.

ericlang · April 25, 2025, 6:34pm

Thx.
I managed to compress from 10 million nodes to 2.6 million while there are 1.45 million unique nodes.
so the rebuild algo is not yet perfect (subtrees not yet ready).
in a while a will upload the crazy trial and error code…

ericlang · June 4, 2025, 7:58pm

Ok my code is not yet online It will be.
But I managed a few very good things.

First I managed to squeeze Node into 32 bits.

pub const Node = packed struct(u32)
{
    /// 6 bits: The encoded character of this node.
    machinechar: u6 = 0,
    /// Flag indicating if this node is a whole word.
    /// * For suffix nodes it is really the end of word.
    /// * For prefix nodes, which are inverted, it is actually the beginning of the word.
    is_eow: bool = false,
    /// Flag indicating this is the last child. This flag is used to iterate children.
    is_last: bool = false,
    /// 24 bits. The location of the first child in the graph's nodes.\
    /// If the value == 0 there are no children.
    childptr: u24 = 0,
};

(gaddag is written to 1 piece of flat memory).
After that I managed to compress the gaddag, hashing each node, determine every duplicate subtree and then rebuild the unique subtrees.
That was quite complicated. It took me 3 days.

Compressed from 22.4 to 4.6 MB memory storing 200.000 words.
I think it is about as optimal as you can get storing a gaddag.

after build (1)
- pivots 335.908
- leafs 1.228.869
- total nodes 5.619.065
- memory used 22.476.260 

after cleanup free splace (2)
- pivots 307.136
- leafs 1.179.408
- total nodes 4.663.379
- memory used 18.653.516 

after compress (3)
- pivots 116.576
- leafs 40.836
- total nodes 1.161.954
- memory used 4.647.816

And the movegenerator produces 6 to 10 million moves per second now.