A little related tail to this… I am using the “gaddag” structure in my program, which takes around 100 MB in memory.
The makers of the macondo scrabble engine say that their gaddag fits in around 5 MB. Some words about the structure are here:
I cannot follow and or believe if you add nodes in “gaddag” way this fits in 5 MB for lets say 250.000 words.
(The word CARE would be entered here as ERAC, RAC@E, AC@RE, C@ARE)
Does anyone understand if it is true and how?
I only had a rough look at the text, but it seems very highly optimized.
Different from other implementations of similar structures, it is allowed to share the tail end of a node.
This makes me think that it may save on a lot of nodes through structural sharing.
It also seems to me that the “Kurnia Word Graph” may be a different graph and not really be a gaddag, but that is just a hunch, finding that out would require reverse engineering the code and I don’t really want to stare at rust code.
I know these source lists are around 250.000 words.
And yes: I am 20 times bigger. That is why I cannot believe it. Or they use some insane completely different structure. I will have to check it out…
edit: ah, i missed your previous post. No… I also don’t want to stare at Rust. I will focus on my engine for now and keep smart gaddags for later.
Using standard pointers will bloat the structure. The original implementation rolled its own pointers. If memory serves from 30 years ago, 22 bits were used for pointing within its own large array, while the other 10 bits encoded other essential information. I recall having about 100K words encoded. I believe the SPE article provides lots of clues about this.
Also, there are lots of duplicate nodes, so FSA minimization is very effective.
Depending on the lexicon, 22 bit addresses might be insufficient for the unminimized version, so you may have to use standard pointers and then minimize it before hand-rolling a small structure that can easily stay in memory.
I optimized the gaddag. It uses now a 32 bits node:
u24 for the “childpointer”
u5 for stored character
2 bit flags
1 ‘terminator’ flag indicating this is a last child.
So now I am in the 40 MB range without any compression for 360.000 words.
Now I want to compress the tree by eliminating all equal subtrees.
I am trying to recursively hash all nodes so that each subtree has a hashcode which is “the sum key of all its children” but i keep getting collisions. (or my algorithm is wrong).
My frist (very) naive implementation looks like this (a little bit simplified).
fn recursive_hash_node(node: *Node) u64
{
var key = key_of_node(node); // initial key (zobrist hashing like in chess)
const children = get_children(nodes, node);
if (children.len == 0)
{
node.key = key;
return key;
}
for (children) |*child|
{
const childkey = recursive_hash_node(nodes, child, hasher);
key ^= childkey; // how combine
key = key *% 0x9e3779b97f4a7c15; // how combine
}
node.key ^= key; // how to combine with all children
return key;
}
Could I somehow use some of the Zig’s hash functions to avoid collisions?
How to combine the keys?
With hashing you have to expect collisions unless the probability for them is extremely small (which you only get if you have a good hash function and the hash value has many bits, which can mean that the hash itself wastes space by being too big) or if you have a perfect hash function.
Usually to get a perfect hash function the inputs need to be sparse enough (not cover the entire value space of possible inputs) and you need to spend a long time computing a hash function that doesn’t result in any collision for your actually used inputs (or also may be impossible/not-worth-it for certain sets of inputs).
My guess would be that in your case, both of these aren’t very promising approaches for compression, you might get better results by instead trying to figure out which nodes are worth trying to share vs store inline directly as values, you also could try to apply some of the more classic compression algorithms, either by adapting them to your data structure or by writing your your data structure as a binary blob and then compressing that with different compression algorithms and comparing which one gives you the best result.
I think I would start with the latter (compressing the binary data with available compression algorithms) and if that doesn’t give you satisfying results, you could create a histogram on all your sub-trees to see if there is still enough entropy to expect better compression results through a custom compression method. With the histogram you should be able to calculate the frequencies of individual sub-trees and give you some insight about how compressible the data should be, by calculating how much space you would save through sharing.
If all that isn’t enough to give you a satisfying solution or an idea for something custom, you could look into Asymmetric numeral systems - Wikipedia which I don’t know a lot about, but it looks like it is a modern way that is able to give you good results and it also seems to be used by a lot of implementations. At least it looks like an interesting thing to research and get familiar with.
This would be my approach, but so far I didn’t have a reason to do something like this, so others who actually have tried to do this may have more practical experiences and insights.
Calling these trees bothers me a bit, because these nodes have a more complicated structure than simple tree nodes, don’t they?
For example I think there are more than one incoming edge, making it much more difficult to share without introducing something that results in sharing something that shouldn’t be.
Trie, tree whatever
No the structure is extremely simple from a tree-standpoint.
Each letter can contain a set of subletters.
And then there is just one boolean: is_whole_word.
Just an idea, but you could make a challenge out of it, create a git repository with the 40 MB file and the nodes struct and maybe some boilerplate to benchmark different implementations that decompress to that same data, then people can add their own implementations to it.
So somebody who wants to participate only needs to pick a name and description and provide a compress and decompress method, choosing whatever approach they find interesting.
Just reading/learning about GADDAG now so take this with a grain of salt, but if it really is just a DAWG/DAFSA/MA-FSA with more word variations added to it, then the same techniques for minimizing nodes in a DAFSA might work.
However, the DAFSA minimization technique I’m familiar with requires the words to be sorted before construction, so what I’d try would be:
Generate the list of words and all their reversed prefix variations
Sort the generated list
Construct a trie and minimize it like you would a DAFSA
Here’s the resource I used for the construction/minimization step:
Thx.
I managed to compress from 10 million nodes to 2.6 million while there are 1.45 million unique nodes.
so the rebuild algo is not yet perfect (subtrees not yet ready).
in a while a will upload the crazy trial and error code…