Compact and efficient 'named character reference' data structure (for HTML tokenization)

squeek502 · July 19, 2025, 12:51am

Got a bit carried away exploring a very narrow corner of HTML tokenization and ended up with this:

I wrote an in-depth article on the subject as well, with comparisons to how Firefox/Chrome/Safari approach the problem:

(note that, even if you (very understandably) don’t care about this specific topic, I think there’s still some worthwhile stuff in there)

The Zig implementation has also been ported to C++ and is now used in the Ladybird browser (initial PR, follow-up improvements)

Kurt · July 20, 2025, 3:04am

Wow, this is so awesome! I can now say TIL DAFSA is a thing!

I started a rendering engine with HTML and CSS parser as a way to learn a bit about zig. My approach was very fast and loose. To the point I barely read the specs, I just wrote it how I remembered CSS and HTML behaving - it “worked”. So I really appreciate the detail and knowledge sharing you’ve given this

squeek502 · July 20, 2025, 3:41am

Thanks! I’m trying my best to avoid the temptation to start writing my own HTML parser, since I know how quickly that can get out-of-hand if spec compliance is the goal.

mnemnion · September 17, 2025, 10:05pm

Resurrecting this, since it was mentioned relatively recently and I missed it the first time around.

I’ve solved similar problems (not open source alas) using a variation on Liang’s algorithm, used for identifying hyphenation opportunities in TeX.

It ends up being very similar to what you came up with, but not identical: a “packed trie with suffix compression”. My hunch is that this would work just about as well, rather than obviously better or worse.

The thesis lays out the theory rather well, I think. If one were to really want to dig in, TeX itself is a literate program, so the actual implementation in book form is also unusually expository and has some good techniques to it, such as how the trie itself is constructed before packing using only arrays.

Which it has to, since the version of Pascal TeX is written in doesn’t even have pointers. Probably more accurate to say that it doesn’t use pointers, Knuth did this so that the code would be more portable, and more readily translated to other languages.

squeek502 · September 17, 2025, 11:09pm

I believe the idea is mostly the same if I understand correctly. From a quick skimming, here’s my understanding of the main difference (mostly based on Figure 7). In both, after doing suffix compression, the remaining nodes are packed into an array:

The packed trie may have gaps in the final array representation and can always use O(1) lookups for the next character (i.e. searching for 'A' you add some number N (in the paper it’s 1 for 'A') to the index stored in the parent node and then compare only that node’s stored character against the search character)
The DAFSA does not have any gaps in the final array. Instead, lists of children are stored contiguously and it uses a O(n) linear scan to search for a child (i.e. starting at the ‘index of the first child’ stored in the parent node, you do a linear search until you find a node with a matching character or find a node with a “last sibling” flag set)

So, my understanding is that using the packed trie would trade off having more elements in the final array for faster child search.

For the particular problem of named character references (and even more specifically my implementation):

For named character references, we also want to be able to use the data structure as a map. The DAFSA’s O(n) search provides a convenient way to achieve this (via minimal perfect hashing) while only needing to add a few more bits of information to each node (explanation)
For the set of named character references, the length of lists of children decreases extremely quickly the further you delve into the trie. That is, the first and second ‘layers’ would benefit the most by far from the O(n) → O(1) child search optimization, while the remaining lists of children are short enough (max length is 13) that I’m not sure it would matter much (i.e. I know they are short enough that using a binary search instead of a linear search is worse).
I was able to implement O(1) search for the first and second layers using a separate methodology (and actually slightly decrease the overall size of the data in the process) (explanation)

So, ultimately, my current implementation (1) uses O(1) search where it matters most, and (2) takes advantage of the O(n) search in order to gain minimal perfect hashing

(it would be interesting to directly compare a typical DAFSA vs a packed trie, though)