Compact and efficient 'named character reference' data structure (for HTML tokenization)

Resurrecting this, since it was mentioned relatively recently and I missed it the first time around.

I’ve solved similar problems (not open source alas) using a variation on Liang’s algorithm, used for identifying hyphenation opportunities in TeX.

It ends up being very similar to what you came up with, but not identical: a “packed trie with suffix compression”. My hunch is that this would work just about as well, rather than obviously better or worse.

The thesis lays out the theory rather well, I think. If one were to really want to dig in, TeX itself is a literate program, so the actual implementation in book form is also unusually expository and has some good techniques to it, such as how the trie itself is constructed before packing using only arrays.

Which it has to, since the version of Pascal TeX is written in doesn’t even have pointers. Probably more accurate to say that it doesn’t use pointers, Knuth did this so that the code would be more portable, and more readily translated to other languages.

1 Like