Hash alternative for a blunt array

Sze · December 31, 2025, 4:34am

You don’t need a bit array to find the beginning of a codepoint (see quote below).

And if you want to get the codepoint index I think it would be easier to just put fence-posts as separators into a big string to separate it into regions of smaller codepoint-sequences (for example for batched parallel processing in a linear forward way), you just have to (UTF-8 - Wikipedia):

… it is self-synchronizing so searches for short strings or characters are possible; and the start of a code point can be found from a random position by backing up at most 3 bytes. The values chosen for the lead bytes means sorting a list of UTF-8 strings puts them in the same order as sorting UTF-32 strings.

The UTF-8 bytes already contain enough information to find the codepoint starts.

Finding the codepoint index from random-access could be done by just updating a counter while iterating from the previous fence-post, basically the fence-post could serve as index structure to skip part way into the string and then iterate the remaining code points.

I think bit array could still be useful while transforming data or to mark special things / properties about the string (reminds me of Hyper-optimizing the Zig Parser). I think another thing to note is that you still could be in the middle of a grapheme cluster, so just finding the beginning of a code point isn’t enough.

I am not very experienced with unicode stuff, the conversation just reminded me of fragments I read here and there.