I’m pleased to announce the completion of my first major Zig project, runeset.
RuneSets are a fast data structure for working with sets of UTF-8 encoded data. This is primarily intended for abstract characters aka scalar values, but is also capable of representing surrogate codepoints and overlong encodings.
The RuneSet struct is an implicit data structure consisting of a contiguous slice of usize. Membership tests use no hashing: instead, a combination of popcounts and masked-based membership tests are used. This is both faster, and for many real sets of significance, considerably more compact.
These benefits are particularly acute for set operations: subsetting, equality (which is simply a direct comparison of the memory), union, intersection, and difference. RuneSets are immutable: any set operation takes an allocator, and returns a new set.
The design is of my own devising: original, if not necessarily novel (it’s difficult to find relevant papers by searching for strings like “UTF-8 character sets”).
The library includes Runes, which are a non-decoded representation of Unicode codepoints, which are faster to work with in general than round-tripping to the integer representation of the codepoint. This design is based on how Julia encodes characters, which they switched to after discovering the performance implications of encoding and decoding codepoints into UTF-8 any time Strings are being operated upon.
The library has 100% line coverage, which you can verify by installing kcov and running zig build cov. It is intended as part of a more complete and complex pattern-matching library, which will take some time to complete.
Interesting. What kind of things are you going to use this for? Generally I don’t find codepoints that useful as they don’t really represent a human perceived charater, but maybe there’s some use cases I’m not aware of (glyph rendering maybe?).
While it’s true that they aren’t a complete expression of Unicode, they are in fact quite useful, especially for strings normalized to composed form (NFC).
It’s for pattern matching, essentially. From that perspective, Farmer Bob here () is a string, not a character, so no different from needing to match “cat|dog”. But for cases where detection of a set of Unicode characters is the desired pattern (and I emphasize, this is in reality very useful indeed), you can use a RuneSet. If the string is in NFC form, the vast majority of scripts in use can be recognized on a codepoint-by-codepoint basis.
If you check the PCRE documentation you’ll see that Unicode support is quite primitive and limited. This is part of improving on that. Think of it as a specialized and efficient DFA for matching collections of codepoints, that’s basically accurate.
Ideally, the compiler for the pattern-matching library will detect grapheme clusters being used inside something like a set construct, and Do The Right Thing by turning it into an ordered choice of the actual set codepoint-characters followed by the grapheme cluster or clusters. That would involve a great deal of additional work, so it’s probably not going to make it into the first release of the matching library.
Normalization is itself an interesting application for RuneSets. The “official” approach is to decompose and then recompose, but this is inefficient (and production libraries don’t use it).
Instead, one can use a RuneSet to identify codepoints which are candidates for composition. You’d use the combining marks, rather than the letters, and then check behind the combining mark to see if a normalizeable character is there. Probably using a RuneSet representing the valid combineables for each diacritic and so on.
The blog post I linked to is just a didactic implementation, production code uses large tables to drive a finite automaton which does effectively what I’ve describe above. RuneSets are smaller, in general, and at least comparable in performance.
You might want a separate routine for Hangeul composition, because the algorithm I sketched out above, while it would work well for almost every other composition case, would behave poorly there. But detecting isolated jamo with a RuneSet, as part of the main loop, is easy, and that could call out to a more efficient code pathway.
There’s actually something I could use some pointers on from someone more experienced with the language.
RuneSet generation uses an allocator, so it can’t be done at comptime, but many applications won’t be runtime-dynamic, and the data structure is just a slice of u64. So I’d like to enable build steps which create RuneSets in a way which lets them be statically included in a program. What’s the best way to go about this?
My implementation of Normalization has to be the most naive implementation ever, so I wonder if I could incorporate this to make it better.
Edit: Hmmm, if I’m understanding correctly, this data structure could become the primary encoding / lookup structure for the whole zg library. Right now I’m using multi-stage tables (compressed for efficient embedding in the binary) and then straight up array indexing for the lookup. The array lookup is hard to beat in terms of speed, but perhaps the speed difference isn’t that large and a significant reduction in space needed for the data could be achieved.
Define an array using the calculated length, populate it at comptime, and return it
This would probably end up looking like a cross between std.unicode.utf8ToUtf16LeStringLiteral (for the ‘calculating the length at comptime’ part) and std.StaticStringMap.initComptime (for the ‘returning a comptime-generated instance of a struct’ part, as well as an example where StaticStringMap can be created at comptime with initComptime or runtime with init which takes an allocator)
Correct me if I’m wrong, but RuneSet seems to only be able to encode boolean data. In other words, it could be used to implement changesWhenCaseFolded (and give a result for an entire UTF-8 string without needing to act on u21 codepoints), but couldn’t be used for implementing caseFold itself.
Yes, this is entirely correct. It’s a set: it answers one yes or no question about the status of a UTF-8 bit sequence in the broadest sense. But only one.
It can also do all the other basic things one would expect of a set, and you can use it for some oddball things like “given a string, produce another string which is all unique codepoints in the original string, sorted in lexical order”. But for mapping any and all codepoints to one of many properties, what zg is doing is already optimal.
That said, normalization spends a lot of effort looking for normalization opportunities, which is a yes-or-no question. RuneSets may be able to boost efficiency there. Remains to be seen.
Glad to see that you got this published! I’ll have to reread it now that you’ve got the official one released. PCRE is quite limited on code-point (that’s why I decided to avoid it altogether) so this is quite cool to see.
True. I was initially under the impression that it could answer the one-to-many questions which are common in a Unicode lib. But as @mnemnion says, for specific yes / no predicates in the normalization process, it could be quite handy.
runeset v0.9.0 adds runeset.ordinalMatch(slice). When the first codepoint in the slice matches the set, this will returns its order in the set. So runes.ordinalMatch("c") for a set “abcde” will return 2. Invalid sequences and unmatched runes will both return null.
This makes RuneSets also suitable for the sort of one-to-many matches we were discussing earlier.
It also adds runeset.serialize, which writes a RuneSet out as Zig source code.