Runerip: DFA-based UTF-8 decoder and validator

mnemnion · September 6, 2024, 9:36pm

Hot off the presses, and two days old, I am pleased to present the initial release of runerip.

This is a small library which implements a UTF-8 validator and decoder. The algorithm is prior art, and being curious about these things, I decided to implement it in Zig and see what happens.

Benchmarks

The library is in an initial state, offering validation, counting of runes, and an iterating decoder for same. Other functionality is contemplated, and some exists in stub form, but would need polishing up.

The demo section of the repo has a short demo text, which is built into six test programs, one each of runerip and the standard library, for three applications of interest. Tests were performed with hyperfine on an M1 Max running macOS Sonoma.

The first tests counts codepoints. On this test, runerip is comfortably faster than the standard library, on the order of 2.3-2.4X.

The second test decodes codepoints, summing them together to convince the optimizer to include the decoding. Here runerip also does well, on the order of 2x faster than standard.

The third test simply validates that the string is UTF-8, and here, the standard library ekes out a victory, with runerip about 0.9x as fast. Zig std uses a highly tuned implementation for this task, including its own lookup DFA. The runerip algorithm is exceedingly simple in comparison, but not in any useful or advantageous way.

Considerations

Overall, I believe the speed gains here are real, and would apply to other tasks not yet implemented, such as transcoding to UTF-16. This comes at a modest cost, needing 376 bytes for the lookup tables, making this technique less suitable for embedded or otherwise highly constrained systems.

For the test data, the optimizations in std which favor ASCII are largely not realized in gains. I tried adding them to runerip to no effect, but then again, the sample data is not ASCII-dominated, and many real texts of interest (Zig source code is a noteworthy example) are so dominated. Of course, these optimizations may be added to a runerip implementation as well, my thinking is that it wouldn’t be excessive to have e.g. utf8CountCodepointsExpectAscii for use when multibyte runes are expected to be rare. It may also be the case that unilaterally including them has no negative impact on use cases with minimal ASCII, this is an empirical question which can only be answered with more sample data.

I did my level best to make a fair comparison, in the subset of Unicode tasks which are so far deployed. The runerip DFA doesn’t offer an obvious way, unlike the standard approach, to signal what sort of ill-formed data is encountered, and it throws on the offset where a sequence is determined to be ill-formed, despite that sometimes several preceding bytes are found invalid in the process.

This is not a durable disadvantage, as I see it, for a few reasons. One is that caring about the specifics of how a validation fails is rare, in comparison to taking some useful action in the face of failure. One example of taking action is to issue a replacement character instead of the bad bytes, and the original implementation has a couple approaches there which I’ll get around to adding eventually. Finally, it would not be difficult to add a relatively-slow-path fixup step which identifies the problem category in more detail, and this should have no impact on the speed of the happy path.

Basically, there are many approaches to coping with invalid sequences, and it all depends on the task at hand. The basic approach here is compatible with any of those, including that taken by the standard library.

An additional strength of the runerip approach (which is, to be clear, the Björn Hörhmann approach) is that it generalizes well. Creating additional DFA tables to match e.g. WTF-8 instead of UTF-8 is a simple matter. Zig source code is a subset of UTF-8 which disallows most C0 control bytes, and should, but at the moment does not, disallow C1 sequences as well. This too is a simple matter of adjusting the DFAs involved to reject those when encountered.

As it stands, runerip is an experiment, a proof of concept. If anyone should care to run the demos, and report back if timings are materially different on other architectures and systems, I would be delighted to hear about it. As I find time, I expect I’ll polish the rough edges, and add a few more functions to the collection.

Runes

Some might have noticed my habit of referring to what the Unicode consortium calls a ‘scalar value’ as runes. This is habitual, and canonical, in Go language circles, but scarce elsewhere (Odin also does so). What’s up with that?

Consider the alternatives! Scalar value, while official, also smells like it. Originally Unicode used “character”, but that is unsuitable several times over: some runes are not characters in any reasonable sense, while many perceived characters are comprised of several runes. wchar carries considerable baggage, usually signalling that the broken UCS-2 encoding is assumed by such a program. Even something like ‘glyph’ doesn’t fit the bill, having an established meaning in text rendering which is sure to collide with using it for the atomic unit of text itself.

rune then, while a bit whimsical, is a clear and short (especially, short) term for a scalar value. Although Go is the only large community of practice to use it at present, it is one of the first terms ever used for what it is. The coinage is from Plan 9, where UTF-8 was designed and first implemented, and it is mainly as an homage to that invention that I use the word. Generally inventing something comes with the privilege of naming it, at least, until Microsoft gets involved.

dee0xeed · September 7, 2024, 6:49pm

Btw, what does rip in runerip mean? Is it “R.I.P.” (pun intended) or “some strict boundaries within utf-8 byte flow”?

mnemnion · September 7, 2024, 7:47pm

Ripping is vernacular for decoding or transcoding, as in “ripping a CD to MP3”. I haven’t added UTF-16 transcoding yet, but I do intend to.

Ripping also sounds fast .

ethindp · September 9, 2024, 7:01pm

Do you plan to add things like normalization forms/verification?

mnemnion · September 10, 2024, 2:16pm

Highly recommend @dude_the_builder’s zg library for your advanced Unicode needs, including normalization forms. I guess by “verification” you mean “of a normalized form”, rather than “as well-formed UTF-8”, which runerip does (but not faster than std in my preliminary tests).

With runerip I’m planning to stick with what you can find in std.unicode, not the identical interface (the upcoming standard library audit will definitely visit the Unicode library) but the same area of concern, along with some things which are currently done by the Zig parser/tokenizer.

Preliminary tests suggest that it could be a viable replacement for maybe half of that library, the extra space requirements are no problem for a computer which can actually run the compiler. But those are preliminary, and don’t include a few essentials.

I expect I’ll write some more rune component libraries over time, but this one will focus on decoding, counting, validating, and transcoding of UTF-8, and a couple useful variants of that encoding. Anything which goes from 16 to 8 is out of scope, the algorithm is unidirectional.

mnemnion · September 18, 2024, 8:10pm

Update: I have added an implementation of
utf8ToUtf16Le to runerip. Head-to-head with the standard library version, runerip is more than twice as fast, about 2.25X according to hyperfine.

As before, I didn’t add the SIMD fast path for ASCII, since the demo text is multibyte-heavy and includes some in the second line, which might even anti-optimize the standard library implementation. But fair is fair, not every text out there has a bunch of ASCII as a prefix. I don’t think this would make a meaningful difference in the results however.

My opinion has grown that user code can mostly predict in advance whether it should expect ASCII-dominated text (source code mostly fits this bill) or whether no such assumption can be made, and that it will be worthwhile to offer expectAscii variants of various fundamental operations on UTF-8. These would not just start in a fast-path ASCII mode, but also look for opportunities to return to it, with a running counter of when runes are ASCII-only, aligned with offsets with SIMD-friendly addresses.

Furthermore, by unrolling the validation logic for each rune, runerip is now faster at pure validation as well, by a hair more than 1.5X. The recent landing of labeled switch continues on the master branch reminded me that branch predictors like it better when they have more branches to work with, and it seemed like that was what the standard library version was doing which runerip was not. The new version is complex enough that I really need to back it up with a good set of failing tests, but any bugs there would be in details which wouldn’t affect performance.

After this experiment, I’m confident that this algorithm is meaningfully faster for all operations on UTF-8, and that these functions would make a good replacement for their counterparts in the standard library. I would love to see some number on other architectures, and different test code, but these are comfortable leads.

There are some notes in std.unicode about rethinking/replacing some aspects of the API, and I’ve been giving some thought, and doing some research, towards a proposal for that. It would make more sense to tackle it all at once. No, that proposal will not hinge on calling everything runes, Zig itself doesn’t have strings or runes and it’s likely best to keep it that way.

I’ll continue to putter around with runerip in the meantime, adding a WTF-8 DFA for example. It’s unlikely that will affect the basic conclusion.