Awhile back, I shared runerip, an exploration of a better way to deal with UTF-8. Results were promising! But I think I’m the only one around here who wants to call codepoints “runes”.
Time to announce: unicoder. This covers the same area of responsibility as std.unicode, but is faster, easier to use, and encourages better practices.
It’s organized into namespaces: codepoint, utf8, wtf8 and so on. More than that, there are three ‘strategies’.
The base libraries, like utf8, use the .exact strategy: they validate as they go, and throw an error when grumpy. But there’s also utf8.lossy, which has the same basic collection of functions, while handling errors using Substitution of Maximal Subparts with U+FFFD. Also, utf8.valid, which is optimal for known-good data.
It’s not quite a drop-in replacement, but close:
const cp1 = try std.unicode.utf8CountCodepoints(str);
const cp2 = try unicoder.utf8.countCodepoints(str);
assert(cp1 == cp2);
I’m satisfied with the functionality provided, and open to feedback on how it’s organized, and what, if anything, might be worth adding.
Kick the tires! Let me know what you think.