ANN: Unicoder

Awhile back, I shared runerip, an exploration of a better way to deal with UTF-8. Results were promising! But I think I’m the only one around here who wants to call codepoints “runes”.

Time to announce: unicoder. This covers the same area of responsibility as std.unicode, but is faster, easier to use, and encourages better practices.

It’s organized into namespaces: codepoint, utf8, wtf8 and so on. More than that, there are three ‘strategies’.

The base libraries, like utf8, use the .exact strategy: they validate as they go, and throw an error when grumpy. But there’s also utf8.lossy, which has the same basic collection of functions, while handling errors using Substitution of Maximal Subparts with U+FFFD. Also, utf8.valid, which is optimal for known-good data.

It’s not quite a drop-in replacement, but close:

const cp1 = try std.unicode.utf8CountCodepoints(str);
const cp2 = try unicoder.utf8.countCodepoints(str);
assert(cp1 == cp2);

I’m satisfied with the functionality provided, and open to feedback on how it’s organized, and what, if anything, might be worth adding.

Kick the tires! Let me know what you think.

9 Likes

Nice work! Great to have a more thought-through scheme for partitioning/naming. That clarifies the use cases and behaviors better than the very ugly names in std.

1 Like

Thanks! I tried to strike a balance between making them easy to find and trimming some of the hideous length.

std favors clarity over brevity, and mostly that works out, but std.unicode has utf16CodeUnitSequenceLength which is 27 codeunits long, so. Yeah.