utf8Decode is deprecated but the documentation doesn’t say what alternatives exist. If I want to decode exactly one utf8 character into a u21
, what should I use instead?
Instantiate a Utf8View and then use the Utf8Iterator to get the unicode code points.
const std = @import("std");
test {
const s = "Γειά";
const view = try std.unicode.Utf8View.init(s);
var iter = view.iterator();
while (iter.nextCodepoint()) |u| {
std.debug.print("{u}", .{u});
}
}
const ascci = "A";
const utf8: u21 = ascci[0];
This does not decode utf8 input. Replace A with a smiley and observe the failure:
const input = "😊";
const utf8: u21 = input[0];
const view = try std.unicode.Utf8View.init(input);
var iter = view.iterator();
const next = iter.nextCodepoint().?;
std.debug.print("{u} != {u}\n", .{ utf8, next });
Output:
ð != 😊
Then you (should) know that you have a 4-byte UTF8 and can use utf8Decode4:
const input = "😊";
const utf8 = try std.unicode.utf8Decode4(input[0..4].*);
std.debug.print("utf8: {u}\n", .{utf8});
utf8: 😊
Not sure what your point is, I just latched onto dimdin’s code to demonstrate how your answer did not decode utf8 at all (which OP asked about)
Yep that works, thanks!
Feels like a lot of ceremony for reading a single character but utf8 is not trivial so it makes sense.
Once you need to iterate grapheme clusters and such, I highly recommend atman/zg: zg provides Unicode text processing for Zig projects. - Codeberg.org
Yeah I’ve seen zg
and it looks like a great library! Luckily I don’t need to deal with those nuances of utf8 yet
That’s clear to me. Only if you work with UTF 8 will you usually know how many bytes you want to convert, and there are functions for that.
When dealing with utf8 I think the usual case is that you don’t know how many bytes an encoded codepoint will have, so I think the way @dimdin has shown is more pragmatically relevant.
If you have a Unicode scalar value encoded as UTF8, you do know how many bytes it takes. You can just look at first byte, and Zig conveniently gives you a function to do just that: utf8ByteSequenceLength.
Conversely, if you have a scalar value in an integer form, then you can use utf8CodepointSequenceLength to check how many bytes it will take when encoded as UTF8.
In both cases, the byte counts are easily accessible in a well-defined place, take O(1) to retrieve, and don’t require any memory allocations. To me, this definitely qualifies as “you know how many bytes you’ll need”.
Sure once you start processing the bytes but not before that.
My point was that it can have different lengths and that you will have to switch based on what length it is, so if you avoid Utf8View
you just invent a different API for what the Utf8View
already does and I think it makes it more convenient to use it, than reinventing that manually every time. Which is why I called it more pragmatically relevant.
Maybe you have a better API for it but that would be a different argument.
As long as the functions used within utf8Decode are still not deprecated, you also could just copy the function into your own code and keep using it to decode single valid code point and maybe look out for further changes of the standard library that may provide a different way once it is actually removed.
Absolutely.
The problem with utf8Decode
is that it’s error-prone: it allegedly takes a slice of any size, suggesting it is capable of decoding UTF8 strings of any length. And it does “work” as long as the string has one codepoint, but otherwise goes into an unreachable
branch at runtime. You have to carefully slice the input to only include the first codepoint, which requires some knowledge of how the UTF8 representation works, not to mention which one of those other functions I mentioned to call first.
Sure, if you are aware of all this, then you’ll make sure to only pass slices of appropriate length. (Or indeed, copy the entire switch
and dispatch to utf8DecodeN
functions yourself). But that’s a lot of extra requirements placed upon the caller, and failure to abide by them results not in an error but in an actual UB. It is no surprise that this function is deprecated when tools like Utf8View
are equally easy to use and provide more control for the most common use case, i.e. decoding an entire string.