utf8Decode is deprecated, what are the alternatives?

utf8Decode is deprecated but the documentation doesn’t say what alternatives exist. If I want to decode exactly one utf8 character into a u21, what should I use instead?

Instantiate a Utf8View and then use the Utf8Iterator to get the unicode code points.

const std = @import("std");

test {
    const s = "Γειά";
    const view = try std.unicode.Utf8View.init(s);
    var iter = view.iterator();
    while (iter.nextCodepoint()) |u| {
        std.debug.print("{u}", .{u});
    }
}
4 Likes
const ascci = "A";
const utf8: u21 = ascci[0];

This does not decode utf8 input. Replace A with a smiley and observe the failure:

    const input = "😊";
    const utf8: u21 = input[0];
    const view = try std.unicode.Utf8View.init(input);
    var iter = view.iterator();
    const next = iter.nextCodepoint().?;
    std.debug.print("{u} != {u}\n", .{ utf8, next });

Output:

ð != 😊

2 Likes

Then you (should) know that you have a 4-byte UTF8 and can use utf8Decode4:

 const input = "😊";
 const utf8 = try std.unicode.utf8Decode4(input[0..4].*);
 std.debug.print("utf8: {u}\n", .{utf8});

utf8: 😊

Not sure what your point is, I just latched onto dimdin’s code to demonstrate how your answer did not decode utf8 at all (which OP asked about)

1 Like

Yep that works, thanks!

Feels like a lot of ceremony for reading a single character but utf8 is not trivial so it makes sense.

Once you need to iterate grapheme clusters and such, I highly recommend atman/zg: zg provides Unicode text processing for Zig projects. - Codeberg.org

1 Like

Yeah I’ve seen zg and it looks like a great library! Luckily I don’t need to deal with those nuances of utf8 yet

1 Like

That’s clear to me. Only if you work with UTF 8 will you usually know how many bytes you want to convert, and there are functions for that.

1 Like

When dealing with utf8 I think the usual case is that you don’t know how many bytes an encoded codepoint will have, so I think the way @dimdin has shown is more pragmatically relevant.

1 Like

If you have a Unicode scalar value encoded as UTF8, you do know how many bytes it takes. You can just look at first byte, and Zig conveniently gives you a function to do just that: utf8ByteSequenceLength.

Conversely, if you have a scalar value in an integer form, then you can use utf8CodepointSequenceLength to check how many bytes it will take when encoded as UTF8.

In both cases, the byte counts are easily accessible in a well-defined place, take O(1) to retrieve, and don’t require any memory allocations. To me, this definitely qualifies as “you know how many bytes you’ll need”.

Sure once you start processing the bytes but not before that.

My point was that it can have different lengths and that you will have to switch based on what length it is, so if you avoid Utf8View you just invent a different API for what the Utf8View already does and I think it makes it more convenient to use it, than reinventing that manually every time. Which is why I called it more pragmatically relevant.

Maybe you have a better API for it but that would be a different argument.
As long as the functions used within utf8Decode are still not deprecated, you also could just copy the function into your own code and keep using it to decode single valid code point and maybe look out for further changes of the standard library that may provide a different way once it is actually removed.

1 Like

Absolutely.

The problem with utf8Decode is that it’s error-prone: it allegedly takes a slice of any size, suggesting it is capable of decoding UTF8 strings of any length. And it does “work” as long as the string has one codepoint, but otherwise goes into an unreachable branch at runtime. You have to carefully slice the input to only include the first codepoint, which requires some knowledge of how the UTF8 representation works, not to mention which one of those other functions I mentioned to call first.

Sure, if you are aware of all this, then you’ll make sure to only pass slices of appropriate length. (Or indeed, copy the entire switch and dispatch to utf8DecodeN functions yourself). But that’s a lot of extra requirements placed upon the caller, and failure to abide by them results not in an error but in an actual UB. It is no surprise that this function is deprecated when tools like Utf8View are equally easy to use and provide more control for the most common use case, i.e. decoding an entire string.

2 Likes