utf8Decode is deprecated, what are the alternatives?

ajoino · February 22, 2025, 8:16pm

utf8Decode is deprecated but the documentation doesn’t say what alternatives exist. If I want to decode exactly one utf8 character into a u21, what should I use instead?

dimdin · February 22, 2025, 9:19pm

Instantiate a Utf8View and then use the Utf8Iterator to get the unicode code points.

const std = @import("std");

test {
    const s = "Γειά";
    const view = try std.unicode.Utf8View.init(s);
    var iter = view.iterator();
    while (iter.nextCodepoint()) |u| {
        std.debug.print("{u}", .{u});
    }
}

chrboesch · February 22, 2025, 9:34pm

const ascci = "A";
const utf8: u21 = ascci[0];

cryptocode · February 22, 2025, 10:15pm

This does not decode utf8 input. Replace A with a smiley and observe the failure:

    const input = "😊";
    const utf8: u21 = input[0];
    const view = try std.unicode.Utf8View.init(input);
    var iter = view.iterator();
    const next = iter.nextCodepoint().?;
    std.debug.print("{u} != {u}\n", .{ utf8, next });

Output:

ð != 😊

chrboesch · February 22, 2025, 10:51pm

Then you (should) know that you have a 4-byte UTF8 and can use utf8Decode4:

 const input = "😊";
 const utf8 = try std.unicode.utf8Decode4(input[0..4].*);
 std.debug.print("utf8: {u}\n", .{utf8});

utf8: 😊

cryptocode · February 22, 2025, 10:51pm

Not sure what your point is, I just latched onto dimdin’s code to demonstrate how your answer did not decode utf8 at all (which OP asked about)

ajoino · February 22, 2025, 10:54pm

Yep that works, thanks!

Feels like a lot of ceremony for reading a single character but utf8 is not trivial so it makes sense.

cryptocode · February 22, 2025, 10:55pm

Once you need to iterate grapheme clusters and such, I highly recommend atman/zg: zg provides Unicode text processing for Zig projects. - Codeberg.org

ajoino · February 22, 2025, 10:56pm

Yeah I’ve seen zg and it looks like a great library! Luckily I don’t need to deal with those nuances of utf8 yet

chrboesch · February 22, 2025, 10:58pm

That’s clear to me. Only if you work with UTF 8 will you usually know how many bytes you want to convert, and there are functions for that.

Sze · February 22, 2025, 11:39pm

When dealing with utf8 I think the usual case is that you don’t know how many bytes an encoded codepoint will have, so I think the way @dimdin has shown is more pragmatically relevant.

Xion · February 23, 2025, 7:52am

If you have a Unicode scalar value encoded as UTF8, you do know how many bytes it takes. You can just look at first byte, and Zig conveniently gives you a function to do just that: utf8ByteSequenceLength.

Conversely, if you have a scalar value in an integer form, then you can use utf8CodepointSequenceLength to check how many bytes it will take when encoded as UTF8.

In both cases, the byte counts are easily accessible in a well-defined place, take O(1) to retrieve, and don’t require any memory allocations. To me, this definitely qualifies as “you know how many bytes you’ll need”.

Sze · February 23, 2025, 8:24am

Sure once you start processing the bytes but not before that.

My point was that it can have different lengths and that you will have to switch based on what length it is, so if you avoid Utf8View you just invent a different API for what the Utf8View already does and I think it makes it more convenient to use it, than reinventing that manually every time. Which is why I called it more pragmatically relevant.

Maybe you have a better API for it but that would be a different argument.
As long as the functions used within utf8Decode are still not deprecated, you also could just copy the function into your own code and keep using it to decode single valid code point and maybe look out for further changes of the standard library that may provide a different way once it is actually removed.

Xion · February 23, 2025, 10:10am

Absolutely.

The problem with utf8Decode is that it’s error-prone: it allegedly takes a slice of any size, suggesting it is capable of decoding UTF8 strings of any length. And it does “work” as long as the string has one codepoint, but otherwise goes into an unreachable branch at runtime. You have to carefully slice the input to only include the first codepoint, which requires some knowledge of how the UTF8 representation works, not to mention which one of those other functions I mentioned to call first.

Sure, if you are aware of all this, then you’ll make sure to only pass slices of appropriate length. (Or indeed, copy the entire switch and dispatch to utf8DecodeN functions yourself). But that’s a lot of extra requirements placed upon the caller, and failure to abide by them results not in an error but in an actual UB. It is no surprise that this function is deprecated when tools like Utf8View are equally easy to use and provide more control for the most common use case, i.e. decoding an entire string.