Handling `std.unicode.Utf8Iterator` Errors

I was looking at the implementation of std.unicode.Utf8Iterator as-of d7f90722f7;

pub const Utf8Iterator = struct {
    bytes: []const u8,
    i: usize,

    pub fn nextCodepointSlice(it: *Utf8Iterator) ?[]const u8 {
        if (it.i >= it.bytes.len) {
            return null;
        }

        const cp_len = utf8ByteSequenceLength(it.bytes[it.i]) catch unreachable;
        it.i += cp_len;
        return it.bytes[it.i - cp_len .. it.i];
    }

    pub fn nextCodepoint(it: *Utf8Iterator) ?u21 {
        const slice = it.nextCodepointSlice() orelse return null;
        return utf8Decode(slice) catch unreachable;
    }

    /// Look ahead at the next n codepoints without advancing the iterator.
    /// If fewer than n codepoints are available, then return the remainder of the string.
    pub fn peek(it: *Utf8Iterator, n: usize) []const u8 {
        const original_i = it.i;
        defer it.i = original_i;

        var end_ix = original_i;
        var found: usize = 0;
        while (found < n) : (found += 1) {
            const next_codepoint = it.nextCodepointSlice() orelse return it.bytes[original_i..];
            end_ix += next_codepoint.len;
        }

        return it.bytes[original_i..end_ix];
    }
};

It feels bad that if we hit invalid unicode, we just get a panic. Would there be appetite for allowing error handling here; or at minimum putting a doc string on the struct to document that it will panic upon encountering invalid unicode?

As a workaround I’ve just made a user-space implementation with desired modifications so that I can handle invalid chunks while iterating.

Oh nevermind I just found Utf8View which checks utf-8 validity during init.

4 Likes

There’s also zg’s code_point module, which provides iterators which always return a codepoint, any malformed sequence is replaced with U+FFFD (�) by Substitution of Maximal Subparts.

If it’s important that the code must not proceed if the byte sequence is not a valid UTF-8 string, then this is not appropriate. Otherwise it’s ideal.

1 Like