How to handle optional functions causing unreachable in the header of a while loop?

This question has a certain relationship to the other question I asked. (No need to read! Totally different question)

But I already figured out many things since then because of the immense help I’ve gotten from all of you, for which I’m super grateful. If this question is too stupid to ask, or I’m getting too spammy, please feel free to remove it!

I haven’t had much time these days. I only spend like 1-2 hours, not every day, to learn anything new after my first question.

Here are my two files that I need to show:
common.zig (This code works, but it’s wild how it’s written)

const latinizer = @import("converters/latinizer.zig");
const std = @import("std");

// All here's for the learning purposes
pub fn getCyrillicToLatinicHashedMap() ![1102][2]u8 {
    var hashed: [1102][2]u8 = undefined;
    var iterator = std.unicode.Utf8Iterator{ .bytes = latinizer.cyrillic_to_latinic, .i = 0 };
    var map_iterator = std.unicode.Utf8Iterator{ .bytes = latinizer.cyrillic_to_latinic_map, .i = 0 };

    while (iterator.nextCodepoint()) |char| {
        _ = try std.unicode.utf8Encode(map_iterator.nextCodepoint() orelse unreachable, &hashed[char]);
        std.debug.print("{} - {s}\n", .{ char, hashed[char] });
    }
    return hashed;
}

main.zig (The project structure hasn’t been decided yet, I’m still learning the basics)

const std = @import("std");
const transliterator = @import("lib.zig");

pub fn main() !void {
    const cyrl_to_lat_hashed_map = try transliterator.common.getCyrillicToLatinicHashedMap();
    const stdin_reader = std.io.getStdIn().reader();
    var buf: [1024]u8 = undefined;

    std.debug.print("Input cyrilic text: ", .{});
    _ = try stdin_reader.readUntilDelimiterOrEof(&buf, '\n');
    std.debug.print("\n{s}", .{buf});

    var buf_copy: [1024]u8 = undefined;
    // Don't judge me pwease. I don't know enough of Zig yet, okay?
    for (buf, 0..) |value, i| {
        buf_copy[i] = value;
    }
    var buf_iter = std.unicode.Utf8Iterator{ .bytes = &buf_copy, .i = 0 };

    var i: usize = 0;
    while (buf_iter.nextCodepoint()) |char| {
        var map_iter = std.unicode.Utf8Iterator{ .bytes = &cyrl_to_lat_hashed_map[char], .i = 0 };
        while (map_iter.nextCodepoint()) |map_char| { // The issue is here
            i += try std.unicode.utf8Encode(map_char, buf[i..]);
            break;
        }
        // if (char == undefined) break;
    }
    std.debug.print("Result: {s}\n", .{buf});
}

The project compiles well, but here’s roughly what I get:

...
1093 - x�
1098 - ʼ // This okina needs two bytes to encode
1099 -  �
1100 -  �
1101 - e�
Input cyrilic text: Хатоликлар // My input
...

thread 359846 panic: attempt to unwrap error: Utf8InvalidStartByte
/usr/lib/zig/std/unicode.zig:28:5: 0x1069578 in utf8ByteSequenceLength (zigxatolik)
    return switch (first_byte) {
    ^
/usr/lib/zig/std/unicode.zig:402:69: 0x1039ca0 in nextCodepointSlice (zigxatolik)
        const cp_len = utf8ByteSequenceLength(it.bytes[it.i]) catch unreachable;
                                                                    ^
/usr/lib/zig/std/unicode.zig:408:44: 0x1035957 in nextCodepoint (zigxatolik)
        const slice = it.nextCodepointSlice() orelse return null;
                                           ^
/home/sohro/projects/zigxatolik/src/main.zig:23:38: 0x10363e8 in main (zigxatolik)
        while (map_iter.nextCodepoint()) |map_char| {
                                     ^
/usr/lib/zig/std/start.zig:524:37: 0x1035545 in posixCallMainAndExit (zigxatolik)
            const result = root.main() catch |err| {
                                    ^
/usr/lib/zig/std/start.zig:266:5: 0x1035061 in _start (zigxatolik)
    asm volatile (switch (native_arch) {
    ^
???:?:?: 0x0 in ??? (???)

Is it possible to handle the unreachable that the function is causing in the header of the while loop? I tried orelse break and while () {} else break;, but I don’t have enough understanding of optionals, or any knowledge that is needed here yet.

while (map_iter.nextCodepoint()) |map_char| { // The issue is here

No need to apologize. A big purpose of the forum is to help those trying to learn Zig.

I think the problem is that you are Instantiating Utf8Iterator directly, and It expects that it is created with some invariants already checked. i.e. Utf8Iterator expects that the bytes it is given to iterate are known to be valid UTF-8. The “correct” way to create is to instantiate it through a Utf8View, as it will then perform the invariant check.

(try Utf8View.init(bytes)).iterator())

Another option is to check the invariant yourself with utf8ValidateSlice(bytes.

3 Likes

I think OP’s issue is that they encoded utf8 before in getCyrillicToLatinicHashedMap so it should be valid without any checks but for some reason it is not

1 Like

Can you add print statement like this in while loop so we can see what is it trying to decode?

std.debug.print("{s} {any}\n", .{&cyrl_to_lat_hashed_map[char], &cyrl_to_lat_hashed_map[char]});

My guess is what it tryies to decode undefined value which is .{0xaa, 0xaa} in debug mode.

1 Like

it looks like its because of the garbage thats in buf due to using undefined

use the success return value from readUntilDelimiterOrEof, its a slice that only has the read data and nothing else

also it doesn’t look like you need to copy the buf, but if you do use @memcpy(&buf_copy, &buf)

2 Likes

Too add to what @vulpesx said, this:

sticks out. Discarding a return value should only be done very intentionally. For readUntilDelimiterOrEof, it’s basically guaranteed that discarding the return value is a bug, since you need to know the length of the portion of the buffer that was filled.

2 Likes

Thank you all for your response!

Yes, the code was trying to encode bytes, past the actual data I wrote in buf, initiated by the debug compiler when assigned undefined (It’s 01010101 I think. I read about it in one of the posts of Andrew himself).

So for the learning purposes, I wanted to break the while loop when it’s reached the garbled data.

My current code

pub fn main() !void {
    const cyrl_to_lat_hashed_map = try transliterator.common.getCyrillicToLatinicHashedMap();
    const stdin_reader = std.io.getStdIn().reader();
    var buf: [1024]u8 = undefined;
    var result: [1024]u8 = undefined;

    std.debug.print("Input cyrilic text: ", .{});
    const buf_slice: []u8 = try stdin_reader.readUntilDelimiterOrEof(&buf, '\n') orelse unreachable;

    var buf_iter = std.unicode.Utf8Iterator{ .bytes = buf_slice[0..buf_slice.len], .i = 0 };

    var i: usize = 0;
    while (buf_iter.nextCodepoint()) |char| {
        // if (char > 1102) continue;
        var map_iter = std.unicode.Utf8Iterator{ .bytes = &cyrl_to_lat_hashed_map[char], .i = 0 };
        while (map_iter.nextCodepoint()) |map_char| {
            i += try std.unicode.utf8Encode(map_char, result[i..]);
            break;
        }
    }
    std.debug.print("Result: {s}\n", .{result[0..i]});
}

The output

...
1093 - x�
1098 - ʼ
1099 -  �
1100 -  �
1101 - e�
Input cyrilic text: Пог

Result: Pog
--task finished--

[Process exited 0]

This whole time I was thinking in C: in-place functions mutating the input directly, less variables defined, everything fits in one index and etc. ;')

I don’t know… maybe I should use utf16 (wchars in C) for everything in this project. This way, there’s less overhead, though I need to write my own io reader/writer functions I guess(?)

1 Like

Some minor things:

  • buf_slice[0..buf_slice.len] does nothing, buf_slice is already a slice with length buf_slice.len
  • const buf_slice: []u8 = could just be const buf_slice =, no need to specify the type

Also, I think you’re overcomplicating the mapping part. Just using a function with a switch would work if all mappings are from one code point to another code point:

const std = @import("std");

pub fn main() !void {
    const stdin_reader = std.io.getStdIn().reader();
    var buf: [1024]u8 = undefined;
    var result: [1024]u8 = undefined;

    std.debug.print("Input cyrilic text: ", .{});
    const buf_slice = try stdin_reader.readUntilDelimiterOrEof(&buf, '\n') orelse return;

    var buf_iter = std.unicode.Utf8Iterator{ .bytes = buf_slice, .i = 0 };

    var i: usize = 0;
    while (buf_iter.nextCodepoint()) |code_point| {
        const mapped_code_point = transliterateCyrillicToLatin(code_point);
        i += try std.unicode.utf8Encode(mapped_code_point, result[i..]);
    }

    std.debug.print("Result: {s}\n", .{result[0..i]});
}

fn transliterateCyrillicToLatin(c: u21) u21 {
    return switch (c) {
        // ...
        'Б' => 'B',
        'б' => 'b',
        'В' => 'V',
        'в' => 'v',
        // ...
        'П' => 'P',
        'о' => 'o',
        'г' => 'g',
        // ...
        else => c,
    };
}

If the switch ends up being slower than you’d like it to be, then you could still use it to generate a lookup table at comptime like so:

// Just a proof-of-concept, the range could be narrowed if
// you only care about certain characters
const CyrillicTransliteration = struct {
    // https://en.wikipedia.org/wiki/Cyrillic_(Unicode_block)
    const cyrillic_block_start = 0x0400;
    const cyrillic_block_end = 0x04FF;
    const cyrillic_block_len = cyrillic_block_end - cyrillic_block_start + 1;

    const map = map: {
        var buf: [cyrillic_block_len]u21 = undefined;
        for (&buf, 0..) |*mapped, i| {
            const code_point = cyrillic_block_start + i;
            mapped.* = transliterateCyrillicToLatin(code_point);
        }
        const final = buf;
        break :map final;
    };

    fn isWithinCyrillicBlock(c: u21) bool {
        return c >= cyrillic_block_start and c <= cyrillic_block_end;
    }

    fn get(c: u21) u21 {
        if (!isWithinCyrillicBlock(c)) return c;
        return map[c - cyrillic_block_start];
    }
};

which would get used like this:

const mapped_code_point = CyrillicTransliteration.get(code_point);
2 Likes

Ooh, I see! Am I forgetting what I’ve learned already? Yes ;')

And, holy… the code you provided… I’ll definitely try all them out. The whole reason that I’m rewriting the already super optimized C code is just to see difference in performance first, then the ease and benefit of rewriting it in Zig.

Thank you all so much :face_holding_back_tears:

1 Like

But to answer the original question, is it possible to handle unreachable, in a function that doesn’t return errors, syntactically? Or if this question is inappropriate and unrelated, which answer should I mark as a solution here?

Nope, using unreachable tells the compiler that you know that it will truly not ever be reached and that the compiler is allowed to use that information to optimize your code.

Here’s an example that shows the generated assembly of a simple function in ReleaseSafe (which panics if it hits unreachable) and ReleaseFast (which assumes that unreachable is accurate and optimizes accordingly):

You can see that, because we let the compiler know that bar not being 50 is unreachable, in ReleaseFast mode the function just returns true unconditionally.

4 Likes

Thank you again. I’ll study this closely when I finish my other assigned jobs. Ahh, it’s all super interesting ;')

1 Like