The Incredible Unicode Mess

marler8997 · November 26, 2024, 10:15pm

It’s common for people coming from other languages that had some sort of abstraction over the encoding to see Zig’s approach as being a problem. However, once those other high level abstractions don’t work for your particular use case, it becomes much worse when you have to figure out how to bypass that abstraction and then re-implement the one you need for your application.

This becomes much worse when you try to use code across external libraries. Most code will be using the default abstraction so if your application can’t, that sometimes means cutting yourself of from being able to use most of the other libraries that exist in the world.

An example of this is D’s “autodecoding”, see auto-decoding - D Programming Language Discussion Forum

D made the arbitrary decision to always iterate over u8 slices by “unicode codepoint” assuming they are utf8 encoded. However some applications may require iterating over them by grapheme, grapheme cluster or even raw bytes. The trouble is this abstraction not only hides the details you might need to know to write your applications/libraries correctly, it misleads you into thinking everything is simpler than it is and that your code is going to work in more places than it will.

In the end I think utf8/codepoints/graphemes/etc are pretty simple, but they are made to look more complicated than they are as most languages try to hide those details from you.

ericlang · November 26, 2024, 11:10pm

True, i get the point and I think Zig made the right decision.
Now I am fighting with the vscode terminal… and the windows console…

pachde · November 27, 2024, 3:47am

Consoles are notorious for bad Unicode support, especially Windows, from legacy codepage defaults to system code page dependence, to limited fonts.

A lot to unpack there, especially on Windows, With it’s ambivalent, but evolving support for UT F8.

FObersteiner · November 27, 2024, 7:05am

what helped me in that respect was this: Zig runtime should set console output mode to utf-8 by default on Windows · Issue #7600 · ziglang/zig · GitHub However, as soon as you write software to be run on other people’s machines, that might put you back to position 0…
Concerning VSCode, Powershell (+ oh-my-posh) worked out good for me. But I don’t spend too much time on Windows, so there might be better options

Zonion · November 27, 2024, 8:45am

This solution helped me to implement a proper solution to handle both Windows and non Windows target: windows codepage

ericlang · November 27, 2024, 6:25pm

Almost correct. In later zig this works. Got the crazy characters now. Great!

const UTF8ConsoleOutput = struct
{
    original: c_uint = undefined,

    fn init() UTF8ConsoleOutput
    {
        var self = UTF8ConsoleOutput{};
        if (std.builtin.subsystem == .Windows)
        {
            const kernel32 = std.os.windows.kernel32;
            self.original = kernel32.GetConsoleOutputCP();
            _ = kernel32.SetConsoleOutputCP(65001);
        }
        return self;
    }

    fn deinit(self: *UTF8ConsoleOutput) void
    {
        if (self.original != undefined)
        {
            _ = std.os.windows.kernel32.SetConsoleOutputCP(self.original);
        }
    }
};

andrewrk · November 27, 2024, 7:20pm

whoa there cowboy, if (x != undefined) is always illegal behavior. You need optionals.

ericlang · November 27, 2024, 7:27pm

I blindly and brainlessly copied / repaired the code to get something working. I’m not the cowboy
edit: I see the problem now…

squeek502 · November 27, 2024, 7:28pm

I’m not sure this is doing what you think it’s doing. The .Windows subsystem is used to tell the linker that the program is a GUI program and therefore no console should be spawned. If this code is working for you, then it’s likely due to the subsystem stuff currently being buggy.

You probably want this instead:

const builtin = @import("builtin");

// ...

    if (builtin.os.tag == .windows) {

This will make it so the code only runs when targeting Windows.

ericlang · November 27, 2024, 7:34pm

Ok, something like this then? Some things I just do not want to dive into… We cannot know everything.

const builtin = @import("builtin");

const UTF8ConsoleOutput = struct
{
    original: ?c_uint = null, // for cowboy

    fn init() UTF8ConsoleOutput
    {
        var self = UTF8ConsoleOutput{};
        if (builtin.os.tag == .windows)
        {
            const kernel32 = std.os.windows.kernel32;
            self.original = kernel32.GetConsoleOutputCP();
            _ = kernel32.SetConsoleOutputCP(65001);
        }
        return self;
    }

    fn deinit(self: *UTF8ConsoleOutput) void
    {
        if (self.original) |org|
        {
            _ = std.os.windows.kernel32.SetConsoleOutputCP(org);
        }
    }
};

squeek502 · November 27, 2024, 7:40pm

Made a comment on the issue with how I would fix it. Reproduced here:

const std = @import("std");
const builtin = @import("builtin");

const UTF8ConsoleOutput = struct {
    original: if (builtin.os.tag == .windows) c_uint else void,

    fn init() UTF8ConsoleOutput {
        if (builtin.os.tag == .windows) {
            const original = std.os.windows.kernel32.GetConsoleOutputCP();
            _ = std.os.windows.kernel32.SetConsoleOutputCP(65001);
            return .{ .original = original };
        }
        return .{ .original = {} };
    }

    fn deinit(self: UTF8ConsoleOutput) void {
        if (builtin.os.tag == .windows) {
            _ = std.os.windows.kernel32.SetConsoleOutputCP(self.original);
        }
    }
};

pub fn main() !void {
    const cp_out = UTF8ConsoleOutput.init();
    defer cp_out.deinit();

    std.debug.print("\u{00a9}", .{});
}