Unicode/UTF-8 decoding and json parsing issues

mnemnion · October 10, 2024, 10:11pm

You can see what’s going on on Unicode plus, it’s a search engine I find very helpful when dealing with Unicode (as I do quite often).

The UTF-8 encoding of Α is 0xce 0x91. If you write the Zig string "\xce\x91" you’ll get Α, or you can write it "Α", or \u{391}, or \u{0391} if you would like (note the curly braces, that’s important).

const expectEqualStrings = std.testing.expectEqualStrings;
test "ways to write Α" {
    try expectEqualStrings("Α", "\xce\x91");
    try expectEqualStrings("\u{391}", "\xce\x91");
}

In JSON, you can just add Α directly to your string, because like Zig source code, it’s natively UTF-8 encoded. If you want to use an escape, it does have to be "\u0391", unlike Zig strings, JSON strings do not allow arbitrary byte sequences.

If you want to encoding-escape an emoji, or any other sequence not found in the Basic Multilingual Plane, you have to use surrogate pairs, like a savage. It’s Microsoft’s fault. In Zig you just write the codepoint in hexadecimal, which is why the curly braces are used, so you can write Α0 as "\u{391}0" and you won’t get 㤐.

Welcome to Ziggit! Hope this helps.