It’s my first time working on a Zig project, I have really fallen in love with the language, and my project is almost complete. There is only one thing holding it back, and I have wasted way too many hours on this.
Assume we have this .json
{
"messages": "\u00ce\u0091"
}
\u00ce\u0091 is the Unicode representation of the Greek letter ‘Α’
and \u0391 is the code point for the Greek letter ‘Α’
JSON specifies that \u followed by 4 hex digits can be used to encode unicode characters.
All unicode code points of the basic plane, between U+0000 and U+FFFF, need only one \u for their encoding. If you have codepoints between U+10000 and U+1FFFF you need to specify two \u as UTF-16 surrogate pairs (e.g. G clef character (U+1D11E) may be represented as “\ud834\udd1e”).
"\u00ce" is the U+00CE code point (Î = Latin capital letter with circumflex). "\u0091" does not correspond to a defined code point (before that is U+007E the ~, and after that is U+00A0 the non breaking space). \u00ce\u0091 is not the Unicode representation of the Greek letter Α, as you already found the correct representation is \u0391 because the code point is U+0391.
Seems like that might be a bug in CyberChef. I don’t get that result from the JavaScript console in Firefox:
console.log("\u00ce\u0091");
�
The encoding of the file doesn’t matter when using escape sequences. The encoding of the result is determined by the parser, and for Zig it’ll be UTF-8.
You can see what’s going on on Unicode plus, it’s a search engine I find very helpful when dealing with Unicode (as I do quite often).
The UTF-8 encoding of Α is 0xce 0x91. If you write the Zig string "\xce\x91" you’ll get Α, or you can write it "Α", or \u{391}, or \u{0391} if you would like (note the curly braces, that’s important).
const expectEqualStrings = std.testing.expectEqualStrings;
test "ways to write Α" {
try expectEqualStrings("Α", "\xce\x91");
try expectEqualStrings("\u{391}", "\xce\x91");
}
In JSON, you can just add Α directly to your string, because like Zig source code, it’s natively UTF-8 encoded. If you want to use an escape, it does have to be "\u0391", unlike Zig strings, JSON strings do not allow arbitrary byte sequences.
If you want to encoding-escape an emoji, or any other sequence not found in the Basic Multilingual Plane, you have to use surrogate pairs, like a savage. It’s Microsoft’s fault. In Zig you just write the codepoint in hexadecimal, which is why the curly braces are used, so you can write Α0 as "\u{391}0" and you won’t get 㤐.
Yeah that’s never fun. If there’s any way for you to fix the bug in upstream, I encourage you to do that.
The problem you face is undecidable. There is a valid interpretation of "\u00CE\u0091", it’s Î followed by C1 control code. Those are both perfectly legal Unicode scalar values, so you’ll have to use heuristics to figure out what the string is supposed to be.
If you luck out and it’s pure mangled UTF-8, then you can strip the leading zeros and use std.fmt.parseInt to turn the two load-bearing hex digits into a byte, and write that to the string.