Unicode/UTF-8 decoding and json parsing issues

jimangel2001 · October 10, 2024, 9:11pm

It’s my first time working on a Zig project, I have really fallen in love with the language, and my project is almost complete. There is only one thing holding it back, and I have wasted way too many hours on this.

Assume we have this .json

{
    "messages": "\u00ce\u0091"
}

\u00ce\u0091 is the Unicode representation of the Greek letter ‘Α’
and
\u0391 is the code point for the Greek letter ‘Α’

I parse it:

const std = @import("std");
const msg = struct {
    messages: ?[]const u8 = null,
};

pub fn main() !void {
    var file = try std.fs.cwd().openFile("test.json", .{});
    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    defer arena.deinit();
    const allocator = arena.allocator();

    const json_string = try file.readToEndAlloc(allocator, std.math.maxInt(u32));
    defer allocator.free(json_string);
    const parsed = try std.json.parseFromSlice(msg, allocator, json_string, .{});
    defer parsed.deinit();
    std.debug.print("{s}\n", .{parsed.value.messages.?});
}

But this yields Î.

I can’t seem to find a way to properly parse the Unicode characters (in a real scenario the message is longer).

If the json were:

{
    "messages": "\u0391"
}

the parsing would yield ‘A’ properly.

Can anyone with a better understanding of Zig’s interpretation of Unicode shed some light?

Thank you.

dimdin · October 10, 2024, 9:34pm

Hello @jimangel2001
Welcome to ziggit

JSON specifies that \u followed by 4 hex digits can be used to encode unicode characters.
All unicode code points of the basic plane, between U+0000 and U+FFFF, need only one \u for their encoding. If you have codepoints between U+10000 and U+1FFFF you need to specify two \u as UTF-16 surrogate pairs (e.g. G clef character (U+1D11E) may be represented as “\ud834\udd1e”).

"\u00ce" is the U+00CE code point (Î = Latin capital letter with circumflex).
"\u0091" does not correspond to a defined code point (before that is U+007E the ~, and after that is U+00A0 the non breaking space).
\u00ce\u0091 is not the Unicode representation of the Greek letter Α, as you already found the correct representation is \u0391 because the code point is U+0391.

jimangel2001 · October 10, 2024, 9:45pm

Thanks, glad to be here!

I think I get it, but not 100%.

Here are two cybershef links

\u00ce\u0091 yields Α

\u0391 also yields Α

So if I have understood correctly, the files I am given are UTF-16 encoded?

squeek502 · October 10, 2024, 9:56pm

Seems like that might be a bug in CyberChef. I don’t get that result from the JavaScript console in Firefox:

console.log("\u00ce\u0091");
Î�

The encoding of the file doesn’t matter when using escape sequences. The encoding of the result is determined by the parser, and for Zig it’ll be UTF-8.

mnemnion · October 10, 2024, 10:11pm

You can see what’s going on on Unicode plus, it’s a search engine I find very helpful when dealing with Unicode (as I do quite often).

The UTF-8 encoding of Α is 0xce 0x91. If you write the Zig string "\xce\x91" you’ll get Α, or you can write it "Α", or \u{391}, or \u{0391} if you would like (note the curly braces, that’s important).

const expectEqualStrings = std.testing.expectEqualStrings;
test "ways to write Α" {
    try expectEqualStrings("Α", "\xce\x91");
    try expectEqualStrings("\u{391}", "\xce\x91");
}

In JSON, you can just add Α directly to your string, because like Zig source code, it’s natively UTF-8 encoded. If you want to use an escape, it does have to be "\u0391", unlike Zig strings, JSON strings do not allow arbitrary byte sequences.

If you want to encoding-escape an emoji, or any other sequence not found in the Basic Multilingual Plane, you have to use surrogate pairs, like a savage. It’s Microsoft’s fault. In Zig you just write the codepoint in hexadecimal, which is why the curly braces are used, so you can write Α0 as "\u{391}0" and you won’t get 㤐.

Welcome to Ziggit! Hope this helps.

jimangel2001 · October 10, 2024, 10:24pm

Thank you a lot for the very detailed explanation.

Because the data I am given, is encoded in the format shown above, I guess I have to find a way to parse it on my own and somehow convert it to utf-8.

mnemnion · October 10, 2024, 10:33pm

Yeah that’s never fun. If there’s any way for you to fix the bug in upstream, I encourage you to do that.

The problem you face is undecidable. There is a valid interpretation of "\u00CE\u0091", it’s Î followed by C1 control code. Those are both perfectly legal Unicode scalar values, so you’ll have to use heuristics to figure out what the string is supposed to be.

If you luck out and it’s pure mangled UTF-8, then you can strip the leading zeros and use std.fmt.parseInt to turn the two load-bearing hex digits into a byte, and write that to the string.