How to parse JSON with std.json.Value and point the unexpected value?

timfayz · March 9, 2024, 3:58pm

I’m writing a language that interprets JSON as a direct representation of AST, with the AST structured as a simple tree of expressions, as illustrated below:

{
    "expr": {
      "type": "arith-add",
      "deps": [
        {
          "expr": {
            "type": "num",
            "repr": 1
          }
        },
        {
          "expr": {
            "type": "num",
            "repr": null // mistake, would be nice to retrieve line:col of the `null`
          }
        }
      ]
    }
  }

As you can see, the user may potentially make a mistake by writing JSON improperly relative to the expected structure (not relative to the JSON syntax). Therefore, my question is how to utilize the std.json library to parse these dynamic structures and, in the event of an ill-formed structure, identify the location where the error occurs in order to provide a helpful hint to the user.

What I currently know is how to traverse JSON tree “dynamically” (as people often refer to it):

const tree = try std.json.parseFromSlice(std.json.Value, alloc, file, .{});
switch(tree.value) {
    // .array, .object, etc.
}

Additionally, I know how to print “diagnostics” in case a JSON does not comply the specified Zig type:

// Provided by @garrisonhh from Discord
const std = @import("std");
const alloc = std.heap.c_allocator;

const Type = []const []const u8;

pub fn main() !void {
    var scanner = std.json.Scanner.initCompleteInput(alloc,
        \\["a", "b", {"this": "breaks parsing"}]
    );
    defer scanner.deinit();

    var diag = std.json.Diagnostics{};
    scanner.enableDiagnostics(&diag);

    const parsed = std.json.parseFromTokenSource(Type, alloc, &scanner, .{}) catch {
        std.log.debug("parsing failed at {d}:{d}\n", .{ diag.getLine(), diag.getColumn() });
        std.process.exit(1);
    };
    defer parsed.deinit();

    std.log.debug("{any}", .{parsed.value});
}

The rest is unknown but it feels like std.json is quite complex and can cover the desired functionality.

timfayz · March 9, 2024, 4:40pm

Okay, it seems the only way to go is to use std.json.Scanner and build a simple parser based on it myself. This is because it is the only structure that provides cursors and includes additional Diagnostics information. However, the API of std.json.Scanner is quite complex, so I really need a helper to get started on writing such parser.

ktz_alias · March 10, 2024, 5:42am

This is minimum exampl:

const std = @import("std");

pub fn main() !void {
    const gpa = std.heap.page_allocator;

    var scanner = std.json.Scanner.initCompleteInput(gpa,
        \\["a", "b", {"this": "breaks parsing"}]
    );
    defer scanner.deinit();

    var diag = std.json.Diagnostics{};
    scanner.enableDiagnostics(&diag);

    while (true) {
        switch (try scanner.peekNextTokenType()) {
            .end_of_document => break,
            .array_begin => {
                std.debug.print("Array began\n", .{});
                _ = try scanner.next(); // skip
            },
            .array_end => {
                std.debug.print("Array end\n", .{});
                _ = try scanner.next(); // skip
            },
            .string => switch (try scanner.next()) {
                .string, .partial_string => |payload| {
                    std.debug.print("String found: `{s}` (line: {}, col: {})\n", .{payload, diag.getLine(), diag.getColumn()});
                },
                else => return error.UnexpectedToken,
            },
            .object_begin => {
                _ = try scanner.next(); // skip object begin token
                const key = (try scanner.next()).string;
                const value = switch (try scanner.next()) {
                    .string, .partial_string => |payload| payload,
                    else => return error.NotCoveredToken,
                };
                _ = try scanner.next(); // skip object end token

                std.debug.print("Object pair found: key:`{s}`, value:`{s}` (line: {}, col: {})\n", .{key, value, diag.getLine(), diag.getColumn()});
            },
            else => return error.NotCoveredToken,
        }
    }
}

And I recommend to use std.json.Reader type if reading from files.
This type fulfills a input text automatically.
I’s a memory friendly.

pub fn main() !void {
    const gpa = std.heap.page_allocator;

    const file = try std.fs.openFileAbsolute(...);
    defer file.close();

    var reader = std.json.Reader.init(gpa, file.reader());

    (snip)
}

timfayz · March 10, 2024, 10:22am

Thank you @ktz_alias ! That is awesome!

Still have questions though…

Is this the right use of reader?:

pub fn main() !void {
    const gpa = std.heap.page_allocator;

    const file = try std.fs.cwd().openFile("test.json", .{});
    defer file.close();

    var reader = std.json.Reader(4 * 1024, @TypeOf(file.reader())).init(gpa, file.reader());

    var diag = std.json.Diagnostics{};
    reader.enableDiagnostics(&diag);

    while (true) {
        // switch (try reader.peekNextTokenType()) {
        //     ...
    }
}

How should I handle .partial_number if the .next() returns so?
Could you explain how getColumn calculates the column with this wrap-around arithmetic and dereferencing cursor pointer? I’ve been trying to understand it for the past two days, but I just can’t grasp the trick, especially with this obfuscated line_start_cursor defined @as(usize, @bitCast(@as(isize, -1)))

pub const Diagnostics = struct {
    line_number: u64 = 1,
    line_start_cursor: usize = @as(usize, @bitCast(@as(isize, -1))), // Start just "before" the input buffer to get a 1-based column for line 1.
    total_bytes_before_current_input: u64 = 0,
    cursor_pointer: *const usize = undefined,

    /// Starts at 1.
    pub fn getLine(self: *const @This()) u64 {
        return self.line_number;
    }
    /// Starts at 1.
    pub fn getColumn(self: *const @This()) u64 {
        return self.cursor_pointer.* -% self.line_start_cursor;
    }
    /// Starts at 0. Measures the byte offset since the start of the input.
    pub fn getByteOffset(self: *const @This()) u64 {
        return self.total_bytes_before_current_input + self.cursor_pointer.*;
    }
};

kristoff · March 10, 2024, 11:09am

If the files that you plan to read are not gigantic, you can use the slice API and that will not yield partial tokens. Partial tokens are only emitted when dealing with streaming data.

If instead you want to support streaming, then my understanding is that you need to save those and then stitch them back together into a complete number. More concretely you probably want to accumulate their content into an ArrayList(u8) and then read the full number once the parser stops producing partial tokens.

Ignore the bitcast stuff, that’s an artifact caused by some changes in the builtins that then got auto-corrected by zig fmt in a confusing way.

The parser contains a field named cursor that contains the byte offset into the json byte slice. Everytime you call next, the parser finds the next token and updates the cursor to the end of the new token it’s about to return. Diagnostics have a pointer to that cursor (cursor_pointer) and also keep track of the last time a newline was found (line_start_cursor). Subtract the current value of cursor with line_start_cursor and you get the column number.

timfayz · March 10, 2024, 7:29pm

More concretely you probably want to accumulate their content into an ArrayList(u8) and then read the full number once the parser stops producing partial tokens.

Got it. I assumed the same.

Diagnostics have a pointer to that cursor (cursor_pointer) and also keep track of the last time a newline was found (line_start_cursor). Subtract the current value of cursor with line_start_cursor and you get the column number.

Great! This explains something!

Ignore the bitcast stuff, that’s an artifact caused by some changes in the builtins that then got auto-corrected by zig fmt in a confusing way.

Actually, I believe this was written intentionally, and I also think I finally understand the trick! The author found a way to work around not using negative numbers but still achieve the same “semantics”. To achieve this, the author decided to use wrapping arithmetic (-% instead of just -). I’ll try to explain:

line_start_cursor: usize = @as(usize, @bitCast(@as(isize, -1)))
// Which is the same as:
line_start_cursor: isize = -1,

// Getting current column is defined as:
self.cursor_pointer.* -% self.line_start_cursor;
// or
self.cursor -% self.line_start_cursor;
// or (if the `line_start_cursor` was isize)
self.cursor - self.line_start_cursor;

// Which actually works, here is an example:
//
// Case 1: we just started parsing and want the column immediately
// 
// `[\n1,2\n]`
//  01 2345 6
//  ^
//  self.cursor(0)
// ^
// line_start_cursor(-1)
// 
// The column relative to the beginning of the line will be:
// self.cursor - line_start_cursor = 0 - -1 = 1 (correct)

// Case 2: We proceeded to the first "real" line
//
// `[\n1,2\n]`
//  01 2345 6
//     ^
//     self.cursor(2)
//   ^
//   line_start_cursor(1)
// 
// self.cursor - line_start_cursor = 2 - 1 = 1 (correct, again)

The second difficult part for me was to understand how the column arithmetics works across buffer boundaries. First, let’s look at what happens before we proceed to a new buffer (comments below are not mine):

pub fn feedInput(self: *@This(), input: []const u8) void {
    assert(self.cursor == self.input.len); // Not done with the last input slice.
    if (self.diagnostics) |diag| {
        diag.total_bytes_before_current_input += self.input.len;
        // This usually goes "negative" to measure how far before the beginning
        // of the new buffer the current line started.
        diag.line_start_cursor -%= self.cursor;
    }
    self.input = input;
    self.cursor = 0;
    self.value_start = 0;
}

Here is the visualized example to align the logic above:

// Case 3: Calculate column across buffer boundaries
//
// `[\n1,` `2\n]`
//  01 23   4 5 (implied indices)
//       ^  0 1 (actual indices in a new buffer)
//       self.cursor(4) -- the assert above that assumes we're done
//   ^
//   line_start_cursor(1)
// 
// Before proceeding to a new buffer, we "save" the relative position of the current line starting position:
// line_start_cursor - self.cursor = 1 - 4 = -3

// Or, in source code above:
diag.line_start_cursor -%= self.cursor;

// So, in case we would want to get the column of a line that spans two buffers:
//     12   3
// `[\n1,` `2\n]`
//  01 23   0 1
//          ^  
//          self.cursor(0) -- we are in a new buffer
//   ^
//   line_start_cursor(-3) -- the previous "saved" state
//
// Now, if we do the usual column calc, we get:
// self.cursor - line_start_cursor = 0 - -3 = 3 (correct)

Anyway, having said that it is still difficult to wrap my head around wrapping-arithmetics

PS. I think it’s magical how the author was able to cover all the cases and achieve a 1-based column calculation with (essentially) just two simple lines of code.