Resources for parsing JSON strings

Howdy, everyone! I am trying to write a simple query language for getting values out of JSON data. I’m looking for any resources related to parsing strings using the standard library. I am familiar with just the basics like std.fmt.allocPrint, as well as accessing and iterating over arrays and slices using the examples from the documentation. So I’m looking for resources on more advanced techniques, or ways to build on the given examples. Also anything else from the standard library that might not be obvious would be helpful, as well as examples from other projects which dont rely on external dependencies. Thank you!

2 Likes

Have you checked out the std.mem module? There is some amazing stuff in there and it’s a very informative read.

zig/lib/std/mem.zig at master · ziglang/zig · GitHub

I’d recommend starting by looking at the SplitIterator and the splitSequence functionalities to start because they do lazy-evaluated string processing. This is a very important concept in Zig because of how hyper-attentive the language is to allocations.

The tokenize variants are very interesting as well for simple tokenization jobs. If you want something like byte-pair encodings because you interesting in sub-word encodings then you’ll still have to build something custom (I’m working on one of those right now actually).

Anyhow, start with std.mem and I’d say go from there! If you haven’t checked out the fmt wing of the library yet, they build dedicated tokenizers for parsing format strings at compile time (an extremely useful perk of zig is compile time strings).

2 Likes

Thanks! Yeah I’ve been using mem.eql and fmt.allocPrint for some things. So I will definitely check out the SplitIterator. I don’t see splitSequence here but I see splitTokenizer. I will have to look more into the tokenization idea since I am not yet familiar it, as well as lazy-evaluated string processing. Here is where I am at with this project…

const std = @import("std");
const allocator = std.testing.allocator;
const expect = std.testing.expect;

fn parse(object: anytype, query: []const u8) !void {
   const result = object.get(query).?;

   const f_result = try std.fmt.allocPrint(allocator, "{any}", .{result});
   defer allocator.free(f_result);

   const t_result = f_result[21];

   switch (t_result) {
      105 => {
         const output = object.get(query).?.integer;
         try expect(output == 74363456);
      },

      115 => {
         const output = object.get(query).?.string;
         try expect(std.mem.eql(u8, output, "okay"));
      },

      98 => {
         const output = object.get(query).?.bool;
         try expect(output == true);
      },

      102 => {
         const output = object.get(query).?.float;
         try expect(output == 2.33e+77);
      },

      110 => {
         const output = object.get(query).?.null;
         try expect(@TypeOf(output) == void);
      },

      97 => {
         const output = object.get(query).?.array;
         try expect(output.items[0].integer == 3);
      },

      111=> {
         const output = object.get(query).?.object;
         try expect(std.mem.eql(u8, output.get("name").?.string, "hello"));
      },

      else => {}
   }
}

test {
   const my_json =
      \\{
      \\   "hello": "okay",
      \\   "okay": 74363456,
      \\   "noway": true,
      \\   "yesway": 2.33e+77,
      \\   "maybe": [3, 4, 5],
      \\   "newone": {"name":"hello"},
      \\   "empty": null
      \\}
   ;

   const parsed = try std.json.parseFromSlice(std.json.Value, allocator, my_json, .{});
   defer parsed.deinit();

   const object = parsed.value.object;

   try parse(object, "hello");
   try parse(object, "okay");
   try parse(object, "noway");
   try parse(object, "yesway");
   try parse(object, "maybe");
   try parse(object, "newone");
   try parse(object, "empty");

   const keys = object.keys();
   try expect(@TypeOf(keys) == [][]const u8);

   const values = object.values();
   try expect(@TypeOf(values) == []std.json.Value);
}

The tricky part was getting the json library to parse a string without a comptime known structure or type value. The best you can do is call something like object.get(query).?.string;or object.get(query).?.integer, which still requires that you know the type of the value you are retrieving when compiling. So to deal with this I took the output of object.get(query).?, which gives you a runtime calculated type in the resulting JSON value, depending on the value retrieved from the specified key…

result = json.dynamic.Value{ .string = { 111, 107, 97, 121 } }
-or-
result = json.dynamic.Value{ .integer = 74363456 }
etc…

From here I run this output through std.fmt.allocPrint like so…

try std.fmt.allocPrint(allocator, “{any}”, .{result});

This converts the output value into a string, and from there I can access the 21st index of the string “json.dynamic.Value{ .integer = 74363456 }” which represents the first letter following the initial . in the opening bracket (in this case ‘i’, for string it is ‘s’, etc…)

And finally I can make a switch statement based on this index to run the get function with the correct type (object.get(query).?.bool, object.get(query).?.float, etc…)

This is the best way I have found to do this without knowing anything in advance about the contents of the json data. Right now the queries are just the key strings, but i want to expand them to be able to do something like

{ hello }
-or-
{ name, id }
for queries to nested values. And also to return the output in formatted JSON…

{ “result”: “okay” }
{ “result”: 74363456 }

etc…

I’m still not sure that there isn’t an easier way to do this with the standard library, but I suspect that with Zig’s strict comptime-known rules that I am pretty close. This type of app seems to fall more into the category of scripting or external library. In any case I will check out everything you mentioned and look more closely into std.mem and std.fmt, in particular the iterators, to see what else I can come up with while I work on expanding this concept. Thanks again!

I started playing around with the example you posted to see what options there were. There’s quite a bit you can do here.

First, I personally would not rely on that index being consistent across versions. That’s more of an implementation detail. Since the value returned is a union, you can just do a switch over the value itself and achieve the same thing you’re doing above.

Meanwhile, I was playing around with an unpack function… Here’s the beginning of rough sketch:

pub fn unpack(comptime T: type, query: []const u8, object: anytype) T {

    const I = @typeInfo(T);

    switch(I) {
        .Int => { 
            return object.get(query).?.integer;
        },
        .Float => {
            return object.get(query).?.float;
        },
        else => {
            @compileError("More types to come...");
        }
    }
}

pub fn main() !void {
    
    const allocator = std.heap.page_allocator;

    const my_json =
      \\{
      \\   "hello": "okay",
      \\   "okay": 74363456,
      \\   "noway": true,
      \\   "yesway": 2.33e+77,
      \\   "maybe": [3, 4, 5],
      \\   "newone": {"name":"hello"},
      \\   "empty": null
      \\}
    ;

    const parsed = try std.json.parseFromSlice(std.json.Value, allocator, my_json, .{});

    defer parsed.deinit();

    const object = parsed.value.object;

    const value = unpack(i64, "okay", &object);

    std.debug.print("\n{}\n", .{ value });
    
}

Currently, it will fail at compile time if the return types are not the same and at runtime if you’re accessing one of the non-active members in the union. More logic can be put in to handle that, but I’ll leave that up to you :slight_smile:

2 Likes

Yeah. I was looking into accessing the union type as well for the json value. However, I am still not sure this is the solution I am looking for, since it seems like I still need to have a comptime-known type. For example if I want to run the query “yesway” to get back 2.33e+77, I will have to call the unpack function specifying f64 as the type. and if I want to run “hello” to get back a string, I will have to pass []const u8 into the unpack function at compile time. The idea is to be able to accept json data from anywhere. So I need to be able to alter only the query itself without having to change anything else in the code, since the final version will not have the actual json string provided beforehand, so there is no way to know what type needs unpacking before runtime. If I could run @typeInfo on the json Value returned by object.get(query).? then I think I would have something, but then it’s giving me these errors:

fmt.zig:662:22: error: values of type '[]const builtin.Type.UnionField' must be comptime-known, but index value is runtime-known
                for (value, 0..) |elem, i| {
                     ^~~~~
builtin.zig:378:15: note: struct requires comptime because of this field
        type: type,
              ^~~~
builtin.zig:378:15: note: types are not available at runtime
        type: type,
              ^~~~
builtin.zig:379:20: note: struct requires comptime because of this field
        alignment: comptime_int,

Also for example if I had a truly random json data, I could at least call object.keys() to get all the keys from the top level, but when I go to run the unpack as key[0] it will fail, while key[1] will succeed in this case. So at least it’s possible to access the root keys from random json data. So this way we can see what queries are available, but still need a way to access the values without knowing the type of the value from the key-value pair. I can use object.values() to get the array of json.Value, but this will still give me the comptime info required error when trying to run @typeInfo on json Values, since this is all being calculated at runtime… :face_with_spiral_eyes:

I know you mentioned that more logic can be put in to handle this, so it’s possible that I’m still missing something here… Either way I will have to think about it a few days since I’ve been learning Zig at the same time I am writing this, so I’m a little bit overwhelmed at the moment. Thank you so much again for your help! :slightly_smiling_face:

Yup, no problem. I’m going to alter the title of this thread to include the word JSON.

I’m very familiar with the problem you are trying to solve here and yes, it’s quite challenging. The critical point here is to make sure we know what we’re asking for.

In any strongly-typed language, you’re going to come across this conundrum in some shape or form. In your instance, here’s the issue:

  1. I have a type T.
  2. I create a value t of type T.
  3. Now I write t to a file.
  4. Later, I read t from the file.
  5. What type was t?

Type Erasure. In essence, a type is how we assign an interpretation to state… and now we’ve lost that.

To be clear, languages like Python don’t avoid this problem either (bearing in mind that that Python is written in C). If I read an int from a JSON and try to use it like a string, my program will crash (or it runs but doesn’t do what I think it it does). The only real difference here is that in Python, you have the “appearance” of a well-defined program where as here… you don’t even get that far.

So what options do you have?

Most erased types will not allow you to work directly on them. How can we? We’ve lost the interpretation of what that thing even is… so what set of instructions should my computer apply to it? In my “unpack” example, I happen to know that the value is in fact an int, so I just cast those bits to an int and be on my way. I’ve just injected type information back in.

In Python, that’s analogous to just parsing the JSON and attempting to use whatever I got back as an int. If it works, we’ll call it successful. In our example, that’s akin to doing the following:

const object = parsed.value.object;

const x: i64 = object.get(query).?.integer; // just go ahead and use it.

Now, you might say that’s crazy - just go for it? Recall the fact that we’ve been ignoring the optional value with the “?” operator in each example we’ve done so far. Basically, we’re making all kinds of assumptions about the validity of our code, but we were comfortable with those assumptions… so what makes this one different? Essentially, it’s not, it’s just funny to think about.

So what else can we do? In your first case, you’re looking for a signifier… a breadcrumb trail if you will. Ultimately, the signifier is just a pre-step to getting back into type-land. Basically, what you’re looking for here is some kind of safety in what we’re doing.

I’m going to recommend that you handle this on a case-by-case basis. For instance, let’s say I have a JSON and one of the fields is my “ID”, and let’s say it’s an integer.

I could do something like the following:

// first, make some typed-state to store our interpretation in
var id: i64 = 0;

// second, let's see if our parse even worked.
const value = object.get(query);

if (value == null) {
    // hit the eject button
}

switch (value.?) {
    .integer => |i| {
         id = i;
    }, 
    else => { 
        // hit a different eject button
    }
}

You could do a lot of things from here. You can make “visitor” functions… functions that return an error if it’s not a specific type (those can be made generic)… etc…

2 Likes

Edited to put error messages in a code block.

1 Like

Thaaaaaaank you :blush: :blush: :blush:

THIS is the solution I was looking for. So now I can finally replace my hopelessly MacGuyver’ed switch statement… which awkwardly checks some random index value to guess which function to use… with a much more elegant and stable solution involving optional types. And also we have an error check right at the beginning to make sure the query is valid :+1: :+1:++

const std = @import("std");

test{
   const allocator = std.heap.page_allocator;

   const my_json =
      \\{
      \\   "hello": "okay",
      \\   "okay": 74363456,
      \\   "noway": true,
      \\   "yesway": 2.33e+77,
      \\   "maybe": [3, 4, 5],
      \\   "newone": {"name":"hello"},
      \\   "empty": null
      \\}
   ;

   const query = "newone";

   std.debug.print("\n", .{});

   const parsed = try std.json.parseFromSlice(std.json.Value, allocator, my_json, .{});
   defer parsed.deinit();

   const object = parsed.value.object;

   var t_integer: i64 = 0;
   var t_string: []const u8 = undefined;
   var t_bool: bool = undefined;
   var t_float: f64 = undefined;
   var t_null: void = undefined;

   const value = object.get(query);

   if (value == null) {
      std.debug.print("{s}\n", .{ "NULL VALUE" });
   }

   switch (value.?) {
      .integer => |i| {
         t_integer = i;
         std.debug.print("{d}\n", .{ t_integer });
      },
      .string => |i| {
         t_string = i;
         std.debug.print("{s}\n", .{ t_string });
      },
      .bool => |i| {
         t_bool = i;
         std.debug.print("{}\n", .{ t_bool });
      },
      .float => |i| {
         t_float = i;
         std.debug.print("{e}\n", .{ t_float });
      },
      .null => |i| {
         t_null = i;
         std.debug.print("{}\n", .{ t_null });
      },
      .array => |i| {
         t_integer= i.items[0].integer;
         std.debug.print("{d}\n", .{ t_integer });
      },
      .object => |i| {
         t_string= i.get("name").?.string;
         std.debug.print("{d}\n", .{ t_string });
      },
      else => {  }
   }
}

Some more work is still needed for dealing with arrays and nested objects. Also I still need to wrap my head around the concept of Optional types… ?.. so I can fully understand exactly why this works and how to take advantage of it in the future. But this will give me a solid foundation to build on. So my goals for the next post are to have a more concrete definition of my query syntax, as well as pre-formatted output. Also I will need to look at std.mem and std.fmt more in depth, so I can see about how to implement the concepts you mentioned earlier, splititerators with tokenize variants, lazy-evaluated string processing, etc… Hopefully I can get some of those concepts in there as well so I can give a more robust example. Thanks again for all your attention, and especially your explanation about the nature of strongly-typed languages. This is something that was sort of at the back of my head the whole time so it’s good to hear it explained clearly… See you next time! :v: :wave:

2 Likes