Techniques for parsing strings (and more JSON)

mscott9437 · August 23, 2023, 9:15pm

I just wanted to share this feature of Zig I came across while working on my last post… It’s not a completely new concept in programming, but it’s not something which is easily done in C.

In C, we have a simple method for parsing strings, which involves iterating over a character array:

#include <stdio.h>
 
void main()
{
   char *cmd = "hello.okay[0]";

   printf("%s\n", cmd);

   char out[13];

   for(int i = 0; cmd[i] != '\0'; i++)
   {
      out[i] = cmd[i] + 1;
      printf("%c", out[i]);
   }

   return;
}

This concept of character arrays is one of the defining features of C, and iterating over a character array is the basic way of parsing a string. We can modify this function to process the string for most situations, without using the standard library.

In Zig we can do the same:

const std = @import("std");

pub fn main() void {

   const cmd = "hello.okay[0]";

   std.debug.print("{s}\n", .{ cmd });

   var out: [13]u8 = undefined;
   var i: usize = 0;

   for (cmd) |c| {
      out[i] = c + 1;
      std.debug.print("{c}", .{ out[i] });
      i += 1;
   }

}

In both cases I am making copies of the string and modifying each character individually, increasing the ascii value by 1, and printing the output.

Also in Zig, we have enhanced functions for parsing strings, found in std.mem. However we also have another option, which might be useful for certain situations, which involves the ArrayList type. I think this is an interesting use case since it represents a middle ground between the primitive C-style arrays and the optimized functions found in std.mem. By using ArrayList we can rely only on the bare minimum of the standard library, and still take advantage of a lot of Zig’s cutting edge features:

const std = @import("std");

pub fn main() !void {

   const cmd = "hello.okay[0]";

   std.debug.print("{s}\n", .{ cmd });

   var out = std.ArrayList(u8).init(std.heap.page_allocator);
   defer out.deinit();

   var i: usize = 0;

   for (cmd) |c| {
      try out.append(c + 1);
      std.debug.print("{c}", .{ out.items[i] });
      i += 1;
   }

}

I was able to take this concept and come up with a simple solution for parsing strings for my JSON utility class:

const std = @import("std");

const T = struct {
   x: ?std.json.Value,

   pub fn init(self: std.json.Value) T {
      return T {
         .x = self
   };}

   pub fn get(self: T, query: []const u8) T {
      if (self.x.?.object.get(query)) |value| {
         return T.init(value);
      }

      else {
         std.debug.print("ERROR::{s}::", .{ "invalid query" });
         return T.init(self.x.?);
   }}

   pub fn unpackInto(self: *const T, buffer: *std.ArrayList(u8)) !void {
      switch (self.x.?) {
         .string => |i| {
            const P = struct { value: []const u8 };
            try std.json.stringify(P{ .value = i }, .{ }, buffer.writer());
         },

         .integer => |i| {
            const P = struct { value: i64 };
            try std.json.stringify(P{ .value = i }, .{ }, buffer.writer());
         },

         .bool => |i| {
            const P = struct { value: bool };
            try std.json.stringify(P{ .value = i }, .{ }, buffer.writer());
         },

         .float => |i| {
            const P = struct { value: f64 };
            try std.json.stringify(P{ .value = i }, .{ }, buffer.writer());
         },

         .null => {
            const P = struct { value: ?usize = null };
            try std.json.stringify(P{ }, .{ }, buffer.writer());
         },

         .array => |i| {
            const P = struct { value: []std.json.Value };
            try std.json.stringify(P{ .value = i.items }, .{ }, buffer.writer());
         },

         .object => {
            const i = self.x.?;
            const P = struct { value: ?std.json.Value };
            try std.json.stringify(P{ .value = i }, .{ }, buffer.writer());
         },

         else => {
            std.debug.print("ERROR::{s}::", .{ "unhandled type" });
      }}

      std.debug.print("{s}\n", .{ buffer.items });
   }

   pub fn pos(self: T, i: usize) T {
      switch (self.x.?) {
         .array => {
            if (i >= self.x.?.array.items.len) {
               std.debug.print("ERROR::{s}::", .{ "index out of bounds" });
               return T.init(self.x.?);
            }
            return T.init(self.x.?.array.items[i]);
         },

         else => {
            std.debug.print("ERROR::{s}::", .{ "not an array" });
            return T.init(self.x.?);
}}}};

pub fn main() !void {

   const my_json =
      \\{
      \\   "hello": { "name":"hello", "id": { "key": "okay" }, "hash": [null, 2.33e+77, false, -5] },
      \\   "okay": [0, 1, "maybe", { "name":"hello", "id": { "key": "okay" }, "hash": [true, 2.55e+99, null, -7] }],
      \\   "maybe": "1234567"
      \\}
   ;

   const parsed = try std.json.parseFromSlice(std.json.Value, std.heap.page_allocator, my_json, .{ });
   defer parsed.deinit();

   const json = T.init(parsed.value);

   try read(&json, "hello;");
   try read(&json, "okay;");
   try read(&json, "okay[2];");
   try read(&json, "okay[3];");
   try read(&json, "hello.id;");
   try read(&json, "hello.id.key;");
   try read(&json, "hello.hash;");
   try read(&json, "hello.hash[0];");
   try read(&json, "hello.hash[1];");

   std.debug.print("\n", .{ });

   try read(&json, "invalid;");
   try read(&json, "hello.hash[8];");
   try read(&json, "hello[2];");

}

pub fn read(json: *const T, cmd: []const u8) !void {
   var buffer = std.ArrayList(u8).init(std.heap.page_allocator);
   defer buffer.deinit();

   try buffer.ensureTotalCapacity(100);

   var pos = std.ArrayList(u8).init(std.heap.page_allocator);
   defer pos.deinit();

   var val: T = json.*;
   var i: usize = 0;
   var p: usize = undefined;

   for (cmd) |c| {
      if (c == 59) {
         if (p == 93) {
            try val.unpackInto(&buffer);
         }

         else {
            val = val.get(pos.items);
            pos.clearRetainingCapacity();
            try val.unpackInto(&buffer);
      }}

      else if (c == 46) {
         val = val.get(pos.items);
         pos.clearRetainingCapacity();

         i += 1;
      }

      else if (c == 91) {
         val = val.get(pos.items);
         pos.clearRetainingCapacity();

         i += 1;

         for (cmd[i..]) |u| {
            if (u == 93) {
               const int = try std.fmt.parseInt(usize, pos.items, 10);

               val = val.pos(int);
               pos.clearRetainingCapacity();

               i += 1;
               p = 93;
            }

            else {
               try pos.append(u);

               i += 1;
      }}}

      else {
         try pos.append(c);

         i += 1;
      }

      buffer.clearRetainingCapacity();
   }}

This project which I was working on the past couple months is now feature complete, but there should still be room for more optimizations and validation. I’m thinking to take this and expand it so that strings can be passed in through an external file, and also processed over a network stream. There I will be trying to take full advantage of std.mem in particular, to see what else I can learn about Zig. I am open to any suggestions for improvement. Thanks for reading. Hope you enjoyed it!

squeek502 · August 23, 2023, 10:11pm

I’m not quite sure what it is you’re trying to show here.

As a tangent, page_allocator should probably not be used even for examples unless you have a specific reason to use it. It only allocates in multiples of the page size (usually 4096) regardless of the size you give it. If you ask for 1 byte, it’ll allocate 4096.

For example,

   var out = std.ArrayList(u8).init(std.heap.page_allocator);
   defer out.deinit();

this will allocate 4096 bytes (the page size) on the first append call. This is not really what you want.

Hard coding allocators (in your read function) is also something you want to avoid. Take the allocator as a parameter instead:

pub fn read(allocator: std.mem.Allocator, json: *const T, cmd: []const u8) !void {

This will allow callers to decide the allocation strategy they want to use, and will make it very simple to use std.testing.allocator during tests to get leak checking/double free checking/etc for free.

mscott9437 · August 23, 2023, 10:23pm

I should probably add, Im just trying to show how I parsed a string. It has basically the same syntax as JavaScript. Its just a minimal example. I dont know much about the allocators other than what i understood from reading the docs, so thats why i mentioned that there was probably room for improvement in that area.

Edit: Actually for the allocator i was thinking the page allocator was the most basic one outside of the testing allocator. So thats why i was using it here.

Another thought I had for this was that ArrayList should be relatively simple to implement on an OS where std is not already supported. So this might be a good place to start looking more closely at the source code.

AndrewCodeDev · August 24, 2023, 5:23am

When do you think we’ll see the project that you’ve been building this for?

mscott9437 · August 24, 2023, 6:29am

Hopefully it will be ready by the time i make the next post. At least i could probably write some introduction for the project and talk about what was motivating me. I am still pretty slow at writing actual code, so its hard to say how long it will take. But I will go ahead and mention it now, that my original idea was to copy some basic ideas of NodeJS. If you look here under the section “Putting it all Together”, you will find the example i was trying to implement in Zig. Other than that i was really intrigued about the possibilites of WASI, especially after i read about how they were able to use it to optimize the Zig compiler. Assuming WASI will eventually support HTTP, you might have something similar to NodeJS but running as a compiled WASI container, as opposed to defining the server logic in a loose script file. And that WASI binary would be able to read external scripts, similar to how it was done with PHP but in a modern and much faster way. Thats basically what i was originally going for, anyway. I just want to be careful i dont get ahead of myself, since a scripting language is a huge project compared to a command line utility. So thats the main reason i have been hesitant to call this anything official. There is also the issue of me getting sidetracked if i find some other idea i want to work on, which is likely to happen either way considering everything you can do with Zig.

Durobot · August 24, 2023, 11:32am

A small improvement - instead of

   var i: usize = 0;

   for (cmd) |c| {
      out[i] = c + 1;
      std.debug.print("{c}", .{ out[i] });
      i += 1;
   }

You can do

   for (cmd, 0..) |c, i| {
      out[i] = c + 1;
      std.debug.print("{c}", .{ out[i] });
   }

At least if you’re using a fairly recent version of Zig - this for loop feature appeared in March, if I’m not mistaken.
It’s not much but it does save you a couple of lines, plus you can’t miss i += 1; if you don’t have to write it.

gonzo · August 24, 2023, 1:48pm

It also keeps i out of the outer scope – what’s not to love?

mscott9437 · August 24, 2023, 2:04pm

Thanks. That’s another nice thing I have noticed about Zig is how it encourages you to keep your things in scope. Also the feedback from the compiler was really helpful is guiding me along.

mscott9437 · August 24, 2023, 4:18pm

I think what I will go for is like a command line TCP request runner. Similar to cURL but geared towards making requests to REST APIs. With authentication, etc.

JPL · August 24, 2023, 6:41pm

Hello, I think that at some point you will have to consider how to manage dispatchers and memory allocation, thanks for your examples.

Validark · August 30, 2023, 8:30pm

It looks like you are learning a lot of Zig for the first time and I’m excited for you!

A few tips.

zig fmt is your friend. It’s relatively easy to make it run every time you save a file in your IDE. It makes it easier for you and other people to read your code.
You don’t have to write c == 59 when you actually mean c == ';'. This isn’t JavaScript
Learn about switch statements. They are a nicer way to express these cases where you want to go to a different branch depending on single variable, in this case c.
The .? operator is an assertion, not a check. You only use it when you know it is impossible for a value to be null. It is NOT like the ?. operator in JavaScript! If you know the value is never null, put that in the type.
For loops can iterate over multiple things of the same length at the same time.

for (cmd, &out) |c, *slot| {
      slot.* = c + 1;
      std.debug.print("{c}", .{ slot.* });
}

Take advantage of comptime. Instead of:

var out: [13]u8 = undefined;

You can write:

var out: [cmd.len]u8 = undefined;

Now, if you update cmd, the length of out will automatically be updated by the compiler.

A bit of general advice: Zig is a lot more powerful than JavaScript and you can write extremely high performance code. Zig code should handle all the edge cases gracefully too. Always consider whether you should be using +, +|, +%, or try std.math.add (I submitted a proposal to turn that last one into +!). Don’t just think, “Well, it PROBABLY won’t overflow. YOLO”. Always think about the range your variables might have. If you have assumptions and implicit contracts in your code, put assertions (std.debug.assert) in to make it explicit. This information can also be used by the compiler to give better optimizations too.

mscott9437 · August 31, 2023, 3:07pm

Thanks for your response. I will go ahead and address your points, since some of these were brought up previously, and it should be worth it to take another look.

I do have an unconventional style for formatting my code. In particular I do indentation with 3 spaces. This is easy to set up in the SciTE editor (Notepad++ is based on this as well), where you can have the tab key insert 3 spaces, and have it use spaces instead of tabs. This gives me the convenience of the tab key along with the universality of using spaces as apposed to literal tabs for indentation. Also you can highlight multiple lines and increase/decrease the indentation using TAB/Shift-TAB. Also I have an issue with closing brackets spanning multiple lines. So I just compress them all on a single line. You can trace the indentation of the bracket group up the page to the initial opening bracket to see where the grouping starts. For me this makes it easier to scan down the page to see how everything is laid out. Having said all of this, I do understand that people not familiar with this coding style can be thrown off, so I will go ahead and use zig fmt for my future posts.
I had to go back and check again after reading this point, because I could’ve sworn I initially tried using the actual character for comparison, which didn’t work. What I realized was that I was trying to do the comparison with double quotes as opposed to single quotes, which will result in a comparison with a string literal as opposed to comptime int, which is not allowed. So thankfully I was able to go back to using the literal character as opposed to having to lookup everything in an ASCII table.
We did go into switch statements previously, when dealing with the type inference “unpack” function. It should be there in the last code block on this thread. For my string parsing function I used the nested if statements, because I would only be switching over 3 values (the 4th value comparison for the closing bracket would be contained within the operation for the opening bracket). So I wanted to go with the more primitive option, even though it’s more verbose. When I go back to refine this I will most likely convert it to a switch statement for clarity.
I think I implemented something like this (based on a suggestion) as a sort of error handler, in case the value of the query was null.

      if (self.x.?.object.get(query)) |value| {
         return T.init(value); //current value
      }
      else {
         std.debug.print("ERROR::{s}::", .{ "invalid query" });
         return T.init(self.x.?);
      }

This returns an optional type, so I am leaving off the .? at the end of get(query)… in case the value is null. I’m not sure if this is exactly what you are referring to, but it’s in the same area. I am not familiar with what an assertion is exactly, so some clarification here might be helpful.

Also was mentioned, but it’s good to see how you can use it with pointers as well. I assume this would be faster.
Another very fine suggestion

For the record, I’m not really trying to approach Zig from the perspective of JavaScript. I actually started learning Zig by trying to implement some existing C libraries into a basic Zig program. I am only focusing on one particular aspect of JavaScript (JavaScript objects/JSON), because it was so central to how the web technology developed over the past 10 years or so, with REST APIs and whatnot. I wanted to see how a systems programming language with builtin cross-platform HTTP capabilities can handle something which has typically been done with JavaScript until now. Personally I think JavaScript is great for the browser, but I always found it awkward on the server end, compared to how it was done previously with PHP/FastCGI. However we were forced to use JavaScript on the backend for a long time, since it addressed a lot of performance and security issues associated with PHP. Of course Golang/Python/Rust are also good alternatives, but I always found NodeJS to be a more obvious way to handle cloud APIs, whereas Go/Py/Rs were traditionally more systems-oriented (as is Zig), so that’s what I focused my research on initially.

Anyway, thanks again for mentioning all of that. It’s not easy to put these ideas into words, so it’s good to have an opportunity for dialog. I did finally come up with a name for my project within the last week or so, as well as a basic description of what it does. I will save most of that for the initial reveal, hopefully within the next couple weeks. What I will say is that it’s going to be a command-line utility for interacting with cloud APIs, similar to cURL but with a smaller scope, dealing specifically with handling JSON responses from different endpoints, as well as some basic authentication. Right now it’s mainly helping me to learn Zig, but I also hope others will find it useful. And I’m also trying to use it as part of the development for some larger projects which I hope to take on in the future. Thanks again!

JPL · August 31, 2023, 6:01pm

I don’t know with your word processor if it works,
I’m not a Microsoft aficionado, but I recognize that at times it does things that are correct.
VScodium (free and validated by Microsoft) or VS Code is a very interesting product for writing programs, and very flexible, you could easily tell him that your tab is 3 spaces,
and an example with “indent-rainbow” (plugin) you have very visible indentations and lots of other things if you want to change Edi.

Come on, I’ll give you two more
“Task manager” allows you to have menus like Microsoft
or any “edi”, but very simply.
“Run Terminal Command” allows you to run different batches or programs

indent-rainbow

mscott9437 · September 1, 2023, 1:07am

VSCodium is pretty good, but more like an IDE. Because none of my projects are complex enough to take advantage of all the extra features, so I usually just use basic editors and batch scripts for compiling

Validark · September 1, 2023, 5:40am

I was talking about how you have self.x.? everywhere. The .? asserts non null. That means in debug mode it does: if (self.x == null) @panic("You said it was supposed to be non-null!");. In ReleaseFast, it assumes it’s non-null, and if it is null, bad things might happen but you don’t know what that might be. An assertion is when you tell the compiler you know some information, a check is when you figure out that information by actually testing it.

AndrewCodeDev · September 1, 2023, 5:55am

It also gets tied into undefined behaviour: Why `?` and `orelse unreachable` behave inconsistently under ReleaseFast/ReleaseSmall

Either way, @Validark is right. In the general case (and in the case of parsing strings) checking an optional should be your default. So any time you have a something.? then consider making that an if (something) |thing| - you can also use orelse… check out the link to see a few examples of that.

gonzo · September 1, 2023, 6:51am

My evolution in zig went through these stages, and I ended up landing in what my current (and I think idiomatic) use is: most functions could return an error (so their return type is !blah), and therefore you always call them with a try.

mscott9437 · September 1, 2023, 3:30pm

okay nice. That does clarify the meaning of all those .?s. much appreciated!