How I wrote an AST parser/validator in Zig

mscott9437 · July 20, 2024, 3:32am

So I got asked about one of my projects recently, and I was trying to describe what I was doing with it. The only thing was, I had not touched it in awhile, so I was not quite ready to defend it. However after thinking about it, I remembered what it was that I had implemented specifically, which I think qualifies this function to be a valid AST parser, in the sense that you can program in any variety of syntax to get validated, and modify it without too much effort. So here’s the code

fn scan(length: *usize, bytes: *std.ArrayList(u8), map: *std.ArrayList(Area), chars: *std.ArrayList([]const u8)) !void {
   var pos: usize = length.*;
   while (pos < bytes.items.len) : (pos += 1) {
      switch (bytes.items[pos]) {
         '\n' => {
            try map.append(.empty);
            try chars.append(bytes.items[pos..pos + 1]);
         },
         '$' => {
            try map.append(.variable);
            try chars.append(bytes.items[pos..pos + 1]);
            pos += 1;
            var start = pos;
            while (pos < bytes.items.len) : (pos += 1) {
               switch (bytes.items[pos]) {
                  '\n' => {
                     try map.append(.end);
                     try chars.append(bytes.items[pos..pos + 1]);
                  },
                  '=' => {
                     try map.append(.query);
                     try chars.append(bytes.items[start..pos]);
                     try map.append(.equals);
                     try chars.append(bytes.items[pos..pos + 1]);
                     pos += 1;
                     start = pos;
                     while (pos < bytes.items.len) : (pos += 1) {
                        switch (bytes.items[pos]) {
                           '\n' => {
                              try map.append(.value);
                              try chars.append(bytes.items[start..pos]);
                              try map.append(.end);
                              try chars.append(bytes.items[pos..pos + 1]);
                           },
                           else => { },
                        }
                     }
                  },
                  else => { },
               }
            }
         },
         else => {
            try map.append(.invalid);
            try chars.append(bytes.items[pos..pos + 1]);
            pos += 1;
            while (pos < bytes.items.len) : (pos += 1) {
               switch (bytes.items[pos]) {
                  '\n' => {
                     try map.append(.quit);
                     try chars.append(bytes.items[pos..pos + 1]);
                  },
                  else => {
                     try map.append(.dump);
                     try chars.append(bytes.items[pos..pos + 1]);
                  },
               }
            }
         },
      }
   }
}

This function is capable of validating the syntax $var=2 or $var=word, while also reconstructing the line to replace the multi-letter areas with a single token, so $var=word will become something like $0=1, which will later be parsed a second time through a for loop, with the instruction set stored as an enum. This second step is like the execution phase, where the variables will be stored in a hashmap in realtime and be valid for the life of the loop.

So, getting back to why I think this is a decent AST parser. Well, I spent a lot of time on this. I was specifically trying to find out the most simple way to validate any given syntax, so I was was focused on going byte by byte to gather as much information along the way and make sort of a map which can be referenced later. It basically came down to the choice between a for loop and a while loop. Finally what I realized, which helped me complete this idea, was that while loops in Zig can be easily nested, but for loops cannot be. This is due to how for loops store their parameters in the definition of the loop, as well as the position identifier. In a for loop the position for any given iteration will be constant, and is not directly modifiable. So if you try to nest a for loop inside a for loop, it becomes increasingly complex. However this is not the case with while loops. In a while loop you store the position in an external variable, and you can increase it or decrease it as you need to. This position will also be stored in the same variable at any level of the loop, and is easy to pass around. This makes while loops a lot more flexible and easy to adapt to different situations. So if you can consider this function as something that can be easily nested as many levels as you need to, it’s easy from here to visualize the branching aspects of a syntax tree. In this particular function I am going about 3 levels deep, in order to deal with the branching aspects related to the $ and = tokens. When you get to those characters, you perform certain checks, store your state in the enum, then go into a nested while loop to continue. After that you go as far as you need to, and it’s easy to break out of those loops any time into any other level, as you need to. And like I mentioned before, later on during the execution phase, I do use a for loop, since the state is already known at any given index, there is no reason to go back and forth. You can just make a straight shot through the loop, comparing each position with the map, where I’m using a second ArrayList which stores enums, with index positions correlating with the compressed line of code, which itself is an ArrayList of u8s.

   fn exec(self: State, writer: anytype, store: *std.StringHashMap(Value)) !void {
      var query: usize = undefined;
      for (self.map, 0..) |a, i| {
         switch (a) {
            .query => {
               query = i;
               try writer.print("{s}", .{ self.chars[i] });
            },
            .end => {
               try writer.print("{s}", .{ self.chars[i] });
            },
            .variable => {
               try writer.print("{s}", .{ self.chars[i] });
            },
            .invalid => {
               try writer.print("{s}", .{ self.chars[i] });
               //try writer.print("INVALID::{s}", .{ self.chars[i] });
            },
            .dump => {
               try writer.print("{s}", .{ self.chars[i] });
            },
            .quit => {
               try writer.print("{s}", .{ self.chars[i] });
               //return;
            },
            .empty => {
               try writer.print("{s}", .{ self.chars[i] });
            },
            .equals => {
               try writer.print("{s}", .{ self.chars[i] });
            },
            .value => {
               try store.put(self.chars[query], Value{ .string = self.chars[i] });
               try writer.print("{s}", .{ self.chars[i] });
            },
         }
      }
   }

I have some ideas on how to improve this, but at the time I wrote it I was satisfied with it, because I had enough understanding of it that I knew I could come back to it later to continue working on it. I also think it’s a pretty fast AST parser, and it might be interesting to see a faster way to do it, to see what else we can learn.

AndrewCodeDev · July 20, 2024, 4:08pm

Going only from this function (because I haven’t seen how we get to this point), it looks like most of the branches are doing the same thing:

try writer.print("{s}", .{ self.chars[i] });

My first instinct would be to compress those branches down because it looks you need to do something unique under the .query and .value branches. If possible, you could also try to print everything in one call. Just some initial thoughts.

mscott9437 · July 20, 2024, 4:34pm

What you are quoting is actually part of the for loop, which is the execution phase. There are no branches there. If you scroll up back to the first block of code i pasted (the function is named “scan”), that is the while loop with the branches i was referring to. The while loop is executed first in the main code in the “scan” function. After that the “exec” function re-parses the same data essentially, except using a for loop which should be faster. The difference is in the exec function, i have an ArrayList of enums to specify the state at each index, and also all occurences of words are replaced with slices, which all point to the original byte array, which is stored in RAM. So when the for loop gets to a string it can just treat it as basically a pointer with a length, which refers back to the original bytes, so in this way it acts as a kind of string lookup table. Note this only works because i used initCapacity() to initialize the byte array, which guarantees the slices will never be invalidated for the life of the program. So its like a protected memory area. The only tradeoff is you have to reserve some memory in advance, which is fine for now because i am still in the very early stages here, and it makes the code a lot less complex.

mnemnion · July 20, 2024, 4:50pm

Switch statements are a branch.

As far as the CPU is concerned, the end of a loop is also a branch, but that fact is less conventional in discussing programs. A switch, however, is definitely a conditional branch.

@AndrewCodeDev’s suggestion is that you might compress those branches like this:

  fn exec(self: State, writer: anytype, store: *std.StringHashMap(Value)) !void {
      var query: usize = undefined;
      for (self.map, 0..) |a, i| {
         switch (a) {
            .query => {
               query = i;
               try writer.print("{s}", .{ self.chars[i] });
            },
            .end, .variable, .dump, .empty, .equals => {
               try writer.print("{s}", .{ self.chars[i] });
            },
            .invalid => {
               try writer.print("{s}", .{ self.chars[i] });
               //try writer.print("INVALID::{s}", .{ self.chars[i] });
            },
            .quit => {
               try writer.print("{s}", .{ self.chars[i] });
               //return;
            },
            .value => {
               try store.put(self.chars[query], Value{ .string = self.chars[i] });
               try writer.print("{s}", .{ self.chars[i] });
            },
         }
      }
   }

I left out the ones with commented-out code, since you clearly haven’t decided if those prongs of the switch will be identical to the others or not.

AndrewCodeDev · July 20, 2024, 5:02pm

Exactly - my next thought is to move the print call out of the switch prongs and only enumerate the values that actually have special instructions and skip the rest in an else prong. I’d probably put one print statement underneath the switch statement itself.

Edit - I see that you are printing slices. You might be able to still condense/combine those.

mscott9437 · July 20, 2024, 5:13pm

Got it. So there are also branches in the for loop because of the switch statement. Well for this post i was mainly referring to the while loop. And all the print statements in the for loop are only there for testing purposes, just to make sure the bytes are getting read as expected by reprinting each character by default at each index. Because yeah i didnt totally decide what i want to do there yet. But anyway this is definitely helpful to get me started. Well If i combine them now i might have to separate them again, but it should be useful to help me stay organized. Also there is an added perspective there which is a huge bonus

mnemnion · July 20, 2024, 5:57pm

I’m wary of else prongs in switch statements, because they undermine exhaustive switching, which is a wonderful feature.

I can only justify an else prong if the answer to the question “is this the correct behavior for, not only every other enum I have now, but every other enum I might ever add” is yes. Sometimes that’s true, but often it turns out I just want to do a bit less typing.

There’s a proposal for enum ranges, which would help with the ‘too much typing’ thing.

I came by this opinion the hard way, btw: through bugs caused by the behavior of a new enum belonging on a branch which wasn’t the default.

One additional problem with else branches (it’s kind of the same problem) is that when a switch doesn’t have one, adding a new enum value to an enum produces a compile error for every switch which uses that enum. So one can then jump to each of those lines and add the new enum where it belongs. With an else statement, the enum goes into the else bucket automatically, so again, it’s best to be entirely sure that this is where it will belong. That assurance is hard to come by.

AndrewCodeDev · July 20, 2024, 6:11pm

Right - else makes sense if there is genuinely default behavior and you are enumerating the exceptions. By in large, I would also recommend trying to fully enumerate the branches.

mscott9437 · July 21, 2024, 7:58am

Alright, so going back to what we were discussing earlier, about bytecode interpreters. I think the for loop im using in the exec function will be a good start for this. Specifically for Java, I could basically just copy over all the labels representing each instruction in the Java instruction set into my enum definition, and also into the switch statement. From there i would need to understand what all of those instructions are doing, and implement functions corresponding to all of those, which also match the Java spec which defines what those instructions do. This will take awhile, since at this point i have almost no idea what those intructions are doing.

After that I could go back to the while loop, and begin to convert this to handle the Java syntax. And this would be the foundation for the compiler, which would go into a separate program. Well theres no telling how long that will take. I think i could begin to grasp the instruction set without too much effort, but its still not clear how to get from Java code to bytecode. But anyway still a decent start i think. Just taking small steps here.