So I got asked about one of my projects recently, and I was trying to describe what I was doing with it. The only thing was, I had not touched it in awhile, so I was not quite ready to defend it. However after thinking about it, I remembered what it was that I had implemented specifically, which I think qualifies this function to be a valid AST parser, in the sense that you can program in any variety of syntax to get validated, and modify it without too much effort. So here’s the code
fn scan(length: *usize, bytes: *std.ArrayList(u8), map: *std.ArrayList(Area), chars: *std.ArrayList([]const u8)) !void {
var pos: usize = length.*;
while (pos < bytes.items.len) : (pos += 1) {
switch (bytes.items[pos]) {
'\n' => {
try map.append(.empty);
try chars.append(bytes.items[pos..pos + 1]);
},
'$' => {
try map.append(.variable);
try chars.append(bytes.items[pos..pos + 1]);
pos += 1;
var start = pos;
while (pos < bytes.items.len) : (pos += 1) {
switch (bytes.items[pos]) {
'\n' => {
try map.append(.end);
try chars.append(bytes.items[pos..pos + 1]);
},
'=' => {
try map.append(.query);
try chars.append(bytes.items[start..pos]);
try map.append(.equals);
try chars.append(bytes.items[pos..pos + 1]);
pos += 1;
start = pos;
while (pos < bytes.items.len) : (pos += 1) {
switch (bytes.items[pos]) {
'\n' => {
try map.append(.value);
try chars.append(bytes.items[start..pos]);
try map.append(.end);
try chars.append(bytes.items[pos..pos + 1]);
},
else => { },
}
}
},
else => { },
}
}
},
else => {
try map.append(.invalid);
try chars.append(bytes.items[pos..pos + 1]);
pos += 1;
while (pos < bytes.items.len) : (pos += 1) {
switch (bytes.items[pos]) {
'\n' => {
try map.append(.quit);
try chars.append(bytes.items[pos..pos + 1]);
},
else => {
try map.append(.dump);
try chars.append(bytes.items[pos..pos + 1]);
},
}
}
},
}
}
}
This function is capable of validating the syntax $var=2 or $var=word, while also reconstructing the line to replace the multi-letter areas with a single token, so $var=word will become something like $0=1, which will later be parsed a second time through a for loop, with the instruction set stored as an enum. This second step is like the execution phase, where the variables will be stored in a hashmap in realtime and be valid for the life of the loop.
So, getting back to why I think this is a decent AST parser. Well, I spent a lot of time on this. I was specifically trying to find out the most simple way to validate any given syntax, so I was was focused on going byte by byte to gather as much information along the way and make sort of a map which can be referenced later. It basically came down to the choice between a for loop and a while loop. Finally what I realized, which helped me complete this idea, was that while loops in Zig can be easily nested, but for loops cannot be. This is due to how for loops store their parameters in the definition of the loop, as well as the position identifier. In a for loop the position for any given iteration will be constant, and is not directly modifiable. So if you try to nest a for loop inside a for loop, it becomes increasingly complex. However this is not the case with while loops. In a while loop you store the position in an external variable, and you can increase it or decrease it as you need to. This position will also be stored in the same variable at any level of the loop, and is easy to pass around. This makes while loops a lot more flexible and easy to adapt to different situations. So if you can consider this function as something that can be easily nested as many levels as you need to, it’s easy from here to visualize the branching aspects of a syntax tree. In this particular function I am going about 3 levels deep, in order to deal with the branching aspects related to the $ and = tokens. When you get to those characters, you perform certain checks, store your state in the enum, then go into a nested while loop to continue. After that you go as far as you need to, and it’s easy to break out of those loops any time into any other level, as you need to. And like I mentioned before, later on during the execution phase, I do use a for loop, since the state is already known at any given index, there is no reason to go back and forth. You can just make a straight shot through the loop, comparing each position with the map, where I’m using a second ArrayList which stores enums, with index positions correlating with the compressed line of code, which itself is an ArrayList of u8s.
fn exec(self: State, writer: anytype, store: *std.StringHashMap(Value)) !void {
var query: usize = undefined;
for (self.map, 0..) |a, i| {
switch (a) {
.query => {
query = i;
try writer.print("{s}", .{ self.chars[i] });
},
.end => {
try writer.print("{s}", .{ self.chars[i] });
},
.variable => {
try writer.print("{s}", .{ self.chars[i] });
},
.invalid => {
try writer.print("{s}", .{ self.chars[i] });
//try writer.print("INVALID::{s}", .{ self.chars[i] });
},
.dump => {
try writer.print("{s}", .{ self.chars[i] });
},
.quit => {
try writer.print("{s}", .{ self.chars[i] });
//return;
},
.empty => {
try writer.print("{s}", .{ self.chars[i] });
},
.equals => {
try writer.print("{s}", .{ self.chars[i] });
},
.value => {
try store.put(self.chars[query], Value{ .string = self.chars[i] });
try writer.print("{s}", .{ self.chars[i] });
},
}
}
}
I have some ideas on how to improve this, but at the time I wrote it I was satisfied with it, because I had enough understanding of it that I knew I could come back to it later to continue working on it. I also think it’s a pretty fast AST parser, and it might be interesting to see a faster way to do it, to see what else we can learn.