Learning Zig by creating a parser

eikooc · July 4, 2024, 4:54am

Hello fellow Ziguanas. I am entirely new to Zig and wanted to ask for some code critique. My program compiles and runs, but this is the first time I code in a manual memory-managed language and wanted to ask how you would maybe structure things differently. I am following the book Crafting Interpreters by Robert Nystrom to create an interpreter for his made up language Lox.

Here’s the code so far:

Things I am wondering:

Am I reading and doing “strings” correctly?
Am I using allocators correctly?
Is there a better “pattern matching” way of doing long if branches?
Could I use structs in a better way?

andrewrk · July 4, 2024, 5:43am

Hi @eikooc and welcome.

For starters, check out switch expressions.

prokoprandacek · July 4, 2024, 6:58am

Hi @eikooc

Looking at the scanner.zig file:

the report function never returns error so the return type should be void instead of !void
std.debug.print is used for only for printf debugging. Consider using std.log
you don’t need the Self = @This() line. See this post: Don't `Self` Simple Structs! - Zig NEWS
it is a convention to take the allocator as a function argument instead of grabbing a global one
maybe you could use functions from std.ascii instead of writing your own isAlpha, isAlphanumeric, …

It might be also possible to use a switch expression:

try addToken(
    self,
    switch (c) {
        '(' => tokens.TokenType.LEFT_PAREN,
        ')' => tokens.TokenType.RIGHT_PAREN,
        '{' => tokens.TokenType.LEFT_BRACE,
        '}' => tokens.TokenType.RIGHT_BRACE,
        ...
    },
    null);

But i’m not sure if it will fit the later complicated cases well.

n0s4 · July 4, 2024, 11:06am

Doing crafting interpreters is still on my todo list .

One tip: since files/namespaces are implicitly structs in Zig, you could define Token, Scanner and Lox at the top level of their respective file for convenience:

//! mv lox.zig -> Lox.zig

const Self = @This()

const std = @import("std") // ...

had_error: bool = false,

pub fn init() !void { // ...

//! main.zig

const Lox = @import("Lox.zig")

pub fn main() !void {
    try Lox.init();
}

You can do this for Token and Scanner also.

Also, judging by your Token definition, it seems like your doing “jlox” - the AST-walking interpreter? There is also a bytecode interpreter written in C in part II which might be easier to translate to zig, but it’s up to you.

ktz_alias · July 4, 2024, 12:15pm

You can initialize GeneralPurposeAllocator on main routine to pass Lox.init().

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer {
        std.debug.print("leak? {}\n", .gpa.deinit()});
    }
    try Lox.init(gpa.allocator());
}

Other methods also receives this allocator.

If you’ve use a allocator incorrectly, the gpa will report memory leak.

eikooc · July 4, 2024, 12:40pm

Hey @andrewrk thanks for taking the time to look through the code. I have updated it to use switch statements. There were two places where I updated it and ran into some difficulties in both places.

The first was in the scanTokens function:

Is it possible to describe a union in the matching part of the switch instead of doing each char manually? I’ve seen something like inline .A .B => |aorb| {} but that looks to only work for enums. Here is the relevant part:

...
' ' => {},
'\r' => {},
'\t' => {},
...

I had a isAlpha function for an if-branch that checked multiple conditions, is it possible to have ors or something like it on a switch case?

(Commit for the first part)

The second conversion to a switch was quite funky. I found some resources pointing to converting strings to enums to do that as a switch. Before I used std.mem.eql(u8, "and", str) in the if-branches. What I changed it to is, an enum with all the keywords like and and the switch statement uses std.meta.stringToEnum to converts it so it can be matched against. Is this a good way of approaching it or is there a better way? (Commit).

dude_the_builder · July 4, 2024, 12:51pm

Some observations / recommendations:

In the scanner, I don’t think you need to store the allocator since the ArrayList already stores it. So your init could just be:

pub fn init(allocator: mem.Allocator, source: []const u8) Scanner {
    return .{
            .token_list = std.ArrayList(Token).init(allocator),
            .source = source,
        };
    }

There are multiple calles where you pass self as the parameter. Although this is perfectly fine and allowed in Zig, you can also turn it into a method call form:

isAtEnd(self) 
self.isAtEnd()

Speaking of isAtEnd, I think subtracting 1 from the source len is a bug, since it would leave the last byte out.

fn isAtEnd(self: *Scanner) bool {
    return self.current >= (self.source.len - 1);
}

You can use enum literals to shorten your code:

// This 
'(' => try addToken(self, tokens.TokenType.LEFT_PAREN, null),
// can be this.
'(' => try self.addToken(.LEFT_PAREN, null),

You can comma separate multiple cases in a switch prong:

'a'...'z', 'A'...'Z', '_' => try self.identifier(),

You could use a StaticStringMap to handle the keywords. This is how I’m doing it in my take on Lox (I’m working through the book too.):

const keywords = std.StaticStringMap(Token.Tag).initComptime(.{
    .{ "and", .kw_and },
    .{ "class", .kw_class },
    .{ "else", .kw_else },
    .{ "false", .kw_false },
    .{ "for", .kw_for },
    .{ "fun", .kw_fun },
    .{ "if", .kw_if },
    .{ "nil", .kw_nil },
    .{ "or", .kw_or },
    .{ "print", .kw_print },
    .{ "return", .kw_return },
    .{ "super", .kw_super },
    .{ "this", .kw_this },
    .{ "true", .kw_true },
    .{ "var", .kw_var },
    .{ "while", .kw_while },
});

fn ident(self: *Scanner) !void {
    while (isIdent(self.peek())) _ = self.advance();
    const identifier = self.src[self.start..self.current];
    const tag = if (keywords.get(identifier)) |tag| tag else .ident;
    try self.addToken(tag, .nil);
}

When dealing with ArrayList, it’s recommended to pass in a pointer to a list and modify it in the function versus creating the list in the function and returning it.

fn addStuffToList(list: *ArrayList(Token)) !void {
    // modify list...
}

In my case, I just collected the tokens in Scanner.tokens (an ArrayList) and then directly geve the list’s items to the parser:

 // collect all tokens in scanner.tokens
try scanner.scan();
 // pass the slice to parser
try parser.parse(scanner.tokens.items);

eikooc · July 5, 2024, 4:19am

Hey @prokoprandacek and thank you for taking the time to look at the code and write some helpful notes .

It think I’ve addressed all your points in my latest commits. Thanks again for the hints!

It was a bit difficult to pass the allocators around because there were memory leaks that had to be fixed at the same time that I found from @ktz_alias coment, which was super helpful as well.

eikooc · July 5, 2024, 4:25am

Haha, I’m actually just working through it from the start. It didn’t occur to me, that it might be easier with the clox interpreter because I haven’t gotten to that chapter. Although trying to map the Java code to Zig code in this way makes it a bit more difficult which might help me understand things better (maybe, I hope).

Thanks for your comment, I’ve done it for the lox file like you suggested. That was a cool thing to know, but do you know of any way to do it when you pass in a pointer to the struct like I do in the scanner.zig file? Since it becomes implicit, the name is gone. Am I unable to “unnest” the struct in this case?

eikooc · July 5, 2024, 4:37am

Hey @dude_the_builder, awesome that you are also working through the book. You have some really good changes. I’ve been trying to figure out how to do multiple cases in a switch prong, so that’s great!

Your point regarding the ArrayList shouldn’t be created in the function, but should be passed. Does that hold for a function which job is to initialize a struct also? Is it because it makes it easier in regards to freeing the memory after use?

And with the StaticStringMap do you get any benefits or is it more idiomatic to do compared to the thing I’ve done already? Or does it give other improvements? Here’s the relevant code:

const Keywords = enum {
    @"and",
    class,
    @"else",
    false,
    @"for",
    fun,
    @"if",
    nil,
    @"or",
    print,
    @"return",
    super,
    this,
    true,
    @"var",
    @"while",
};

fn keywords(str: []const u8) tokens.TokenType {
    const enummed = std.meta.stringToEnum(Keywords, str);
    if (enummed == null) {
        return tokens.TokenType.IDENTIFIER;
    }
    const tokenType: tokens.TokenType = switch (enummed.?) {
        .@"and" => tokens.TokenType.AND,
        .class => tokens.TokenType.CLASS,
        .@"else" => tokens.TokenType.ELSE,
        .false => tokens.TokenType.FALSE,
        .@"for" => tokens.TokenType.FOR,
        .fun => tokens.TokenType.FUN,
        .@"if" => tokens.TokenType.IF,
        .nil => tokens.TokenType.NIL,
        .@"or" => tokens.TokenType.OR,
        .print => tokens.TokenType.PRINT,
        .@"return" => tokens.TokenType.RETURN,
        .super => tokens.TokenType.SUPER,
        .this => tokens.TokenType.THIS,
        .true => tokens.TokenType.TRUE,
        .@"var" => tokens.TokenType.VAR,
        .@"while" => tokens.TokenType.WHILE,
    };
    return tokenType;
}

squeek502 · July 5, 2024, 11:34am

stringToEnum actually uses StaticStringMap internally (when there are <= 100 fields in the enum):

github.com

ziglang/zig/blob/b3afba8a70af984423fc24e5b834df966693b50a/lib/std/meta.zig#L17-L43


      
          pub fn stringToEnum(comptime T: type, str: []const u8) ?T {
              // Using StaticStringMap here is more performant, but it will start to take too
              // long to compile if the enum is large enough, due to the current limits of comptime
              // performance when doing things like constructing lookup maps at comptime.
              // TODO The '100' here is arbitrary and should be increased when possible:
              // - https://github.com/ziglang/zig/issues/4055
              // - https://github.com/ziglang/zig/issues/3863
              if (@typeInfo(T).Enum.fields.len <= 100) {
                  const kvs = comptime build_kvs: {
                      const EnumKV = struct { []const u8, T };
                      var kvs_array: [@typeInfo(T).Enum.fields.len]EnumKV = undefined;
                      for (@typeInfo(T).Enum.fields, 0..) |enumField, i| {
                          kvs_array[i] = .{ enumField.name, @field(T, enumField.name) };
                      }
                      break :build_kvs kvs_array[0..];
                  };
                  const map = std.StaticStringMap(T).initComptime(kvs);
                  return map.get(str);
              } else {
                  inline for (@typeInfo(T).Enum.fields) |enumField| {

This file has been truncated. show original

dude_the_builder · July 5, 2024, 11:52am

It depends. I’d say that if you’re dealing with a data structure that holds pointers (or slices,) it’s probably better to pass in a pointer to the struct versus creating it in the function. Creating it in the function and returning it can open the door to having Pointers to Temporary Memory . Also, as you mention, receiving a pointer to the list simplifies the lifetime issues because that would be the list owner’s responsibility.

Your code is basically doing the same job as the StaticStringMap. In this case I’d say it would simplify your code because all the data needed for the lookup is right there in the map and the actual lookup is just a one-liner.

dude_the_builder · July 5, 2024, 11:59am

If you decide to use your version, you can simplify it a bit more like this:

fn keywords(str: []const u8) tokens.TokenType {
    const enummed = std.meta.stringToEnum(Keywords, str) orelse return .IDENTIFIER;

    return switch (enummed) {
        .@"and" => .AND,
        .class => .CLASS,
        .@"else" => .ELSE,
        .false => .FALSE,
        .@"for" => .FOR,
        .fun => .FUN,
        .@"if" => .IF,
        .nil => .NIL,
        .@"or" => .OR,
        .print => .PRINT,
        .@"return" => .RETURN,
        .super => .SUPER,
        .this => .THIS,
        .true => .TRUE,
        .@"var" => .VAR,
        .@"while" => .WHILE,
    };
}

n0s4 · July 6, 2024, 5:40am

I think you’re looking for @This(). You can create an alias at the top of the file like so:

const Scanner = @This();