A VM bytecode interpreter. Parsing

jean-michelet · November 17, 2024, 2:02pm

Hi!

I’m a beginner with Zig, learning the language by creating a bytecode VM.

Thanks for your feedback under this post: A VM bytecode interpreter for EduScript, an educational language

I started the development of the parser, and I wonder if I use pointers correctly, if I structure my nodes in an optimal way… Particularly the construction of binary expressions.

I’d be very happy to have some feedback before continuing

You can read the code and tests here.

Nodes:

pub const Stmt = union(enum) {
    expr: Expr,
    empty: Empty,
};

pub const Expr = union(enum) {
    binary: Binary,
    literal: Literal,
};

pub const Binary = struct {
    left: *Expr,
    operator: Token.Type,
    right: *Expr,
};

pub const Literal = union(enum) {
    number: f64,
    string: []const u8,
    boolean: bool,
    nullVal: void,
    undefinedVal: void,
};

pub const Empty = struct {};

pub const Program = struct {
    statements: std.ArrayList(Stmt),
};

pub fn createBinary(allocator: std.mem.Allocator, op: Token.Type, left: Expr, right: Expr) !Binary {
    const binary = Binary{
        .left = try allocator.create(Expr),
        .right = try allocator.create(Expr),
        .operator = op,
    };

    binary.left.* = left;
    binary.right.* = right;

    return binary;
}

Use-case:

fn parseExpr(self: *@This(), allocator: std.mem.Allocator, min_precedence: usize) !?Node.Expr {
    var left = try self.parsePrimaryExpr(allocator);

    // Try to create binary expression node (e.g. 1 + 2, true == false)
    while (self.current < self.tokens.len) {
        var op = self.peek();

        // if precedence == 0, not a binary expression operator
        const precedence = self.getPrecedence(op.token_type);
        if (precedence < min_precedence or precedence == 0) break;

        op = try self.consume(op.token_type);

        const right = try self.parseExpr(allocator, precedence + 1) orelse unreachable;

        left = Node.Expr{ .binary = try Node.createBinary(allocator, op.token_type, left.?, right) };
    }

    return left;
}

IntegratedQuantum · November 17, 2024, 3:02pm

Looks fine to me, apart from maybe the fact that you are never freeing the pointers you allocate, but that is probably fine since you probably only parse it once and then keeping it in memory for the entire lifetime of the program.

This part is more difficult to answer, since in reality there often isn’t the one optimal way.

One thing that sticks out to me here, is that you are doing two allocations, where really only one is needed. You could just combine the two expression pointers into one array pointer *[2]Expr. This not only reduces the number of allocation, but also ensures that the results are next to each other in memory.

Finally I also want to mention a few minor unrelated things:

empty: Empty,
We already have the builtin type void for that. Unless you need a distinct type or are planning to fill it later, there is no reason to reinvent the ~~wheel~~ void here.

pub fn createBinary(allocator: std.mem.Allocator, op: Token.Type, left: Expr, right: Expr) !Binary
This is not C, Zig has member functions which are more suited for this.
The idiomatic approach is to put an init and a deinit function into the struct itself, like this:

pub const Binary = struct {
    left: *Expr,
    operator: Token.Type,
    right: *Expr,

    pub fn init(allocator: std.mem.Allocator, op: Token.Type, left: Expr, right: Expr) !Binary {
        ...
    }

    pub fn deinit(self: Binary, allocator: std.mem.Allocator) void {
        allocator.destroy(self.left);
        allocator.destroy(self.right);
    }
};

mnemnion · November 17, 2024, 5:24pm

This might be the way to go, but I want to point out that struct {} is also a zero-size type like void, and comes with some advantages: you can add declarations to it to associate behaviors with Empty, and do comptime reflection on it as a distinct type. For a simple example, it could have a format member function which prints it in a distinct way.

I’m almost rephrasing what you just said, just wanted to provide more detail about when it might be better to use a no-field struct rather than choosing void.

jean-michelet · November 17, 2024, 5:57pm

Hi, thanks a lot for your feedback!

I had in mind to use only one ArenaAllocator allocator per compilation phase, for the moment I allocate one to generate the token array, which I free when the AST is generated from it.

Am I right in thinking that ArenaAllocator tracks all allocations and is able to free all of them via arena.deinit();? E.g. In my case token structs, plus strings if token literal with string value.

Interesting, I am not used to think at the hardware level. The only thing that bothers me (in a minor way) is that it’s a bit less readable, maybe I can leverage left, right getters?

Do this looks good to you?

pub const Binary = struct {
    operands: *[2]Expr,
    operator: Token.Type,

    pub fn init(allocator: std.mem.Allocator, op: Token.Type, left_: Expr, right_: Expr) !@This() {
        const binary = @This(){
            .operands = try allocator.create([2]Expr),
            .operator = op,
        };

        binary.operands[0] = left_;
        binary.operands[1] = right_;

        return binary;
    }

    pub fn left(self: *@This()) *Expr {
        return self.operands[0];
    }

    pub fn right(self: *@This()) *Expr {
        return self.operands[1];
    }
};

If I don’t plan to use this component outside of the context of ArenaAllocator, should I declare a deinit method?

IntegratedQuantum · November 17, 2024, 6:30pm

Yes, that is correct.

Yeah, that’s a good solution.
Another option would be to just make a struct {left: Expr, right: Expr} and use that instead of the array. Then you could access it with binary.operands.left

No, if you never use it, then you don’t need it. However I would recommend renaming the variable to arena: Allocator in the function signatures, to make clear that you expect to get an ArenaAllocator.