Memory management for a parser

Hello everyone!
I’ve started learning Zig about a week ago, rewriting my c++ interpreter to Zig as an exercise, and I wanted to ask for some feedback on how to do memory management for the parser.
(project link: — that’s all we know, continue to be the. this link shows up weird for me in Discourse… it’s a link to a codeberg repo)

In the old (c++) code, it looked like this
( The knowledge. You are not distinguishable at all. His whole appearance was startling. With just. ):

class Parser {
    Arena &arena;
    Dynarr<ErrMsg> &errarr;
//...
}

So when I started rewriting to Zig I kept the code pretty similar:

pub const Parser = struct {
    alloc: Allocator,
    scratch_alloc: Allocator,
    errs: std.ArrayList(ast.Error),
    err_alloc: Allocator,
//...
}

but something I don’t really “like” is that Allocator doesn’t have the same guarantees as ArenaAllocator. With an arena I know I can make a bunch of allocations, and then free them all at once, but making the field of the parser a generic allocator doesn’t make that clear. The only solution I can think of is to make alloc, scratch_alloc, etc. all *ArenaAllocator instead of Allocator but I’m not sure if this is the “proper” way of doing this (it would be a bit annoying to have to do something like alloc.allocator().create() each time but I can live with that).

Another idea that I just had while writing this was to have an *ArrayList for every possible kind of node (binary: *std.ArrayList, unary: *std.ArrayList, etc.), but I think that’s not a great approach just because I’ll just have to keep adding more fields to Parser as I (inevitably) realise I need more kinds of nodes. Maybe there’s something clever I can do here but I haven’t given it enough thought yet.

Also another idea I had would be to make the nodes of my AST a tagged union and just have a single node: *std.ArrayList (right now I have a base member in every node and I use @fieldParentPtr to cast a *Node into the specific node). I never tried this approach because I was worried it would be pretty wasteful of memory, if for some reason one of variants took up way more space than the rest, but maybe this is not a legitimate concern.

I would be curious to hear what others think/recommend :slight_smile:. Also, if there are other parts of my project that look sketchy, please point them out!

Thanks for reading!

The approach required will depend very much on the lifetime of the data in the life cycle of the program. Probably worthwhile to separate what data may still be needed at runtime from data that is only needed to parse and generate whatever your runtime representation is.

When I know that I’ll be using the parsed data immediately after parsing and it doesn’t live any longer, I tend to bind the lifetime of the data to the lifetime of the parser:

// Parser.zig
const Parser = @This();

/// For scratch memory that will certainly have to be created
arena: ArenaAllocator,
/// For writing errors
err_writer: Io.Writer.Allocating,

// other members...

/// Make sure to assign this to a `var` instance
pub fn init(gpa: Allocator) Parser {
    return .{ .arena = .init(gpa), .err_writer = .init(gpa) };
}

pub fn parse(
    self: *Parser,
    source: []const u8,
    // other args...
) !Result {
    // parse it!
}

pub fn deinit(self: *Parser) void {
    // free everything
    self.arena.deinit();
    self.err_writer.deinit();
}

const std = @import("std");
const Io = std.Io;
const Allocator = std.mem.Allocator;
const ArenaAllocator = std.heap.ArenaAllocator;

But yeah, as already said, it largely depends on the lifetime of the data. If you reuse the parser, you might wanna add a function to move the data was previously parsed and reset the arena/error-writer. Or if the data lives simply outlives the parser, you could write the final data to an argument passed to the parse() function (however that looks). In general, I found a good groove with arena + error writer when I have to use a lot of “disposable” memory.

2 Likes

Thanks for the responses, I should clarify that I’m not reusing the parser, and everything allocated during parsing (the AST nodes) is used by the later stages (resolving identifiers and emitting bytecode) but once we start interpreting the program we no longer need the AST.

Replying to MiahDrao97:

Or if the data lives simply outlives the parser, you could write the final data to an argument passed to the parse() function (however that looks). In general, I found a good groove with arena + error writer when I have to use a lot of “disposable” memory.

I think I poorly worded my original post, the allocations made during parsing aren’t “disposable” (we still need that stuff after parsing), so if I were to write the final data to an argument passed to the parse function I would be copying everything allocated by the parser into this parameter, which is why I thought to maybe pass a *ArenaAllocator as a an argument to the parse() function.

1 Like

Okay, yeah, I see. I perused your code a bit, and I’ll get to the lifetimes bit, but first I noticed there’s a lot of []*T as opposed to just []T. In general, try to avoid slices of pointers since that’s more allocations and more address indirection. More to juggle and less efficient.

For example, ast.Module should probably be:

pub const Module = struct {
    imports: []Import,
    fn_decls: []FnDecl,
    class_decls: []ClassDecl,
};

The only reason I can think of reaching for slices of pointers is if you’re dealing with recursive structures (or *anyopaque). But even in the recursion case, I think you can pull off a flat slice anyway and just reference other elements within the slice. I didn’t see any recursive structures in my look-through… But regardless, see if you can replace the []*T stuff with []T.

Anyway, if I were tasked with creating a parser, here’s the shell I’d start with:

// the following code is completely untested, so... might have mistakes :P

const Parser = struct {
    arena: std.heap.ArenaAllocator,
    // other members?
    // Even with 1 member, it might be worth keeping around just to organize the parser methods
    // Up to you, of course

    fn init(gpa: Allocator) Parser {
        return .{ .arena = .init(gpa) };
    }

    fn deinit(self: *Parser) void {
        self.arena.deinit();
    }

    // helper methods that may create disposable memory
};

const ModuleParts = struct {
    imports: ArrayList(ast.Import),
    fn_decls: ArrayList(ast.FnDecl),
    class_decls: ArrayList(ast.ClassDecl),

    const empty: ModuleParts = .{
        .imports = .empty,
        .fn_decls = .empty,
        .class_decls = .empty,
    };

    fn toModule(self: *ModuleParts, gpa: Allocator) Allocator.Error!ast.Module {
        const imports_slice: []ast.Import = try self.imports.toOwnedSlice(gpa);
        errdefer gpa.free(imports_slice);

        const fn_decls_slice: []ast.FnDecl = try self.fn_decls.toOwnedSlice(gpa);
        errdefer gpa.free(fn_decls_slice);

        return .{
             .imports = imports_slice,
             .fn_decls = fn_decls_slice,
             .class_decls = try self.class_decls.toOwnedSlice(gpa),
        };
    }

    fn deinit(self: *ModuleParts, gpa: Allocator) void {
         self.imports.deinit(gpa);
         self.fn_decls.deinit(gpa);
         self.class_decls.deinit(gpa);
    }
};

/// `gpa` will back the long-lived allocations returned in the following slice
pub fn parse(
    gpa: Allocator,
    source: [:0]const u8,
    errors: *ArrayList(ast.Error),
) !ast.Module {
    // parser is ephemeral
    var parser: Parser = .init(gpa);
    defer parser.deinit();

    // the module parts are also ephemeral in the sense that we're returning owned slices in the end
    var module_parts: ModuleParts = .empty;
    defer module_parts.deinit(gpa); // <-- again, we're returning owned slices, so this is fine

    // all the parsing (pass around a reference to the module parts)...

    return try mode_parts.toModule(gpa);
}

So here’s the rationale behind this shell. There are 2 units of memory that must be long-lived: 1) The resulting ast.Module (and obviously all those internals) and 2) parse errors.

  • The parse() function is all that’s public, because I’m assuming that the parser is an implementation detail that the calling code doesn’t care about, as it’s intended to be freed by the time the function returns.
  • gpa is used for all data that must outlive the parse() function.
  • The errors list comes from outside, so that solves that lifetime problem. Just make sure to append errors with gpa and not scratch space.
  • The ModuleParts struct holds the growing lists of components that ultimately compose a module. Again, I’d really scrutinize your data structures in ast.zig and eliminate those slices of pointers.
  • The Parser struct holds the arena and maybe some helper methods so that we can easily tap into scratch space and append items to the growing module parts. Note that gpa will need to be passed everywhere long-lived memory needs to be created.

Look into data-oriented design if you want to learn more about memory efficiencies. There’s still a lot that can be done. The ultimate challenge is to refactor your memory system to just a few large slices that AST structs index into instead of the AST structs requiring their own individually-allocated slices. I would imagine that would cause a nice performance boost and definitely less memory to worry about.

Obviously, there’s more than one way to skin a cat, but I hope this helps!

1 Like