What is the state of AST processing?

tiagoantao · January 6, 2025, 10:41pm

I wonder what is the state AST processing? For example, when I try to process

pub fn fun_name(x: u32) bool {
    return x < 1;
}

I cannot find (in the node list created by the AST parser) a reference to fun_name (the token exists, but it is not referenced on any node). Also the root node doesn’t point in to the fun declaration proper.

If I add another trivial statement before the fun declaration (say, a const), I do not get a pointer to the const name, neither a clear list of statements at the root.

If folks are interested I can add this thread my AST code along with the outputs that I am getting for the example above… For reference my entry point is something like

const tree = try Ast.parse(allocator, buffer_0, Ast.Mode.zig);

My point is to try to understand if this is a skill issue or that the AST module is still under construction.

Thanks

mlugg · January 6, 2025, 11:25pm

std.zig.Ast definitely works, because the compiler uses it!

To save memory, we don’t unnecessarily duplicate data in the AST: for instance, there isn’t any string data stored in the AST. To get at the string fun_name, you find the node for the function declaration, and get to the token for the function name. We don’t directly store that token index; you need to find it implicitly (try ast.fullFnProto(fn_node).name_token). Then, use Ast.tokenSlice to pull the relevant slice out from the parser’s original input.

Not only does this decrease memory consumption, but it can also increase performance, since performance and memory consumption are often quite heavily correlated. This is Data Oriented Design in action; for more information, see this talk from Andrew.

tiagoantao · January 6, 2025, 11:47pm

I imagine this is a skill issue, but cannot find the identifier. Let me simplify this:

const aaaa = 1;

I end up with these nodes:

0
Node{ .tag = Tag.root, .main_token = 0, .data = Node.Data{ .lhs = 0, .rhs = 1 } }
TokenList{ .tag = Tag.keyword_const, .start = 0 }
extra_data { 1 }
token const

1
Node{ .tag = Tag.simple_var_decl, .main_token = 0, .data = Node.Data{ .lhs = 0, .rhs = 2 } }
TokenList{ .tag = Tag.keyword_const, .start = 0 }
extra_data { 1 }
token const

2
Node{ .tag = Tag.number_literal, .main_token = 3, .data = Data{ .lhs = 0, .rhs = 0 } }
TokenList{ .tag = Tag.number_literal, .start = 13 }
extra_data { 1 }
token 1

token 1 is actually aaaa so I wonder if extra_data is the place?

Then there is the issue of more than one statement at the top level, but for now what might I be overlooking here?

Thanks

ianprime0509 · January 7, 2025, 3:46am

The Node.Tag enum values have doc comments briefly explaining the meaning (if any) of the main_token, lhs, and rhs of nodes with that tag: zig/lib/std/zig/Ast.zig at fc28a71d9f019dc4ae65a363cad6330791ea928c · ziglang/zig · GitHub If you’re looking to understand the meaning or structure of a specific node type, I recommend starting there.

If that doesn’t help for a node you’re looking at, the next thing you can try is to look for a function whose name starts with full, such as fullFnProto which @mlugg mentioned above. Those functions will take a node in the AST and derive other information about the node: you can reference their logic to see how they’re getting those details from the AST structure.

Another good reference, if the full functions don’t give you what you need, is to look at how other components of Zig are handling the AST. For example, if you want to know more about how simple_var_decl works, you can look at

how it’s rendered when formatting the AST (zig fmt): zig/lib/std/zig/render.zig at fc28a71d9f019dc4ae65a363cad6330791ea928c · ziglang/zig · GitHub
how Autodoc (e.g. zig std, -femit-docs) is handling it: zig/lib/docs/wasm/Decl.zig at fc28a71d9f019dc4ae65a363cad6330791ea928c · ziglang/zig · GitHub

In this case, the name of the declaration is inferred from the structure: as the documentation in Node.Tag states, the main_token of a simple_var_decl is the var or const token, and given that information and the fact that the declaration name is always the token right after var or const, you can add 1 to the main_token index to get the identifier.