How does 'zig fmt' work?

tsdtas · January 24, 2025, 10:30am

This may be way too broad of a topic, but here goes something…

From briefly skimming the source of ‘fmt.zig’ and related ‘std.zig.render’, it seems like the formatter output is generated directly from the AST, meaning there is only one way to represent a given set of tokens. This lines up well with the language goal of having a canonical form that everyone uses.

How does the ‘reverse AST generation’ process work?

castholm · January 24, 2025, 1:29pm

I’m not sure if there’s a lot to explain about how zig fmt works unless you have specific things you have questions about, most of it is laid out pretty clearly if you look at the two files you mentioned.

src/fmt.zig is called when you invoke zig fmt and uses std.zig.Ast.parse to build the AST, which is then passed to lib/std/zig/render.zig to generate source code.

To be able to make the most sense out of how the code rendering works you need to be aware of is how std.zig.Ast represents different language constructs (Zig Parser – Mitchell Hashimoto is a good introduction), because it might not be immediately obvious. For example, keywords or tokens like align ( ) are not directly referenced by the AST because they are no longer relevant at that point, so the code rendering functions need to implicitly find these tokens by subtracting/adding known offsets from/to the token indices that are stored in the AST.

It is also important to understand that whitespace and (non-documentation) comments are not recognized as tokens and are ignored by the tokenizer. But we still need to account for these when rendering source code because they affect formatting, which is why we always have to render each source token even if it only has one valid lexeme. E.g. if we have the tokens align ( 4 ) we can’t just hard-code the rendering using something like "align(" ++ alignment_node ++ ")" because there might be comments after each individual token that need to be preserved.

(The rendering function also has some support for omitting or replacing specific tokens or nodes, mostly to serve the zig reduce command, but this functionality is a bit half-baked and only does exactly what the compiler needs it to do (and IMO render is the wrong place for this kind of logic because it introduces a lot of complexity and special cases and belongs elsewhere) so you can mostly ignore it.)

zig fmt can not be configured. Each unique sequence of tokens and whitespace/comments has (should have) exactly one canonical form. zig fmt uses whitespace/comments between tokens or optional tokens like trailing commas to determine whether to render syntactic constructs horizontally on one line or vertically across multiple.

tsdtas · January 24, 2025, 1:58pm

Thank you very much for the thorough answer to a vague question.

The blog post seems like solid gold.

I’ll probably come back to this with some more specific questions when I feel like I know what’s actually going on, so I’m leaving the topic open. May just have to spend the weekend with a textbook or two to dig in properly.

Side note: I love how eager people here are to be helpful. It’s very refreshing.

kristoff · January 24, 2025, 2:30pm

In my opinion a fast track to get full understanding of how this stuff works is to write your own parser for a language and also implement rendering for the AST that you generate.

My recommended example would be a small subset of JSON where you only have maps and strings (with the ability to nest maps into other maps). For strings, don’t even implement escapes.

{ 
  "kinda": { "like": "that" }
}

Write a tokenizer, then a parser, and then give a render() function to the parser.

This is in my opinion the optimal balance between not too much complexity that might confuse and slow you down, and at the same time it’s not too simplistic, allowing you to experience something realistic.