Where to start reading the Zig (compiler) source code?

samuel-fiedler · June 24, 2024, 11:03am

So, now that I have (in my opinion) learned really much about Zig and Zig’s concepts, I’d like to understand the Zig compiler source code to maybe also improve things.

But where should I start? I once looked into a file which imported the InternPool. In that InternPool, I wasn’t able to find any other module imports, so maybe it’s a good idea to start there (because everything you need to know about Zig internals used there is in that file)?

dimdin · June 24, 2024, 2:07pm

An introduction to the internals of the Zig compiler frontent is Mitchell Hashimoto Zig Compiler Internals

See also the responses to explain category questions for compiler internals like:

matklad · June 24, 2024, 2:15pm

In terms of actual “meat” of the compiler, once you get past parsing&AstGen and want to learn how the actual analysis is being done, I think fn analyzeBodyInner is the entry point

github.com

ziglang/zig/blob/ab4c461b76ff7b1d10e6d2010370ea0984f97efe/src/Sema.zig#L970


      
          /// standard `break` at comptime. This error is pushed up the stack until the target block is
          /// reached, at which point the break operand will be fetched.
          ///
          /// It is rare to call this function directly. Usually, you want one of the following wrappers:
          /// * If the body is exited via a `break_inline`, or is being evaluated at comptime,
          ///   use `Sema.analyzeInlineBody` or `Sema.resolveInlineBody`.
          /// * If the body is behind a fresh runtime condition, use `Sema.analyzeBodyRuntimeBreak`.
          /// * If the body is an entire function body, use `Sema.analyzeFnBody`.
          /// * If the body is to be generated into an AIR `block`, use `Sema.resolveBlockBody`.
          /// * Otherwise, direct usage of `Sema.analyzeBodyInner` may be necessary.
          fn analyzeBodyInner(
              sema: *Sema,
              block: *Block,
              body: []const Zir.Inst.Index,
          ) CompileError!void {
              // No tracy calls here, to avoid interfering with the tail call mechanism.
          
              try sema.inst_map.ensureSpaceForInstructions(sema.gpa, body);
          
              const mod = sema.mod;
              const map = &sema.inst_map;

But it would be good to at least Mitchell’s posts before that, to get the big picture

mlugg · June 24, 2024, 2:35pm

Mitchell’s posts are a great starting point, although bear in mind that once you reach the one on Sema there’s some outdated information – in particular, the explanation of Type vs Value vs TypedValue is no longer accurate.

Here’s a summary. The Zig compiler pipeline looks vaguely like this:

Parse -> AstGen -> Sema -> CodeGen

Parse and AstGen are in the standard library as std.zig.{Parse,AstGen}. The result of this is a block of instructions for each file. These instructions are called ZIR (Zig Intermediate Representation). The code is not yet type-checked: this happens in semantic analysis (Sema). Most error messages and comptime magic happens in Sema; the main notable things that AstGen handles while lowering the AST to ZIR are RLS (Result Location Semantics; see the langref if unfamiliar) and certain “global” error messages (those which do not require semantic analysis, e.g. “unused variable”; these are the errors that zig ast-check can pick up on).

Sema’s job is to take those ZIR instructions and convert them to AIR – or, in the case of comptime execution, interpret them. I find that it helps to think of Sema as an interpreter which in some cases emits AIR instructions to instead do the operation at runtime. Sema is definitely the most important part of the compiler, but it can also be quite hard to understand, largely due to its size.

Loosely, the idea is that a single “body” of ZIR instructions is interpreted by the main loop, Sema.analyzeBodyInner. This function is essentially a big ol’ switch inside of a loop over the instructions. The switch cases mostly dispatch to handler functions, e.g. zirCondbr to handle the condbr instruction.

When analyzing a runtime function, Sema emits AIR instructions which are sent to the code generation backend. The default is the LLVM backend – this lives in src/codegen/llvm.zig. We also have several WIP self-hosted backends, for instance in src/arch/x86_64/CodeGen.zig.

You note the InternPool as a fairly isolated part of the compiler. The primary role of InternPool is to store immutable comptime-known values (including types) in an efficient manner, exposing a (relatively) type-safe API for accessing them. It’s a very important part of the compiler, but can be a bit tricky to grasp, because there are some slightly tricky memory optimization strategies in play (including Andrew’s favourite pet datastructure std.AutoArrayHashMapUnmanaged(void, void)). You don’t need to have a deep understanding of the InternPool implementation for a simple understanding of Sema.

If you have a debug build of the compiler, you can dump the ZIR for any file using zig ast-check -t foo.zig. You can dump all AIR emitted by Sema for a compilation by passing --verbose-air to the build-{exe,lib,obj} command (be warned: there’ll probably be a lot, so you’ll want to pipe it to a file). If you have any specific questions, let me know and I’ll answer them as best as I can. Happy hacking!