How Zig incremental compilation is implemented internally?

mlugg · March 12, 2024, 7:27pm

Nice question! I can give an overview here – I’m also intending to do a proper write-up of the system at some point into a docs file in the Zig repo, although that’ll probably assume a bit more technical background (since it’ll be targeted towards existing compiler devs).

First, we need to understand the compiler pipeline. After parsing (which is boring), there are three important stages:

AstGen
Sema
CodeGen

AstGen consumes a parsed AST and generates an instruction-based intermediate representation called ZIR (Zig Intermediate Representation). This implementation detail was actually recently promoted to the standard library, in std.zig.AstGen and std.zig.Zir. This pass operates on whole files, and has no context - it doesn’t deduce any type information, and it knows nothing of the contents of any other files in the project. It also doesn’t do anything compilation-specific, for instance it doesn’t change behavior based on your target. For that reason, the results of AstGen are cached globally. You might have noticed that when you update your local Zig compiler, when you first build something, you’ll see AST Lowering... in the progress message for a moment - that’s AstGen running over the whole standard library on the new compiler version (the caches aren’t compatible across versions), since we eagerly run it over every file which could possibly be referenced.

After we have generated all of our ZIR, then we have Sema. This is the heart of the compiler – it’s the stage that performs semantic analysis, which includes type checking, comptime code execution, most error messages, etc. Sema interprets the ZIR which AstGen emitted and turns it into AIR (Analyzed Intermediate Representation), a much more simple and low-level IR which is sent to the code generator. CodeGen is actually interleaved with Sema: after a function is semantically analyzed, it’s immediately sent to the code generator.

Sema is where the magic happens, so we’re going to have to zoom in on some implementation details now. When writing Zig, you might think of semantic analysis as being something that happens to the whole program at once, but that’s not how the compiler sees it. Instead, semantic analysis occurs in discrete steps, analyzing one thing at a time. There are two kinds of thing we can analyze:

A runtime function body. When we know that a function may be referenced at runtime, and we know all of its comptime args, we will analyze the function body and send the results to CodeGen.
Anything else, at comptime. The thing we analyze here is currently called a Decl. Here, I have to explain a somewhat unfortunate set of terms we currently have - despite the name, a Decl does not necessarily correspond to a source-level declaration. Every source declaration – that’s a container-level const, var, fn, comptime, usingnamespace, or test – has a Decl, but not every Decl is a source declaration. I’m actively working on some internal reworks to make this less confusing, but for now, I’ll explain the system as it exists today.

The idea behind incremental compilation is that for each of these things we can analyze – a runtime function or a Decl – we can register “dependencies” for everything the analysis used. The current (WIP) system has the following kinds of dependency:

src_hash – the value of the 128-bit hash of a range of source bytes. The compiler relies on the uniqueness of 128-bit hashes in quite a few places!
decl_val – the resolved value of a single declaration, e.g. a container-level const.
namespace – the full set of names in a namespace. This basically only exists because of @typeInfo.
namespace_name – the (non-)existence of a single name in a namespace.
func_ies – the resolved IES of a runtime function.

On an incremental update, we can invalidate changed dependencies as we analyze more things. We can invalidate src_hash, namespace, and namespace_name dependencies as soon as AstGen completes – the other two we learn about as we perform analysis. The idea is that in the main loop of analysis, we will find a Decl or runtime function which is known to be outdated and which (ideally) has all up-to-date dependencies, and re-analyze it. For a Decl, we can then mark decl_val dependencies as out-of-date or up-to-date depending on whether the value changed; similarly with func_ies dependencies for runtime functions.

There’s a lot more that I’ve not covered here:

ZIR instruction tracking
Type recreation
Locating unreferenced Decls/functions
All of the codegen/linking stuff; that’s not really my domain

…but this is a basic overview of some of the system

One day I’ll probably write a blog post with more detail. For now, let me know if you have any questions, I’ll still be happy to answer!