Probably the most important datastructure to understand in the Zig compiler is ZIR. It’s an instruction-based structured SSA IR which is generated for every Zig file in a compilation. If you have a debug build of Zig, you can use the command zig ast-check -t file.zig
to dump the ZIR for file.zig
to the terminal.
The semantic analysis phase of the Zig compiler – which lives in Sema.zig
in the compiler source tree – is essentially a big ol’ ZIR interpreter. For every instruction, we look at its operands, and do something along these lines:
- Type check the operands. If the types are wrong, give up and emit a compile error.
- Check if the operands are comptime-known. If they are, perform this operation at comptime, and store that as the result (in case said result is used by another instruction).
- Otherwise, emit a runtime instruction to do the operation, and store a reference to that instruction as the result of this ZIR instruction. The runtime instructions we emit here are for another IR called AIR, which is superficially similar to ZIR.
Of course, the specifics vary hugely between different ZIR instructions, but that’s a good intuitition at least. To summarise, Sema
is an interpreter for ZIR which stores the result of each instruction in a big hashmap for future instructions to reference if desired. The result of an instruction might be a plain comptime-known value, or it might be a runtime instruction.
The logic here is basically all in Sema.zig
; there’s also some important stuff in e.g. Zcu.zig
and InternPool.zig
, but most of the interesting stuff is in Sema
. The main loop is Sema.analyzeBodyInner
, although I don’t think that’s actually a very helpful place to start there, since it’s mostly a big switch
statement over instruction types with each case dispatching to a handler function, and the few bits of other code in there are kind of subtle things about AIR block elision / post-hoc blocks / comptime control flow (don’t worry about what any of that means for now). All of the instruction handler functions are called zirInstructionName
(e.g. zirCondbr
for condbr
). Let me walk you through a few basic instruction handler functions.
fn zirRetAddr(
sema: *Sema,
block: *Block,
extended: Zir.Inst.Extended.InstData,
) CompileError!Air.Inst.Ref {
_ = extended;
if (block.is_comptime) {
return sema.mod.intRef(Type.usize, 0);
} else {
return block.addNoOp(.ret_addr);
}
}
zirRetAddr
implements the ret_addr
instruction corresponding to the @returnAddress
builtin, and it’s about as simple as instructions get. The zirRetAddr
builtin takes no arguments, so the ret_addr
instruction takes no operands; that means we have no type checking to do. extended
would contain information about the instruction (e.g. its operands), but we don’t have any extra information here, so we just ignore it.
So, all we need to do is figure out the instruction’s result. To do this, we check whether we are currently in a comptime
scope, using the block.is_comptime
field. (Block
is a datastructure stored on the stack which essentially contains state relevant to the current body of ZIR we’re interpreting.) If we are, we just want to return the value @as(usize, 0)
. We call a helper function which constructs this value and turns it into an Air.Inst.Ref
. This type is the “result” of a ZIR instruction; it is either an AIR instruction index, or an index into a structure called the InternPool
which stores and deduplicates comptime-known values. The intRef
function we call here adds the value @as(usize, 0)
to the InternPool
, and, for convenience, turns it into an Air.Inst.Ref
which we can return. In the else
case – where we are running at runtime – we instead want to emit a runtime instruction whose result will be the function’s return address. AIR instructions are emitted into the Block
, and there are a bunch of helper methods for doing so, all named like Block.addXyz
. We use addNoOp
to add an instruction to the AIR block which takes no operands (that’s what NoOp
refers to), and the helper method also turns this into an Air.Inst.Ref
for us.
Okay, now for a slightly more complex example!
fn zirPanic(sema: *Sema, block: *Block, inst: Zir.Inst.Index) CompileError!void {
const inst_data = sema.code.instructions.items(.data)[@intFromEnum(inst)].un_node;
const src = block.nodeOffset(inst_data.src_node);
const msg_inst = try sema.resolveInst(inst_data.operand);
const coerced_msg = try sema.coerce(block, Type.slice_const_u8, msg_inst, block.builtinCallArgSrc(inst_data.src_node, 0));
if (block.is_comptime) {
return sema.fail(block, src, "encountered @panic at comptime", .{});
}
try sema.panicWithMsg(block, src, coerced_msg, .@"@panic");
}
This function implements the panic
instruction corresponding to the @panic
builtin. This is a little more interesting because this builtin takes an argument – the panic message.
The first 2 lines are just boilerplate to do with the in-memory representation of ZIR – we’re extracting the data associated with the instruction, and constructing a LazySrcLoc
(named src
) which represents the source location. If we emit a compile error for any reason, it will be associated with this source location (given by the AST note in inst_data.src_node
).
Next, we call Sema.resolveInst
. This is a really important function: it converts a Zir.Inst.Ref
to an Air.Inst.Ref
using the mapping constructed by evaluation of previous ZIR instructions. So, we pass it the ZIR ref which is our operand (this will be a previously-evaluated ZIR instruction), and we get back the corresponding AIR ref, which will either be a comptime-known value in the InternPool
or a runtime AIR instruction.
Our next job is type checking: the argument to @panic
has to be a []const u8
. To enforce this, we try to coerce the operand to that type. Sema.coerce
is a monster of a function full of many specific cases, so I won’t show it here, but what matters is that it returns another Air.Inst.Ref
corresponding to the coerced value (so it definitely has type []const u8
). If the value cannot be coerced – i.e. the wrong type was passed to @panic
– then Sema.coerce
will emit a compile error. How this works, by the way, is that the error message and source location are stored in a hashmap associated with the piece of code we’re analyzing, and then error.AnalysisFail
is returned up the stack (we try
most things in Sema
) to terminate analysis of this declaration early.
Assuming type checking succeeded, though, we get to actually perform the panic! As before, we check whether we’re in a comptime scope using block.is_comptime
. If we are, then we just want to emit a compile error – in this case, we do that with Sema.fail
, which, as discussed above, will construct and store the compile error, then return error.AnalysisFail
. If we wanted to include any notes on the error, the code would be slightly more complex; fail
is a handy wrapper for the simple case. If we’re not in a comptime scope, then we once again want to emit a runtime instruction. In this case, another function in Sema
is dealing with all of this, called panicWithMsg
. We won’t go into this function, because it’s a little complex, but the short version is that it’ll end up emitting an AIR call
instruction to call the panic handler.
Let’s look at one final example.
fn zirBoolNot(sema: *Sema, block: *Block, inst: Zir.Inst.Index) CompileError!Air.Inst.Ref {
const mod = sema.mod;
const inst_data = sema.code.instructions.items(.data)[@intFromEnum(inst)].un_node;
const src = block.nodeOffset(inst_data.src_node);
const operand_src = block.src(.{ .node_offset_un_op = inst_data.src_node });
const uncasted_operand = try sema.resolveInst(inst_data.operand);
const operand = try sema.coerce(block, Type.bool, uncasted_operand, operand_src);
if (try sema.resolveValue(operand)) |val| {
return if (val.isUndef(mod))
mod.undefRef(Type.bool)
else if (val.toBool())
.bool_false
else
.bool_true;
}
try sema.requireRuntimeBlock(block, src, null);
return block.addTyOp(.not, Type.bool, operand);
}
This implements the bool_not
instruction, corresponding to the !
operator.
As before, some boilerplate at the top. I should note operand_src
; here, we’re referencing the source node of the operand to !
, i.e. in !foo
, we’re referencing the expression foo
. We don’t store in ZIR the AST node of the operand, because that would use a lot of bytes; instead, LazySrcLoc
has some interesting mechanisms to refer to things like call arguments and operator operands indirectly, and these references are resolved to actual AST nodes only if an error actually happens.
Also as before, we use resolveInst
to get the AIR ref corresponding to the operand, and coerce
to ensure it’s a bool
, emitting a compile error otherwise. Then we get onto the actual instruction logic.
The first thing we’re going to do here is check whether the operand is comptime-known, i.e. whether the Air.Inst.Ref
called operand
corresponds to a value in the InternPool
(as opposed to the result of a previous AIR instruction). If it does, then we want the result of this operation to also be comptime-known. This pattern hasn’t come up in our previous two examples, but it’s the norm for most “computation-ey” ZIR instructions; most things are eagerly evaluated at comptime when their operands are comptime-known (for instance, 4 / 2
is comptime-known for this reason). If the operand is comptime-known, Sema.resolveValue
will return a Value
(which is a thin wrapper around a reference to a value in the InternPool
); otherwise, it returns null
.
If the value was comptime-known, we’ll do some checks on it. Since we already know it must be a bool
, there are only 3 possibilities:
- Is it
undefined
? If so, return undefined
.
- Is it
true
? If so, return false
.
- Is it
false
? If so, return true
.
Note that the Air.Inst.Ref
we return in the latter two cases don’t need us to call a function to construct the value; we just return .bool_false
or .bool_true
. This is because, for efficiency (alongside some other reasons), there are some special values of Air.Inst.Ref
for certain comptime-known values. I won’t go into detail here, but suffice to say, these correspond to the comptime-known values @as(bool, false)
and @as(bool, true)
.
Otherwise, the operand is runtime-known. The first thing we do here is call Sema.requireRuntimeBlock
. This is going to check block.is_comptime
, and if it is true
– i.e. we are in a comptime
scope – emit a compile error. This is what stops comptime !runtime_value
from working. Assuming that passed, we finally just emit a runtime instruction using Block.addTyOp
. “TyOp
” here means “type + operand”; these names relate to how we store AIR instructions in memory.
I hope that all made sense – let me know if you have any particular questions!