Zig’s comptime is really impressive. I would like to know if there are any tricks or optimizations I can use to increase comptime performance. For example:
pub fn main() !void {
@setEvalBranchQuota(1000000);
comptime var vals: [100000]f32 = undefined;
comptime {
for (0..vals.len) |i| vals[i] = 23;
}
std.debug.print("{}\n", .{vals[8426]});
}
This code takes about 12s compile + runtime with comptime var, vs just 500ms for a regular runtime var.
I know comptime might not be suitable for heavy workloads, but I think I have a valid use case where I need to create thousands of comptime structs.
Its my first time here, dont know if I can say this in this thread or need to do it under showcase tag. Here’s a description of my project anyway.
I initially wanted to make a tensor lib after discovering zig and comptime few months ago. Started making an IR to build the tensors on top of it like pytorch ATen. I am basically representing a graph/IR tree thing as comptime structs which carry info like the function, type etc. So they can only live at comptime. An example of how it looks:
const AddExprOp = (blk: {
const A = Op.ivar("A");
const B = Op.ivar("B");
const C = Op.ivar("C");
const i = Op.rvar("i");
const body = C.index(i).set(A.index(i).add(B.index(i)));
break :blk body.range(
Op.rref("i"),
C.len(), // stop
.c(0), // start
.c(1), // step
);
}).pack();
const AddExpr = AddExprOp.build(struct {
A: Ptr(*const f32),
B: Ptr(*const f32),
C: Ptr(f32),
i: usize,
}).cexport("adder_custom_exported", .c);
extern fn adder_custom_exported(*const anyopaque) void;
which is equivalent to:
export fn adder_zig(noalias inp: *const AddExpr.input_t) void {
for (0..inp.C.len) |i|
inp.C.ptr[i] = inp.A.ptr[i] + inp.B.ptr[i];
}
They produce identical assembly in release build thanks to zig’s compiler and explicit control of inlining. Basically same Op graph into different types including @Vector for simd. And also able to export to cuda etc. Being in comptime, the graph can be represented in json too. Everything modeled as an Op node, including allocations, kernel launches etc. Also the whole code so far is just under 2k lines all because of how powerful the comptime is.
Zig’s shines again with its build system and ability to generate ptx directly. I managed to generate a portable c header + .so files that are just 30kb, compiles under 4s. Able to do mixed cpu + gpu compute in a portable way calling from c code linking only the .so file (on linux), not even requiring cuda toolkit for compilation or runtime. Just driver api.
Recently I moved to tensors and came across comptime performance bottleneck. Tensors are built on top of ops, compute first and dont occupy memory, enabling aggressive fusion by default. A simple stress test with 1000 tensor adds took 13s to compile and adding any more tensors is increasing the time exponentially. It seems like the bottleneck was with zig comptime unable to create large amounts of structs as most compilation time was spent during structs creation itself before even processing kicked in.
Sorry if this is too verbose under this topic. I really would like to push further and see how far I can get with comptime. I would appreciate any leads on getting max performance at comptime code interpretation phase.