Comptime performance limits

Zig’s comptime is really impressive. I would like to know if there are any tricks or optimizations I can use to increase comptime performance. For example:

pub fn main() !void {
    @setEvalBranchQuota(1000000);
    comptime var vals: [100000]f32 = undefined;
    comptime {
        for (0..vals.len) |i| vals[i] = 23;
    }
    std.debug.print("{}\n", .{vals[8426]});
}

This code takes about 12s compile + runtime with comptime var, vs just 500ms for a regular runtime var.
I know comptime might not be suitable for heavy workloads, but I think I have a valid use case where I need to create thousands of comptime structs.

Its my first time here, dont know if I can say this in this thread or need to do it under showcase tag. Here’s a description of my project anyway.

I initially wanted to make a tensor lib after discovering zig and comptime few months ago. Started making an IR to build the tensors on top of it like pytorch ATen. I am basically representing a graph/IR tree thing as comptime structs which carry info like the function, type etc. So they can only live at comptime. An example of how it looks:

const AddExprOp = (blk: {
    const A = Op.ivar("A");
    const B = Op.ivar("B");
    const C = Op.ivar("C");
    const i = Op.rvar("i");
    const body = C.index(i).set(A.index(i).add(B.index(i)));
    break :blk body.range(
        Op.rref("i"),
        C.len(), // stop
        .c(0), // start
        .c(1), // step
    );
}).pack();

const AddExpr = AddExprOp.build(struct {
    A: Ptr(*const f32),
    B: Ptr(*const f32),
    C: Ptr(f32),
    i: usize,
}).cexport("adder_custom_exported", .c);

extern fn adder_custom_exported(*const anyopaque) void;

which is equivalent to:

export fn adder_zig(noalias inp: *const AddExpr.input_t) void {
    for (0..inp.C.len) |i|
        inp.C.ptr[i] = inp.A.ptr[i] + inp.B.ptr[i];
}

They produce identical assembly in release build thanks to zig’s compiler and explicit control of inlining. Basically same Op graph into different types including @Vector for simd. And also able to export to cuda etc. Being in comptime, the graph can be represented in json too. Everything modeled as an Op node, including allocations, kernel launches etc. Also the whole code so far is just under 2k lines all because of how powerful the comptime is.

Zig’s shines again with its build system and ability to generate ptx directly. I managed to generate a portable c header + .so files that are just 30kb, compiles under 4s. Able to do mixed cpu + gpu compute in a portable way calling from c code linking only the .so file (on linux), not even requiring cuda toolkit for compilation or runtime. Just driver api.

Recently I moved to tensors and came across comptime performance bottleneck. Tensors are built on top of ops, compute first and dont occupy memory, enabling aggressive fusion by default. A simple stress test with 1000 tensor adds took 13s to compile and adding any more tensors is increasing the time exponentially. It seems like the bottleneck was with zig comptime unable to create large amounts of structs as most compilation time was spent during structs creation itself before even processing kicked in.

Sorry if this is too verbose under this topic. I really would like to push further and see how far I can get with comptime. I would appreciate any leads on getting max performance at comptime code interpretation phase.

The comptime interpreter is not optimized for speed. At some point doing code generation becomes a better option for generating data.

4 Likes

std.Build has the functionality to generate source files and include them into your modules/libraries. This can be automated, added to its own build step, or whatever your need. Typically they won’t live directly in your src directory, but in the .zig-cache, but are still available and visible to your code and LSP.

Here is a rough (untested) idea of how you might accomplish this with codegen:

pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});

    var allocating = std.Io.Writer.Allocating.init(b.allocator);
    var writer = &allocating.writer;
    for (0..100000) |i| {
        // Generate whatever code you need using Zig 
        writer.print("pub const value_{d} = {d};\n", .{ i, i }) catch unreachable;
    }

    const file_path = b.addWriteFiles().add("generated.zig", writer.buffered());
    const mod = b.addModule("generated", .{
        .root_source_file = file_path,
        .target = target,
    });

    // import mod normally wherever you need it
    // ...
}
5 Likes

Like you’ve already noted, using comptime to compute results or generate data should be done sparingly, so once your uses of comptime begin to increase your compile times noticeably it’s a good sign to begin considering build-time code generation instead.

As far as general advice goes I would probably take similar considerations into account as one would in garbage-collected scripting-like languages like JavaScript or Python. For example, a loop that does array concatenation is going to be much slower than declaring an array with a known size upfront and iterating over each uninitialized element:

// slow
comptime var vals: []const f32 = &.{};
comptime {
    for (0..10000) |_| vals = vals ++ .{23};
}

// faster
comptime var vals: [10000]f32 = undefined;
comptime {
    for (&vals) |*val| val.* = 23;
}

From my experience comptime tends to crap out more quickly when working with larger contiguous chunks of memory such as std data structures or large arbitrary bit-width integers like u1024.

Obviously any specific performance pitfalls are subject to how comptime happens to be implemented by the compiler at any point in time. The long term goal is for comptime to be roughly similar in performance to CPython:

4 Likes

I am actually using the build system to generate things like zig code, c header for interface and to embed generated cuda ptx using @enbedFile etc. Its just that I saw comptime as something unique unlike traditional code gen since my nodes’ final eval functions are either compiled or can be inlined in the code itself.

For example, the constructed graph is directly callable from my above example like AddExpr.eval(inputs).

Also I am also doing things like lazy type resolution using zig types (type as a field) and extensible ops (function as a field), which can only exist at comptime. So I cannot easily move to runtime unless I reduce the scope very much by representing types as a fixed set of enums.

Since I am new to this, I am unable to wrap my head around not using comptime for my use case. I am heavily relying on compile time only types like ‘type’ and fn types as fields. I am not doing the traditional enum or registry based IR at all because of what comptime is letting me do. I can frame it as using comptime as a function inliner, rather than pre-computing some results. For example,

pub fn add(a: *Op, b: *Op) *Op {
        return simple("add", &.{ a, b }, null, struct {
            pub inline fn call(eval: type, a_: anytype, b_: @TypeOf(a_)) eval.output_t {
                return a_ + b_;
            }
        });
    }

The above is the only thing needed to integrate with all other nodes including serialization to-from json etc, but can only be in comptime because the Op struct will hold the function I am passing as a field. And its basically a single function for all backends since zig can do the lowering.

And comptime is very needed to make the types concrete (the eval.output_t) and also to make the state structs etc. It will be difficult to explain without telling about core of the project. But in the end, I currently feel like I cannot do this without comptime.

I am mostly using tree of node structs rather than continuous arrays. And I am not using most of std data structures since most of them wont work at comptime.

And for all above reasons I am not sure if I can use build time code gen.

Right now, when the build system decides it need to run the compiler, it’s done completely from scratch, no caching. Comptime has the great property of being deterministic, so the compiler should cache specific invocations of comptime functions once incremental compilation is here. I believe you can try out incremental on x86_64 Linux.

Unfortunately that does effectively mean “just wait and eventually it’ll get fast”. Others have commented on some optimizations strategies in the mean time. I wrote a comptime assembler a while back and made extensive use of Bounded Arrays.