Comptime performance limits

iRohith · October 16, 2025, 4:09pm

Zig’s comptime is really impressive. I would like to know if there are any tricks or optimizations I can use to increase comptime performance. For example:

pub fn main() !void {
    @setEvalBranchQuota(1000000);
    comptime var vals: [100000]f32 = undefined;
    comptime {
        for (0..vals.len) |i| vals[i] = 23;
    }
    std.debug.print("{}\n", .{vals[8426]});
}

This code takes about 12s compile + runtime with comptime var, vs just 500ms for a regular runtime var.
I know comptime might not be suitable for heavy workloads, but I think I have a valid use case where I need to create thousands of comptime structs.

Its my first time here, dont know if I can say this in this thread or need to do it under showcase tag. Here’s a description of my project anyway.

I initially wanted to make a tensor lib after discovering zig and comptime few months ago. Started making an IR to build the tensors on top of it like pytorch ATen. I am basically representing a graph/IR tree thing as comptime structs which carry info like the function, type etc. So they can only live at comptime. An example of how it looks:

const AddExprOp = (blk: {
    const A = Op.ivar("A");
    const B = Op.ivar("B");
    const C = Op.ivar("C");
    const i = Op.rvar("i");
    const body = C.index(i).set(A.index(i).add(B.index(i)));
    break :blk body.range(
        Op.rref("i"),
        C.len(), // stop
        .c(0), // start
        .c(1), // step
    );
}).pack();

const AddExpr = AddExprOp.build(struct {
    A: Ptr(*const f32),
    B: Ptr(*const f32),
    C: Ptr(f32),
    i: usize,
}).cexport("adder_custom_exported", .c);

extern fn adder_custom_exported(*const anyopaque) void;

which is equivalent to:

export fn adder_zig(noalias inp: *const AddExpr.input_t) void {
    for (0..inp.C.len) |i|
        inp.C.ptr[i] = inp.A.ptr[i] + inp.B.ptr[i];
}

They produce identical assembly in release build thanks to zig’s compiler and explicit control of inlining. Basically same Op graph into different types including @Vector for simd. And also able to export to cuda etc. Being in comptime, the graph can be represented in json too. Everything modeled as an Op node, including allocations, kernel launches etc. Also the whole code so far is just under 2k lines all because of how powerful the comptime is.

Zig’s shines again with its build system and ability to generate ptx directly. I managed to generate a portable c header + .so files that are just 30kb, compiles under 4s. Able to do mixed cpu + gpu compute in a portable way calling from c code linking only the .so file (on linux), not even requiring cuda toolkit for compilation or runtime. Just driver api.

Recently I moved to tensors and came across comptime performance bottleneck. Tensors are built on top of ops, compute first and dont occupy memory, enabling aggressive fusion by default. A simple stress test with 1000 tensor adds took 13s to compile and adding any more tensors is increasing the time exponentially. It seems like the bottleneck was with zig comptime unable to create large amounts of structs as most compilation time was spent during structs creation itself before even processing kicked in.

Sorry if this is too verbose under this topic. I really would like to push further and see how far I can get with comptime. I would appreciate any leads on getting max performance at comptime code interpretation phase.

pachde · October 16, 2025, 5:28pm

The comptime interpreter is not optimized for speed. At some point doing code generation becomes a better option for generating data.

ForeverZer0 · October 16, 2025, 6:54pm

std.Build has the functionality to generate source files and include them into your modules/libraries. This can be automated, added to its own build step, or whatever your need. Typically they won’t live directly in your src directory, but in the .zig-cache, but are still available and visible to your code and LSP.

Here is a rough (untested) idea of how you might accomplish this with codegen:

pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});

    var allocating = std.Io.Writer.Allocating.init(b.allocator);
    var writer = &allocating.writer;
    for (0..100000) |i| {
        // Generate whatever code you need using Zig 
        writer.print("pub const value_{d} = {d};\n", .{ i, i }) catch unreachable;
    }

    const file_path = b.addWriteFiles().add("generated.zig", writer.buffered());
    const mod = b.addModule("generated", .{
        .root_source_file = file_path,
        .target = target,
    });

    // import mod normally wherever you need it
    // ...
}

castholm · October 16, 2025, 7:19pm

Like you’ve already noted, using comptime to compute results or generate data should be done sparingly, so once your uses of comptime begin to increase your compile times noticeably it’s a good sign to begin considering build-time code generation instead.

As far as general advice goes I would probably take similar considerations into account as one would in garbage-collected scripting-like languages like JavaScript or Python. For example, a loop that does array concatenation is going to be much slower than declaring an array with a known size upfront and iterating over each uninitialized element:

// slow
comptime var vals: []const f32 = &.{};
comptime {
    for (0..10000) |_| vals = vals ++ .{23};
}

// faster
comptime var vals: [10000]f32 = undefined;
comptime {
    for (&vals) |*val| val.* = 23;
}

From my experience comptime tends to crap out more quickly when working with larger contiguous chunks of memory such as std data structures or large arbitrary bit-width integers like u1024.

Obviously any specific performance pitfalls are subject to how comptime happens to be implemented by the compiler at any point in time. The long term goal is for comptime to be roughly similar in performance to CPython:

github.com/ziglang/zig

improve comptime performance to roughly, generally the same as CPython execution speed of equivalent Python code

opened 05:36AM - 03 Jan 20 UTC

squeek502

enhancement optimization frontend

While investigating https://github.com/ziglang/zig/issues/3863, I ran into an un…expected problem: the performance of certain things (in my case, `std.hash.Wyhash` and `std.sort.sort`) is hugely degraded during comptime. Here's a test file that shows the problem using `std.sort`: ```zig const std = @import("std"); const num_values: usize = 10000; const expected_first_element: usize = num_values - 1; fn sortSomething() usize { var buf = init: { var arr: [num_values]usize = undefined; for (arr) |*v, i| { v.* = i; } break :init arr; }; std.sort.sort(usize, buf[0..], std.sort.desc(usize)); return buf[0]; } test "comptime" { @setEvalBranchQuota(100000); comptime std.testing.expectEqual(expected_first_element, sortSomething()); } test "runtime" { std.testing.expectEqual(expected_first_element, sortSomething()); } ``` The runtime test on its own (`--test-filter runtime`) runs very quickly (as expected) with `num_values = 10000`. For comptime, however, here's what I get with various different `num_values`: | `num_values` | time spent compiling | memory usage | status | | --- | --- | --- | --- | | 10 | 1s | ? | OK | | 100 | 1s | ? | OK | | 1000 | 3s | 500mb | OK | | 10000 | 75s | 3gb | evaluation exceeded 100000 backwards branches | (tested with zig `0.5.0+33d9dda55` on Windows) I would assume most of this is due to things like gathering stack traces for potential compile errors and things like that, but maybe there needs to be a way to disable that for certain comptime blocks? Or maybe its something else that's causing the performance issues here?

iRohith · October 16, 2025, 7:38pm

I am actually using the build system to generate things like zig code, c header for interface and to embed generated cuda ptx using @enbedFile etc. Its just that I saw comptime as something unique unlike traditional code gen since my nodes’ final eval functions are either compiled or can be inlined in the code itself.

For example, the constructed graph is directly callable from my above example like AddExpr.eval(inputs).

Also I am also doing things like lazy type resolution using zig types (type as a field) and extensible ops (function as a field), which can only exist at comptime. So I cannot easily move to runtime unless I reduce the scope very much by representing types as a fixed set of enums.

iRohith · October 16, 2025, 8:05pm

Since I am new to this, I am unable to wrap my head around not using comptime for my use case. I am heavily relying on compile time only types like ‘type’ and fn types as fields. I am not doing the traditional enum or registry based IR at all because of what comptime is letting me do. I can frame it as using comptime as a function inliner, rather than pre-computing some results. For example,

pub fn add(a: *Op, b: *Op) *Op {
        return simple("add", &.{ a, b }, null, struct {
            pub inline fn call(eval: type, a_: anytype, b_: @TypeOf(a_)) eval.output_t {
                return a_ + b_;
            }
        });
    }

The above is the only thing needed to integrate with all other nodes including serialization to-from json etc, but can only be in comptime because the Op struct will hold the function I am passing as a field. And its basically a single function for all backends since zig can do the lowering.

And comptime is very needed to make the types concrete (the eval.output_t) and also to make the state structs etc. It will be difficult to explain without telling about core of the project. But in the end, I currently feel like I cannot do this without comptime.

I am mostly using tree of node structs rather than continuous arrays. And I am not using most of std data structures since most of them wont work at comptime.

And for all above reasons I am not sure if I can use build time code gen.

mattnite · October 17, 2025, 3:51pm

Right now, when the build system decides it need to run the compiler, it’s done completely from scratch, no caching. Comptime has the great property of being deterministic, so the compiler should cache specific invocations of comptime functions once incremental compilation is here. I believe you can try out incremental on x86_64 Linux.

Unfortunately that does effectively mean “just wait and eventually it’ll get fast”. Others have commented on some optimizations strategies in the mean time. I wrote a comptime assembler a while back and made extensive use of Bounded Arrays.