How does one deal with deeply nested comptime loops?

I’m currently working on a tensor math library for running neural networks, and Zig has been a joy to use so far. I especially appreciate the comptime and inline features, they’ve been super helpful for generating and specializing tensor operations. The idea is to emit Zig code for each operation and use these features to tailor it to specific tensor shapes and values. Ideally, this lets the compiler optimize each operation for the task at hand (or so I hope).

It’s worked well overall, but I’ve hit some challenges when I push this concept too far. For example, deeply nested inline for loops for something like MatMul can cause long compile times, especially during the LLVM Emit Object phase. In some cases, scaling up simply crashes the compiler without any helpful output. This makes me hesitant to lean into inline and comptime as much as I’d like, since things can unexpectedly blow up.

So my questions are:

  1. Am I misusing Zig for this kind of thing? I imagine in a production setting, aggressive unrolling like this might lead to unacceptably large binaries. Admittedly, often I’m chasing curiosity more than practical design, like “Ooh, what if I inline this loop? Will my code be faster?!”… which might not qualify as “proper” software engineering (although I do really enjoy it).

  2. If this approach is valid, are there any tips for dealing with these issues? Any patterns, workarounds or tricks that might help?

EDIT:

Basically, to give a contrived example, this is a nice pattern that doesn’t scale well:

inline for (0..N) |i| {
    inline for (0..N) |j| {
        inline for (0..N) |k| {
            const index = comptime some_function(i, j, k);
            // do something with index
        }
    }
}

You are misusing it, in the sense that you shouldn’t use it for every loop you encounter.

Generally llvm is pretty good at optimizing loops (e.g. for a simple for loop llvm will give you basically the same vectorized code with and without inline for: Compiler Explorer)

However if you don’t make everything inline, llvm knows when to stop unrolling the loop. With inline for you will end up with thousands of copies of the same basic instructions. This increases your compile times because llvm will now need to parse and optimize more code. And you might laso run into problems with big binaries and you are flooding your cache with the same repeating instructions.

This approach is valid, however it should be used with caution:
First of all you must measure it, if it doesn’t make a measurable performance impact then why bother with the long compile times?
I would start with only inlining the innermost loops, and then stop if it doesn’t make a performance difference. The outer loops are executed rarely (→ low runtime cost), but unrolling them causes huge amounts of code (→high comptime cost)

4 Likes

Breaking up code into functions helps. They’ll be inlined anyway, so you’re not going to lose much in terms of performance. When there’re lots of local variables in one function scope that can greatly increase compilation time and memory, as the optimizer tries futilely to find ways of optimizing them.

1 Like

Thanks for the comment so far!

In the mean time, my own experimentation continues. In cases where LLVM chokes I actually found that compiling with -Doptimize=ReleaseSmall reduces compile times too (~80% reduction in compiletime, I guess because LLVM doesnt have to do as much)! In some cases, an otherwise uncompilable scenario became compilable.

Thanks for the tips! I agree, going inside out with inline seems sensible. I’ve experienced trickiness in scenarios where the outer loops are tricky not to inline; e.g. in a scenario like the following:

inline for (0..H) |h| {
    inline for (0..W) |h| {
        for (0..C) |c| {
            inline for (0..Kh) |kh| {
                inline for (0..Kw) |kw| {
                    // do boundary checking for [h + kh, w + kw]; really rather not
                    // do this at runtime!
                    if (...) {}
                }
            }
        }
    }
}

In a case like this one, using inline has a huge performance benefit as you do not need to do any run-time bounds checking.

But from what I’m reading thus far, the solution is probably more clever design of the algorithm to reduce the number of times the boundary checking is hit, with some more “light” use of comptime sprinkled here and there.

This is a very nice tip indeed! Limiting the scope with function calls seems clever :slight_smile:.

1 Like