Brainstorming ideas for how to test incremental compilation

It’s finally starting to happen!

With this change, we can start to actually use the incremental compilation feature. The problem is, it introduces an entire new dimension where bugs can hide. Sequences of edits could potentially affect the state in different ways.

For example, if I do this sequence:

pub fn main() !void {
    std.debug.print("hello\n", .{});
}
pub fn main() !void {
    std.debug.print("hello {s}\n", .{"world"});
}
pub fn main() !void {}

Versus this sequence:

pub fn main() !void {
    std.debug.print("hello {s}\n", .{"world"});
}
pub fn main() !void {
    std.debug.print("hello\n", .{});
}
pub fn main() !void {}

The same 3 source files are used in both cases, and the final result is the same, but the order is different. The compiler must end up with the same final result, but it will require handling different scenarios, and it’s possible that bugs could hide in these differences.

So, we need a new kind of test case to cover this kind of thing. Maybe a new kind of zig reduce to produce test cases. And ideally, we could try to find all the incremental compilation bugs before our users do. Perhaps we can invest in fuzz testing combined with zig reduce in order to generate test cases. Finally, we need a way to easily understand the state of the compiler given a failing test case, in order to fix the bug easily. If we’re spending hours or days troubleshooting any one particular bug, that’s a sign we should have invested that time instead into tooling.

A related topic is testing in the face of the new parallelism being added to the compiler, but perhaps we can save that for another topic to keep things more focused.

So, the point of opening this thread is to ask the community, do you have any testing ideas? Let’s brainstorm together.

10 Likes

A simple place to start would be restoring some old testing functionality we had.

In test/cases/ live the compiler “test cases”, which generally consist of a single file which we can expect a certain result from (perhaps a successful compilation with a certain output, or a safety panic, or a compile error). When the legacy implementation of incremental compilation was enabled, we also supported “incremental test cases” here. Using files of the form foo.0.zig, foo.1.zig, etc, you can simulate a series of incremental updates. At some point, we disabled these tests, as the old implementation of incremental compilation was only POC quality and was beginning to bit-rot (we eventually ripped it out entirely). Recently, I had to rip out the logic for runnig those cases too, because it relied on some legacy test logic which hooked into the compiler (rather than running it via std.process.Child, which is the preferred mechanism). Restoring this functionality would be a solid starting point for testing incremental compilation.

The starting point for this change is these lines:


A further enhancement here might be to change this file format to condense the data into a single file, perhaps using diffs. This would reduce the friction of manually writing these test cases. I’m just spitballing, but something like this:

pub fn main() void {
    while (true) break;
}
// run

@ 2
-    while (true) break;
+    break;
// error
//
// :2:5: error: break expression outside loop

I don’t actually think that’s a good syntax (I dislike how it awkwardly merges Zig and diff syntax), but that’s the kind of thing I’m thinking of.

2 Likes

For executing those test cases, perhaps it’s time to finally implement

Or at least the build runner multiplexer component (i.e. running zig build --watch as a child process). This would test the workflow and features used by a potential IDE.

I do think it would be worth investing in some kind of compact test case format. I think ideally, a single file that describes potentially multiple source files and of course multiple sequences of edits. Here’s a draft:

#update=initial version
#file=main.zig
const other = @import("other.zig");
pub fn main() !void {
    try std.io.getStdOut().writer().print("{s}\n", {other.string});
}
#file=other.zig
pub const string = "hello";
#expect_exit=0
#expect_stdout="hello\n"

#update=delete the other file
#delete=other.zig
#expect_error=file not found: other.zig

#update=fix the reference
#file=main.zig
pub fn main() !void {
    try std.io.getStdOut().writer().print("{s}\n", {"hello2"});
}
#expect_exit=0
#expect_stdout="hello2\n"

One thing that’s interesting to think about here is that it is represented in the test case as a sequence, however, the expected behavior is independent of the order in which the sequence occurs. So maybe, the test harness could try out multiple orderings, depending on a seed, and then if a problem is observed it prints the seed so that you can reproduce the failure.

Looking at the problem this way, it does start to become more clear how zig reduce could fit into this picture. It could try to reduce these test cases.

2 Likes

What if…

There was a tool similar to zig reduce but instead of mutating the code and checking if some property still exists, it mutates the code by taking small steps to transform A into B.

Then we pick two commits in a source tree, and then have this tool generate a series of in-between states of the source trees.

Then we use a fuzzing algorithm to pick arbitrary subsets of these source trees, and test that incremental compilation is able to get from point A to point B, and B to A, using the source tree’s passing test suite at point A and point B to verify correctness.

Several disjoint thoughts here!

I was there when incremental compilation was retrofitted into rustc and into Kotlin compilers, and in both cases there were some bad bugs shipped, where incremental compilation was enabled in X, critical bug detected in X + m, and incremental getting disabled until X + m + n. That is to say, this is likely harder than it seems, you really need to invest into some ways to prevent this kinds of bugs, and that, even if you are sure you are done (version X), your users might discover something nasty later.


Curiously, I think we had grand total of one incremental compilation bug in rust-analyzer. That is because it was incremental from day one, and it was based on incremental framework which guaranteed correct results by construction. (debugging the framework was easy, because you only needed to debug incremental update algorithm, and the actual computation were fully abstract). The bug we had was due to non-determinism in a user-defined proc macro, which broke the core assumption of the framework.

My understanding though that Zig doesn’t go for a “framework” approach (which probably makes sense, as rust-analyzer is as anti-dod as it gets), so this probably isn’t applicable


A very useful implementation property would be path independence. That, regardless of the path you use to get to a particular source code, the result is byte-for-byte identical (as opposed to being merely equivalent). That indeed opens up path for fuzzing.


Yes, obviously this needs fuzzing / randomized generative testing. After working at TigerBeetle, I am convinced that, if you care about correctness, you should build the bulk of your test-suite around an integrated randomized test. There are so many bugs which are next to impossible to find by just looking at the code or doing manual example-based test.


For fuzzing compilers, the blueprint is wasm-smith:

That is a thing that can generate random, but syntactically and semantically WASM code, which also gives minimization for free, and also for free integrates with coverage-guided fuzzers like AFL.

The idea is very simple: write a function that takes an rng argument, and produces a random valid code fragment. Which is easy — you basically pick a random expression kind, and then generate sub-expressions, pruning them by the known result type. Then, the key trick — make the rng finite, such that it can generate only up to n numbers. That is, an rng is just a pre-filled array of random bytes. And than patch the generating function to short-circuit if the rng runs out of entropy (eg, just generate undefined expression). With this setup, you can shorten the byte array to make the resulting code shorter. And you can use AFL as a source of those random bytes, to plug into coverage-guided aspect.


If you can generate random code, it should be easy to generate random equivalent histories. For example, you can start with A, then apply some operations to go from A to different B, then repeat the same with different seed to go from A to C, an then reply the edits backwards to get from B and C to identical code.

More directly, you can build A incrementally by fixing the sequence of random bytes used as a seed for the generator, and gradually increasing its length.

Alternatively, you can try applying local edits to the seed sequence, which may result in local edits to the generated code.


Fuzzing is tricky to integrate with CI. While normal tests are edge triggerd (you run test on commit, it finishes in finite time), fuzz tests are level triggered (you run it for however long you can, it can discover a failing seed at any point, and there’s a set of currently failing seeds). I think we arrived at a good pattern at TigerBeetle:

  • we have a machine which in a loop fetches code from the main branch, runs the fuzzer, records failing seeds and pushes (commit hash, seed) pairs to a json file in the repo
  • there’s a piece of html/js that displayes the table of “hey, these seeds are currently failing on these commits”: TigerBeetle DevHub.
7 Likes

Thank you so much, @matklad! To be honest, I posted this brainstorming topic here specifically hoping that you would notice and have some useful advice to share, and you did not disappoint.

4 Likes

Thank you for this nice introduction. This will make it easier to follow the resources you linked, which perhaps already answer my question below, but I’ll ask it anyway:

This all makes sense and will help find crashes, but how would you test for behavior correctness? I suppose we can provide a few “fixed points” along the way where we know the code example is supposed to run with exit code 0, print a specific string to stdout, or fail with certain compile errors, and then the fuzz-generated test cases would check that inserting arbitrary in-between points between those known fixed points did not cause any crashes, and also did not cause the behavior of the fixed points to change.

Or perhaps this is where your determinism comment comes in:

I think I’m starting to understand this, because you could check that two different fuzz-generated paths resulted in identical binaries and thus generate test failures from seeds alone, without even needing any “fixed points” to start with.

I’m not sure I want to require this property for debug builds, however, because it’s quite handy to allocate and free machine code buffers in the output file similar to how alloc and free work, and the orderings there could affect the positions within the machine code.

However, perhaps we can still achieve the goal, by having a post-processing “makeDeterministic()” function that takes a nondeterministic binary, and produces a deterministic one. For example, by sorting the symbols and removing padding. When comparing binaries for the purposes of path independence, this post-processed output could be checked, rather than the direct output.

1 Like

Yeah, that’s exactly why I think that path independence is useful.

Another question here is what’s the output here. Naively, it seems that testing all the way up to machine code might not be necessary. Like, obviously you’ll have some live patching infrastructure and what not, but that seems relatively easier to get right than various sema bits.

So perhaps it makes sense to focus the bulk of testing slightly higher than machine code. Eg, using wasm as the target backend might make sense, as it has better textual representation, and also is a bit higher level.

Or maybe just ask sema to hash all reachable functions, or something like that

2 Likes

I think that would be an interesting approach, that could enable other/later benefits.

Since I watched “Performance Matters by Emery Berger” https://www.youtube.com/watch?v=r-TLSBdHe1A I was thinking that the incremental compilation support, could later be extended with a feature, where the compiler is able to produce many different address space layouts, while the program is running (re-shuffling the binary while the program is running), the talk explains that this can help with getting better performance metrics, because the results of specific address space layouts are averaged out and thus it helps with getting metrics and knowing whether the code changes have actually improved performance (and it wasn’t just that the compiler got lucky with picking a specific layout). It also may later be useful in developing optimizations around finding good layouts that increase performance.

But I will stop here, because this comment is a bit of a side-note.

3 Likes

I guess another option here is then just to fuzz the resulting program for equivalence? You don’t necessary fix a desired output, and instead allow the program to print whatever it wants to stdout, based on input args. Then you run two versions of the program, with random inputs, and compare the output.

One obstacle here is that a program might contain a while(true), but it’s actually easy to make sure that a program always terminates. Add the following device to each generated program:

var fuel: u32 = 1_000_000;
fn consume_fuel() {
    fuel -= 1;
    if (fuel = 0) @panic("out of fuel");
}

and then, whenever generating a looping construct or a function, also generate a call to consume_fuel

5 Likes

I guess a more principled issue with running programs is that its just slow: effectiveness of fuzzing is directly proportional to how fast can you get it going, and if you have “spawn a process” in your inner loop, that’s going to be a sizable fixed cost.

That’s actually another bit of unsolicited feedback re testing compilers: the path of the least resistance is to say that a single compiler’s test is a self-contained program, which you just compiler and run, but that actually can be quiet slow, as you need to burn two process to make this happen (compiler process, and then the program under test). A faster way is usually to say that a single compiler test is a

test "compiler test №16492" {
   // code here
}

and then create just a handful of CUs by concatenating a bunch of such individual tests into one giant file which is compiled and run only once.

2 Likes

It would be pretty great if there were an API hook for this kind of AST transform. Someone recently started a thread about writing a fuzzer for Zig and hooking at the AST level is the best way to do something like that.

Now that the build system is starting to flesh out, it might be a good time to add something like that as an experimental feature accessible through a build step.

That might exist already, for all I know.

I wonder if we can make use of the Collatz conjecture here. Have a script that performs certain transforms on a piece of code based on an integer number. Starting out from a random seed, each iteration would involve the next number in the sequence until 1 is reached.

it does: zig/lib/std/zig/render.zig at 20563d84577203891637477b2e475a9b279152ac · ziglang/zig · GitHub

it’s used by zig reduce

2 Likes

Properly Testing Concurrent Data Structures seems relevant as well for expanding this strategy into also testing multi-threading.

I’ve thought about that as well, but there’s a catch: that approach requires you to manually annotate all concurrency-relevant section. Which is easy in Rust, because only few types are thread safe, and the compiler would protect you from accidentally sharing non-thread-safe stuff.

In Zig, I think the biggest concurrent problem we are going to face would be of the shape “we are accidentally sharing this int between threads, so it must be atomic”, and it’s not something you can instrument up-front.

In other words, the approach from the post works for logical race conditions, but not for physical data races.

Though: half of the Rust checking relies on essentially deeply checking types passed to thread.spawn to make sure they don’t contain non-thread safe stuff. This could also work in Zig: in spawn, we can comptime reflect on the shape of the args tuple and error, if, for example, it contains a pointer to non-atomic int or some such. Though, the other part is that Rust also tracks aliasing, it can also check that no one else points at the args array.n

1 Like

Hmmm, well, tools like Thread Sanitizer work fairly well for catching physical data races. I think the logical race conditions is the main challenge leftover.

1 Like
2 Likes

Reminds me of go’s txtar. It seems to solve a very similar problem.

Worth considering something like Russ Cox’s bisect tool for Go. Which is already similar to zig reduce, in a few ways.

In fact, I needed to read the zig reduce source to be sure it wasn’t doing everything that gets covered in the article.

The part I wanted to draw attention to is taking two versions of a function, and distributing them through every place in the code where that function is called. This can work with every function which changed from A to B, and isolate a stack trace with the specific new function which triggers the bug, and the context in which that happens.

Or to take a pull quote from the article:

That is, we can run binary search on a different axis: over the program’s code, not its version history. We’ve implemented this search in a new tool unimaginatively named bisect. When applied to library function behavior like the timer change, bisect can search over all stack traces leading to the new code, enabling the new code for some stacks and disabling it for others. By repeated execution, it can narrow the failure down to enabling the code only for one specific stack

So very similiar to what zig reduce does already. The difference is that the input to zig reduce is a program and an ‘interestingness’ checker, and the input into this kind of bisection is two editions of a program, which differ in some particulars, and the output is a specific stack trace, which makes use of some combination of the old and new code, where that stack trace is where the interesting error happens.

So to use this with incremental compilation, the old and new editions of the program are a whole-program compilation which is not interesting, and the new one is an incrementally-compiled version of the same source code, which is interesting.

This can then be further narrowed and permuted to identify what subset of available incremental combinations will produce interestingness. As in, if you have A->B->C, that can be reduced to AB->C and A->BC.

This could be pretty powerful when diagnosing why and how the difference between whole-program compilation, and incremental compilation which should arrive at the same state, actually fails.

This is tricky to do after the fact, but instrumentation can collect this data fairly straightforwardly, by making deltas from every version of a project which parses and then compiles, to the next version.

It would be possible to brute-force a large diff into a series of small ones, but there the difference is that the temporal information is lost, and can’t be retrieved, so you’ll end up with paths from A to B which were never taken by the program. Since the invariant is “incremental compilation applied in any order results in a program with the same behavior”, that might be either ok or good, but it’s also a case where it’s plausible to identify a bug, then find a different one, and need to hold onto it somewhere while the program finishes searching for the original one.

Done naively this can be a very large search space as well. For example, it might need to find a transition in which fifteen examples of a token all change at once, in order to have a minimal delta which compiles. Memorization and structure-aware optimizations can help with that, at the cost of more complexity.

There’s room for both I suspect.