Running benchmarks with the build system

Hai, I’m trying to run benchmarks (and then post process the data they generate) using the build system. In theory I have the entire pipeline down, however all my benchmarks are running at the same time, which means they’re all interfering with eachother.

So then my concrete question is if there is a way to specify that a run step should run by itself (in parallel with nothing)?

2 Likes

Just make them all depend on each other in a big chain with std.Build.Step.dependOn().
If steps only depend on the top-level step, the build system takes that to mean that they can be run in parallel with one another.
If they depend on another step, they’ll instead wait for that other step to complete before starting.

7 Likes

To provide an example of what @tholmes mentioned, here is something I did a while back to (essentially) run a bunch of tests and output their results to stdout. (so I can compare visually)

Here is the code : paella/build.zig at 46d5bdfc37b5ef6d10cb8be625aeb5dee48ee73e · asibahi/paella · GitHub

However it is from an old Zig version and it has been a long while since I did it, but it worked fine for what I wanted at the time.

If you mark a Run step as has_side_effects = true then it gains these properties (see impl of spawnChildAndCollect):

  • Locks the global stderr mutex, causing the step to be run in isolation from anything else that prints to stderr.
  • Uses “inherit” on stdin, stdout, and stderr of the Run step, so that the user can interact with the child process with the terminal.

At first I wrote this comment as if it were a “yes” answer to your question, but looking more closely I think it’s not exactly modeling the use case. I think it’s a nice use case that deserves to have some build system API dedicated to it. Maybe there could even be some kind of mechanism for conveniently recording data points for long term storage, intended for use in collecting data points in e.g. CI runs over time and viewing them in a graph somehow. Maybe we can brainstorm that together here?

8 Likes

I’ve implemented this and it works but it’s very fragile. It’s very dependent on the fact that the steps running after my benchmarks just happen to require information from all benchmarks and that all the steps running before my benchmarks are probably already cached or all being compiled at the same time. It’s functional but not great. Regardless, thank you.

My current implementation for future reference:

    const global = struct {
        var benchmark_chain_start: ?*Step = null;
        var benchmark_chain: ?*Step = null;
    };

    { // Timing run. These run one at a time
        // --- snip ---

        if (global.benchmark_chain_start) |s| {
            // Ensure all benckmarks are compiled before the first one runs
            s.dependOn(&exe.step);
        } else global.benchmark_chain_start = &harness_run.step;

        if (global.benchmark_chain) |s| {
            harness_run.step.dependOn(s);
        }
        global.benchmark_chain = &harness_run.step;
    }
    { // Logging run. These run in parrallel, before the first timing run
        // --- snip ---

        global.benchmark_chain_start.?.dependOn(&harness_run.step);
    }

I think it would be great if the build system has some mechanism for benchmarks.

For my usecase I would be content with a mechanism to specify a run step runs in complete isolation (aside form the build process).
It would be nice if these results would also be cached, as rerunning a benchmark should not give wildly different results for the same binaries.

As to what you’re suggesting, I think running a binary many times in a simple harness that does some warmup runs and then times the average runtime over N runs would be fairly useful. I see a few open questions though:

  • What datapoints do you collect?
    • Linux is very open in this sense, but I believe on windows there’s no equivalent to perf counters
  • What data do you store?
    • If you’re doing development, and you’re just checking that you didn’t regress performance it’s not very useful to store. Similarly storing every unmerged PR isn’t very useful. So most likely there should be some --store-bench-results
  • Adding to this, should data be stored in version control, or be local to the system?
  • Should there be custom harnesses?
    • In my, fairly advanced, usecase I have a ptrace harness that collects a lot more data that this particular project finds interesting. Should that be considered in this hypothetical benchmark API or should that remain custom work.
    • This is basically asking if there should be a standard extensible format to store datapoints in, or a simple one
  • What should be the plotting backend?
    • In my recent search for finding utilities to plot things, I’ve found a distinct lack of tools that ‘just’ accept a format and produce a nice plot. (If anyone knows any decent programs that do this I would love to hear about it)
1 Like

Windows does have perf counters. QueryPerformanceCounter function - Win32 apps | Microsoft Learn

1 Like

+1 for this general idea. I’ll have to go back and re-familiarize myself, but I found it difficult to run kcov in ‘combining mode’, the tests were all randomized and as a result, they’d miss two lines here, four lines there, it added up to 100% but only produces that result maybe one in six times.

I seem to recall the issue was that everything was running in a temp directory and only exported to zig-out when it was finished, so kcov wouldn’t consolidate prior runs since it always looked like the first time to it.

So yeah, some kind of concept of a cumulative operation in the build system could be pretty handy. Benchmarks and regression testing, stochastic coverage tests, that’s a couple use cases, and there are probably more.