Optimize runtime checks away with double compilation

Hello :))

I am building a Discrete-Event simulation of a social network (cannot share the code to it yet..), and this idea came to me. To schedule the time between events, I use runtime dynamic dispatching reading from a json config file (see the distributions library, the json part of the README for more context) and to give the user the choice if the traces should be written to file or not. Let’s focus on the file example for a minute, my code is filled with:

if (simconf.trace_to_file) { // write the trace }

which cannot be optimized, as the variable trace_to_file is read from a json at runtime. Also, I cannot embed the config file into the build system, as I want this to be just a single executable, the end user should be able to just pass the desired configuration and the simulation to run —make them run zig build is not an option.

Now, an idea crossed in my head… Could I make a Zig program that received the configuration file, read it ({"trace_to_file": false}) and knowing that information, could compile the simulation removing the if (simconf.trace_to_file) checks?

A silly concrete example mimicking the situation I am at:

const std = @import("std");
const Io = std.Io;

// json equivalent struct
const Conf = struct {
    write_to_file: bool,
};

pub fn main(init: std.process.Init) !void {
    const arena: std.mem.Allocator = init.arena.allocator();

    const args = try init.minimal.args.toSlice(arena);

    if (args.len <= 4) {
        std.debug.print("Usage: n a b config.json\n", .{});
        std.process.exit(1);
    }

    const n = try std.fmt.parseInt(u64, args[1], 10);
    const a = try std.fmt.parseInt(u64, args[2], 10);
    const b = try std.fmt.parseInt(u64, args[3], 10);

    // load json
    const content = try std.Io.Dir.cwd().readFileAlloc(init.io, "conf.json", arena, .unlimited);
    defer arena.free(content);
    const options = std.json.ParseOptions{ .ignore_unknown_fields = true };
    const parsed_result = try std.json.parseFromSlice(Conf, arena, content, options);
    const conf = parsed_result.value;
    // In order to do I/O operations need an `Io` instance.
    const io = init.io;

    var stdout_buffer: [1024]u8 = undefined;
    var stdout_file_writer: Io.File.Writer = .init(.stdout(), io, &stdout_buffer);
    const stdout_writer = &stdout_file_writer.interface;

    try stdout_writer.print("Computing the {d}-th fibonacci number starting with ({d}, {d})\n", .{ n, a, b });
    try stdout_writer.flush();

    const nth = try fibonacci(init.io, a, b, n, conf);

    try stdout_writer.print("nth: {d}\n", .{nth});
    try stdout_writer.flush(); // Don't forget to flush!
}

/// Computes the n-th fibonacci number
fn fibonacci(io: std.Io, a: u64, b: u64, n: u64, conf: Conf) !u64 {
    var buffer: [64 * 1024]u8 = undefined;
    const file = try std.Io.Dir.cwd().createFile(io, "./fib.txt", .{ .read = false });
    defer file.close(io);
    var file_writer = file.writer(io, &buffer);
    const writer = &file_writer.interface;

    var f_i2: u64 = a;
    var f_i1: u64 = b;
// AIMING TO REMOVE THIS
    if (conf.write_to_file) { 
        try writer.print("{d}\n", .{f_i2});
        try writer.print("{d}\n", .{f_i1});
    }
    for (2..n) |_| {
        const f_i = f_i2 + f_i1;
        // AIMING TO REMOVE THIS
        if (conf.write_to_file) {
            try writer.print("{d}\n", .{f_i});
        }

        f_i2 = f_i1;
        f_i1 = f_i;
    }
    // AIMING TO REMOVE THIS
    if (conf.write_to_file) {
        try writer.flush();
    }

    return f_i1;// AIMING TO REMOVE THIS
}

The point would be that the if inside the fibonacci code could get optimized away by making the two programs setup.

Idk if I am just delulu, but I don’t know where even to stat to try to test this, and that’s why I am asking here :smiley:

1 Like

Have you actually benchmarked a difference, and is it enough to care?

To answer the actual question of how you would do this: compiling the program at runtime is a way to do this, but it would add considerable delay to the execution which probably defeats the entire point.

Instead, your functions can take a comptime log_enabled: bool, and higher up the call stack if (runtime_log_enabled) foo(true, ...) else foo(false, ...). However, this could greatly increase binary bloat, though perhaps the optimiser is smart enough to deduplicate much of it.

Alternative would be making the logging faster, one way to do that would be to only write to the file at key points where the delay is acceptable, or perhaps in another thread, or something else entirely.

6 Likes

No! I just thought this could be a cool way to do it, idrc if its worth it or not :smiley:
Probably for just the trace writing it’s not worth it, if you take into account all the sampling from distributions that involve a vtable dereference (which I hope this method would optimize away) maybe? This is my master thesis, but it’s in statistics and OR, so they don’t expect me (and I don’t have time to haha) do a very serious performance benchmarking. Once I have hand it in I will benchmark it for sure, I am very curious.

I did not give as much context as needed to accurately answer the “defeating the point”, but the simulation has to be ran thousands (or tens of thousands) of times and it can take form a few miliseconds (very simple topologies and short timespans) to several hours with a lot of users. In the latter case any gain would be absolutely welcomed. Again, this question aimed to this set up is possible and how (which no actuall idea even when to start, that’s my reasoning behind giving a very simplification of one of the optimize things whit the fibonacci)

Hmmm yes, for just one boolean would totally work. You mean that the binary should just contain two functions, one with the writes and another one without them, right?

Regarding this yeah, I already though about having one thread just writing to a file and the main thread just going and copying the trace of the simulation! But, again, the thing that I might be more able to gain from is the runtime dynamic dispatch from the distribution sampling.

Thank you for your answer tho! If you feel more context is needed I can upload the whole code, but I would need some time to get it somewhat more presentable hahah

4 Likes

Okay, if someone stumbles with this into the future, I made a proof of concept for this small example

Essentially:

  • Program A: normal zig project, reads the arguments and checks the file exists. then spawns a subprocess that calls zig build run -Dconfig=filename -- 10 0 1
  • Program B: the same as the example but now the cofiguration is compiletime!

The result binary has actually less size, so I’d say it works. My problems (and further help i am asking) with this are:

  1. Spawning a child process: meh, i feel it’s ugly. Can I call zig from zig? should I add the compiler itself as a dependency?
  2. Two actual programs: I am ABSOLUTELY sure this can be done in a single build.zig, but I am not seeing how xd
1 Like

This seems a bit like partial evaluation, moving things that are normally runtime towards comptime.

I imagine once the incremental hot-reload is implemented, it might be possible to add a feature that basically dynamically re-optimizes the code based on seen observed values (similar to some of the optimizations popularized via Javascript jit compilers), also seems similar to profile guided optimization automatic profile guided optimization in release modes based on test cases · Issue #237 · ziglang/zig · GitHub, just instead of using tests as a source for profiling data you could use the running program.

Once you have a compiler that can essentially be used as your application launcher that also is able to hot-reload recompiled code, then the zig build run process could be seen as the entrypoint, which through communicating with the compiler could basically communicate profiling information back to the compiler which then could use that info to generate better optimized code and inform the running application to swap in the performance improved code. If then some data that was assumed to be constant by the profile changes, the code would have to have guards around it that detect that and invalidate the current version going back to the un-optimized version, until new profiling data has been collected (based on the unexpected data changes).

So overall I would say, I think my preferred solution would be one that is partially supported and intended by features provided by the compiler and implemented in a way that is mostly like an automatic optimization that happens transparently to the user, without needing manual code changes. (At least not far beyond what is needed for applications that use the (future) hot-reloading and compiler-communication-protocol)

I also think once we have advanced hot-reloading, people will naturally ask for optimizations like this (using runtime-knowledge for dynamic optimizations). Basically I think something like this is better implemented mostly automatically and that it is a bit early for such a feature. My personal need for the feature also isn’t high enough to try the manual approach.


That said I wonder if you could implement the manual approach by having two different implementations for a single module, the first implementation would receive the commandline arguments and verify them, then generate the second implementation which reads the arguments and panics if they don’t match, then runs the second impl by passing the second impl as a build option that defaults to the first implementation when it isn’t provided?

3 Likes

I’ll answer the first part of your answer later, gotta reflect on it and understand it better haha

This is exactly what I am trying to do yes! I find the double program not very elegant, and I am sure the build system has more possibilities.

yep, this is essentially the same as the repo is doing with two projects. What I am struggling to wrap my head around is my self imposed limitation that the user of this program should not have to run the Zig compiler, and be able to run the binary on it’s own. So running the final binary will:

  1. Read the arguments and validate them.
  2. Invoke the compiler with the arguments to compile the logic.
  3. Compile the program with the arguments at runtime.

To do step two in a single build.zig should I add Zig as a dependency of my program as if it were any other dependency? Is that even a good idea?

Thank you for your answer! :smiley:

1 Like

https://codeberg.org/ziglang/zig/src/branch/master/build.zig.zon

// The Zig compiler is not intended to be consumed as a package.
// The sole purpose of this manifest file is to test the compiler.

So I don’t think it is a good idea, also as a user I wouldn’t want some program to start a bootstrapping build of a compiler without my knowledge (maybe once llvm is completely unneeded and Zig builds really fast).

My question is:
Is this something that your users really want, or is it something that you really want to do?
Does this really improve the user experience?

Only if it vastly improves the user experience it is worth doing this (in my opinion).
And in that case I would want the option to specify how the program obtains a working Zig compiler, for example you could have a config and if it doesn’t exist the program could ask whether the user wants to provide a Zig compiler or whether the program should auto-install one.

Have you measured that this optimization is even worth it in your case?

Why? I think this isn’t a particularly good goal.
If you just use zig build run this becomes way easier and it might even make people curious about Zig.

I guess I don’t like it because it hides stuff from the user and over complicates things.
Wanting to create something which is a single program, contains both a compiler and the source for the target program, recompiles that and then auto runs that, seems like it is doing too much.

It seems more to me like your program has to be partially seen like a development environment you may not expect your users to write code but other than that it seems similar to something like vscode which downloads a zig compiler runs it to compile the program and the runs the program.

So I think it makes sense to have 3 programs, one that serves as launcher, config manager and compiler (by invoking it, which is the 2nd program) and then the third program that is actually run by that launcher.
Seems clearer to me than putting all of that into the same executable. (Also less likely to get flagged as malware because of suspicious looking self-modification)

The launcher can contain the info where to get the other things and download them or the launcher plus a vendored Zig in an archive or installer (if you want to provide a fully contained download).

2 Likes

Thank you for showing me that, I am now absolutely convinced this is not the way to go then :))

I am my own (and only!!) user, so technically both are true hahaha. In more seriousness, I am just enjoying myself being curious and testing if this can be done.

I absolutely agree, no buts. Literally this is such a weird thing to do, I would not like to be the next developer of a project that does something like this without a very very good reason to do it.

That’s a good one, thank you! I am probably going to just run zig version in entrypoint.zig and ask the user to install it itself. If not, maybe I could embed the zig compiler as a binary inside my binary, install it in /tmp/... so I could make sure the program works everywhere, but I agree with you: to install a compiler and run it without users acknowledgment does not feel right.

I’d say it might be a yes in my case, that’s why I am taking a look at it with more seriousness and not leaving it just as a thought experiment. I’ll provide exact numbers and benchmarks between today and tomorrow as this is the main question, but with the exact simulation I have to run. The TL;DR of my program is that both writing to file and the VTable deference to implement polymorphism that I am using are either a very heavy operation (there are a lot of file writes) or super common (inside the hot loop).

Small example: to run 10K users (the dataset i can run on my laptop) there are around 600K calls to a distribution, which is characterized with a VTable to be able to be defined at runtime with a json, as I might want to try distinct metrics. The simulation in the server runs with 1M users, and grows in $O(n)$ time and $O(n^2)$ events processed, as the more users there are, the denser the network is. Approx then, we are skipping the dereference approximately 6.000.000.000 = 6·10^9, which is absurdely big number, so my instinct tells me that it will probably compound, like a 2-3% faster execution times! This is just for one run, and to have statistically significant results i need, well, between 100 and 1000.

When the performance benchmark is done, I’ll let you know if my instincts were right! Thank you for your answer :))

2 Likes

I am a bit puzzled why you need a vtable, it sounds to me like a tagged union would be enough, because you own all the code involved?

Basically use the tagged union to represent the different possibilities which are then selected via the json. A tagged union may result in better optimized code. (Once we get restricted function pointers I would expect them to be similar)

1 Like

Hmmm okay at this point let me share you the code we are talking about while I expand in this post:

The simulation uses a paradigm called Discrete-Event simulation, where events will be generated according to the descritpion of statistical quantities. As this is a tool to research, the program must give a way to its users to change those distributions without recompiling, and as I needed weirder distributions than the ones implemented on the std, I did the distributions repository.

In there, there are two types of polymorphism, the VTable Rust trait style, which has this VTable implemented by all the distribution types.

pub fn VTable(comptime Precision: type) type {
    return struct {
        sample: *const fn (dist: *const Distribution(Precision), rng: Random) Precision,
        format: *const fn (dist: *const Distribution(Precision), writer: *Io.Writer) std.Io.Writer.Error!void,
    };
}

and a UnionDistribution (src/UnionDist.zig), which implements the sample like this:

pub fn sample(self: *const Self, rng: Random) Precision {
    switch (self.*) {
        // generates this:
        // .constant => |*c| return c.sample(rng),
        // .exponential => |*exp| return exp.sample(rng),
        // .uniform => |*unif| return unif.sample(rng),
        // ...
        inline else => |*dist| return dist.sample(rng),
    }
}

And now, I must apologize as I did something wrong all this time. I implemented this on march, and I started working on the simulation. Apparently the VTable was never used for this… and I just realized now writing this post xd

Not only this makes more sense, but past me agrees with you xddd. I must have assumed it was the VTable, never recheck it again and gaslight myself into believing it was a VTable, I’m sorry xD

So the potential optimizations are to skip billions of switch cases per run, not dereferences to the VTable. Now that we are here, let’s just go into the details.

In des-ctic-dev/src/simulation.zig the configuration gets passed by parameter, and every call to simconf.quantity.sample() has to go through the switch, apart of the if (simconf.trace_to_file) clauses that check if the trace should be written or not. The argument —while less efective than with a VTable— is the same: if inlined, this small stuff could accoumulate into big savings! The code is halfway there to be benchmarked :slight_smile: Thank you!!

2 Likes

My next attempt at improving the performance wouldn’t be to optimize the performance by getting rid of the switch case / union completely, but instead I would try whether the api can be changed from sample returning a single sample to it calculating many samples.

So basically the question: can you make your hot loops more data oriented so that they operate on many elements at once, without asking repeatedly for each sample what method of sampling to use?

I think you might be able to flip it around that your code can call into a calculateSamples(iterator or slice-of-things-that-need-calculating) function where you switch on the method of sampling once and then calculate all or many samples at once. CppCon 2014: Mike Acton “Data-Oriented Design and C++”

Basically the grouping which things use which method should already be done.
Some of the older talks about data-oriented programming talked about not writing functions like drawPoint, only functions like drawPoints(slice-of-points).

So that you never create situations where you repeatedly choose the same choice over and over again and instead make the choice once for the entire group.

2 Likes

Here’s an interesting option:

-     const nth = try fibonacci(init.io, a, b, n, conf);
+     const nth = if(conf.write_to_file) 
+         try fibonacci(init.io, a, b, n, .{.write_to_file=true})
+     else
+         try fibonacci(init.io, a, b, n, .{.write_to_file=false});

- fn fibonacci(io: std.Io, a: u64, b: u64, n: u64, conf: Conf) !u64 {
+ fn fibonacci(io: std.Io, a: u64, b: u64, n: u64, comptime conf: Conf) !u64 {

Your binary will have two copies of fibonacci(), one for each version of conf that is passed in, and the .write_to_file=false version will optimize away the dead code.

2 Likes

You can wrap these things up in a struct:
I don’t know the exact structure of your project, so it might be a bit harder to do, but let’s say that I have one file with 10 functions that care about this:

fn wrapperStruct(comptime boundsCheck: bool) type {
    return struct{
    // Your functions go here, and they have access to the global
    };
}

Then at runtime you would have to put a switch at some point to decide which version of this struct to call.

I found wrapping things in structs generally reduces boilerplate of passing comptime stuff around.

First, the CppCon video is crazy!!! have already seen it but it’s so enlightening, i am going to rewatch it for sure.

Now,

Nice idea, thank you :)) Tho “techincally” possible, it breaks the simulation in two very subtle ways:

  1. Dependence on the pre-samples random number size for reproducibility: the whole point of DES is to see how the mechanics generate, and to pregenerate all the quantites per buffer will be possible and reproducible, if someone decided to change that could be a bit of a problem.
  2. User inner sampling: every user gets randomly samples from empirical pareto distributions, so I would need essentially an arbitrary number of list of random numbers, and I would need to check at runtime where those numbers at.

So I will look into this as it might work, but I don’t quite see being a easy and short api rewrite due to the nature of the simulation in itself.

Thank you for the idea, it’s totally plausible for the small example I provided if you are willing to let your binary a little bit more bloated, as well as the first answer to my post already considered this approach.

The thing is that the real program has several calls to a polymorfic sample() function driven by a switch, and there can be a whole lot of options there, so I discarded this option right away as unfeasible for my use case —and imo i don’t like it that much taste wise, if there is a solution that does not involve duplicating code I will take it first. Thank you for your answer :))

My knowledge of statistics is pretty shallow, so I don’t really know how feasible it is in that area.
But maybe you find a way to shuffle things around in a way so that you can group up operations more, not necessarily changing the algorithm but finding a way to have batches.

Another thing, have you tried comparing different random number generators?

TBH no, i just focused on the implementation of the distributions. I am using the default one!

yep, got the idea, but I suspect is colliding very heavy with the design part of the simulation. To have a good answer I would have to actually try to give it a go or deeply reflectionate for a while :))

As both @Sze and @vulpesx asked for them, I proceeded to code and run a benchmark of two different programs:

  • general: takes the json as runtime, every if is if(config.trace_to_file) is at runtime
  • specific: no json, inits the struct and calls a “calibrate” method which fills it with the adequate parameters.

The benchmark is structured in the following way: we test

  • generic, no write to file.
  • generic, write to file
  • specific, no write to file
  • specific, write to file

Each of those in ReleaseSafe, ReleaseFast and ReleaseSmall. I paralelize the workload with 4 workers within 500 total runs.

The results are the following tables (llm generated to look pretty)

General:

Compile Trace Mean (ms) Std (ms) CI 95%
ReleaseFast notrace 333.2 14.8 [331.9, 334.5]
ReleaseFast trace 348.4 14.4 [347.2, 349.7]
ReleaseSmall notrace 379.0 15.8 [377.6, 380.4]
ReleaseSmall trace 405.5 16.1 [404.1, 406.9]
ReleaseSafe notrace 743.0 18.2 [741.4, 744.5]
ReleaseSafe trace 751.4 21.2 [749.6, 753.3]

Specific:

Compile Trace Mean (ms) Std (ms) CI 95%
ReleaseFast notrace 338.4 14.7 [337.1, 339.7]
ReleaseFast trace 348.7 13.7 [347.5, 349.9]
ReleaseSmall notrace 375.0 19.7 [373.2, 376.7]
ReleaseSmall trace 392.1 16.6 [390.6, 393.5]
ReleaseSafe notrace 731.1 17.8 [729.6, 732.7]
ReleaseSafe trace 735.8 17.2 [734.3, 737.4]

Comparison (delta and improvement is general - specific)

Compile Trace Δ (ms) % impr. p-value Cohen’s d
ReleaseFast notrace -5.3 +1.55% <0.000001 -0.357
ReleaseFast trace -0.3 +0.08% 0.742567 -0.021
ReleaseSmall notrace +4.0 -1.07% 0.000392 +0.225
ReleaseSmall trace +13.5 -3.43% <0.000001 +0.823
ReleaseSafe notrace +11.8 -1.62% <0.000001 +0.657
ReleaseSafe trace +15.6 -2.12% <0.000001 +0.808

The 100K (which has much more loops) ill upload them when the benchmark finishes. The test run is a t-test with sample variances.

The highlight of this analysis should probably be that when compiling in ReleaseSmall or ReleaseSafe there is a statistical significant change, even tho the quantity of that change ranges just from 1.5 to 3.5% faster in the specific againts its generic counterpart. In ReleaseFast the compile time is not worth it, as the generic implementation is the same or better than the specific one.

The code can be found in this branch of the repo

I did not jump into the assembly to check why ReleaseFast does not care about the runtime performance, but if someone has some idea i would love to hear about it :))