Shock physics benchmarks for Zig (2771 SLOC and 234 SLOC)

Hi,

I’ve just finished coding up and debugging a 1D and 3D shock physics benchmark for Zig:

The larger 3D Zig benchmark runs about 15% faster than the C reference implementation compiled with gcc -O3 on my machine ( which is faster than gcc -O2). I used the aarch64 tarball off the download page to do my run. I notice that there are a lot of unexploited “low hanging fruit” peephole optimization opportunities in the FP assembly language, e.g. fmadd/fmsub/fnmsub. ZIg can only get much faster from here for science apps, and it is already handily beating GCC.

The smaller 1D benchmark can be visualized by redirecting the output to a file, then issuing a “plot filename” command in gnuplot.

There are still some issues with exactly matching the Zig numerics to the C reference implementation numerics, which I will attempt to work through over the next few days. I will also likely post a parallel implementation some time over the next few weeks.

I will not accept pull requests for the next few days, as I have a list of clean up activities and Zig language improvements I still need to implement. I was just so excited about the great performance vs GCC, that I wanted to post early results.

Thanks to @squeek502 and @floooh for suggestions that made my implementation/debugging go faster.

In all such benchmarks I think you should keep in mind that you are not necessarily comparing Zig vs C, but rather LLVM vs (a potentially older version of) GCC.

To get accurate comparisons, you should always try to compile it with zig cc to get a C compiler with the same optimization backend.

2 Likes

Thanks. I thought that when I downloaded the aarch64 tarball from the Zig download page, that it would use the backend code generator that was not tied to LLVM. In fact, if it is LLVM doing the optimization, then it is doing a very bad job of it. The fmadd/fmsub/fnmsub FP opcodes are dirt simple peephole optimizations on an ARM architecture, and they are not being generated.

Here is how I am compiling the 3D benchmark:

zig build-exe -OReleaseFast lulesh.zig

Did you also use the -mtune=native option? I think most C compilers default to a “generic” version of the instruction set on you machine whereas zig optimizes for the exact architecture you’re compiling on.

1 Like

Also, I think -march=native is important on vanilla clang (not quite sure how it overlaps with -mtune=native though). But IME most performance differences between building with Zig (with the LLVM backend) vs building with vanilla Clang (and no matter if compiling C or Zig code) come down to different default build options, since the most important optimizations all happen down in LLVM, not in the frontend.

It might be -march=native, I don’t use that flag very often.

I tried the following, and still no fmadd, fmsub, or fnmsub being generated:

zig build-exe -OReleaseFast -target native -mcpu native lulesh.zig

“zig build-exe” doesn’t seem to recognize the -march flag. These instructions are available on all ARM AARCH64 chips, so the native shouldn’t strictly be necessary.

I’m guessing that “zig build-exe” didn’t enable these optimizations due to potential numerical differences, but that is just a guess. The performance difference is large when fmadd, fmsub, and fnsub are enabled. I can’t get it to work.

Try using @setFloatMode(.optimized): https://godbolt.org/z/36vGEbGec.

Native is the default, regardless of when using zig build-exe or zig cc.

I added this as the first line of my file, and still no luck:

comptime { @setFloatMode(.optimized); }

Thanks for the suggestion.

UPDATE: I added it to every single function, and that seemed to work. Thanks. For 32bit FP, it is 17% faster than GCC, and for 64bit FP it is 42% faster!!

1 Like

I think the implementation of @setFloatMode may be bugged, or the documentation is wrong. I haven’t been able to set the float mode using a comptime block.

You may just have to add it to the top of every applicable function.

1 Like

I pushed the changes for ZigShock/3D/lulesh.zig back to the repo. As I said, changed the datatype to f64 in ZIg, and comparing that against the C reference implementation using doubles seems to indicate a 42% speedup over gcc on my machine. I feel like I must have done something wrong, but I don’t think so.

PS It would certainly be nice if I could use “comptime { @setFloatMode(.optimized); }” at global scope rather than change every function manually. That would make it easier to compare optimized numerics to strict IEEE 754 numerics.

You could use:

const mode: std.builtin.FloatMode = .optimized;

pub fn someOperation(a:f64, b: f64) f64 {
    @setFloatMode(mode);
    return a + b;
}

So that you have a single place where you can switch it, or you also could add a build-option for that to your build.zig.

The floating point mode is inherited by child scopes, and can be overridden in any scope. You can set the floating point mode in a struct or module scope by using a comptime block.

But the documentation states that it should apply to child scopes, so I wonder why it doesn’t seem to work for you.

1 Like

Good idea. Thanks. Pushed back to repo.

The global comptime approach is definitely broken. I encourage you to grab ZigShock/3D/lulesh.zig at main · HPCguy/ZigShock · GitHub delete my tags, and try the global comptime. I would love to hear that I messed up!

I was talking about the C program, Zig does that automatically.

Thanks. I’ve already been using that for the C program, but with GCC.

1 Like