Can anyone provide compilation speed comparisons for their project 0.10.1 to 0.11.x (dev version)?

I’d like to provide some performance data points in the release notes for the upcoming 0.11.x release. Does anyone have a medium-to-large sized project that has a branch that compiles with zig 0.10.x and a branch that compiles with zig 0.11.x so we can see the difference?

Note that simply checking out an old version of your project won’t be very interesting, because it means all the modifications made since then will make the comparison unfair.

I’d also like to know about peak memory usage.

4 Likes

Do you have a preferred way that people time their compilations? Are you looking for statistics provided directly by the compiler or some sort of outside source? I think it may be helpful to give us a standard way of doing this so we can provide good data points.

1 Like

Consider using poop (GitHub - andrewrk/poop: Performance Optimizer Observation Platform) which gives all the requested info.

2 Likes

dug out my advent of code repo (GitHub - xxxbxxx/advent-of-code: https://adventofcode.com/ solutions in zig)

Benchmark 1 (5 runs): /bin/sh ./alldays.10_1_stage1.sh
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          9.22s  ± 66.6ms    9.16s  … 9.32s           0 ( 0%)        0%
  peak_rss            977MB ± 8.03MB     971MB …  988MB          0 ( 0%)        0%
  cpu_cycles         30.8G  ± 76.8M     30.7G  … 30.9G           0 ( 0%)        0%
  instructions       42.6G  ± 6.16M     42.6G  … 42.6G           0 ( 0%)        0%
  cache_references   1.57G  ± 16.5M     1.55G  … 1.59G           0 ( 0%)        0%
  cache_misses        122M  ±  787K      121M  …  123M           0 ( 0%)        0%
  branch_misses       203M  ±  497K      202M  …  203M           1 (20%)        0%
Benchmark 2 (6 runs): /bin/sh ./alldays.9_1.sh
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          7.75s  ± 9.27ms    7.74s  … 7.77s           0 ( 0%)        ⚡- 15.9% ±  0.7%
  peak_rss            933MB ±  102KB     933MB …  933MB          0 ( 0%)        ⚡-  4.5% ±  0.8%
  cpu_cycles         25.3G  ± 47.6M     25.3G  … 25.4G           0 ( 0%)        ⚡- 17.8% ±  0.3%
  instructions       36.5G  ± 6.47M     36.4G  … 36.5G           0 ( 0%)        ⚡- 14.5% ±  0.0%
  cache_references   1.11G  ± 3.44M     1.10G  … 1.11G           0 ( 0%)        ⚡- 29.5% ±  1.0%
  cache_misses        105M  ±  469K      105M  …  106M           0 ( 0%)        ⚡- 13.9% ±  0.7%
  branch_misses       159M  ±  544K      158M  …  160M           0 ( 0%)        ⚡- 21.9% ±  0.4%
Benchmark 3 (7 runs): /bin/sh ./alldays.10_1_stage2.sh (genrated exe non functional)
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          7.28s  ± 46.4ms    7.24s  … 7.38s           0 ( 0%)        ⚡- 21.0% ±  0.8%
  peak_rss            288MB ± 6.34MB     284MB …  301MB          0 ( 0%)        ⚡- 70.5% ±  0.9%
  cpu_cycles         26.2G  ± 56.0M     26.2G  … 26.3G           0 ( 0%)        ⚡- 14.9% ±  0.3%
  instructions       32.8G  ± 4.32M     32.8G  … 32.9G           0 ( 0%)        ⚡- 22.9% ±  0.0%
  cache_references   1.52G  ± 10.8M     1.50G  … 1.53G           0 ( 0%)        ⚡-  3.3% ±  1.1%
  cache_misses       98.7M  ±  705K     97.7M  … 99.7M           0 ( 0%)        ⚡- 19.3% ±  0.8%
  branch_misses       208M  ±  262K      208M  …  208M           0 ( 0%)        💩+  2.4% ±  0.2%
Benchmark 4 (6 runs): /bin/sh ./alldays.11_0.sh
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          7.74s  ± 29.0ms    7.71s  … 7.79s           0 ( 0%)        ⚡- 16.0% ±  0.7%
  peak_rss            313MB ± 2.48MB     312MB …  318MB          0 ( 0%)        ⚡- 67.9% ±  0.8%
  cpu_cycles         28.3G  ± 30.9M     28.3G  … 28.4G           0 ( 0%)        ⚡-  8.2% ±  0.2%
  instructions       36.3G  ± 4.89M     36.3G  … 36.3G           0 ( 0%)        ⚡- 14.9% ±  0.0%
  cache_references   1.57G  ± 9.90M     1.55G  … 1.58G           0 ( 0%)          +  0.2% ±  1.2%
  cache_misses        121M  ±  608K      121M  …  122M           0 ( 0%)          -  0.9% ±  0.8%
  branch_misses       202M  ±  226K      202M  …  203M           0 ( 0%)          -  0.3% ±  0.3%

using zig builds from Download ⚡ Zig Programming Language
and these scripts

$ cat alldays.9_1.sh 
cd aoc.9
rm -r zig-cache zig-out */zig-cache 
zig-linux-x86_64-0.9.1/zig build-exe 2019/alldays.zig  --pkg-begin "tools" "common/tools.zig" --pkg-end 
zig-linux-x86_64-0.9.1/zig build-exe 2020/alldays.zig  --pkg-begin "tools" "common/tools.zig" --pkg-end 
zig-linux-x86_64-0.9.1/zig build-exe 2021/alldays.zig  --pkg-begin "tools" "common/tools_v2.zig" --pkg-end  

$ cat alldays.10_1_stage1.sh 
cd aoc.10
rm -r zig-cache zig-out */zig-cache 
zig-linux-x86_64-0.10.1/zig build-exe -fstage1 2019/alldays.zig  --pkg-begin "tools" "common/tools.zig" --pkg-end 
zig-linux-x86_64-0.10.1/zig build-exe -fstage1 2020/alldays.zig  --pkg-begin "tools" "common/tools.zig" --pkg-end 
zig-linux-x86_64-0.10.1/zig build-exe -fstage1 2021/alldays.zig  --pkg-begin "tools" "common/tools_v2.zig" --pkg-end  

$ cat alldays.11_0.sh 
cd aoc.11
rm -r zig-cache zig-out */zig-cache 
zig-linux-x86_64-0.11.0-dev.4238+abd960873/zig build-exe 2019/alldays.zig  --mod "tools"::"common/tools.zig" --deps tools 
zig-linux-x86_64-0.11.0-dev.4238+abd960873/zig build-exe 2020/alldays.zig  --mod "tools"::"common/tools.zig" --deps tools  
zig-linux-x86_64-0.11.0-dev.4238+abd960873/zig build-exe 2021/alldays.zig  --mod "tools"::"common/tools_v2.zig" --deps tools
2 Likes

Edited the code blocks to add sh as the language for highlighting.

2 Likes

No rigorous measurement, but at TigerBeetle it feels like 0.11 is substantially (1.5x–2x) slower than 0.10 for us. Haven’t dug into that yet.

1 Like

That’s really unfortunate. I’ll be keeping an eye on the situation. Any insights you dig up would be greatly appreciated.

For what it’s worth, here is the performance roadmap:

  1. Ditch LLVM for debug builds
  2. Incremental compilation, including serialization of compiler state
  3. compiler perf: eliminate call graph cycle of codegen backends calling into Sema · Issue #15899 · ziglang/zig · GitHub and followup changes
  4. Run linker/codegen on a different thread
  5. Introduce a thread pool to semantic analysis

There has been a lot of effort going into (1) and (2) lately. Both of those are big sub-projects. recent progress

4 Likes

Did a tiny bit of looking into this. In particular, here are the repro commits:

That’s manual compilation through build-exe, to eliminate build.zig changes as the probable cause.

What would be the next step for profiling this? I tried perf, but there are no symbols.

EDIT: FWIW, master behaves like 0.11

3 Likes

One comparison that could be quite handy would be using callgrind. The basic usage is like this:

valgrind --tool=callgrind zig build-exe ...

This will dump some profiling data into the cwd, which can be analyzed in a few different ways but I personally enjoy using kcachegrind, which looks something like this:


If you do multiple runs, kcachegrind will open them all up and show comparisons. Mainly, it will be interesting to see where most of the CPU instruction count is spent, compared to each other.

In order for this to work, you will need unstripped release builds of Zig. The binaries provided on the website are stripped. I’m happy to help with that if you need any assistance obtaining such binaries. It should be only a matter of passing -Dstrip=false to zig build.

6 Likes

Captured flamegraphs. Could codegen.llvm.FuncGen.fieldPtr be quadratic?

Profiles:

4 Likes

kcachegrind points to abiAlignmentAdvanced, and it seems like alingment&size recursively call each other:

So, yeah, it feels like something is quadratic and/or undercached in the compiler, but hard to say more without deeper knowledge of how this should work.

1 Like

I got curious why perf and kcache grind point to related, but different functions… I think kcachegrind is just confused, as it thinks that abiSizeAdvanced takes 334.94% of total exectution time. Without cycle detection, abiSizeAdvanced looks like this in kcachegrind:

Those fractaly rectangles are the abiAlignmentAdvanced .

I consider the case closed with respect to who’s the culprit, not sure how to fix that though.

2 Likes

Those flame charts are really helpful, thank you! Interesting find indeed…

3 Likes

Hmm I think I see the problem. structs do actually cache their field offsets, however, at some point the LLVM backend stopped using that information and started doing its own calculations, which repeat the calculation every time. So if the zig source code initializes N fields, then this is O(N^2) calculations.

I’m working on migrating structs over to InternPool today, so it’s actually perfect timing for me to look into solving this perf regression as well.

If only Performance Tracking ⚡ Zig Programming Language was not bitrotted… I would love to have a tool like this available. Alas, it requires recurring operational maintenance, and I lack the time to keep it up and running.

10 Likes

@matklad, would you be willing to try a new build of master branch zig and see if the performance regressions you observed have been fixed? The changes in particular were done in compiler: move struct types into InternPool proper by andrewrk · Pull Request #17172 · ziglang/zig · GitHub which landed last week. This commit is reflected in the CI builds already if you wanted to use one of those builds.

There was a follow-up issue I looked into just now which was packedStructFieldByteOffset is implemented via O(N) linear search · Issue #17178 · ziglang/zig · GitHub however according to my measurements, storing the bit offsets of packed structs actually regressed performance rather than improving it, so I backed out of that change.

Much bigger things are coming soon on the performance roadmap; this was just a little side quest along the way.

6 Likes

Yup, much faster, debug build goes from 16s to 9s!

9 Likes

Out of curiosity, I’ve run the same test to inclde the new zig-12.0 exe:

Benchmark 1 (8 runs): /bin/bash ./alldays.10_1_stage2.sh
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          7.77s  ± 54.6ms    7.71s  … 7.88s           1 (13%)        0%
  peak_rss            285MB ±  187KB     284MB …  285MB          1 (13%)        0%
  cpu_cycles         26.9G  ±  244M     26.3G  … 27.1G           1 (13%)        0%
  instructions       32.8G  ± 8.70M     32.8G  … 32.9G           0 ( 0%)        0%
  cache_references   2.57G  ± 2.93M     2.57G  … 2.57G           0 ( 0%)        0%
  cache_misses        656M  ± 1.28M      655M  …  659M           0 ( 0%)        0%
  branch_misses       197M  ±  399K      196M  …  198M           0 ( 0%)        0%
Benchmark 2 (6 runs): /bin/bash ./alldays.10_1_stage1.sh
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          10.3s  ±  144ms    10.1s  … 10.5s           0 ( 0%)        💩+ 32.1% ±  1.5%
  peak_rss            972MB ±  188KB     971MB …  972MB          0 ( 0%)        💩+241.3% ±  0.1%
  cpu_cycles         31.5G  ±  186M     31.2G  … 31.6G           0 ( 0%)        💩+ 17.1% ±  1.0%
  instructions       42.6G  ± 3.16M     42.6G  … 42.6G           0 ( 0%)        💩+ 29.8% ±  0.0%
  cache_references   2.55G  ± 8.07M     2.54G  … 2.56G           0 ( 0%)          -  0.8% ±  0.3%
  cache_misses        614M  ± 3.85M      610M  …  620M           0 ( 0%)        ⚡-  6.4% ±  0.5%
  branch_misses       192M  ±  687K      192M  …  193M           0 ( 0%)        ⚡-  2.4% ±  0.3%
Benchmark 3 (8 runs): /bin/bash ./alldays.11_0.sh
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          8.00s  ±  103ms    7.86s  … 8.18s           0 ( 0%)        💩+  3.1% ±  1.1%
  peak_rss            298MB ±  266KB     298MB …  299MB          0 ( 0%)        💩+  4.8% ±  0.1%
  cpu_cycles         26.7G  ±  253M     26.2G  … 27.1G           1 (13%)          -  0.8% ±  1.0%
  instructions       33.3G  ± 6.33M     33.3G  … 33.3G           0 ( 0%)        💩+  1.4% ±  0.0%
  cache_references   2.55G  ± 5.64M     2.54G  … 2.56G           2 (25%)          -  1.0% ±  0.2%
  cache_misses        636M  ± 3.79M      630M  …  642M           0 ( 0%)        ⚡-  3.1% ±  0.5%
  branch_misses       183M  ±  477K      183M  …  184M           0 ( 0%)        ⚡-  6.9% ±  0.2%
Benchmark 4 (7 runs): /bin/bash ./alldays.12_0.sh
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          8.97s  ±  107ms    8.81s  … 9.08s           0 ( 0%)        💩+ 15.5% ±  1.2%
  peak_rss            291MB ±  251KB     291MB …  292MB          0 ( 0%)        💩+  2.3% ±  0.1%
  cpu_cycles         29.7G  ±  251M     29.4G  … 30.0G           0 ( 0%)        💩+ 10.7% ±  1.0%
  instructions       39.3G  ± 14.2M     39.3G  … 39.3G           0 ( 0%)        💩+ 19.7% ±  0.0%
  cache_references   2.79G  ± 8.49M     2.78G  … 2.80G           0 ( 0%)        💩+  8.4% ±  0.3%
  cache_misses        631M  ± 3.46M      626M  …  635M           0 ( 0%)        ⚡-  3.9% ±  0.4%
  branch_misses       202M  ±  554K      201M  …  203M           0 ( 0%)        💩+  2.5% ±  0.3%

2 Likes