Now, when you target x86_64, by default, Zig will use its own x86 backend rather than using LLVM to lower a bitcode file to an object file. The default is not changed on Windows yet, because more COFF linker work needs to be done first.
ZSF is actively restoring my faith in humanity. Cathedral-levels of achievement! Yee-haw! WooHoo! LFG!
Fantastic work, congratulations to Andrew and the team!
Congrats!!
Super!
How is the roadmap looking for achieving the same level on arm64/macos ?
Iām assuming thatās a whole new pile of effort, but curious if itās able to ride on the coattails of the x86 backend to help accelerate the work.
Would love to have a go (even though it sounds like a multi-year rabbit hole, lol)
Certainly a whole new pile of effort, but the Legalize pass can help out by doing things like:
- Expanding safety checks into a simpler instruction subset
- Scalarizing operations so the backend doesnāt have to implement vectors
- Converting packed struct operations into simpler load, store, and bit shifting operations
More legalizations can be added as well. For instance I can imagine a handy one would be converting integer arithmetic with integer bit size more than 64 bits into 64 bit arithmetic.
A mature backend will likely want to disable these things, because it can probably generate better machine code by handling the higher level AIR instructions directly. However, these legalizations can help a work-in-progress backend to get more of the language supported faster, get full behavior test coverage passing, before going in to disable the legalizations in search of better machine code quality.
I just checked with my emulator code, and unfortunately debug performance has tanked so much that it isnāt really an option to use yet for that project.
E.g. try this:
git clone https://github.com/floooh/chipz
cd chipz
zig build run-kc854
In version 0.14.1 the emulator is comfortably able to run in realtime (on my Meteor Lake Linux laptop: about 2ms out of an 8.333ms frame).
On Zig nightly the performance is about 140ms for an (artifically throttled to prevent a death-spiral) 24ms frame (this means that running the emulator for 24ms āreal timeā takes 140ms, e.g. itās about 6x slower than realtime.
The LLVM debug mode lets the emulator run about 4x faster than realtime, so this gives us an idea how much slower the Zig backend is compared to the LLVM backend in debug mode: about 6*4 = 24x slower.
Hopefully this is just a glitch for the specific sort of code used in the emulator (mostly bit twiddling on integers, lots of small branches, and one massively big switch statement for the CPU decoder).
For comparison: release-fast perf is about 0.6ms out of an 8.33ms frame on that laptop.
Itās notable that Zig got faster to ādonāt use LLVM for debugā than Rust!
PS: did some benchmarking by running the Z80 emulation alone unthrottled for 10 million ticks:
- LLVM --release=fast: 0.0326 seconds
- LLVM debug: 0.3725 seconds
- x86 backend: 10.03 seconds
(this is on a Meteor Lake laptop)
ā¦the first notable thing is that thereās already huge difference between LLVM with release-fast and debug: 10x performance difference between release and debug mode is quite unusual.
The native x86 backend is then another 27x slower in debug mode than the LLVM debug mode (and a whopping 330x slower than release mode).
To try for youself:
git clone https://github.com/floooh/chipz
cd chipz
git checkout z80zexbench
zig build
zig-out/bin/z80zex
This is the sort of code which causes this extreme behaviour (brace yourself):
(basically one switch-branch per instruction tick, with about 1.8k branches, the whole thing being code-generated)
Well doh, stepping through the assembly code I guess I found the reasonā¦
That switch-statement is essentially translated into a linear search:
if (self.step == 0) {
...
} else if (self.step == 1) {
...
} else if (self.step == 2) {
...
} ...
ā¦and even this is spectacularly inefficent
cmpw $0,bx
je case_0
jmp check_1
case_0:
// case 0 payload
jmp done
check_1:
cmpw $1,bx
je case_1
jmp check_2
case_1:
// case 1 payload
jmp done
check_2:
cmpw $2,bx
je case_2
jmp check_3
case_2:
// etc pp
ā¦e.g. it might take up to 1800 tries until it finds the right branch and thereās lots of redundant jumping around (which probably kills any attempt at branch prediction by the CPU).
In the LLVM backend the switch-case uses a jump table in debug mode (one thing that LLVM does really well is picking the best strategy for switch-case statements, even when there are gaps in case branches - but the emulator doesnāt even need such a clever approach, since there are no gaps, at best empty case branches.
Does the LLVM-generated code use a jump table for the whole thing? Or does it first do one branch to check if the value is <= 0x669
?
Excited for this but I use windows as my main environment so Iāll pass for now.
In LLVM debug mode thereās a single range check (conditional branch) but when that passes it does an indirect jump through one big jump table for the entire switch (Iāve also seen LLVM select between multiple smaller jump tables and find the right one with a binary search, but only if there continuous ranges separated by large gaps) - e.g. as soon as the cases are continuous itās pretty much guaranteed that LLVM will use a single jump table, no matter how big the switch is).
Each switch-branch ends with a direct jump to one of two āepiloguesā:
- chipz/src/chips/z80.zig at 5c04cb14264f346e1483ce3ce9b4b9c3599722a0 Ā· floooh/chipz Ā· GitHub
- chipz/src/chips/z80.zig at 5c04cb14264f346e1483ce3ce9b4b9c3599722a0 Ā· floooh/chipz Ā· GitHub
In release mode, the initial range check before the jump table access is removed because of the unreachable
here:
E.g. apart from the range-check (which is only removed in release-mode because of the unreachable
, it looks like the actual switch-structure is as efficient as it gets in LLVM debug mode (the actual case-branch payloads then are different again, in debug mode there are function calls, while in release mode those are inlined).
Weāll get there! Independence from LLD is a major priority right now.
ā¦tbh Iām starting to wonder if Zig should have some sort of switch-variant that enforces a jump table and fails with a compilation error if the conditions to generate a single jump table are not met⦠the heuristics that LLVM uses to create a jump-table versus binary-search are kinda opaque (even though it generally does the right thing).
(with the condition being of course: all case-ākeysā must be integer constants, continuous and in order - it should be allowed though to either have empty {}
payloads, or gaps which the compiler will fill with āemptyā jump table slots (e.g. the address in the jump table directly points to the to the code behind the switch - thatās what LLVM does for āemptyā case-branches.
I think thatās a little too low-level even for Zig. Weāll be playing the heuristics game along with LLVM.
By the way the x86 backend does have jump tables implemented but it does not have the partial jump table optimization (the binary search strategy). In this case the gaps triggered the heuristic to not emit a jump table.
Itās interesting that LLVM used a jump table despite the gaps. This will be an interesting case study to keep an eye on. Appreciate the real world use case!
Another thought: I canāt really get rid of the ādummy case-branchesā which have a {}
(which I guess is equivalent with a non-existing case-branch - e.g. a āgapā). But at the same time I donāt want those empty branches to prevent creating a jump table⦠e.g. maybe some sort of hint to say āthis case branch is a no-op, but please treat it like there would be some important code hereā - basically the opposite of unreachable
reachable
I also notice significant performance degradations with the new backend. In my case I could trace some of it to some rep movsb
instructions which seem to be copying 300 kiB of data around many times. Now I sadly cannot look further into it because my profiler doesnāt seem to be able to understand the debug info generated by the x86 backend (it does work fine with llvm code though). But I suspect that some more of those ācopying the entire array on every accessā problems might have returned.
We havenāt forgotten about that bit of perf - were just talking about it in a meeting the other day. Thanks for checking, letās keep working together on finding common patterns that can be optimized to help real world projects.