Self-Hosted x86 Backend is Now Default in Debug Mode

andrewrk · June 8, 2025, 9:13pm

Now, when you target x86_64, by default, Zig will use its own x86 backend rather than using LLVM to lower a bitcode file to an object file. The default is not changed on Windows yet, because more COFF linker work needs to be done first.

devlog entry

kj4tmp · June 8, 2025, 9:34pm

ZSF is actively restoring my faith in humanity. Cathedral-levels of achievement! Yee-haw! WooHoo! LFG!

gonzo · June 8, 2025, 10:32pm

Fantastic work, congratulations to Andrew and the team!

Kacaii · June 8, 2025, 10:47pm

Congrats!!

zigster · June 8, 2025, 11:19pm

Super!

How is the roadmap looking for achieving the same level on arm64/macos ?

I’m assuming that’s a whole new pile of effort, but curious if it’s able to ride on the coattails of the x86 backend to help accelerate the work.

Would love to have a go (even though it sounds like a multi-year rabbit hole, lol)

andrewrk · June 8, 2025, 11:56pm

Certainly a whole new pile of effort, but the Legalize pass can help out by doing things like:

Expanding safety checks into a simpler instruction subset
Scalarizing operations so the backend doesn’t have to implement vectors
Converting packed struct operations into simpler load, store, and bit shifting operations

More legalizations can be added as well. For instance I can imagine a handy one would be converting integer arithmetic with integer bit size more than 64 bits into 64 bit arithmetic.

A mature backend will likely want to disable these things, because it can probably generate better machine code by handling the higher level AIR instructions directly. However, these legalizations can help a work-in-progress backend to get more of the language supported faster, get full behavior test coverage passing, before going in to disable the legalizations in search of better machine code quality.

floooh · June 9, 2025, 9:45am

I just checked with my emulator code, and unfortunately debug performance has tanked so much that it isn’t really an option to use yet for that project.

E.g. try this:

git clone https://github.com/floooh/chipz
cd chipz
zig build run-kc854

In version 0.14.1 the emulator is comfortably able to run in realtime (on my Meteor Lake Linux laptop: about 2ms out of an 8.333ms frame).

On Zig nightly the performance is about 140ms for an (artifically throttled to prevent a death-spiral) 24ms frame (this means that running the emulator for 24ms ‘real time’ takes 140ms, e.g. it’s about 6x slower than realtime.

The LLVM debug mode lets the emulator run about 4x faster than realtime, so this gives us an idea how much slower the Zig backend is compared to the LLVM backend in debug mode: about 6*4 = 24x slower.

Hopefully this is just a glitch for the specific sort of code used in the emulator (mostly bit twiddling on integers, lots of small branches, and one massively big switch statement for the CPU decoder).

For comparison: release-fast perf is about 0.6ms out of an 8.33ms frame on that laptop.

matklad · June 9, 2025, 10:17am

It’s notable that Zig got faster to “don’t use LLVM for debug” than Rust!

floooh · June 9, 2025, 3:04pm

PS: did some benchmarking by running the Z80 emulation alone unthrottled for 10 million ticks:

LLVM --release=fast: 0.0326 seconds
LLVM debug: 0.3725 seconds
x86 backend: 10.03 seconds

(this is on a Meteor Lake laptop)

…the first notable thing is that there’s already huge difference between LLVM with release-fast and debug: 10x performance difference between release and debug mode is quite unusual.

The native x86 backend is then another 27x slower in debug mode than the LLVM debug mode (and a whopping 330x slower than release mode).

To try for youself:

git clone https://github.com/floooh/chipz
cd chipz
git checkout z80zexbench
zig build
zig-out/bin/z80zex

This is the sort of code which causes this extreme behaviour (brace yourself):

github.com/floooh/chipz

src/chips/z80.zig

5c04cb142


      
          pub fn tick(self: *Self, in_bus: Bus) Bus {
              @setEvalBranchQuota(4096);
              var bus = in_bus & ~(CTRL | RETI);
              next: {
                  switch (self.step) {
                      // BEGIN DECODE
                      0x0 => { }, // NOP
                      0x1 => { self.step = 0x200; break :next; }, // LD BC,nn
                      0x2 => { self.step = 0x206; break :next; }, // LD (BC),A
                      0x3 => { self.setBC(self.BC() +% 1); self.step = 0x209; break :next; }, // INC BC
                      0x4 => { self.r[B]=self.inc8(self.r[B]); }, // INC B
                      0x5 => { self.r[B]=self.dec8(self.r[B]); }, // DEC B
                      0x6 => { self.step = 0x20B; break :next; }, // LD B,n
                      0x7 => { self.rlca(); }, // RLCA
                      0x8 => { self.exafaf2(); }, // EX AF,AF'
                      0x9 => { self.add16(self.BC()); self.step = 0x20E; break :next; }, // ADD HL,BC
                      0xA => { self.step = 0x215; break :next; }, // LD A,(BC)
                      0xB => { self.setBC(self.BC() -% 1); self.step = 0x218; break :next; }, // DEC BC
                      0xC => { self.r[C]=self.inc8(self.r[C]); }, // INC C
                      0xD => { self.r[C]=self.dec8(self.r[C]); }, // DEC C

This file has been truncated. show original

(basically one switch-branch per instruction tick, with about 1.8k branches, the whole thing being code-generated)

floooh · June 9, 2025, 3:37pm

Well doh, stepping through the assembly code I guess I found the reason…

That switch-statement is essentially translated into a linear search:

if (self.step == 0) {
    ...
} else if (self.step == 1) {
    ...
} else if (self.step == 2) {
    ...
} ...

…and even this is spectacularly inefficent

    cmpw $0,bx
    je case_0
    jmp check_1
case_0:
    // case 0 payload
    jmp done
check_1:
    cmpw $1,bx
    je case_1
    jmp check_2
case_1:
    // case 1 payload
    jmp done
check_2:
    cmpw $2,bx
    je case_2
    jmp check_3
case_2:
    // etc pp

…e.g. it might take up to 1800 tries until it finds the right branch and there’s lots of redundant jumping around (which probably kills any attempt at branch prediction by the CPU).

In the LLVM backend the switch-case uses a jump table in debug mode (one thing that LLVM does really well is picking the best strategy for switch-case statements, even when there are gaps in case branches - but the emulator doesn’t even need such a clever approach, since there are no gaps, at best empty case branches.

andrewrk · June 9, 2025, 6:40pm

Does the LLVM-generated code use a jump table for the whole thing? Or does it first do one branch to check if the value is <= 0x669 ?

kracked · June 9, 2025, 7:23pm

Excited for this but I use windows as my main environment so I’ll pass for now.

floooh · June 9, 2025, 7:32pm

In LLVM debug mode there’s a single range check (conditional branch) but when that passes it does an indirect jump through one big jump table for the entire switch (I’ve also seen LLVM select between multiple smaller jump tables and find the right one with a binary search, but only if there continuous ranges separated by large gaps) - e.g. as soon as the cases are continuous it’s pretty much guaranteed that LLVM will use a single jump table, no matter how big the switch is).

Each switch-branch ends with a direct jump to one of two “epilogues”:

In release mode, the initial range check before the jump table access is removed because of the unreachable here:

github.com/floooh/chipz

src/chips/z80.zig

5c04cb142


      
                      INT_IM2_T14 => { self.step = INT_IM2_T15; break :next; },
                      INT_IM2_T15 => { if (wait(bus)) break :next; bus = mrd(bus, self.@"WZ++"()); self.step = INT_IM2_T16; break :next; },
                      INT_IM2_T16 => { self.dlatch = gd(bus); self.step = INT_IM2_T17; break :next; },
                      // => read ISR high byte from interrupt vector into WZ, set PC to WZ
                      INT_IM2_T17 => { self.step = INT_IM2_T18; break :next; },
                      INT_IM2_T18 => { if (wait(bus)) break :next; bus = mrd(bus, self.WZ()); self.step = INT_IM2_T19; break :next; },
                      INT_IM2_T19 => { self.r[WZH] = gd(bus); self.r[WZL] = self.dlatch; self.pc = self.WZ(); self.step = INT_IM2_OVERLAPPED; break :next; },
                      // => overlapped: fetch first ISR instruction
                      INT_IM2_OVERLAPPED => { },
          
                      else => unreachable,
                  }
                  bus = self.fetch(bus);
              }
              // track interrupt state
              const nmi: u1 = @truncate(bus >> cfg.pins.NMI);
              self.nmi |= (nmi ^ self.last_nmi) & nmi;    // keep track of nmi rising edge
              self.last_nmi = nmi;
              self.int = @truncate(bus >> cfg.pins.INT);
              return bus;
          }

E.g. apart from the range-check (which is only removed in release-mode because of the unreachable, it looks like the actual switch-structure is as efficient as it gets in LLVM debug mode (the actual case-branch payloads then are different again, in debug mode there are function calls, while in release mode those are inlined).

andrewrk · June 9, 2025, 7:53pm

We’ll get there! Independence from LLD is a major priority right now.

floooh · June 9, 2025, 7:53pm

…tbh I’m starting to wonder if Zig should have some sort of switch-variant that enforces a jump table and fails with a compilation error if the conditions to generate a single jump table are not met… the heuristics that LLVM uses to create a jump-table versus binary-search are kinda opaque (even though it generally does the right thing).

(with the condition being of course: all case-“keys” must be integer constants, continuous and in order - it should be allowed though to either have empty {} payloads, or gaps which the compiler will fill with ‘empty’ jump table slots (e.g. the address in the jump table directly points to the to the code behind the switch - that’s what LLVM does for ‘empty’ case-branches.

andrewrk · June 9, 2025, 7:57pm

I think that’s a little too low-level even for Zig. We’ll be playing the heuristics game along with LLVM.

By the way the x86 backend does have jump tables implemented but it does not have the partial jump table optimization (the binary search strategy). In this case the gaps triggered the heuristic to not emit a jump table.

It’s interesting that LLVM used a jump table despite the gaps. This will be an interesting case study to keep an eye on. Appreciate the real world use case!

floooh · June 9, 2025, 8:13pm

Another thought: I can’t really get rid of the ‘dummy case-branches’ which have a {} (which I guess is equivalent with a non-existing case-branch - e.g. a ‘gap’). But at the same time I don’t want those empty branches to prevent creating a jump table… e.g. maybe some sort of hint to say “this case branch is a no-op, but please treat it like there would be some important code here” - basically the opposite of unreachable

vulpesx · June 9, 2025, 11:25pm

reachable

IntegratedQuantum · June 10, 2025, 8:25pm

I also notice significant performance degradations with the new backend. In my case I could trace some of it to some rep movsb instructions which seem to be copying 300 kiB of data around many times. Now I sadly cannot look further into it because my profiler doesn’t seem to be able to understand the debug info generated by the x86 backend (it does work fine with llvm code though). But I suspect that some more of those “copying the entire array on every access” problems might have returned.

andrewrk · June 10, 2025, 9:00pm

We haven’t forgotten about that bit of perf - were just talking about it in a meeting the other day. Thanks for checking, let’s keep working together on finding common patterns that can be optimized to help real world projects.