A little something about GCC and Zig optimisation approaches

Just a little thing I found playing with godbolt - their MaxArray C example:

void maxArray(double* x, double* y) {
    int i;

    for (i = 0; i < 65536; i++) {
        if (y[i] > x[i])
            x[i] = y[i];
    }
}

Compiles to the following with x86-64 gcc 13.1 using -O2 flag:

maxArray:
        xor     eax, eax
.L4:
        movsd   xmm0, QWORD PTR [rsi+rax]
        comisd  xmm0, QWORD PTR [rdi+rax]
        jbe     .L2
        movsd   QWORD PTR [rdi+rax], xmm0
.L2:
        add     rax, 8
        cmp     rax, 524288
        jne     .L4
        ret

Notice how rax, which is used to index the arrays, goes from 0 to 524288 in increments of 8. 524288 is 65536 * 8, and 8 is sizeof double. So this makes sense, since i is not used for any other purpose in the for loop.

I have decided to try to reproduce this result in Zig as closely as possible. Here’s my code:

export fn maxArray(x: [*]f64, y: [*]f64) void
{
    for (0..65536) |i|
    {
        if (y[i] > x[i])
            x[i] = y[i];
    }
}

Not very ā€˜ziggy’, I agree - it’s a verbatim reproduction of C code. But look at the compilation results (zig trunk with -O ReleaseSmall, as -O ReleaseFast produces a huge number of CPU instructions, probably doing some kind of partial loop unrolling?):

maxArray:
        xor     eax, eax
.LBB0_1:
        cmp     rax, 65536
        je      .LBB0_5
        vmovsd  xmm0, qword ptr [rsi + 8*rax]
        vucomisd        xmm0, qword ptr [rdi + 8*rax]
        jbe     .LBB0_4
        vmovsd  qword ptr [rdi + 8*rax], xmm0
.LBB0_4:
        inc     rax
        jmp     .LBB0_1
.LBB0_5:
        ret

As you can see, rax goes to 65536, just like i in the source code, and its value is multiplied by 8 in vmovsd and vucomisd instructions.

Probably not a big deal (if anything at all), since multiplication by powers of 2 is just shift left, so perhaps this does not impact performance at all (?).

Still the way GCC does it in comparison to Zig caught my attention.

Also, I’m tempted to benchmark ReleaseSmall vs ReleaseFast - is the long version really faster?

1 Like

Note that this is not really a multiplication. Rather, x86 has addressing mode where stuff is multiplied by 8:

https://renenyffenegger.ch/notes/development/languages/assembler/x86/addressing-modes

I don’t know what would be the perf difference here though.

My gut feeling is that the primary job of a low-level compiler/programming language isn’t really to optimize anything you throw at it as hard as possible, but rather to provide tools and vocabulary for expressing optimizable code. In this sense, I’d be more interested to see if multi-for loops, alignment annotations, and vectors can force the compiler to emit proper SIMD here.

One thing I like about Zig is that, rather than having to extract a well-aligned subslice for SIMD, it can just require the caller to pass in aligned data, where the alignment requirenment naturally propagates recursively up to the point where you allocate stuff.

This is multiplication, in the sense that ā€œmultipication happensā€.
Not a separate, distinct opcode, e.g. mul or shl, but CPU still has to do it.

It could be ā€˜free’, from performance standpoint - e.g. there’s dedicated circuitry to do this multiplication (and thus implement this addressing mode) at a certain pipeline stage, anyway, and there’s no penalty. I don’t know that, just speculating. Maybe not in every CPU though.

So if you can avoid it (just by incrementing rax by 8 instead of 1), I see no reason not to.

Update: funny thing, I have played with the code a bit, and Zig gave me the same exact add rax, 8 / cmp rax, 524288 behavior.
I guess the reason is that now compiler wanted to use that ā€œfree multiplicationā€ to implement j (which is just i * 2):

export fn maxArray(x: [*]f64, y: [*]f64) void
{
    for (0..65536) |i|
    {
        const j = i * 2;
        if (y[i] > x[j])
            x[j] = y[i];
    }
}
maxArray:
        xor     eax, eax
.LBB0_1:
        cmp     rax, 524288
        je      .LBB0_5
        vmovsd  xmm0, qword ptr [rsi + rax]
        vucomisd        xmm0, qword ptr [rdi + 2*rax]
        jbe     .LBB0_4
        vmovsd  qword ptr [rdi + 2*rax], xmm0
.LBB0_4:
        add     rax, 8
        jmp     .LBB0_1
.LBB0_5:
        ret

Just to clarify - I’m not trying to claim anything, just pointing out (very minor) things I see and trying to make sense of them.
And, maybe, enjoying myself a bit too much.

The big one here probably isn’t the data alignment. The big culprit if the code gen gets rally unlucky is going to be the addres the loop construct is on. The loop should be on a 32 byte boundary and the loop itself be kept to 24 or fewer iops for full speed LSD magic (not joking).

If you jack up the data alignment you wil pay a price on first and last iteration possibly where it needs to mask the results but all the middle iterations should be fine.

This is just another good place to put a PSA - If you plan on using AVX or similar on whatever, make big buffers and always allocate past the end of what you need to fill up to the next 256 or 512 byte boubndary (ie, your allocation should be a multiple and start at am aligned address). This allows you to not have to generate complex masking code and just let the big dog eat.

Here you go:

export fn maxArray(x: [*]@Vector(4, f64), y: [*]@Vector(4, f64)) void
{
    for (0..65536/4) |i|
    {
        x[i] = @select(f64, y[i] > x[i], y[i], x[i]);
    }
}
maxArray:
        xor     eax, eax
.LBB0_1:
        vmovapd ymm0, ymmword ptr [rsi + rax]
        vmaxpd  ymm0, ymm0, ymmword ptr [rdi + rax]
        vmovapd ymmword ptr [rdi + rax], ymm0
        add     rax, 32
        cmp     rax, 524288
        jne     .LBB0_1
        vzeroupper
        ret

Interestingly it now also uses byte units, going from 0 to 524288.
And llvm even figured out that we are just calculating the max of both elements.

2 Likes