A little something about GCC and Zig optimisation approaches

Just a little thing I found playing with godbolt - their MaxArray C example:

void maxArray(double* x, double* y) {
    int i;

    for (i = 0; i < 65536; i++) {
        if (y[i] > x[i])
            x[i] = y[i];
    }
}

Compiles to the following with x86-64 gcc 13.1 using -O2 flag:

maxArray:
        xor     eax, eax
.L4:
        movsd   xmm0, QWORD PTR [rsi+rax]
        comisd  xmm0, QWORD PTR [rdi+rax]
        jbe     .L2
        movsd   QWORD PTR [rdi+rax], xmm0
.L2:
        add     rax, 8
        cmp     rax, 524288
        jne     .L4
        ret

Notice how rax, which is used to index the arrays, goes from 0 to 524288 in increments of 8. 524288 is 65536 * 8, and 8 is sizeof double. So this makes sense, since i is not used for any other purpose in the for loop.

I have decided to try to reproduce this result in Zig as closely as possible. Here’s my code:

export fn maxArray(x: [*]f64, y: [*]f64) void
{
    for (0..65536) |i|
    {
        if (y[i] > x[i])
            x[i] = y[i];
    }
}

Not very ‘ziggy’, I agree - it’s a verbatim reproduction of C code. But look at the compilation results (zig trunk with -O ReleaseSmall, as -O ReleaseFast produces a huge number of CPU instructions, probably doing some kind of partial loop unrolling?):

maxArray:
        xor     eax, eax
.LBB0_1:
        cmp     rax, 65536
        je      .LBB0_5
        vmovsd  xmm0, qword ptr [rsi + 8*rax]
        vucomisd        xmm0, qword ptr [rdi + 8*rax]
        jbe     .LBB0_4
        vmovsd  qword ptr [rdi + 8*rax], xmm0
.LBB0_4:
        inc     rax
        jmp     .LBB0_1
.LBB0_5:
        ret

As you can see, rax goes to 65536, just like i in the source code, and its value is multiplied by 8 in vmovsd and vucomisd instructions.

Probably not a big deal (if anything at all), since multiplication by powers of 2 is just shift left, so perhaps this does not impact performance at all (?).

Still the way GCC does it in comparison to Zig caught my attention.

Also, I’m tempted to benchmark ReleaseSmall vs ReleaseFast - is the long version really faster?

1 Like

Note that this is not really a multiplication. Rather, x86 has addressing mode where stuff is multiplied by 8:

https://renenyffenegger.ch/notes/development/languages/assembler/x86/addressing-modes

I don’t know what would be the perf difference here though.

My gut feeling is that the primary job of a low-level compiler/programming language isn’t really to optimize anything you throw at it as hard as possible, but rather to provide tools and vocabulary for expressing optimizable code. In this sense, I’d be more interested to see if multi-for loops, alignment annotations, and vectors can force the compiler to emit proper SIMD here.

One thing I like about Zig is that, rather than having to extract a well-aligned subslice for SIMD, it can just require the caller to pass in aligned data, where the alignment requirenment naturally propagates recursively up to the point where you allocate stuff.

This is multiplication, in the sense that “multipication happens”.
Not a separate, distinct opcode, e.g. mul or shl, but CPU still has to do it.

It could be ‘free’, from performance standpoint - e.g. there’s dedicated circuitry to do this multiplication (and thus implement this addressing mode) at a certain pipeline stage, anyway, and there’s no penalty. I don’t know that, just speculating. Maybe not in every CPU though.

So if you can avoid it (just by incrementing rax by 8 instead of 1), I see no reason not to.

Update: funny thing, I have played with the code a bit, and Zig gave me the same exact add rax, 8 / cmp rax, 524288 behavior.
I guess the reason is that now compiler wanted to use that “free multiplication” to implement j (which is just i * 2):

export fn maxArray(x: [*]f64, y: [*]f64) void
{
    for (0..65536) |i|
    {
        const j = i * 2;
        if (y[i] > x[j])
            x[j] = y[i];
    }
}
maxArray:
        xor     eax, eax
.LBB0_1:
        cmp     rax, 524288
        je      .LBB0_5
        vmovsd  xmm0, qword ptr [rsi + rax]
        vucomisd        xmm0, qword ptr [rdi + 2*rax]
        jbe     .LBB0_4
        vmovsd  qword ptr [rdi + 2*rax], xmm0
.LBB0_4:
        add     rax, 8
        jmp     .LBB0_1
.LBB0_5:
        ret

Just to clarify - I’m not trying to claim anything, just pointing out (very minor) things I see and trying to make sense of them.
And, maybe, enjoying myself a bit too much.