Just a little thing I found playing with godbolt - their MaxArray C example:
void maxArray(double* x, double* y) {
int i;
for (i = 0; i < 65536; i++) {
if (y[i] > x[i])
x[i] = y[i];
}
}
Compiles to the following with x86-64 gcc 13.1 using -O2
flag:
maxArray:
xor eax, eax
.L4:
movsd xmm0, QWORD PTR [rsi+rax]
comisd xmm0, QWORD PTR [rdi+rax]
jbe .L2
movsd QWORD PTR [rdi+rax], xmm0
.L2:
add rax, 8
cmp rax, 524288
jne .L4
ret
Notice how rax
, which is used to index the arrays, goes from 0 to 524288 in increments of 8. 524288 is 65536 * 8, and 8 is sizeof double
. So this makes sense, since i
is not used for any other purpose in the for
loop.
I have decided to try to reproduce this result in Zig as closely as possible. Hereās my code:
export fn maxArray(x: [*]f64, y: [*]f64) void
{
for (0..65536) |i|
{
if (y[i] > x[i])
x[i] = y[i];
}
}
Not very āziggyā, I agree - itās a verbatim reproduction of C code. But look at the compilation results (zig trunk
with -O ReleaseSmall
, as -O ReleaseFast
produces a huge number of CPU instructions, probably doing some kind of partial loop unrolling?):
maxArray:
xor eax, eax
.LBB0_1:
cmp rax, 65536
je .LBB0_5
vmovsd xmm0, qword ptr [rsi + 8*rax]
vucomisd xmm0, qword ptr [rdi + 8*rax]
jbe .LBB0_4
vmovsd qword ptr [rdi + 8*rax], xmm0
.LBB0_4:
inc rax
jmp .LBB0_1
.LBB0_5:
ret
As you can see, rax
goes to 65536, just like i
in the source code, and its value is multiplied by 8 in vmovsd
and vucomisd
instructions.
Probably not a big deal (if anything at all), since multiplication by powers of 2 is just shift left, so perhaps this does not impact performance at all (?).
Still the way GCC does it in comparison to Zig caught my attention.
Also, Iām tempted to benchmark ReleaseSmall
vs ReleaseFast
- is the long version really faster?