No SIMD addition instructions in ReleaseSmall mode

Below is simple program which asks a user for 8 numbers, fills two 4-element vectors with these numbers, then adds these two vectors:

const std = @import("std");

pub fn main() !void {

    const stdin = std.io.getStdIn().reader();
    const stdout = std.io.getStdOut().writer();

    var a = @Vector(4, u32){0,0,0,0};
    var b = @Vector(4, u32){0,0,0,0};
    var buf: [16]u8 = undefined;

    for (0..4) |k| {
        try stdout.print("a[{}] = ", .{k});
        if (try stdin.readUntilDelimiterOrEof(buf[0..], '\n')) |inp| {
            a[k] = try std.fmt.parseInt(u32, inp, 10);
        }
    }

    for (0..4) |k| {
        try stdout.print("b[{}] = ", .{k});
        if (try stdin.readUntilDelimiterOrEof(buf[0..], '\n')) |inp| {
            b[k] = try std.fmt.parseInt(u32, inp, 10);
        }
    }

    const c = a + b;
    std.debug.print("{}\n", .{c});
}

The program works correctly. But there is a question about generated code.
I compiled it with

zig build-exe v2.zig -O ReleaseSmall -femit-asm -fsingle-threaded

Then I inspected assembler output for SIMD instructions.
I see some, for ex. movups xmmword ptr [rdi], xmm0 and xorps xmm0, xmm0,
but I do not see no adding instructions, only moves and a couple of xors.
Why did not the compiler generate adding instructions? Does it mean that actual vector addition is done without SIMD instructions in this particular example?

Accoding to lscpu, CPU on the machine has sse, sse2, ssse3, sse4_1 and sse4_2 flags.

it’s there: vpaddd xmm0, xmm1, xmm0 (about 250 lines later)

1 Like

But not in my case:

~/2-coding/zig-lang/@vector$ grep add v2.s 
	add	r9, 8
	add	rcx, -8
	add	rcx, 16
	add	r15, 56
	add	eax, dword ptr [rsp + rcx + 240]
	add	rcx, -2
	add	al, 48
	add	rsp, 56
	add	bl, -48
	add	edx, eax
	add	bl, -48
	add	edx, eax
	add	r12, rax
	add	r8, -2
	add	al, 48
	add	rsp, 72
	add	rdi, qword ptr [r14]
	add	qword ptr [r14 + 16], r12
	add	r13, r12
	add	rsp, 8

and

~/2-coding/zig-lang/@vector$ grep xmm v2.s 
	movups	xmm0, xmmword ptr [rax + .L__unnamed_1]
	movups	xmmword ptr [rdi], xmm0
	xorps	xmm0, xmm0
	movaps	xmmword ptr [rsp + 160], xmm0
	movaps	xmmword ptr [rsp + 176], xmm0
	movaps	xmm0, xmmword ptr [rsp + 160]
	movaps	xmm1, xmmword ptr [rsp + 176]
	movaps	xmmword ptr [rsp + 208], xmm0
	movaps	xmmword ptr [rsp + 192], xmm1
	movaps	xmm0, xmmword ptr [rsp + 208]
	movaps	xmmword ptr [rsp + 224], xmm0
	movaps	xmm0, xmmword ptr [rsp + 192]
	movaps	xmmword ptr [rsp + 240], xmm0
	xorps	xmm0, xmm0
	movups	xmmword ptr [r8], xmm0

I also tried it on a machine with avx and avx2 using the most recent Zig (0.12.0-dev.2811+3cafb9655), same picture, no SIMD add, only vmovups, vmovaps and vxorps

You are right.
When using -O ReleaseSmall there is no vpaddd.
All other options, (ReleaseFast, ReleaseSafe) produce the instruction.

1 Like

Just thought about it.
Also a checked this on a machine with avx512f, avx512dq, avx512cd, avx512bw, avx512vl - no vpaddXXX in generated asm.

Aha, on a machine with SSE only (no AVX) with ReleaseFast I got

$ grep xmm v2.s | grep add
	paddd	xmm0, xmmword ptr [rsp + 144]
	paddq	xmm1, xmm0
	paddq	xmm0, xmmword ptr [rip + .LCPI45_1]

Is it intended? I mean not using addition SIMD instructions in ReleaseSmall mode.

I’m getting vpadd on ReleaseSmall: godbolt.
What code actually got generated in you case? It could be that the compiler found a path to do the addition with fewer bytes.

If I understand right, -Doptimize=ReleaseSmall is for zig build, not for zig build-exe…

My v2.s is 40_654 bytes, and asm taken from GB, is 4_606_219 bytes

v2.s.gz.txt (6.7 KB)

It is the only more or less reasonable explanation, but it’s a bit strange - single SIMD instruction vs several usual instructions… can the latter really be shorter? (I just do not know)

Code uploaded, remove .txt suffix after downloading.

That’s right, -Doptimize=ReleaseSmall does not affect zig build-exe:

zig build-exe v2.zig -Doptimize=ReleaseSmall -femit-asm -fsingle-threaded -fstrip
grep xmm v2.s | grep add
	paddd	xmm0, xmm1

But

zig build-exe v2.zig -O ReleaseSmall -femit-asm -fsingle-threaded -fstrip
grep xmm v2.s | grep add

does not generate paddd xmm0, xmm1.

Right, my bad. After correcting this, the vpadd disappeared.
In any case, I removed a bunch of stuff from the code to make it easier to understand the generated assembly, and now I’m getting the vpadd: godbolt.
Taking a look at the original code, I believe that, since the arguments for printing need to be passed on the stack, the compiler did the addition directly on the stack. This replaces the vector add with either a scalar add or a lea, which are smaller. Once I removed the call, the compiler again used a vector add.

2 Likes

In other words compiler in some situations may “sacrifice” SIMD instructions since the priority was smaller size, not faster execution (or whatever). Thanks for your explanations!

BTW, why is there no button similar to “Solution” (for ex “Explanation”) in “Explain” category?
And then in the original post there would be smth like “Explained by X in post number N”.
Without such an explicit indications it is not clear whether a topic was explained or not.

1 Like