SIMD generation for large arrays

insalt-glitch · March 25, 2025, 2:33pm

Hey y’all,

I’m working on a simulation where I know the lengths of my arrays at comptime. A minimalistic example looks something like this:

// The functions signatures are for comparison in Godbolt but
// would be simple slices in the actual program.
export fn manual_loop(input: [*]f64, result: [*]f64) void {
    const arr: *const [64]@Vector(8, f64) = @alignCast(@ptrCast(input));
    const vec_res: *[64]@Vector(8, f64) = @alignCast(@ptrCast(result));
    for (0..64) |idx| {
        vec_res.*[idx] = @sqrt(arr.*[idx]);
    }
}

export fn intuitive_solution(input: [*]f64, result: [*]f64) void {
    const arr: *const @Vector(512, f64) = @alignCast(@ptrCast(input));
    const vec_res: *@Vector(512, f64) = @alignCast(@ptrCast(result));
    vec_res.* = @sqrt(arr.*);
}

For many of the array-wide operations, the ZIG-vector would come in handy. However, the ZIG-compiler generates hundrets (or thousands) of instructions instead of using a loop that be optimized by LLVM. Why is this the case? Am I using it wrong? The workaround is annoying in particular reduce operations and forces me to do all the math myself (and potentially requires some guess on the available SIMD width).

I would love to understand a bit more about how this is supposed to be used .Thank you everyone!

lufe · March 25, 2025, 3:53pm

Why is it an issue? Are you getting the wrong results? Or is the binary size an issue? Or did you benchmark and get worse results?

You could pass the arrays as slices, in that case (maybe, I did not check) zig does not unroll the loop. Also you probably get vectorization without specificly casting to vectors. Did you try that?

insalt-glitch · March 25, 2025, 4:13pm

You definitely have a point. In order to be 100% sure, I should benchmark this for different array-sizes. However, even without the benchmark. My understanding is:

for an array with ~10^7 elements we would emit ~10^6 assembly instructions every time such an operation is called in program. This leads to extremely large binaries.
Additionally, from my understanding for CPU-architectures, this would considerably harm the performance of the program because the instruction cache would be absolutely flooded. However, I’m not 100% sure on this last point.
This is more of a convenience thing, but debugging would really annoying as well if the functions get this large.

There might be other reasons but these are the primary ones I considered when asking the question. My reasoning is simply that it should be possible to figure out a better way to generate the assembly or LLVM-IR in this case. It is not that the ZIG-compiler is doing the wrong thing, but its also not doing the best thing either…

Cheers

IntegratedQuantum · March 25, 2025, 5:59pm

I think the main problem is that it would require a fundamentally different computation model, where the compiler would figure out the entire code path and execute it in blocks of std.simd.suggestVectorLength(T), instead of instruction by instruction to avoid storing intermediate values on the stack (which is expensive, floods the data cache, and in the case of 10⁷ elements would even cause a stack overflow).

So far I’ve only seen this computation model for shader languages, and as nice as it would be, I don’t think we can expect Zig to achieve it any time soon.

Cloudef · March 26, 2025, 4:31am

The vector operations guarantee SIMD operations on supported CPUs, while arrays might or might not, but as you see in the godbolt, LLVM generates much better code for your for loop version, and I bet it will do even if you don’t use the @Vector at all. Sometimes letting the compiler to do the job is better.

So far I’ve only seen this computation model for shader languages, and as nice as it would be, I don’t think we can expect Zig to achieve it any time soon.

Zig has some experimental spir-v/shader compilation. Even with compute shaders you have to manually specify your workgroup size though. It would be interesting to have SIMD backend for this model however.

insalt-glitch · March 26, 2025, 9:44am

Thanks! You’re right, he following code:

export fn simple_loop(input: [*]align(64)f64, result: [*]align(64)f64) void {
    for (0..511) |idx| {
        result[idx] = @sqrt(input[idx]);
    }
}

Does indeed generate the intended code as well. This means the use of vectors is simply meant as a convenient feature but nothing more? Nevertheless, I think this is what vectors are meant for so it would be nice to see this feature in the compiler at some point in the future .

Cloudef · March 26, 2025, 9:58am

The heavy lifting for them is done mostly by LLVM, so what kind of code is generated is not much in control of zig. Zig’s own backends might be able to improve this.

Validark · March 26, 2025, 10:37am

Calling @sqrt on a @Vector(512, f64) is expected to generate 512 / 8 sqrt instructions on a machine that can do 8 at once. It will not automatically create a loop for you. If you want more control, you can try something like this:

export fn simple_loop(input: [*]align(64)f64, result: [*]align(64)f64, len: usize) void {
    const input_slice = input[0..len];
    const result_slice = result[0..len];

    var cur_input = input_slice;
    var cur_result = result_slice;

    const VLEN = 32;

    while (true) {
        cur_result[0..VLEN].* = @sqrt(@as(@Vector(VLEN, f64), cur_input[0..VLEN].*));
        cur_input = cur_input[VLEN..];
        cur_result = cur_result[VLEN..];
        if (cur_input.len == 0) break;
    }
}

Just make sure that for whatever VLEN you choose, your slices are big enough that you don’t run off the end of a page!

QuantumEspresso · March 29, 2025, 5:11am

I believe VLEN could be automatically chosen by std.simd.suggestVectorLength or std.simd.suggestVectorLengthForCpu.

Validark · March 29, 2025, 12:43pm

I’m talking about what happens when there are 5 elements left in a slice but you load in 8 at a time. You can guarantee memory safety by aligning the pointer to VLEN (prevents page faults but you might grab some unrelated data, which could be bad if you are doing an operation whose latency varies based on the input or if you overwrite the data) or by overallocating a few extra slots for the slice. Either way, one should be mindful of their chosen strategy.