Hey everyone, I was trying to understand SIMD a bit better and played around with vectors in zig.
I have the following two examples and noticed that I don’t really get much of a performance improvement. It seems the compiler is vectorising the non SIMD example already (because of it being structs of arrays?). My knowledge of assembly is not that great though so I have trouble validating it. Am I using SIMD the right way or is this also just a wrong use case?
Compilers are really good at vectorization of loops where
control flow of loop does not depend on loop body. (e.g. no if (condition()) break)
loop iterations are independent from each other. (e.g. no x[i] = x[i - 1] * 10;)
loop body always takes same control flow. (e.g. no if’s in body)
You example fits all boxes so it is well optimized.
Sometimes compilers can even overcome mentioned problems but its not reliable.
You may ask. Why this are problems for the compiler?
If early exit from loop is possible vectorizing can cause code to access memory it shouldn’t have accessed. Classic example is search of 0 byte in zero terminated string. Compiler has no choice but to do dumb iterations since reading past 0 byte can cause problems. Compiler Explorer
Compiler would need to know closed formula to be able to overcome it. Here is doesn’t know fibbonachi sequence Compiler Explorer for example. But it knows about algebraic sequences Compiler Explorer
SIMD stands for Single Instruction Multiple Data so if instructions are different on different iterations you can’t really use it. There are some SIMD instructions which can be used to emulate different control flow but not always.
Yes, the compiler will auto-vectorize any trivial loop like this, and it is rather difficult to get any better with manual vectorization.
Honestly the biggest gains of SIMD here come from changes in memory layout, which you already laid out perfectly for the compiler to optimize it in this case.
To give you an example where manual SIMD is more relevant, I’d suggest to look at std.mem.eql.
The original code looked something like this
for (a, b) |a_elem, b_elem| {
if (a_elem != b_elem) return false;
}
Here the compiler isn’t allowed to access memory that wouldn’t be accessed by the unoptimized function, since that may in the worst case cause a page fault and crash the program.
That means, if you know that it is legal to access that memory, then you can write much faster code.
Here the manual SIMD code roughly looks like this (plus a ton of extra code to handle unaligned inputs and odd lengths)
Thanks for the replies! Very insightful. So from what I understand I should probably just try to keep my loops tight and branchless if possible. If I can’t I should try to incorporate SIMD if possible? It kinda feels like an “it depends” topic and is obviously not as straight forward as I though.
Also trying to learn simd, what I do is adapted from Using SIMD to Tell if the High Bit is Set but with a build options
to use simd so I can compare performance easily with and without. Sometimes it’s faster, sometimes not.
var remaining = input;
if (build_options.use_simd) if (comptime std.simd.suggestVectorLength(u8)) |vector_len| {
while (remaining.len > vector_len) {
const chunk: @Vector(vector_len, u8) = remaining[0..vector_len].*;
// do something
remaining = remaining[vector_len..];
}
}
for (0..remaining.len) |i| {
remaining[i] = …;
}