@Vector best practices

Since there’s not a lot of documentation to be found on @Vector() I’ll ask my question here, if anyone knows of a good source of info feel free to let me know :+1:

I need to do some comparisons on certain bytes for validation. Perfect usecase for SIMD. The problem is that the amount of bytes could range from just a few to millions.

I’ve seen multiple ways of doing so. Right now I have working version where I just create my Vectors at runtime. e.g. if i have 10 000 bytes then i do @Vector(10_000, u8).

I’ve seen people use std.simd.suggestVectorLength instead and do their calculations in loop. I expect that @vector does something similar under the hood anyway.

The docs mentions:

Note that excessively long vector lengths (e.g. 2^20) may result in compiler crashes on current versions of Zig

I guess using suggestVectorLength helps avoiding this issue, is there another good reason for using std.simd.suggestVectorLength?

If you prefer text: SIMD with Zig
If you prefer video: Zig in Depth: Vectors and SIMD

EDIT:
There is no good reason for using suggestVectorLength, because with 10k length you are way above any suggestion that matches the cpu registers for vector length.
The size of the vectors is determined by the CPU, suggestVectorLength can be useful to get the vector size and transfer the data to vectors in a loop.

1 Like

I’m still noob at zig, so here’s from a hardware perspective.

Look up what is the biggest SIMD register on the hardware you want to support, and use that. 512 bits (aka 64 bytes) is the biggest on general hardware (AVX512). I guess zig (or llvm) will split it to multiple registers on hardware that say only supports 16 bytes, so all good.

If you run it in a tight loop it also shouldn’t be that much of a difference. You can unroll a loop or whatever. I guess using a 128 byte Vector compiles to same as doing two 64 byte operations, and so on.
Benchmark to be sure.

Don’t forget to align your data (to 64 bytes to be sure).

As for how to support 1 byte up to a lot, you can split the work into what is divisible by your Vector length and the remainder.
For example:
Got 30 bytes.
30 / 16 is one, so one loop of Vector(16, u8).
30 % 16 is 14, so 14 loops of normal code.
You can use shifts instead of division, but I think the compiler will optimize it (if it is power of two, that it is).
If you can’t guarantee alignment you can do normal then vector when aligned then normal, but that is playing with pointers and might be scary to some.

3 Likes

I believe @Vector does automatically the splitting depending on the architecture, but is it possible to define the Vector’s size at runtime? Isn’t it part of the type, hence defined at comptime ?

I think you also can “overallocate”, work on 32 bytes and then ignore the last 2 bytes in later code, for example if you are comparing two vectors you either could make sure that the last 2 bytes are always the same and thus can be reduced down with the rest without influencing the results of the comparison, or you could set/mask them in some way before doing the reduce.

Related talks about SIMD:

1 Like

This is almost always a good idea for large data, for cache-line reasons.