# Suggested SIMD vector length is slower?

In the following code, if I comment out the call to `simd.suggestVectorLength(T)` and use the hard-coded value `32`, the run time is nearly 3x faster (60ms vs 175ms). Is this expected?
(mac m1, zig 0.14.0-dev.367+a57479afc, ReleaseFast)

``````const std = @import("std");
const heap = std.heap;
const mem = std.mem;
const print = std.debug.print;
const simd = std.simd;
const time = std.time;

fn unrollSimdMatrixMultiply(
comptime T: type,
c: anytype,
a: anytype,
b: anytype,
) void {
const n = a.len;
// const vec_len = simd.suggestVectorLength(T) orelse @panic("No SIMD?");
const vec_len = 32;

for (c, a) |*c_row, a_row| {
var j: u32 = 0;

while (j <= n - vec_len) : (j += vec_len) {
for (0..n) |k| {
const u: @Vector(vec_len, T) = b[k][j..][0..vec_len].*;
const y: @Vector(vec_len, T) = c_row.*[j..][0..vec_len].*;
const w: @Vector(vec_len, T) = @splat(a_row[k]);
const slice: [vec_len]T = (u * w) + y;
@memcpy(c_row.*[j .. j + vec_len], &slice);
}
}

while (j < n) : (j += 1) {
for (0..n) |k| {
c_row.*[j] += a_row[k] * b[k][j];
}
}
}
}

fn Matrix(comptime T: type) type {
return struct {
m: [][]T,

const Self = @This();

fn init(
allocator: mem.Allocator,
n: usize,
v: T,
) !Self {
var to_free: usize = 0;
var m = try allocator.alloc([]T, n);
errdefer {
for (0..to_free) |i| allocator.free(m[i]);
allocator.free(m);
}

for (0..n) |i| {
m[i] = try allocator.alloc(T, n);
to_free += 1;
for (0..n) |j| m[i][j] = v;
}

return .{ .m = m };
}

fn deinit(self: *Self, allocator: mem.Allocator) void {
for (self.m) |row| allocator.free(row);
allocator.free(self.m);
}
};
}

pub fn main() !void {
var gpa = heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();

const T = u32;
const n = 500;

var a = try Matrix(T).init(allocator, n, 1);
defer a.deinit(allocator);
var b = try Matrix(T).init(allocator, n, 1);
defer b.deinit(allocator);
var c = try Matrix(T).init(allocator, n, 0);
defer c.deinit(allocator);

var timer = try time.Timer.start();
unrollSimdMatrixMultiply(T, c.m, a.m, b.m);
print("{} {d} ms\n", .{ c.m[50][50], timer.lap() / time.ns_per_ms });
}
``````

Hmm, just printed out the suggested vec length and itâ€™s `4`. Looking up the info for the M1, it says itâ€™s Neon 128 bit, so the suggestion should be `16`?

On my cpu I am getting `32 = @sizeOf(T=u32)=4 * suggestVectorLength(T=32)=8`
And it takes 99ms.

``````const vec_len = @sizeOf(T) * (simd.suggestVectorLength(T) orelse @panic("No SIMD?"));
``````

EDIT: but this is wrong, the correct is `@Vector(4, u32)`.

``````const vec_len = simd.suggestVectorLength(T) orelse @panic("No SIMD?");
``````

And this takes 155ms!!!

This works, thanks @dimdin ! So if I understand correctly then, the function returns the number of â€śslotsâ€ť needed; each slot to be filled by the number of bytes that the specific type uses.

Now weâ€™re both confused! lol

That was debug (cpu i7-14700).
In releaseFast I am getting 14ms(vec_len=32) / 22ms(vec_len=4) .

1 Like

So the suggested vec length is correct, but using a larger vec length than the hardware actually has space for performs a lot better. Interestingâ€¦ Seems like this parameter has to be fine tuned with actual benchmarking then.

I am not sure that the suggested vector length is correct.

See the source of simd.suggestVectorLengthForCpu
For aarch64 there is a comment:

``````// SVE allows up to 2048 bits in the specification, as of 2022 the most powerful machine has implemented 512-bit
// I think is safer to just be on 128 until is more common
// TODO: Check on this return when bigger values are more common
``````

Yes, that is correct. Because `suggestVectorLength` calculates the total bits for the vector and it divides it by the bits of the type.

``````return @divExact(vector_bit_size, element_bit_size);
``````
1 Like

As you increase the vector length, your algorithm becomes more cache friendly.
`b[k][j..][0..vec_len]` will always need to load a new cache line, as k increases.
But `k` is your inner loop, so you are doing a cache miss once per iteration.
Now if you increase the vector length, you will do more stuff within the k loop, so the cache miss is less severe.

Now in this case you can just swap the loops to get rid of the inefficient memory accesses (another 3Ă— speedup):

``````    for (c, a) |*c_row, a_row| {
for (0..n) |k| {
var j: u32 = 0;
while (j <= n - vec_len) : (j += vec_len) {
const u: @Vector(vec_len, T) = b[k][j..][0..vec_len].*;
const y: @Vector(vec_len, T) = c_row.*[j..][0..vec_len].*;
const w: @Vector(vec_len, T) = @splat(a_row[k]);
c_row.*[j..][0..vec_len].* = (u * w) + y;
}
while (j < n) : (j += 1) {
c_row.*[j] += a_row[k] * b[k][j];
}
}
}
``````

Also note that I removed your useless `@memcpy`.
Now it still seems that the larger vectors are faster (10 ms vs 9 ms), but this is not significant because it could also come from differences in code alignment.

2 Likes

Just a side note: if you want some real â€śrealtimeâ€ť (pun intended) and â€ścritical mission applicationsâ€ť, just do not use CPUs with d/i caches, they are by theirs nature unpredictable in the sense that it is extremely hard to estimate how long a given piece of code will take to execute.

This is excellent, thanks! I guess that the fine tuning then moves from SIMD register size to cache line size fitting.