In the following code, if I comment out the call to simd.suggestVectorLength(T) and use the hard-coded value 32, the run time is nearly 3x faster (60ms vs 175ms). Is this expected?
(mac m1, zig 0.14.0-dev.367+a57479afc, ReleaseFast)
const std = @import("std");
const heap = std.heap;
const mem = std.mem;
const print = std.debug.print;
const simd = std.simd;
const time = std.time;
fn unrollSimdMatrixMultiply(
comptime T: type,
c: anytype,
a: anytype,
b: anytype,
) void {
const n = a.len;
// const vec_len = simd.suggestVectorLength(T) orelse @panic("No SIMD?");
const vec_len = 32;
for (c, a) |*c_row, a_row| {
var j: u32 = 0;
while (j <= n - vec_len) : (j += vec_len) {
for (0..n) |k| {
const u: @Vector(vec_len, T) = b[k][j..][0..vec_len].*;
const y: @Vector(vec_len, T) = c_row.*[j..][0..vec_len].*;
const w: @Vector(vec_len, T) = @splat(a_row[k]);
const slice: [vec_len]T = (u * w) + y;
@memcpy(c_row.*[j .. j + vec_len], &slice);
}
}
while (j < n) : (j += 1) {
for (0..n) |k| {
c_row.*[j] += a_row[k] * b[k][j];
}
}
}
}
fn Matrix(comptime T: type) type {
return struct {
m: [][]T,
const Self = @This();
fn init(
allocator: mem.Allocator,
n: usize,
v: T,
) !Self {
var to_free: usize = 0;
var m = try allocator.alloc([]T, n);
errdefer {
for (0..to_free) |i| allocator.free(m[i]);
allocator.free(m);
}
for (0..n) |i| {
m[i] = try allocator.alloc(T, n);
to_free += 1;
for (0..n) |j| m[i][j] = v;
}
return .{ .m = m };
}
fn deinit(self: *Self, allocator: mem.Allocator) void {
for (self.m) |row| allocator.free(row);
allocator.free(self.m);
}
};
}
pub fn main() !void {
var gpa = heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
const T = u32;
const n = 500;
var a = try Matrix(T).init(allocator, n, 1);
defer a.deinit(allocator);
var b = try Matrix(T).init(allocator, n, 1);
defer b.deinit(allocator);
var c = try Matrix(T).init(allocator, n, 0);
defer c.deinit(allocator);
var timer = try time.Timer.start();
unrollSimdMatrixMultiply(T, c.m, a.m, b.m);
print("{} {d} ms\n", .{ c.m[50][50], timer.lap() / time.ns_per_ms });
}
Hmm, just printed out the suggested vec length and it’s 4. Looking up the info for the M1, it says it’s Neon 128 bit, so the suggestion should be 16?
This works, thanks @dimdin ! So if I understand correctly then, the function returns the number of “slots” needed; each slot to be filled by the number of bytes that the specific type uses.
So the suggested vec length is correct, but using a larger vec length than the hardware actually has space for performs a lot better. Interesting… Seems like this parameter has to be fine tuned with actual benchmarking then.
// SVE allows up to 2048 bits in the specification, as of 2022 the most powerful machine has implemented 512-bit
// I think is safer to just be on 128 until is more common
// TODO: Check on this return when bigger values are more common
As you increase the vector length, your algorithm becomes more cache friendly. b[k][j..][0..vec_len] will always need to load a new cache line, as k increases.
But k is your inner loop, so you are doing a cache miss once per iteration.
Now if you increase the vector length, you will do more stuff within the k loop, so the cache miss is less severe.
Now in this case you can just swap the loops to get rid of the inefficient memory accesses (another 3Ă— speedup):
for (c, a) |*c_row, a_row| {
for (0..n) |k| {
var j: u32 = 0;
while (j <= n - vec_len) : (j += vec_len) {
const u: @Vector(vec_len, T) = b[k][j..][0..vec_len].*;
const y: @Vector(vec_len, T) = c_row.*[j..][0..vec_len].*;
const w: @Vector(vec_len, T) = @splat(a_row[k]);
c_row.*[j..][0..vec_len].* = (u * w) + y;
}
while (j < n) : (j += 1) {
c_row.*[j] += a_row[k] * b[k][j];
}
}
}
Also note that I removed your useless @memcpy.
Now it still seems that the larger vectors are faster (10 ms vs 9 ms), but this is not significant because it could also come from differences in code alignment.
Just a side note: if you want some real “realtime” (pun intended) and “critical mission applications”, just do not use CPUs with d/i caches, they are by theirs nature unpredictable in the sense that it is extremely hard to estimate how long a given piece of code will take to execute.