Suggested SIMD vector length is slower?

In the following code, if I comment out the call to simd.suggestVectorLength(T) and use the hard-coded value 32, the run time is nearly 3x faster (60ms vs 175ms). Is this expected?
(mac m1, zig 0.14.0-dev.367+a57479afc, ReleaseFast)

const std = @import("std");
const heap = std.heap;
const mem = std.mem;
const print = std.debug.print;
const simd = std.simd;
const time = std.time;

fn unrollSimdMatrixMultiply(
    comptime T: type,
    c: anytype,
    a: anytype,
    b: anytype,
) void {
    const n = a.len;
    // const vec_len = simd.suggestVectorLength(T) orelse @panic("No SIMD?");
    const vec_len = 32;

    for (c, a) |*c_row, a_row| {
        var j: u32 = 0;

        while (j <= n - vec_len) : (j += vec_len) {
            for (0..n) |k| {
                const u: @Vector(vec_len, T) = b[k][j..][0..vec_len].*;
                const y: @Vector(vec_len, T) = c_row.*[j..][0..vec_len].*;
                const w: @Vector(vec_len, T) = @splat(a_row[k]);
                const slice: [vec_len]T = (u * w) + y;
                @memcpy(c_row.*[j .. j + vec_len], &slice);
            }
        }

        while (j < n) : (j += 1) {
            for (0..n) |k| {
                c_row.*[j] += a_row[k] * b[k][j];
            }
        }
    }
}

fn Matrix(comptime T: type) type {
    return struct {
        m: [][]T,

        const Self = @This();

        fn init(
            allocator: mem.Allocator,
            n: usize,
            v: T,
        ) !Self {
            var to_free: usize = 0;
            var m = try allocator.alloc([]T, n);
            errdefer {
                for (0..to_free) |i| allocator.free(m[i]);
                allocator.free(m);
            }

            for (0..n) |i| {
                m[i] = try allocator.alloc(T, n);
                to_free += 1;
                for (0..n) |j| m[i][j] = v;
            }

            return .{ .m = m };
        }

        fn deinit(self: *Self, allocator: mem.Allocator) void {
            for (self.m) |row| allocator.free(row);
            allocator.free(self.m);
        }
    };
}

pub fn main() !void {
    var gpa = heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const T = u32;
    const n = 500;

    var a = try Matrix(T).init(allocator, n, 1);
    defer a.deinit(allocator);
    var b = try Matrix(T).init(allocator, n, 1);
    defer b.deinit(allocator);
    var c = try Matrix(T).init(allocator, n, 0);
    defer c.deinit(allocator);

    var timer = try time.Timer.start();
    unrollSimdMatrixMultiply(T, c.m, a.m, b.m);
    print("{} {d} ms\n", .{ c.m[50][50], timer.lap() / time.ns_per_ms });
}

Hmm, just printed out the suggested vec length and it’s 4. Looking up the info for the M1, it says it’s Neon 128 bit, so the suggestion should be 16?

On my cpu I am getting 32 = @sizeOf(T=u32)=4 * suggestVectorLength(T=32)=8
And it takes 99ms.

const vec_len = @sizeOf(T) * (simd.suggestVectorLength(T) orelse @panic("No SIMD?"));

EDIT: but this is wrong, the correct is @Vector(4, u32).

const vec_len = simd.suggestVectorLength(T) orelse @panic("No SIMD?");

And this takes 155ms!!!

This works, thanks @dimdin ! So if I understand correctly then, the function returns the number of “slots” needed; each slot to be filled by the number of bytes that the specific type uses.

Now we’re both confused! lol

That was debug (cpu i7-14700).
In releaseFast I am getting 14ms(vec_len=32) / 22ms(vec_len=4) .
:face_with_spiral_eyes:

1 Like

So the suggested vec length is correct, but using a larger vec length than the hardware actually has space for performs a lot better. Interesting… Seems like this parameter has to be fine tuned with actual benchmarking then.

I am not sure that the suggested vector length is correct.

See the source of simd.suggestVectorLengthForCpu
For aarch64 there is a comment:

// SVE allows up to 2048 bits in the specification, as of 2022 the most powerful machine has implemented 512-bit
// I think is safer to just be on 128 until is more common
// TODO: Check on this return when bigger values are more common

Yes, that is correct. Because suggestVectorLength calculates the total bits for the vector and it divides it by the bits of the type.

return @divExact(vector_bit_size, element_bit_size);
1 Like

As you increase the vector length, your algorithm becomes more cache friendly.
b[k][j..][0..vec_len] will always need to load a new cache line, as k increases.
But k is your inner loop, so you are doing a cache miss once per iteration.
Now if you increase the vector length, you will do more stuff within the k loop, so the cache miss is less severe.

Now in this case you can just swap the loops to get rid of the inefficient memory accesses (another 3Ă— speedup):

    for (c, a) |*c_row, a_row| {
        for (0..n) |k| {
            var j: u32 = 0;
            while (j <= n - vec_len) : (j += vec_len) {
                const u: @Vector(vec_len, T) = b[k][j..][0..vec_len].*;
                const y: @Vector(vec_len, T) = c_row.*[j..][0..vec_len].*;
                const w: @Vector(vec_len, T) = @splat(a_row[k]);
                c_row.*[j..][0..vec_len].* = (u * w) + y;
            }
            while (j < n) : (j += 1) {
                c_row.*[j] += a_row[k] * b[k][j];
            }
        }
    }

Also note that I removed your useless @memcpy.
Now it still seems that the larger vectors are faster (10 ms vs 9 ms), but this is not significant because it could also come from differences in code alignment.

2 Likes

Just a side note: if you want some real “realtime” (pun intended) and “critical mission applications”, just do not use CPUs with d/i caches, they are by theirs nature unpredictable in the sense that it is extremely hard to estimate how long a given piece of code will take to execute.

This is excellent, thanks! I guess that the fine tuning then moves from SIMD register size to cache line size fitting.