Couple questions about how folk are approaching working with vector types for SIMD work.
Say I have some complex function that takes a number of f64 arguments. I want to avoid writing two implementations of the same logic, one for scalar f64 and one for @Vector(n, f64). I have a few questions / thoughts on this below.
One solution: use a signature that takes anytype, and use type reflection to detect vector types. This is doable, but stuff like scalar mutliplication is scalar * x in scalar land and the lovely @as(Vec, @splat(3.0)) * x in vector-land, so you have to do a lot of these comptime checks.
Another solution I’ve been playing with is writing everything as vector only, and special casing scalar calls as vectors of length 1. For example
fn typedMultiply(T: type, x: T, y: T) T {
return x * y;
}
export fn multiplyAsVec(x: f64, y: f64) f64 {
return typedMultiply(@Vector(1, f64), @splat(x), @splat(y))[0];
}
export fn multiplyAsScalar(x: f64, y: f64) f64 {
return typedMultiply(f64, x, y);
}
I was interested in whether there is overhead on doing the vector to scalar conversion - so ran this through to look at the assembly. Interestingly, it’s really exactly the same (to my very, very untrained eye).
example.multiplyAsScalar:
push rbp
mov rbp, rsp
sub rsp, 304
vmovsd qword ptr [rbp - 296], xmm0
vmovsd qword ptr [rbp - 288], xmm1
lea rax, [rbp - 280]
mov qword ptr [rbp - 16], rax
mov qword ptr [rbp - 8], 32
mov qword ptr [rbp - 24], 0
lea rdi, [rbp - 24]
call example.typedMultiply__anon_482
add rsp, 304
pop rbp
ret
example.typedMultiply__anon_482:
push rbp
mov rbp, rsp
sub rsp, 16
vmovsd qword ptr [rbp - 16], xmm0
vmovsd qword ptr [rbp - 8], xmm1
vmulsd xmm0, xmm0, xmm1
add rsp, 16
pop rbp
ret
example.multiplyAsVec:
push rbp
mov rbp, rsp
sub rsp, 304
vmovsd qword ptr [rbp - 304], xmm0
vmovsd qword ptr [rbp - 296], xmm1
lea rax, [rbp - 288]
mov qword ptr [rbp - 24], rax
mov qword ptr [rbp - 16], 32
mov qword ptr [rbp - 32], 0
lea rdi, [rbp - 32]
call example.typedMultiply__anon_491
vmovsd qword ptr [rbp - 8], xmm0
vmovsd xmm0, qword ptr [rbp - 8]
add rsp, 304
pop rbp
ret
example.typedMultiply__anon_491:
push rbp
mov rbp, rsp
sub rsp, 16
vmovsd qword ptr [rbp - 16], xmm0
vmovsd qword ptr [rbp - 8], xmm1
vmulsd xmm0, xmm0, xmm1
add rsp, 16
pop rbp
ret
My question then is: is this a sensible pattern? Or am I going to run into gotchas where vector-based code will use slower instruction sets on scalars or things like this?
Please do throw in any other tips for this kind of work as well
thanks all!