Using @Vector for SIMD computations does not improve speed

I’m experimenting with @Vector to see if I can make my code run faster. Here, I’m generating N sawtooth waveforms in parallel leveraring the vector type to do the same computation N times in parallel.

Unfortunately, N=8 takes twice the time of N=4, so I guess my code is not leveraging SIMD at all. Here is the code:

const std = @import("std");

pub fn Saw(comptime srate: f32, comptime len: usize) type {
    const Vec = @Vector(len, f32);
    const step: Vec = @splat(1 / srate);

    return struct {
        const Self = @This();

        phase: Vec = @splat(0),

        pub fn eval(self: *Self, freq: Vec) Vec {
            const output = @mulAdd(Vec, self.phase, @splat(2), @splat(-1));
            self.phase = @mod(self.phase + step * freq, @as(Vec, @splat(1)));
            return output;

pub fn main() void {
    const srate: f32 = 48000;
    const len = 4;
    const Vec = @Vector(len, f32);

    var s = Saw(srate, len){};
    const freq: Vec = @splat(1);
    var out: Vec = undefined;

    for (0..srate * 100) |_| {
        out = s.eval(freq);

    std.debug.print("{}\n", .{out});

The code bellow is compiled with -O ReleaseFast, and here are the results:

N = 4

Command Mean [ms] Min [ms] Max [ms] Relative
./simple_vec 157.5 ± 3.0 155.1 168.6 1.00

N = 8

Command Mean [ms] Min [ms] Max [ms] Relative
./simple_vec 329.1 ± 12.5 321.9 363.9 1.00

What am I missing? Is there a missing compile option to enable SIMD ? Is there a part of my code that forces it back to sequential evaluation?

It looks like @mod does not get optimized well for floats, here it looks like it’s iterating through all the floats and doing some complex stuff.
I think instead of @mod(x, 1) you can use x - @floor(x). The compiler produces much better code for this.

Also for the future, I can recommend godbolt as a tool, it lets you see the assembly output of the compiler. Here is a small example of the two variants in godbolt: Compiler Explorer
There you can see that @mod produces 100 lines of assembly code, whereas x - floor(x) only produces 3 lines.


thanks a lot! you are right the godbolt output is really explicit - I don’t use it much because it usually spits out a lot of things I can’t decipher, but in this case it is crystal clear.