SIMD and lookup tables

For my silly project, I need to compute sin(x) for an array of 65536 values of x.

To be more precise it is a sinusoidal function that takes an u8 and return an u8. So right now what I’m doing is initializing an array const sin: [256]u8, so I can just return the precomputed value when requested.

Now, I’m trying to use SIMD instructions to improve performance, but I’m not sure how I can approach that. My first try was to use @shuffle like this:

@shuffle(u8, sin, _, x);

Unfortunately the mask (x) has to be comptime known …

I was just sharing this yesterday :slight_smile:

1 Like

Just to clarify the question, given x = [x1, x2, x3, …, x65536], you want to compute [sin(x1), sin(x2), sin(x3), …, sin(x65536)]? And the “sin()” function is your lookup table? Or can it be the built-in sin function?

sorry I wasn’t clear, let me rephrase. I have a “sin-ish” function that takes the form sin(u8) -> u8. In order to speed things up, I did some memoization : I computed sin(x) for every possible value of x and stored the results in a sin_memo [256]u8 array, so now instead of calling sin(x) and recomputing it, I can just access sin_memo[x] to get the result.

Now, what if x is not one value but many ? I want to transform sin(u8) to sin([65536]u8) -> [65536]u8.

damn I missed it, I’ll go read this!

I see. Thanks for the clarification. In this case, @shuffle won’t work since it needs a comptime mask/indices for the 3rd argument while your data from the 65536 array is live at runtime. In defense of Zig, lookup via SIMD is not straight forward even using native asm. The best bet is using AVX 512’s vpermb for 64-byte lookup, where the 256-lookup table can be split into 4 lookups. However, AVX512 adoption is not wide spread yet. Or use ARM’s tbl and tbx with cascade lookup.

For using AVX2, you can look at simdJson/Highway/Folly/Lemire’s SIMD UTF-8 classifier for reference. The typical approach is splitting the lookup key into hi-nibble (4-bit) and lo-nibble. Split your lookup table into 16 chunks of smaller 16-element lookup table. Use vpshufb on the lo-nibble against all 16 chunks. Then blend the 16 results using the hi-nibble as the selector. You can ask ChatGpt/Gemini for the detail.

Actually you can rely on the compiter/LLVM/backend to do auto-vectorization and be done with it. LLVM is pretty smart to pick the correct SIMD instruction for you, and you don’t have to worry about platform support.

    // Populate lookup and x.
    var lookup: [256]u8 = undefined;
    var x: [65536]u8 = undefined;
    var result: [65536]u8 = undefined;

    const Vec64 = @Vector(64, u8);    // process 64 bytes at a time (aim for AVX-512/Neon)
    var i: usize = 0;
    while (i < x.len) : (i += 64) {
        // Load a chunk of x as lookup indices into SIMD vector(s)
        const indices: Vec64 = x[i..][0..64].*;

        // LLVM sees this and most likely optimizes it into SIMD/unrolled instructions.
        var sine_values: Vec64 = undefined;
        inline for (0..64) |j| {
            sine_values[j] = lookup[indices[j]];
        }

        result[i..][0..64].* = sine_values;
    }

Check the assembly output to verify.

2 Likes

tangentially related, so I’m putting it here instead of creating another post.

I am generating a [65536]u8 at comptime but it is too slow for my old hardware.

pub const x = init: {
    var arr: [size * size]u8 = undefined;
    var row: [size]u8 = undefined;
    for (&row, 0..) |*r, i| {
        r.* = @intCast(i);
    }

    var i: usize = 0;
    while (i < arr.len) : (i += size) {
        @memcpy(arr[i..][0..size], &row);
    }

    break :init arr;
};

Obviously I don’t want to wait for 10 seconds for the arrays to fill up every time I compile my program. How should I approach this? Should I add a helper program to generate a binary and embed it into my actual program? How can I add the production of this binary as a build step?

tl;dr: generate a file using a separate helper program (addExecutable, addRunArtifact), add the lazypath of the generated file as a module to your main module, then @embedFile it

4 Likes