Dispatching SIMD functions at runtime

desttinghim · November 11, 2024, 10:13pm

I have been learning about vectors and SIMD, and one of the topics I’ve seen come up is runtime dispatch, at least for x86_64 devices. The idea is to compile multiple version of your code that uses larger vectors on the devices that support it, and then chose which function to run based on information about the CPU gathered at runtime. It allows you to distribute a single binary for the architecture while still supporting new CPU features.

This got me wondering, how would I do this in zig? Some days later, I’ve got a proof of concept that I think is worth sharing.

Problems

There are a couple of problems that need to be solved for this to work:

Detect what features the current cpu supports
Compile the vector functions once for each set of features we want to support

Runtime detection of CPU features

The zig standard library is capable of detecting the current cpu! In hindsight, this seems obvious because zig automatically determines which CPU to compile for when it is run, but figuring out how to use the zig standard library for this took me some time.

First, import the builtin module with @import("builtin"). We need it to figure out what target the executable was built for. Then we create a Target.Query using std.Target.Query.fromTarget(builtin.target). The Query field cpu_model will need to be set to .native next. Finally, call std.zig.system.resolveTargetQuery(query) to get the runtime target information.

We aren’t done quite yet. To actually check features, we need to make use of the type std.Target.Cpu.Feature.Set. Also, while the previous code should work anywhere zig runs, the next part needs to be switched by the current architecture. To keep things simple, we’ll assume the architecture is x86_64. The list of all features for x86_64 is std.Target.x86, which we will use to look up what each of the feature flags we got from the native target actually means.

Here’s what that looks like all together:

Get cpu information at runtime

const std = @import("std");
const builtin = @import("builtin");

var query = std.Target.Query.fromTarget(builtin.target);
query.cpu_model = .native;    

const native_target = try std.zig.system.resolveTargetQuery(query);

switch (builtin.target.cpu.arch) {
    .x86_64 => {
        const x86 = std.Target.x86;

        std.debug.print("CPU Supported Features\n", .{});
        for (x86.all_features) |feat| {
            if (!native_target.cpu.features.isEnabled(feat)) continue;
            std.debug.print("Feature: {s}\nDescription:{s}\n", .{feat.name, feat.description});
        }
    },
    else => @compileError("Unsupported architecture"),
}

Compiling the same function multiple times

We could technically re-implement the vector functions for each set of targets we want runtime dispatch for, but that could quickly become annoying. Instead, I defined an enum for each CPU family I wanted to target. The name can be arbitrary, but I used the names that zig/llvm uses to group x86_64 cpus - x86_64, x86_64_v2, x86_64_v3, and x86_64_v4.

This is where it might get a little tricky to follow. We create a function taking the run time cpu as a comptime parameter. The return value is a struct type with our vector functions. Inside the function, we compare the run time cpu with the build target - if this doesn’t match, we return a struct with the declaration const function_name = @extern(fnptr, .{.name = @tagName(runtime_cpu) ++ "_fn_name" };for each of our functions. If the run time cpu value matches the build time cpu value, we declare a function with the actual implementation of the function. Before returning, we call @export(fn_struct.fn_name, .{.name = @tagName(runtime_cpu) ++ "_fn_name");.

Outside of the function, we switch on the runtime cpu value and use inline else so that the each branch knows which runtime cpu to expect at compile time.

To recap: we use a comptime parameter to determine the function name, and we check that against the compile time target to know whether the current compilation unit is the one that will contain that function. If it isn’t we return an extern function instead, which will be resolved during link time.

Solution

Here is the final solution I came up with (I am using zig version 0.13.0:

main.zig

pub fn main() !void {
    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    defer arena.deinit();

    var prng = std.Random.DefaultPrng.init(68326650);

    const stdout_file = std.io.getStdOut();
    var buf_writer = std.io.bufferedWriter(stdout_file.writer());
    const stdout = buf_writer.writer();

    try dispatch.dumpStatus(stdout);
    const cpu_type = try dispatch.getRuntimeCPU();
    const vec_fns = VectorFunctions.init(cpu_type);

    const length = prng.random().uintAtMost(usize, (1024) / @sizeOf(f32));
    try stdout.print("Elements[{d}] =", .{length});
    const numbers = try arena.allocator().alloc(f32, length);
    for (numbers) |*num| {
        num.* = prng.random().float(f32) * 100;
        try stdout.print(" {d}", .{num.*});
    }
    try stdout.print("\n", .{});

    const sum = vec_fns.sum(numbers);
    try stdout.print("The sum is {d}\n", .{sum});

    try buf_writer.flush();
}

const std = @import("std");
const builtin = @import("builtin");
const VectorFunctions = @import("vector.zig").VectorFunctions;
const dispatch = @import("runtime-dispatch.zig");

runtime-dispatch.zig

pub const CPUToken = enum {
    /// x86_64 CPU: sse, sse2
    x86_64,
    /// sse3-sse4.2
    x86_64_v2,
    /// avx + avx2
    x86_64_v3,
    /// avx512
    x86_64_v4,
};

pub fn dumpStatus(writer: anytype) !void {
    try writer.print(
        \\Build CPU:    {[build_cpu]}
        \\Run CPU:      {[run_cpu]}
        \\
    , .{
        .build_cpu = getBuildCPU(),
        .run_cpu = try getRuntimeCPU(),
    });
}

pub fn getRuntimeCPU() !CPUToken {
    var query = std.Target.Query.fromTarget(builtin.target);
    query.cpu_model = .native;

    const native_target = try std.zig.system.resolveTargetQuery(query);

    return getCPUToken(native_target);
}

pub fn getBuildCPU() CPUToken {
    return getCPUToken(builtin.target);
}

pub fn getCPUToken(target: std.Target) CPUToken {
    switch (builtin.target.cpu.arch) {
        .x86_64 => {
            const x86 = std.Target.x86;

            const x86_64_v2_feature_set = [_]x86.Feature{
                .sse3,
                .ssse3,
                .sse4_1,
                .sse4_2,
            };
            const x86_64_v3_feature_set = [_]x86.Feature{
                .avx,
                .avx2,
            };
            const x86_64_v4_feature_set = [_]x86.Feature{
                .avx512f,
            };

            if (!x86.featureSetHasAll(target.cpu.features, x86_64_v2_feature_set)) return .x86_64;
            if (!x86.featureSetHasAll(target.cpu.features, x86_64_v3_feature_set)) return .x86_64_v2;
            if (!x86.featureSetHasAll(target.cpu.features, x86_64_v4_feature_set)) return .x86_64_v3;
            return .x86_64_v4;
        },
        else => @compileError("Unsupported architecture"),
    }
}

const std = @import("std");
const builtin = @import("builtin");
const VectorFunctions = @import("vector.zig").VectorFunctions;

vector.zig

comptime {
    _ = get_vec_fns(dispatch.getBuildCPU());
}

pub const VectorFunctions = struct {
    selected_cpu: std.Target.Cpu.Feature.Set,
    sum_fn: *const fn ([*]const f32, usize) callconv(.C) f32,

    pub fn init(token: CPUToken) VectorFunctions {
        switch (token) {
            inline else => |tok| {
                const fns = get_vec_fns(tok);
                return .{
                    .selected_cpu = fns.feature_set,
                    .sum_fn = fns.sum,
                };
            },
        }
    }

    pub fn sum(this: *const @This(), elements: []const f32) f32 {
        return this.sum_fn(elements.ptr, elements.len);
    }
};

fn get_vec_fns(comptime run_cpu: CPUToken) type {
    const build_cpu = dispatch.getBuildCPU();
    if (run_cpu != build_cpu) {
        return struct {
            const feature_set = std.Target.Cpu.Feature.Set.empty;
            const sum = @extern(*const fn ([*]const f32, usize) callconv(.C) f32, .{ .name = @tagName(run_cpu) ++ "_sum" });
        };
    }

    const vec_f32_size = std.simd.suggestVectorLength(f32) orelse @compileError("No suggested vector length");

    // @compileLog("compiling for " ++ @tagName(run_cpu) ++ ". Vectors will be " ++ @typeName(@Vector(vec_f32_size, f32)));

    const fns = struct {
        const feature_set = builtin.target.cpu.features;
        fn sum(elements: [*]const f32, len: usize) callconv(.C) f32 {
            // @setFloatMode(.optimized);
            var num_left: usize = len;

            var s_accumulator = if (len > vec_f32_size) v_sum: {
                var accumulator: @Vector(vec_f32_size, f32) = elements[0..vec_f32_size].*;
                while (num_left > vec_f32_size) : (num_left -= vec_f32_size) {
                    accumulator += elements[len - num_left ..][0..vec_f32_size].*;
                }
                break :v_sum @reduce(.Add, accumulator);
            } else 0;

            while (num_left > 0) : (num_left -= 1) {
                s_accumulator += elements[len - num_left];
            }

            return s_accumulator;
        }
    };

    @export(fns.sum, .{ .name = @tagName(build_cpu) ++ "_sum" });

    return fns;
}

const std = @import("std");
const builtin = @import("builtin");
const dispatch = @import("runtime-dispatch.zig");
const CPUToken = dispatch.CPUToken;

I didn’t get around to creating a build script, but here are the (approximate) commands I ran to build the executable:

$ zig build-lib -target x86_64-linux-none -mcpu=x86_64_v2 --name v2 vector.zig
$ zig build-lib -target x86_64-linux-none -mcpu=x86_64_v3 --name v3 vector.zig
$ zig build-lib -target x86_64-linux-none -mcpu=x86_64_v4 --name v4 vector.zig
$ zig build-exe -target x86_64-linux-none -mcpu=x86_64 main.zig libv2.a libv3.a libv4.a

If you’ve copied the files and compiled them, you should see this output:

Build CPU:    runtime-dispatch.CPUToken.x86_64
Run CPU:      runtime-dispatch.CPUToken.x86_64_v4
Elements[101] = 52.292168 66.588684 30.61861 92.43083 15.7091465 4.285571 83.334045 49.30658 40.096645 72.2112 78.322296 30.43186 22.754877 99.54004 48.60313 83.482895 77.57028 84.17113 40.13544 29.82823 28.115356 13.428084 25.282606 4.5296035 45.37845 14.405889 64.48759 81.71015 20.85259 13.1780405 8.807455 12.590628 42.777763 49.24631 53.405243 25.987661 5.243837 47.531986 23.708027 66.29517 84.95026 17.422407 61.068886 33.04193 34.481113 2.455198 82.168236 28.434917 71.032616 12.235698 43.233074 43.377136 11.8576 92.39796 21.493795 75.504036 82.81199 12.264427 57.234383 37.26576 16.692627 2.8394244 40.08117 77.581276 5.3921075 5.1101165 22.17554 79.60823 91.6607 27.644339 40.32245 76.61647 92.85088 77.89375 17.027369 87.65586 48.65207 17.474144 62.31637 60.598415 55.00127 94.01178 60.353775 61.988564 0.22854076 40.048454 71.96352 52.381706 84.31811 61.956234 46.91669 7.517119 49.367817 23.86009 64.47792 51.68957 8.637657 42.452866 37.314655 52.866737 58.978127
The sum is 5024.497

Well, except for the value of Run CPU: of course

Afterword

Feel free to ask questions if you’ve got them! Alternatively, if you think there is a better way to do something, let me know! This is just a prototype, and I’d love to read what other people think.

mutech · November 12, 2024, 4:00am

Great writeup!

But I don’t understand why you need to go through the step of creating the version-library for different CPU models.

Could you not compile for the latest model (the default?) and just use the lower model version features according to the detected hardware support?

I didn’t dive into the details of this yet and just compiled with defaults expecting that the resulting binary will support the latest features and fall back as appropriate (in the zig libraries) and allow me to use any (old or new) features of the CPU (via asm, because I would expect the features supported by zig already dispatching to the best implementation available).

I just realized that I assumed code compiled for newer architectures would still run on older hardware. Is this not the case?

This would shatter my mind model of compiling “for intel”.

(Btw, sorry for being lazy and not reading up on that myself, but I’m just curious and you are into that already ;-))

Validark · November 12, 2024, 4:35am

Newer architectures support new instructions that were not present on old architectures.

If you create a program that runs newer instructions than the CPU supports, the CPU will not know how to execute those. One solution is runtime dispatch, which is what this write-up tells you how to do. Zig does not automatically do runtime dispatch for all “intel” targets.

Often times, when you download a binary, it is compiled for some lowest common denominator.

quic5 · November 12, 2024, 10:51am

There is an accepted proposal to solve this in the language: Proposal: Function multi-versioning · Issue #1018 · ziglang/zig · GitHub