Datastructures and strategies to do runtime dispatch to C functions

I’ve been working on a project that requires me to collect C functions in common groups so they can be dispatched to. At this point, I’ve tried quite a few things and eventually had to move away from comptime methods because that brought in a whole list of restrictions that were detrimental.

First, assume that we must go through a C backend - we cannot rewrite all of that in Zig.

Here’s a basic example of what I’m doing right now:

// two functions that overload eachother:
void foo_f32(void const* src, void *dst, size_t len);
void foo_f64(void const* src, void *dst, size_t len);

// the foo's cast to the correct pointer types internally.
// Gather them in an Zig array (pseudo-code)
const foos = make_array(.{ foo_f32, foo_64 });

// using foos:

const x: []f32 = ...
const y: []f32 = ...
const id: usize = // get id for f32 types

// id indexes to the correct function that we call
foos[id](x.ptr, y.ptr, x.len);

Currently, this works but I’ve quickly outgrown it as I suspected would happen. There’s a few limitations to this method that are somewhat obvious:

  1. Some versions of foo actually need extra parameters and this method will not support that because you’ll get a different signature in the array.

  2. Type-indicies need to be tightly coupled to array positions which is only enforced by the build system (these foo arrays are generated, not hand written).


One alternative that I don’t particularly enjoy is that foos could be an array of *const anyopaque that get casted to the correct types before being called. I can generate signatures so I can provide correct signatures to cast to. I forsee that as being annoying though.

Another option is that I need to separate those arrays out and under one common operation we can do something like:

switch (class) {
   .foo => foos[foo_id](x.ptr, y.ptr, x.len),
   .bar => bars[bar_id](x.ptr, y.ptr, x.len, ...),
}

I have not played with hash-map solutions but I am open to any options people would suggest here because I don’t have a lot of dependencies at the moment. After I get this figured out, it will become a big part of the project though so I want to make sure it’s a good mechanism.

Finally, another option is to just abandon this idea entirely and bite the bullet. If this is really the most robust approach, then I’ll go with it:

switch (class) {
    .f32_id => foo_f32(...),
    .f64_id => foo_f64(...),
    .bar_id => foo_bar(...),
}

Alright - as always, thanks :slight_smile:

1 Like

I’m leaning towards just dropping this idea and going with the hard-coded approach. Keep in mind… there’s going to be quite a lot of these going forward.

Without seeing examples for the optional extra parameters, I find it difficult to imagine how those would be handled.

I guess you could have some kind of data / *anyopaque parameter for the extra parameters so every function would be called with a signature like this:

switch (class) {
   .foo => foos[foo_id](x.ptr, y.ptr, x.len, extra),
   .bar => bars[bar_id](x.ptr, y.ptr, x.len, extra),
}

Where sometimes extra just happens to be garbage/unused or nothing or size zero. But this would then require that you create these functions that actually adapt to the underlying specific functions.

At that point you might as well change all signatures to just *anyopaque that points to a method specific struct of the arguments, which isn’t really fun. (And you probably want more direct function calls)

I feel like doing the hardcoded thing enough, to see if it could be handled by a limited small number of patterns might be the easiest way, tedious, but at least you would be able to collect specific examples that allow you to evaluate if there is any pattern, that can be made into something generic, that generates the tedious stuff.

Also sounds a bit like you are getting into compiler territory with that project, meaning you might want to generate code via possibly multiple stages / optimization passes.


Maybe you can create a description of all the functions via some data structure of structs and then write a program that generates the glue code from there (at runtime), maybe even just directly as a normal program taking an application specific description and directly generating source code from that. Whether that would be wrapped in a build step later, is an orthogonal issue.


I think another question is: who is in control? Is the c code supposed to be used directly, or can it be inspected/extended/adapted by (half-)automatic means? Or do you want to describe the C signatures sufficiently from the zig side and then group them up with generated glue code?

This was another choice I was planning on looking into. I agree - it’s no fun. The casting logic gets totally pushed to the backend and that inevitably requires a layer of its own boilerplate. I don’t think I want to give that much control to the backend.

This is what I’m starting to feel is correct. Here’s what I think happened…

When we worked on that OverloadSet, it was dispatching perfectly because the type system was handling all this stuff with the algorithm you and I worked on. The issue is that it required a lot of comptime work and that makes python bindings basically a nightmare (or impossible). Moving this to runtime dispatch was the only good choice but I still have a lot of ]code generation that becomes legacy at this point that is still in use.

Essentially, I need to take one step further back and strip out that layer of dispatch code because it’s probably inappropriate now and I need to think about the problem in a different way. Thanks for the confirmation :+1:

1 Like

Passing extra too many arguments to a C function doesn’t cause any problem, since the caller is responsible for stack clean-up. What you can do is cast all function pointers to a function type with the maximum number of arguments and always call them with that many arguments. Callees with fewer arguments would end up receiving random values that happen to be on the stack, which will be ignored anyway so that doesn’t matter.

1 Like

How would the dispatch look in C for your case?

Could you do the actual dispatch in C in a (C) helper function that lends itself better to be called from Zig?

I think the easiest thing is to just post some code. For a bit of context, I’m reworking a computation graph for reversing operations. You can see where the kernels are called from core.kernels. This used to be all done at comptime:

// enable choice of graph
pub fn forward_impl(   
    graph: *Graph,
    x: Tensor, 
    y: Tensor, 
) Tensor {

    std.debug.assert(x.dtype() == y.dtype());

    std.debug.assert(std.mem.eql(usize, x.sizes(), y.sizes()));

    const z = deduce_output(graph, x);

    const key = core.dkey(z);

    if (x.dtype() == .q8) {

        const x_q = x.qmap();
        const y_q = y.qmap();

        std.debug.assert(x_q.dtype == y_q.dtype);

        core.kernel.subtraction_quantized[key](
            x.data_ptr(), x_q.scale.data_ptr(), x_q.zpoint.data_ptr(),
            y.data_ptr(), y_q.scale.data_ptr(), y_q.zpoint.data_ptr(),
            z.data_ptr(), z.len(), z.stream(),
        );

    } else {

        core.kernels.subtraction[key](
            x.data_ptr(), 
            y.data_ptr(), 
            z.data_ptr(), 
            z.len(), 
            z.stream(),
        );        
    }

    if (graph.mode == .train) {
        core.attach_op(@This(), z, &.{ 
            OpDatum{ .tensor = x },
            OpDatum{ .tensor = y },
            OpDatum{ .tensor = z },
        });        
    }

    return z;
}

The issue here is that these don’t have consistent calling conditions because we also need to swap what the inputs are based on the context:

pub fn reverse(args: []const OpDatum) void {

    core.enable_gradient(args[0].tensor);
    core.enable_gradient(args[1].tensor);

    if (args[0].tensor.dtype() == .q8 or args[1].tensor.dtype() == .q8)
        @panic("Reversing to quantized values is unimplemented.");
    
    const key = core.dkey(args[2].tensor);
    
    core.kernels.addition[key](
        args[2].tensor.grad_ptr(),
        args[0].tensor.grad_ptr(),
        args[0].tensor.grad_ptr(),
        args[0].len(),
        args[0].stream(),
    );

    core.kernels.subtraction[key](
        args[2].tensor.grad_ptr(),
        args[1].tensor.grad_ptr(),
        args[1].tensor.grad_ptr(),
        args[1].len(),
        args[1].stream(),
    );
}

So the issue with dispatching to C to do the thinking is that we lose a lot of context by the time we’re at that level

@chung-leong - that’s very interesting (and probably a tad evil) but I haven’t thought of it that way.

Could you elaborate? Is there something wrong with OverloadSet?

I don’t know if it would help, but have you considered @call? Since the args are anytype it could handle different numbers of args, but I think they would have to be selected at comptime.

I’m doing really evil stuff because I’m working on a dispatch mechanism for variadic functions. In your case I think you just normalize the interface by creating a thunk function for each of the C function. Instead of different argument sets, these thunks will all accept an opaque pointer:

const std = @import("std");

fn Thunk(comptime function: anytype) type {
    const FT = @TypeOf(function);
    const f = @typeInfo(FT).Fn;

    return struct {
        pub const Struct = define: {
            var struct_fields: [1 + f.params.len]std.builtin.Type.StructField = undefined;
            struct_fields[0] = .{
                .name = "retval",
                .type = f.return_type.?,
                .is_comptime = false,
                .alignment = @alignOf(f.return_type.?),
                .default_value = null,
            };
            for (f.params, 0..) |param, index| {
                struct_fields[1 + index] = .{
                    .name = std.fmt.comptimePrint("{d}", .{index}),
                    .type = param.type.?,
                    .is_comptime = false,
                    .alignment = @alignOf(param.type.?),
                    .default_value = null,
                };
            }
            break :define @Type(.{
                .Struct = .{
                    .layout = .auto,
                    .decls = &.{},
                    .fields = &struct_fields,
                    .is_tuple = false,
                },
            });
        };
        pub const name = @typeName(@This());
        pub const arg_count = f.params.len;

        pub fn call(opaque_ptr: *anyopaque) void {
            const arg_ptr: *Struct = @ptrCast(@alignCast(opaque_ptr));
            const Tuple = std.meta.ArgsTuple(FT);
            var arg_tuple: Tuple = undefined;
            inline for (0..arg_count) |index| {
                const key = std.fmt.comptimePrint("{d}", .{index});
                @field(arg_tuple, key) = @field(arg_ptr.*, key);
            }
            arg_ptr.retval = @call(.auto, function, arg_tuple);
        }

        pub fn init(tuple: anytype) Struct {
            var self: Struct = undefined;
            inline for (0..arg_count) |index| {
                const key = std.fmt.comptimePrint("{d}", .{index});
                @field(self, key) = @field(tuple, key);
            }
            return self;
        }
    };
}

fn hello(_: u32, _: u32, _: u32) u32 {
    return 1234;
}

fn world(_: f64, _: f64, _: f64, _: f64) f64 {
    return 3.14;
}

pub fn main() void {
    const HelloThunk = Thunk(hello);
    const WorldThunk = Thunk(world);

    std.debug.print("{s}\n", .{HelloThunk.name});
    std.debug.print("{s}\n", .{WorldThunk.name});
    var hello_arg = HelloThunk.init(.{ 1, 2, 3 });
    HelloThunk.call(@ptrCast(&hello_arg));
    std.debug.print("retval = {d}\n", .{hello_arg.retval});
}

This is the mechanism that Zigar uses to marshal calls from the JavaScript world.

3 Likes

Thanks for all the great replies - I appreciate all the considerations.

There’s nothing wrong with the OverloadSet - it works great and is an awesome utility. @Sze and I worked on another version that handles [*] -> [*c] conversions and also has a max-matching algorithm to figure out the best dispatch in case there were later dispatches that were better.

The issue is how I was using them - they were the wrong tool for my job. They need type information which requires capturing the argument types at the call site. That leads to a lot of closures. The issue is that I need to iterate over those arguments and form future closures in tricky ways because the computation graph can do Nth-derivatives by reforming new graphs over existing ones. That got to be a very cumbersome because that all happens at runtime. Bussing around type information was better handled through RTTI and the whole interface was straightened out when I moved to polymorphic dispatching instead of closures.

@dude_the_builder thought about it, but if you see that post wrote to @Tosti, you’ll see that we’re in kind of a weird position. Once the call is setup, it’s easy to do and that coud be an answer (it technically was when I was using OverloadSet).

@chung-leong Since I need to capture the arguments and iterate over them to form new function units (again, see my reply to @Tosti above), can you tell me more about what this affords in that situation? I’m not sure if this moves any pieces on the board for me, but it’s an interesting mechanism. I’m trying to avoid building closures that require a lot of comptime-probing to get the information back out of them.


As an update, I’m considering a hybrid approach so far between what @Sze said and where I started that may get me over the next set of hurdles. Likewise functions are grouped and that still helps cut out any of the switch statements (there were going to be a lot of them).

@chung-leong I think I see your point. One benefit I can foresee here is that it makes the array types consistent and then I can keep them together in the same array. In that case, this is definitely an interesting option.

Who’d have… thunk it?

Drevil_million_dollars


In my case, I’m starting to see value in keeping those branches separate but I may try Thunk’ing at some point.

2 Likes

Have you considered using a tagged union? Then the dispatch would look something like this:

switch (foos[id]) {
   .foo => |foo| foo(x.ptr, y.ptr, x.len),
   .bar => |bar| bar(x.ptr, y.ptr, x.len, ...),
}

For my case, that’s actually backwards for what I need because I have to switch over the x’s and y’s to find which foo to get. However, in the more general case (especially if we’re talking about loading up the array with Thunks), that would be a very viable solution.