Advice on linking CUDA to Zig

Not for Zig+CUDA combo but for interfacing C++ libs from C (which the CUDA/Zig case should reduce to) the approach is:

interface header that’s used by your .cu source and @cImport’d to Zig:

// in lib.h

#if defined(__cplusplus)
    #define EXTERN_C extern "C"
#else
    #define EXTERN_C extern
#endif

EXTERN_C void vectorAddition(const int*, const int*, int*);
// in .cu file
#include "lib.h"

// this will use C linkage because you included lib.h above
EXTERN_C void vectorAddition(const int*, const int*, int*) {
    // implementation
}
// in .zig file

const c = @cImport(@cInclude("lib.h"));
c.vectorAddition(...)

In addition, you need to arrange for your .cu file to be compiled in your build.zig.

As an example of a C++ library that can be linked to C apps that exposes a C API for its classes, maybe soloud’d be a good example. Also the “darknet” library I mentioned earlier in the thread has examples of a variant of this pattern. E.g., https://github.com/pjreddie/darknet/blob/master/src/avgpool_layer_kernels.cu:

extern "C" {
#include "avgpool_layer.h"
#include "cuda.h"
}```
2 Likes

Ah, great stuff - thanks for the info. I’ll get back to hacking away at this tonight :beers:

1 Like

Alright, I’ve got it working from Zig - we’re calling kernels, folks… let the good times roll.

Here’s a basic idea of how to compile kernels in Cuda for Zig using NVCC and the build system. Note: I’m running on Linux. I’ve decided to go down the shared object route (I’m sure there’s other ways, but most cuda libraries are .so anyhow). So…

First, you have to build your library file - couple things to note:

  • I’m using @nurpax’s suggestion for defining extern to extern "C"
  • In the source file, keep everything marked as extern "C", only use the macro on the header declarations.
  • Mark kernels as extern "C"extern "C" __global__ void foo...

Then, compiling it - here’s the command line argument for compiling that worked for me:

nvcc --shared -o libkernels.so kernels.cu --gpu-architecture=sm_89 --compiler-options '-fPIC' -I/usr/local/cuda/include -L/usr/local/cuda/lib -lcudart

Note that we put the prefix “lib” on kernels - this helps the Zig build search for the library file. You can put this as a build step as well… but that’s trivial.

Then, import your header using @cImport and use your functions like so:

const cu = @cImport({
    @cInclude("/path/to/your/header/file.h");
});

// sometime later...

const dev_ptr = cu.deviceMalloc(i32, 42);

Now, for the build… here’s the basic idea… you need to link your shared object, libc, and the cuda requirements you’ll be using… here’s a rough way to do that.

    const exe = b.addExecutable(.{
        .name = "Cuda",
        .root_source_file = .{ .path = "src/main.zig" },
        .target = target,
        .optimize = optimize,
    });

    // for our newly created file
    exe.addLibraryPath(.{ .path = "/path/to/shared_object" });

    // for cudart and nvrt libraries
    exe.addLibraryPath(.{ .path = "/usr/local/cuda/lib64" });

    exe.linkSystemLibrary("cuda");
    exe.linkSystemLibrary("cudart");
    exe.linkSystemLibrary("nvrt");
    exe.linkSystemLibrary("kernels");  // this is the one we made
    exe.linkLibC();

Again, to link to libkernels.so like we compiled above, you just need the string kernels.

And that’s it! It builds for me - now back to the main show.

4 Likes

Idiomatic way of exposing C++ headers to C is a bit simpler and can work for CUDA as well.

// libfoo.h
#if defined(__cplusplus)
extern "C" {
#endif

... the body of libfoo.h

#if defined(__cplusplus)
}
#endif

This idiom wraps the entire header body in extern "C" {...} block iff processed by C++ compiler. Naked extern directives are allowed inslide extern "C" {...} blocks. If this trick works for you, you do not need EXTERN_C macro at all.

2 Likes

Why link your CUDA code as .so tho? Isn’t static linking like almost always the path of least problems?

1 Like

Sure - there’s options and I can get better performance down the road with trying other alternatives (such as static linking). At this stage, I just wanted to get something working so I have a baseline to go from so I flipped a coin. I’m sure there’s a myriad of better ways that will turn up in the end. I’m currently working on something that may remove an entire step in this process… but more on that to come later. All in good time :slight_smile:

For now, I want to turn my attention back to this issue of function definitions - once I have a proof of concept that compiles, I’ll go back over all these assumptions and will probably go down the road you’re suggesting. I’m just thrilled to be calling Cuda kernels from Zig source code without needing to write any additional shell scripts or cmake files.

2 Likes

Hi, after reading this thread a little bit I found very fascinating as it is very related to the project I am working on, which is very similar to you guys but different in some sense.

I was wondering if there is any update on this project. Also, if Cuda driver api has been considered to be used in this case, as it is language independent.

PS: I have been building the project in C++ because of multiple ExecutionProviders I need to have Cuda/ TensorRT/ HIP/ OpenVino/ Raw CPU currently only CPU EP has been built partially so I wouldn’t mind porting it to zig because most constexpr time things work with zig as well with its new comptime support at the master branch. But seems like building different EPs will be a huge pain in raw zig.

Hey Sid, welcome to the forum :slight_smile:

I just finished installing another huge chunk of this last night. I’ll be making a post about it soon. I decided to drop the CPU versions of the functions and go straight for a cuda only backend. The file generator is really easy to use and hasn’t been a problem at all (my version of it right now is a bit hacky but it gets the job done).

If you want to chat about it, I could post up what I have but it’s a work in progress. There’s a lot that needs to be improved - especially around tbe build system - it just checks if your files are out of sync and then recompiles most everything.

I’m also still linking it in as a shared object. I’ll probably do that differently at some point as well but since it works I want to get my foundation fixed up.

Otherwise, back propagation works and it looks quite a bit like pytorch :slight_smile:

2 Likes

@Sid, in terms of the cuda driver api, I had a horrid time linking it up. I was consistently running into dead ends with “uknown declaration” type errors. For instance, the driver API distinguishes between void* and CUdeviceptr where CUdeviceptr is actually just unsigned long long - the cuMemAlloc functions all worked fine but the mem-copy functions couldn’t find the CUdeviceptr declarations in the Zig C-translate via @cImport.

I lost interest in trying to debug that at around 3:00am… it’s not that it can’t be done, but it’s kind of ugly right now.

I decided to go a different route (maybe I’ll regret this later) and I’m not directly using @cImport on Cuda headers… those headers are a mess. The number of constants it drags was causing helix to choke and even the cuda_runtime.h complains that it can’t find it’s own utility files (auto complete couldn’t even find cudaDeviceSynchronize on helix’s latest stable build). You need your symlinks setup corecctly or it drives the lsp nuts, but even that still has difficulties.

Anyhow, the approach I decided to take is to make minimal headers that only use fundamental C-types, a few user defined structs, etc… then the implementation files are where the Cuda headers are brought in to keep their code out of what gets exposed through the @cImport. So far, working well!

A big goal of my project is to limit the exposure to NVIDIA’s implementation details and only expose a sane-subset. It’s proving to be a bit of boilerplate but I’m under the impression that it’s worth it.

2 Likes

Not using cImport on these complex headers sounds like a good pragmatic choice. I’ve had problems with cImport with a lot simpler headers but exposing the API through your own .h tends to work.

Okay, update time!

I’m sticking with the file generation approach because so far, it’s working out great.

There is a 3 step process for generating files to get convenient usage/linkage to zig. Here’s the overview of what I’m working with:


Step 1 - generate overloads from cuda source

My source files are marked with stand-in replacable types that allows me to write direct cuda with full lsp support and then generate overloads for compiling the library. Here’s how that step happens…

In the file_gen.zig file, there’s a list of replacement types and their size precedence that’s used to create valid type combinations - the declaration looks like this:

// level relates to the validity of a cast
// higher levels cannot result in lower levels
const Replacer = struct {
    symbol: []const u8,
    level: usize,
};

pub const ReplacerSet = struct {
    indicator: []const u8,
    replacers: []const Replacer,
};

const replacer_sets = [_]ReplacerSet {
    ReplacerSet { // real number replacers 
        .indicator = "RScalar",
        .replacers = &.{
            Replacer{ .symbol = "r16", .level = MIN_LEVEL + 0 },
            Replacer{ .symbol = "r32", .level = MIN_LEVEL + 1 },
            Replacer{ .symbol = "r64", .level = MIN_LEVEL + 2 },
        }
    },
    ReplacerSet { // complex number replacers 
        .indicator = "CScalar",
        .replacers = &.{
            Replacer{ .symbol = "c16", .level = MIN_LEVEL + 0 },
            Replacer{ .symbol = "c32", .level = MIN_LEVEL + 1 },
            Replacer{ .symbol = "c64", .level = MIN_LEVEL + 2 },
        }
    },
    ReplacerSet { // real tensor replacers 
        .indicator = "RTensor",
        .replacers = &.{
            Replacer{ .symbol = "RTensor16", .level = MIN_LEVEL + 0 },
            Replacer{ .symbol = "RTensor32", .level = MIN_LEVEL + 1 },
            Replacer{ .symbol = "RTensor64", .level = MIN_LEVEL + 2 },
        }
    },
    ReplacerSet { // complex tensor replacers 
        .indicator = "CTensor",
        .replacers = &.{
            Replacer{ .symbol = "CTensor16", .level = MIN_LEVEL + 0 },
            Replacer{ .symbol = "CTensor32", .level = MIN_LEVEL + 1 },
            Replacer{ .symbol = "CTensor64", .level = MIN_LEVEL + 2 },
        }
    },
};

Here’s an example of a cuda kernel:

__global__ void __kernel_addition_RScalar(
  const RScalar *dev_a,
  const RScalar *dev_b,
  RScalar *dev_c,
  len_t N
) {
  const len_t tid = (blockIdx.x * blockDim.x) + threadIdx.x;
     
  if (tid < N)
    dev_c[tid] = dev_a[tid] + dev_b[tid];
}

extern "C" void launch_addition_RScalar(
  const RScalar* a,
  const RScalar* b, 
  RScalar* c, 
  len_t N
) {
  __kernel_addition_RScalar<<<GRID_1D(N), 32>>>(a, b, c, N);
}

In a header file, I have RScalar, CScalar, RTensor, and CTensor defined to their 32 bit types types (aka RScalar is float). This enables me to write cuda code with full assistance from the lsp’s that will then be replaced by the file generator to their final types.


Step 2 - generate C-stype declarations

This step is short, but important - we gather each of the extern "C" declarations and push them into a header during the generation process. The file looks like this:

/* GENERATED FILE */

#include "../tensor_types.h"

#if defined(__cplusplus)
    #define EXTERN_C extern "C"
#else
    #define EXTERN_C extern
#endif

EXTERN_C void launch_hadamard_reverse_r16(
  r16 *grads_a,
  const r16 *value_b,
  const r16 *grads_c,
  len_t N
);
EXTERN_C void launch_hadamard_reverse_c16(
  c16 *grads_a,
  const c16 *value_b,
  const c16 *grads_c,
  len_t N
);
EXTERN_C void launch_hadamard_reverse_r32(
  r32 *grads_a,
  const r32 *value_b,
  const r32 *grads_c,
  len_t N
);
EXTERN_C void launch_hadamard_reverse_c32(
  c32 *grads_a,
  const c32 *value_b,
  const c32 *grads_c,
  len_t N
);
EXTERN_C void launch_hadamard_reverse_r64(
  r64 *grads_a,
  const r64 *value_b,
  const r64 *grads_c,
  len_t N
);
EXTERN_C void launch_hadamard_reverse_c64(
  c64 *grads_a,
  const c64 *value_b,
  const c64 *grads_c,
  len_t N
);

Step 3: generate overload sets

Here’s where this all comes together - a while ago I started a thread about OverloadSets: Making Overloaded Function Sets Using Comptime

I continued to workshop that idea and with @Sze’s input, and we were able to build an OverloadSet that does best matching on const pointer parameters. I now can automatically generate function overloads from the C/Cuda back end as part of the kernel generation process. The the include path is also automatically generated as well:

const OverloadSet = @import("overloadset.zig").OverloadSet;

const decls = @cImport(
    @cInclude("/home/andrew/ZigCode/Metaphor/src/nvcc_target/kernel_decls.h"),
);

pub const kernel_hadamard_reverse = OverloadSet(.{
	decls.launch_hadamard_reverse_r16,
	decls.launch_hadamard_reverse_c16,
	decls.launch_hadamard_reverse_r32,
	decls.launch_hadamard_reverse_c32,
	decls.launch_hadamard_reverse_r64,
	decls.launch_hadamard_reverse_c64,
});

pub const kernel_subtraction = OverloadSet(.{
	decls.launch_subtraction_r16,
	decls.launch_subtraction_c16,
	decls.launch_subtraction_r32,
	decls.launch_subtraction_c32,
	decls.launch_subtraction_r64,
	decls.launch_subtraction_c64,
});

pub const kernel_fill = OverloadSet(.{
	decls.launch_fill_r16,
	decls.launch_fill_c16,
	decls.launch_fill_r32,
	decls.launch_fill_c32,
	decls.launch_fill_r64,
	decls.launch_fill_c64,
});

pub const kernel_permutate = OverloadSet(.{
	decls.launch_perumutate_r16,
	decls.launch_permutate_c16,
	decls.launch_perumutate_r32,
	decls.launch_permutate_c32,
	decls.launch_perumutate_r64,
	decls.launch_permutate_c64,
});

pub const kernel_addition = OverloadSet(.{
	decls.launch_addition_r16,
	decls.launch_addition_c16,
	decls.launch_addition_r32,
	decls.launch_addition_c32,
	decls.launch_addition_r64,
	decls.launch_addition_c64,
});

pub const kernel_addition_reverse = OverloadSet(.{
	decls.launch_addition_reverse_r16,
	decls.launch_addition_reverse_c16,
	decls.launch_addition_reverse_r32,
	decls.launch_addition_reverse_c32,
	decls.launch_addition_reverse_r64,
	decls.launch_addition_reverse_c64,
});

pub const kernel_subtraction_reverse = OverloadSet(.{
	decls.launch_subtraction_reverse_r16,
	decls.launch_subtraction_reverse_c16,
	decls.launch_subtraction_reverse_r32,
	decls.launch_subtraction_reverse_c32,
	decls.launch_subtraction_reverse_r64,
	decls.launch_subtraction_reverse_c64,
});

pub const kernel_hadamard = OverloadSet(.{
	decls.launch_hadamard_r16,
	decls.launch_hadamard_c16,
	decls.launch_hadamard_r32,
	decls.launch_hadamard_c32,
	decls.launch_hadamard_r64,
	decls.launch_hadamard_c64,
});

How it’s used

In my operations file for my torch-style library, I can now just do the following:

pub fn additionForward(x: anytype, y: anytype, z: anytype) void {
    const x_values = x.values();
    const y_values = y.values();
    const z_values = z.values();

    overloads.kernel_addition.call(.{
        x_values.ptr, y_values.ptr, z_values.ptr, z_values.len
    });
}

pub fn additionReverseArg0(X: anytype, _: anytype, Z: anytype) void {
    const x_grads = UT.assertGrads(X);
    const z_grads = UT.assertGrads(Z);

    overloads.kernel_addition_reverse.call(.{
        x_grads.ptr, z_grads.ptr, z_grads.len
    });
}

pub fn additionReverseArg1(_: anytype, Y: anytype, Z: anytype) void {
    const y_grads = UT.assertGrads(Y);
    const z_grads = UT.assertGrads(Z);

    overloads.kernel_addition_reverse.call(.{
        y_grads.ptr, z_grads.ptr, z_grads.len
    });
}

pub const AddImpl = CallbackBuilder(
    additionForward, .{
        .{ additionReverseArg0, 0 },
        .{ additionReverseArg1, 1 }
    }, NoCleanup
);

Which creates a callback via CallbackBuilder that can be used by the computation graph to call forwards and reverse for gradient back-prop.

So far it all works - I’m not supporting mixed precision/category operations yet but the generator can create those overloads. Basically, you just have to write kernels now and the rest gets automatically parsed and built for you and Zig beautifully picks up all the generated declarations and overloads them. It’s really simple, actually.

Anyhow, that’s what I’m going with… I’ll post a version of my library hopefully soon… getting there quickly now!

3 Likes

Hi @AndrewCodeDev sorry about the late reply somehow this thread flew under my radar ( I don’t know how it got past my email ). Thanks for taking the time to mention your approach on this, I would love to see how your project is setup, and what kind of backprop api you got working.

Also pleased to see you are using helix, I have been trying it out as my primary editor for last few months or so, its been pretty great to me even in large projects, I am pretty sure this has to do with your build system, I have never faced this kind of issue with my C++ project, you are probably using zig build unfortuantely this issue has been sitting on the sidelines for a moment now. Implemented the JSON compilation database support by jonathanmarvens · Pull Request #18391 · ziglang/zig · GitHub .

I haven’t been able to build much from the last month I posted on this thread. Currently most of my work has went into compile-time AutoDiff implementation (Stalled by the fact that I am trying to learn intermediate-advanced calculas… I simply don’t know enough to just implement AutoDiff for implicit layers like ODE and DEQ, I wish I was more intelligent ) and compile time compute graph generation support and ability to generate function calls at compile time in C++. All of this is obviously still possible with Zig comptime.

I will reach to implementing Cuda EP in about 2-3 weeks (if I don’t get distracted again) that’s when the real should I move to Zig crisis will be real. Until then I will keep experimenting with Zig to see if I can mix and match C++ and Zig at the same time for this project.

Gotcha, if you’re having issues with things like N-dimensional tensor derivatives, just send me a message and I’ll help you. They’re weird at first but once you get it, the whole thing clicks. The issue with the typical calculus approach is that derivations to activation functions usually are very inefficient if you aren’t aware of some of the tricks that go into them… like the sigmoid derivative is literally just s * (1 - s) if you treat it like a differential equation.

I have a few examples up in my library so far that you are more than welcome to look through - there’s a lot of compile time deduction that goes on… I use comptime string maps for that. So far, I only have posted one 2D kernel, but have a bunch more going in for rank 1/2 tensors before I move on to 3D kernels: https://github.com/andrewCodeDev/Metaphor/blob/main/src/tensor_ops.zig

Again, if you want any help with this stuff, just send me a message and I’m happy to help.

Thanks, I will let you know if I get into any hurdles while implementing them, right now I am more focused on graph optimization etc. and AutoDiff implementations, its mostly being going smoothly as I learn more calculus and new tricks, I have been incorporating them as I go.

This whole compute graph business absolutely screams LLVM and language backend projects and how they optimize, that’s why I have also been creating a toy language at the same time (partially for uni project and partially because it will help me understand and learn more). Hence the more time is needed as I explore.

I will look into the links and your project, its fascinating as we are basically building the same thing with different approaches. I will update my project to be public in a few days and link it to you, maybe you will find it interesting.

Oh, I’m sure I will! Sounds like we’re taking a radically different approach because if I’m not mistaken, you’re doing your graph at comptime and I’m dynamically building mine during runtime (so you can change the graph with if statements, for example).

Open a topic when you do!

I will, I am not sure though if I should open it separately as I currently don’t use zig, its entirely cpp at the moment.

About my approach, if you would like to know, my graph sub-library (called manifold) supports full compile time compatibility of graph operations and manipulation. But this restricts me to use basically arrays and structs as my interface essentially, no pointers no hashmap no heap allocated data (I can use them obviously internally but no API usage). This hinders my ability to do some types of API well, but this also means user can choose to generate Compute Graph at compile time or runtime without any problems.

My main library Scions basically consumes the graph generated and runs them using different EPs, it doesn’t really care what it does, or is a ML model like your approach with ops.

Also, I envy your API, my previous versions did what you are doing and most other libraries do by using atleast what I call Expressions containing other Expressions my current V0 api simply can’t do .reverse(). Turns out it is very difficult specially if you think that my interfaces are only structs and arrays with one obvious problem being a struct can’t contain itself among others.

Sorry for extending this too long, anyways I will link when I public it after improving it a bit ( I can’t seem to be able to send a DM)

1 Like