Advice on linking CUDA to Zig


I’ve just finished the first draft of my pytorch-like clone for Zig and it’s time to start wiring up the heavy lifting compute functionality.

I’ve written quite a bit of CUDA and I’ve always abhorred the build systems that it entails but I think I’m ready to tackle this beast with Zig. I’m expecting a heavy learning experience but I’m excited to start.

According to this project, Zig can call the function signatures written in C that are compiled by the NVCC compiler using the Cuda C API. This may suggest that just treating it like typical C code linkage may be a good way to go: GitHub - gwenzek/cudaz: Toy Cuda wrapper for Zig

That said, I really want to do it right and not learn to slowly hate my project over time. I’m sure there’s an art to doing this but I’m hoping some people here can point me in the right direction. Here’s a couple questions I have for everyone…

  • Are there any projects that you would recommend I read towards this goal?
  • Are there any common footguns I should be aware of?
  • Any general advice on good build system structures/patterns for this job?
  • Is there something I should absolutely avoid doing?

Cursory googling has revealed very little on the subject. I’ll keep digging to see if I can find more examples.

Thanks everyone, appreciate your time and attention :slight_smile:


I didn’t look into the cudaz project in detail but forcing .cu code to be C seems like a potentially painful limitation. Wouldn’t you still be able to compile .cu normally, use it from C and expose a C-only API to your Zig code? I think this means you won’t be able to “@cImport all the things” but rather you’d have your own interface .h file that you use to call your CUDA kernels from Zig.

Here’s one example of a CUDA/C project that uses .cu files from .c: darknet/src at master · pjreddie/darknet · GitHub. This project is by Joseph Redmon (the first author of the YOLO object detection paper) who’s chosen to develop their own neural engine. I find this code pretty easy to follow.

I’ve also been continued working on my “autograd” engine and it’s a lot of fun. I massively nerd-sniped myself into working on it – I don’t expect anyone to ever use it for anything, but it’s been a fun playground for some “relaxing” Zig coding. I recently added convolutional layers and did some matmul optimization to make my convos run faster. I can easily beat PyTorch CPU performance in a simple MLP model for MNIST classificiation but my conv2d is not nearly as fast as PyTorch’s. To make mine go faster, I’d probably need to go for a) SIMD and cache-optimized GEMM and b) use multi-core.


Thank you for linking me to that project and I appreciate your import observations. This is precisely why I’m playing around with different options before committing to something.

Torch is relatively clunky for a variety of reasons - not least of which is the number of different backends they are committed to supporting. Their tensor class is jam packed full of flags for whatever backend you happen to be running.

It’ll be really interesting to compare projects with you, @nurpax. I’m excited to see where we differ and converge. I’m going towards the dynamic route (the graph is built on every forward pass and can be conditionally built on the fly, so things like recurrence just naturally shake out but also have a cost).

Anyhow, back to the subject - I’ll read that link you sent and see how it’s setup. The cudaz implementation is very complete and it’s from the perspective of Zig so it’s at least a good point of reference.

1 Like

After two sessions of staring at the cuda wrapper project (which is broken, the build system in Zig has moved on from that by quite a few large strides), going through C is starting to look like the only sane option.

At the very least, cuda is built to actually be C compatible - otherwise, it feels like I’m trying to break into the black box… not to mention the PTX assembly in the link I posted above is crashing so hopefully the C route will help me avoid this. I was worried about that when I saw the raw assembly in the member functions.

Still working on it…

Update on this… I’m going down the road of file generation using C template files. The idea here is to enable template-like behaviour in the C files so I can replace keywords in the function names and bodies. That way, I can spawn a bunch of different versions of the same function, append them to a file, and compile it so I can avoid duplicating function definitions. This is particularly helpful because I’m supporting complex numbers and allow for mixed precision operations so the combinatorics would get nasty.

So far, here’s the fundaments for the string replacement algorithm that will iterate through a directory, replace the strings in the files, and then save the new file to the output directory. This will probably be a temporary file that gets deleted at some point just for the compiler to generate the output files. It’s all panics and no try because this is at build time and I don’t have any real use for an error value.

const std = @import("std");
const heap = std.heap;
const mem = std.mem;

fn readFileToBuffer(filename: []const u8, allocator: mem.Allocator) []const u8 {

    const f = std.fs.cwd().openFile(filename, .{}) catch @panic("Cannot open file.");
    defer f.close();
    const f_len = f.getEndPos() catch @panic("Could not get end position.");

    const buf = allocator.alloc(u8, f_len) catch @panic("Out of memory.");

    _ = f.readAll(buf) catch @panic("Could not read file.");

    return buf;

fn writeStringToFile(path: []const u8, string: []const u8) void {

    var file = std.fs.cwd().createFile(path, .{}) catch @panic("Failed to create file.");
    defer file.close();

    var writer = file.writer();

    _ = writer.writeAll(string) catch @panic("Failed to write file.");

pub fn replace(
    haystack: []const u8,
    needle: []const u8,
    replacement: []const u8,
    allocator: mem.Allocator
) []u8 {

    const new_string: []u8 = allocator.alloc(
        u8, mem.replacementSize(u8, haystack, needle, replacement)
    ) catch @panic("Out of memory.");

    _ = mem.replace(u8, haystack, needle, replacement, new_string);

    return new_string;

pub fn joinPaths(
    buffer: []u8,
    head: []const u8,
    tail: []const u8
) []const u8 {

    var i: usize = 0;
    while (i < buffer.len and i < head.len) : (i += 1) {
        buffer[i] = head[i];

    buffer[i] = '/';
    i += 1;
    var j: usize = 0;
    while (i < buffer.len and j < tail.len) : ({ i += 1; j += 1; }) {
        buffer[i] = tail[j];

    return buffer[0..i];

pub fn main() !void {

    var arena = heap.ArenaAllocator.init(heap.page_allocator);
    defer arena.deinit();
    const allocator = arena.allocator();

    var path_buffer: [512]u8 = undefined;

    var dir: std.fs.Dir = try std.fs.cwd().openDir("test_dir", .{ 
        .access_sub_paths = false, .iterate = true, .no_follow = true
    defer dir.close();

    var itr = dir.iterate();
    while (try |path| {
        const input = readFileToBuffer(
            joinPaths(&path_buffer, "test_dir",, allocator
        const new_string = replace(
            input, "XXXX", "test", allocator
        std.debug.print("\nOutput: {s}\n", .{ new_string });

            joinPaths(&path_buffer, "test_out",, new_string

The only thing I’m worried about at this point is how to link to currently-non existing function definitions (I haven’t played around with extern in a serious way). I may be able to get away with just declaring functions and leaving their bodies as C files.

Anyone have any thoughts on this so far?

I’m not sure I 100% follow what you’re trying to do. Is this about auto-generating implementations for different function overloads for different types (int/f32/f64/complex, etc.)?

If so, I guess I’d personally keep that separate from the build scripts, just run it whenever you add new functions or different type combinations, and check the results into git. This way the generator might as well be written in Python.

BTW: I think you could use std.fs.path.join in place of joinPaths?

Yup, you’ve got the idea :slight_smile:

An external generation script may be a good idea for the reason you mentioned. One reason to keep it in the build is that I can always have up-to-date functions. If it’s a big hassle, then I’ll skip it but I’m curious about how far that idea can go with pure zig. It’s pretty similar to how files are generated in numpy… they hash the names and call them (which I’d like to avoid the hashing bit).

Join is definitely an option, but I chose not to because it uses the allocator and I don’t really need one in this case… it’s a shallow path. I can just reuse the same buffer. I’d be able to get decent behaviour if I used the fix buffer allocator, but then I’d have to reset it a bunch. At some point (especially if I want to make this cross platform), I’ll probably switch to std.fs.path.join.

This is honestly one of those places I wish C++ wasn’t such a nightmare to link to because I could write up the whole thing using templates in probably a week. Honestly, cuda itself is really awkward to work with, so this is as much of an experiment as anything.

Is it possible to use FixedBufferAllocator for your usecase? Anyway, your ad-hoc solution should perform a little better, because it avoids calls via pointers. I haven’t measured it.

On a general note, isn’t it a little bit unfortunate, that allocators always work via an indirection? I imagine another option: take allocator parameter as anytype and call alloc directly, thus avoiding a call via a pointer, and enabling a compiler to inline the allocator’s functions. In this case a call to std.fs.path.join could look like this

const fba = FixedBufferAllocator.init(buffer);
const result = std.fs.path.join(fba, ...);

Perhaps it was considered in the past and discarded in favor of the interface. It may significantly increase code bloat (because allocator parameters are ubiquitous in Zig) and performance gain (if any) should outweigh it. But it would give more flexibility. If code bloat is an issue, the following will still be possible

const fba = FixedBufferAllocator.init(buffer);
const allocator = fba.allocator();
const result = std.fs.path.join(allocator, ...);

i.e. instantiating join by Allocator type.

In generally I don’t like the saying “premature optimization is the root of all evil”, but optimizing away allocations in build script path joins may be a case of that? :thinking: A build script written in Zig avoids some tens of clock cycles by not calling into an Allocator, followed by, what, trillions of clock cycles invoking various compilers?

As an example, if this script was written in Python, probably every line of that Python code would cost more than the whole Zig program. :slight_smile:

1 Like

In the case of a build script yes, I agree. If you are talking about taking allocator as anytype, then It was just a small rant about general state of affairs, not particularly about this case.

I commented on the build script, yes.

I guess there’s some non-performance reason for Allocators being passed an interface that’s called through virtual functions. Code size maybe? So that its implementation doesn’t potentially get inlined everywhere? Or compilation speed? Or maybe it’s so that Allocators are safer to pass by value and are easily copyable?

1 Like

Like I mentioned, I’ll probably replace it at some point in the future.

Yes, it seems possible. Between calls I may have to use this member function if I’m doing multiple files: zig/lib/std/heap.zig at master · ziglang/zig · GitHub

    pub fn reset(self: *FixedBufferAllocator) void {
        self.end_index = 0;

Otherwise, I think you mean the Allocator Interface always works via pointer indirection but we’re starting to get off topic here.

What I’m taking away from this thread so far is that there aren’t many people working on this problem. I’m genuinely surprised considering how little I could find on google (and the git examples that do exist are defunct).

TBH, I’m not sure I entirely understand what problem you’re facing. Is it that you have a large number of overloads of a single kernal and you want to expose that to Zig somehow? I guess the simplest case is to approach it like making a C API to a C++ library. Like cimgui for example.

But I’m not sure if that’s the problem.

1 Like

I’m definitely trying to get to the heart of the issue too and I appreciate your on-going interest - let’s take a step back and brainstorm for a bit.

Ideally, the best thing would be to link/call Cuda kernels in as few steps/dependencies as possible.

My problem is that I have a potential combinatoric expansion of function signatures that may be needed.

Since we can link to C functions very cleanly and Cuda already supports C-interoperability, this intuitively seems like the missing gap. We have one nasty catch though.

C does not have function overloading or generic parameters that carry type information at compile time.

Introducing C++ that then gets routed through C means I still need to write an interface with C in general - if I have variable argument/return types, this still seems like we’re in the same trap. Maybe I’m wrong and you have a counter example?

The reason I thought function generation would be handy is that it allows me to stamp out whatever overloads I need from a minimal set of template files - any change I make to a template will get translated across all function instantiations of that same template. This sounds ideal but it is tricky to do correctly and we have partially formed files.

That said, there are examples of successful libraries that are doing this so it’s certainly a route that has been taken before.

In summary, what I’m fundamentally dealing with is how to generate multiple function definitions for a language (if I choose to stick with the C approach) that doesn’t support this requirement natively.

Introducing C++ that then gets routed through C means I still need to write an interface with C in general - if I have variable argument/return types, this still seems like we’re in the same trap. Maybe I’m wrong and you have a counter example?

I don’t think the C API itself will look very pretty, but I think I’ve seen some examples where this C API is auto-generated from the C++ type information using some combination of Python and Clang APIs. I thought cimgui would’ve been one such example, but I think they’re doing something simpler. Nevertheless, they have a systematic approach to generating C bindings for overloaded C++ functions. The way I’d imagine this would be some sort of offline processing step that’d be run when you change any APIs, add new overloads or write new kernels, and you’d check in the generated files into Git. Or at least I can’t imagine this being easy to integrate into build.zig.

I think being able to build the .cu kernel files (with C++ support in the .cu code) and having a C API for them is sounds like something you’ll need no matter what your approach, since most likely you must link C code with Zig.

But I think even if you have a good system for doing the .cu → C part, I think you’ll then want some other approach for C → Zig so that Zig code using the kernels doesn’t look too awful. TBH, I don’t know how that should look like. Maybe that’s where the largest struggle is now? Prototype with some real code integrating these things together, and see if anything shakes out?

FWIW, I haven’t written any code for this either, just trying to summarize or rephrase what I’ve tried to say in this thread.

Interesting. Thanks for the lead, I’ll have to check this out.

Same, but I think we’re on the same page here. If anything, prototyping a few examples will at least give me the background to talk more directly to the problem itself.

I’ll update here once I’ve cooked up a few prototypes and we’ll see what the damage is.

1 Like

Update on this (and notes for the future interested):

The NVCC compiler treats everything from the point of view of a C++ compiler. Basically, it mangles names. To prevent that, you have to build all your C calling functions in the .cu with the extern "C" qualifier.

So for instance…


const int N = 10; // some constant for demonstration's sake

// CUDA Kernel for Vector Addition
__global__ void __vectorAddition(const int *dev_a, const int *dev_b, int *dev_c) {

  //Get the id of thread within a block
  unsigned short tid = threadIdx.x;
  if (tid < N) // check the boundry condition for the threads
    dev_c[tid] = dev_a[tid] + dev_b[tid];

// callable function from C
extern "C" void vectorAddition(const int* a, const int* b, int* c) {
  __vectorAddition<<<1, N>>>(a, b, c);

From there, you can make function declarations in a C style like so…

// main.c
extern void vectorAddition(const int*, const int*, int*);

…and NVCC will successfully tie the two together.

At this point, I need to start working Zig into the picture. The problem at hand is that the C++ style compilation requires the extern "C" keyword to stop it from mangling names, but extern "C" is a C++ feature, not a C feature. So getting Zig to pick up on that could be weird. I need to now look into the @cImport command to see what requirements it has.


Can C/C++ macros help here? Like conditionally defining extern_C depending on #ifdef __cplusplus to be either nothing or extern "C".

1 Like

Yep, I think that’s the common approach: use an EXTERN_C macro in a .h file that you’d cImport in Zig. That’s how it’s commonly done when mixing C and C++, I don’t think it’s any different with CUDA.

@nurpax, do you have any examples to look at regarding this approach?