Metaphor: GPU machine learning library for Zig

AndrewCodeDev · February 21, 2024, 8:09am

Hey everyone! I’ve been slowly hacking away at a Torch like library for Zig.

This is an on going passion project of mine that I’m excited to be share. It’s a combination of Cuda, C, C++, and of course, Zig.

Metaphor is intended to be a torch-like library for Zig. The goal is to have a simple syntax that feels pythonic without sacrificing low level control.

Mixing Zig with Cuda

Metaphor is entirely GPU driven and it’s focused on working with large data.

After a lot of tinkering, I believe I have found a balance between exposing implementation details to keep the includes manageable and writing kernels easy.

The library is inherently multi-stream. Similar to multi-threading, streams act like work queues that can be loaded and launched asychronously.

Example:

Everything in Metaphor works with streams, so to get started, we initialize our GPU context, obtain a stream, build a graph, and allocate some tensors.

    const mp = @import("metaphor");

    // Initialize device and cuda context on device zero
    mp.device.init(0);

    const stream = mp.stream.init();
        
    defer mp.stream.deinit(stream);

    const G = mp.Graph.init(.{
        .stream = stream,
        .mode = eval,
    });

    defer G.deinit();

    // CUDA tensors have analogous datatypes to the CPU, but
    // with some implementation differences for 16bit floats
    // to reduce bus traffic, freed memory is cached for reuse

    // can free individually but will also be freed on G.deinit()
    const X1 = G.tensor(.inp, .r32, mp.Dims(2){ 2, 2 });  
    const X2 = G.tensor(.wgt, .r32, mp.Dims(2){ 2, 2 })

The math operations are straight-forward, as is the reversal process:

// y = A.x
const y = mp.ops.innerProduct(A, x, "ij,j->i");
// y = A.x + b
const y = mp.ops.linear(A, x, b, "ij,j->i");
// B = A transpose
const B = mp.ops.permutate(A, "ij->ji");
// w = u + v
const w = mp.ops.add(u, v);
// operations can be composed, e = (a + b) * (c + d)
const e = mp.ops.hadamard(mp.ops.add(a, b), mp.ops.add(c, d));

// feed-forward block
const y = mp.ops.selu(mp.ops.linear(x, A, b, "i,ij->j"));    

y.reverse();

// inspect gradients
if (A.grads()) |grd| {
    // use gradient...
}

In the works:

More kernels!

I spent a lot of time working my way through Zig to find an architecture that I felt was a good starting place. At this point, I’m focusing on implementing custom kernels.

Static Library Linkage

– Edited - this was accomplished

Configurable build

– Edited - this has been started

Anyhow, I’ve genuinely learned a lot so far and I’m looking forward to learning more!

Nathan-Franck · February 21, 2024, 8:30pm

Totally awesome! I am way over my head with this stuff, but it sounds like a great start for some optimal CPU / GPU compute synergy.

GitHub - matter-labs/era-boojum: Boojum, the scariest SNARK implementation. - came to mind, it’s a cryptographic zero knowledge proof prover, where they created a very similar looking tool in Rust, specifically for their usecase.

In a small presentation he goes into a lot of detail - https://youtu.be/zcGmDO6uisk?si=uUuhxYjMiivFD4Ir - the takeaway I got was that you can get a lot more GPU work done by having larger more specialized kernels, vs sending out a lot of instructions for basic math. Do you see any future in this project where kernels can be synthesized at build-time from multiple simpler nodes? Zig seems well positioned for it I think

AndrewCodeDev · February 21, 2024, 8:39pm

@Nathan-Franck, that is a great question and idea! I’ve got two thoughts on this.

There’s some larger kernels on the way (some that are trinary and above) and I built the callbacks to have a small buffer optimization but they can allocate if you need more arguments for a kernel. So in some sense, I’m planning on making larger kernels soon that combine operations.

My second thought is that Metaphor is setup to be very easy to build custom kernels for. I need to write a readme on this, but basically when you run build, it generates the header, overloads the kernel for whatever datatypes you want, and then hands you back a Zig file with a new OverloadSet in it. (You can see the result here: https://github.com/andrewCodeDev/Metaphor/blob/main/src/kernel_overloads.zig). I’m hoping this encourages people to write the custom kernels that they want.

Attaching a math ops function to it is easy (you just call the overload set, you can see that in “tensor_ops.zig”), and then in the main “metaphor.zig” file, you specify how you want the graph to behave - does it allocate a new node, build scratch memory… etc.

At this point, I’m very open to argument - I’ve decided to play around with it as is to see where my sticking points are and then doing another large update. I’m primarily focused on the backend (like, how can we multithread the reversal process based on different streams and does that have a device memory cost… or adding support for CPU offloading on forwards because you can fine-tune a 13 billion parameter network on a single gpu if you’re clever about that sort of thing).

That said, I’m certainly happy to accept help writing kernels if anyone has any ideas

AndrewCodeDev · February 21, 2024, 9:42pm

Also, I’m curious if this is true for streams and CUDA graphs. One of the biggest overheads with CUDA kernels is “launch overhead”. Minimizing independent launches is a great strategy for this, but queuing up operations on a stream might reduce this overhead quite a bit. I need to test this out more because Metaphor is built on streams and not independent launches. Instead, it builds a queue for the GPU to call and each queue does not require device synchronization to produce the correct results.

Nathan-Franck · February 21, 2024, 10:03pm

Again, totally out of my depth, but maybe just the context-switching of the GPU moving over to another kernel is most of the overhead, regardless of how well the CPU can queue up commands to it? Sounds like you are set up to do some exploration

jaime · February 21, 2024, 10:03pm

This is great! I was thinking about this for a while ago and here you are with a neat library. I will follow your work. Looking forward your coming iterations of your project!

AndrewCodeDev · February 21, 2024, 10:06pm

@Nathan-Franck yes, context switching is brutal. If you can overlap your calls, you can get dramatically better performance but you’re absolutely correct - there is much experimentation to be done.

Thank you, @jaime and welcome to the forum I’m always open to contributors if you ever find the motivation!

jaime · February 21, 2024, 10:18pm

Thanks Andrew for your awesome work. It would be great contributing to your library but still learning zig at the moment and I don’t feel I have enough skills to jump into your code and maybe mess it up :). But I will go through it and try understand the idea anyways.

AndrewCodeDev · March 15, 2024, 9:00am

I’ve finished some more testing and even with gpu queues, book keeping and dimension inspections, I’m getting around a 100x speedup on large tensor operations. The queues hide the launches quite brilliantly it seems. I’ve uploaded several examples to github and one of them is on using multi-stream GPU operations.

I’m sure there’s still plenty of performance left on the table (I’m still filling out my BLAS library and there’s tons of optimizations to do there still) but I’m happy with the results so far! https://github.com/andrewCodeDev/Metaphor/blob/main/src/examples/streams.zig

oxrinz · December 15, 2024, 5:36pm

hey man! really cool stuff

i’ve been trying to link cuda zig for a few days now with no luck, any insights into how you achieved that? even chatgpt/claude are having a hard time figuring it out

ideally i’d like to compile everything into a single .so file for simplicity (and potentially performance (?)), both zig and cuda stuff

if you could guide me in the right way i’d highly appreciate it

dimdin · December 15, 2024, 6:49pm

The linking code is in build.zig:

github.com

andrewCodeDev/Metaphor/blob/bd6db615178eafd9219f326713685f70c9253c6e/build.zig#L125-L146


      
          fn link_libraries(
              b: *std.Build,
              step: *std.Build.Step.Compile,
              mp_src_lib: []const u8,
              cuda_lib64: []const u8,
              cuda_stubs: []const u8,
              mp_kernels: []const u8,
          ) void {
              step.addLibraryPath(b.path(mp_src_lib));
              step.addLibraryPath(b.path(cuda_lib64));
              step.addLibraryPath(b.path(cuda_stubs));
          
              step.linkSystemLibrary("cuda");
              step.linkSystemLibrary("cudart");
              step.linkSystemLibrary("nvrtc");
              step.linkSystemLibrary("dev_utils");
              step.linkSystemLibrary("cublas");
          
              step.addObjectFile(b.path(mp_kernels));

This file has been truncated. show original

The code and libraries are generated using script_compiler.zig and file_gen.zig to deps/cuda and src/lib from src/cuda.

AndrewCodeDev · December 15, 2024, 8:53pm

Thanks @dimdin - I have a much cleaner way to do this these days and we’re doing that for Zigrad. Ultimately, it’s just a C-style linkage and isn’t much different than anything else tbh. The more familiar I got with the build system, the more it all kind of looks the same.

In other news, I am closing out Metaphor because Zigrad is going to be much more feature complete! We’re working on our first rewrite with full device support that can be disabled at comptime (Metaphor was GPU only). We’re really excited about the prospects and we’ll have a bunch of examples (yes, including LLM’s) that will be run on CUDA. This is coming together quickly so I don’t anticipate it will be a long wait!

Also, we’re actively looking for contributors. If you want to help us bring a full scale, feature complete torch-style ML system to Zig, then please get in contact!

If you haven’t already, checkout Zigrad: Deep learning faster than PyTorch

And to see the status of our new release and all the new changes, checkout the unstable branch (and the other feature branches too): GitHub - Marco-Christiani/zigrad at unstable

Device Support PR: WIP: CUDA and redesign by Marco-Christiani · Pull Request #36 · Marco-Christiani/zigrad · GitHub