Metaphor: GPU machine learning library for Zig

Hey everyone! I’ve been slowly hacking away at a Torch like library for Zig.

This is an on going passion project of mine that I’m excited to be share. It’s a combination of Cuda, C, C++, and of course, Zig.

Metaphor is intended to be a torch-like library for Zig. The goal is to have a simple syntax that feels pythonic without sacrificing low level control.


Mixing Zig with Cuda

Metaphor is entirely GPU driven and it’s focused on working with large data.

After a lot of tinkering, I believe I have found a balance between exposing implementation details to keep the includes manageable and writing kernels easy.

The library is inherently multi-stream. Similar to multi-threading, streams act like work queues that can be loaded and launched asychronously.

Example:

Everything in Metaphor works with streams, so to get started, we initialize our GPU context, obtain a stream, build a graph, and allocate some tensors.

    const mp = @import("metaphor");

    // Initialize device and cuda context on device zero
    mp.device.init(0);

    const stream = mp.stream.init();
        
    defer mp.stream.deinit(stream);

    const G = mp.Graph.init(.{
        .stream = stream,
        .mode = eval,
    });

    defer G.deinit();

    // CUDA tensors have analogous datatypes to the CPU, but
    // with some implementation differences for 16bit floats
    // to reduce bus traffic, freed memory is cached for reuse

    // can free individually but will also be freed on G.deinit()
    const X1 = G.tensor(.inp, .r32, mp.Dims(2){ 2, 2 });  
    const X2 = G.tensor(.wgt, .r32, mp.Dims(2){ 2, 2 })  

The math operations are straight-forward, as is the reversal process:

// y = A.x
const y = mp.ops.innerProduct(A, x, "ij,j->i");
// y = A.x + b
const y = mp.ops.linear(A, x, b, "ij,j->i");
// B = A transpose
const B = mp.ops.permutate(A, "ij->ji");
// w = u + v
const w = mp.ops.add(u, v);
// operations can be composed, e = (a + b) * (c + d)
const e = mp.ops.hadamard(mp.ops.add(a, b), mp.ops.add(c, d));    
// feed-forward block
const y = mp.ops.selu(mp.ops.linear(x, A, b, "i,ij->j"));    

y.reverse();

// inspect gradients
if (A.grads()) |grd| {
    // use gradient...
}

In the works:

More kernels!

I spent a lot of time working my way through Zig to find an architecture that I felt was a good starting place. At this point, I’m focusing on implementing custom kernels.

Static Library Linkage

– Edited - this was accomplished :slight_smile:

Configurable build

– Edited - this has been started :stuck_out_tongue:

Anyhow, I’ve genuinely learned a lot so far and I’m looking forward to learning more!

18 Likes

Totally awesome! I am way over my head with this stuff, but it sounds like a great start for some optimal CPU / GPU compute synergy.

GitHub - matter-labs/era-boojum: Boojum, the scariest SNARK implementation. - came to mind, it’s a cryptographic zero knowledge proof prover, where they created a very similar looking tool in Rust, specifically for their usecase.

In a small presentation he goes into a lot of detail - https://youtu.be/zcGmDO6uisk?si=uUuhxYjMiivFD4Ir - the takeaway I got was that you can get a lot more GPU work done by having larger more specialized kernels, vs sending out a lot of instructions for basic math. Do you see any future in this project where kernels can be synthesized at build-time from multiple simpler nodes? Zig seems well positioned for it I think

1 Like

@Nathan-Franck, that is a great question and idea! I’ve got two thoughts on this.

There’s some larger kernels on the way (some that are trinary and above) and I built the callbacks to have a small buffer optimization but they can allocate if you need more arguments for a kernel. So in some sense, I’m planning on making larger kernels soon that combine operations.

My second thought is that Metaphor is setup to be very easy to build custom kernels for. I need to write a readme on this, but basically when you run build, it generates the header, overloads the kernel for whatever datatypes you want, and then hands you back a Zig file with a new OverloadSet in it. (You can see the result here: Metaphor/src/kernel_overloads.zig at main · andrewCodeDev/Metaphor · GitHub). I’m hoping this encourages people to write the custom kernels that they want.

Attaching a math ops function to it is easy (you just call the overload set, you can see that in “tensor_ops.zig”), and then in the main “metaphor.zig” file, you specify how you want the graph to behave - does it allocate a new node, build scratch memory… etc.

At this point, I’m very open to argument - I’ve decided to play around with it as is to see where my sticking points are and then doing another large update. I’m primarily focused on the backend (like, how can we multithread the reversal process based on different streams and does that have a device memory cost… or adding support for CPU offloading on forwards because you can fine-tune a 13 billion parameter network on a single gpu if you’re clever about that sort of thing).

That said, I’m certainly happy to accept help writing kernels if anyone has any ideas :slight_smile:

1 Like

Also, I’m curious if this is true for streams and CUDA graphs. One of the biggest overheads with CUDA kernels is “launch overhead”. Minimizing independent launches is a great strategy for this, but queuing up operations on a stream might reduce this overhead quite a bit. I need to test this out more because Metaphor is built on streams and not independent launches. Instead, it builds a queue for the GPU to call and each queue does not require device synchronization to produce the correct results.

Again, totally out of my depth, but maybe just the context-switching of the GPU moving over to another kernel is most of the overhead, regardless of how well the CPU can queue up commands to it? Sounds like you are set up to do some exploration :slight_smile:

This is great! I was thinking about this for a while ago and here you are with a neat library. I will follow your work. Looking forward your coming iterations of your project!

2 Likes

@Nathan-Franck yes, context switching is brutal. If you can overlap your calls, you can get dramatically better performance but you’re absolutely correct - there is much experimentation to be done.

Thank you, @jaime and welcome to the forum :slight_smile: I’m always open to contributors if you ever find the motivation!

2 Likes

Thanks Andrew for your awesome work. It would be great contributing to your library but still learning zig at the moment and I don’t feel I have enough skills to jump into your code and maybe mess it up :). But I will go through it and try understand the idea anyways.

3 Likes

I’ve finished some more testing and even with gpu queues, book keeping and dimension inspections, I’m getting around a 100x speedup on large tensor operations. The queues hide the launches quite brilliantly it seems. I’ve uploaded several examples to github and one of them is on using multi-stream GPU operations.

I’m sure there’s still plenty of performance left on the table (I’m still filling out my BLAS library and there’s tons of optimizations to do there still) but I’m happy with the results so far! Metaphor/src/examples/streams.zig at main · andrewCodeDev/Metaphor · GitHub

2 Likes