Zig-Torch – Writing a Custom Python Backend (and a lesson in humility vs BLAS)

Hey everyone,

I’d like to share a hobby project I’ve been working on for a while – zig-torch.

I started this project about a year ago with the idea of creating a PyTorch extension written in Zig. Back then, fueled by enthusiasm (and perhaps some favorable micro-benchmarks), I even posted on Reddit claiming I achieved a 94% performance improvement over the original implementation.

Realitt and a deeper dive into the subject (especially after my university exams) verified those initial results. Currently, I treat this project as a testing ground to learn Zig, explore low-level optimization (SIMD), and handle FFI with Python.

What is zig-torch?

It’s an attempt to write tensor operations (primarily matrix multiplication, mm) in pure Zig and expose them to Python so they can function independently or alongside PyTorch.

What’s currently working:

  • Build System: build.zig compiles the code into a shared library (.so / .dll), which Python loads via ctypes. This is probably the most enjoyable part of working with Zig—the tooling just works.

  • Matrices: The zig_mm implementation utilizes tiling (cache blocking) and SIMD vectorization (@Vector(8, f32)).

  • Integration: A simple C API (callconv(.c)) allows for painless pointer exchange with NumPy/PyTorch libraries.

Performance (The Elephant in the Room):

Right now, my implementation beats pure NumPy in certain scenarios, but going up against PyTorch’s optimized C++ backend (utilizing MKL/BLAS) is a tough battle.

For 1024x1024 matrices, Zig currently lags behind Torch, which really highlights just how complex numerical engineering is “under the hood.” Despite this, for smaller operations the results are promising, and the satisfaction of writing my own kernel is huge.

Roadmap:

  • Implement multithreading (currently running single-threaded).

  • Better handling of “edge” cases in memory blocking.

  • Refactoring comments (some are currently in Polish; I’m migrating everything to English).

The code is available here: https://github.com/kitajusSus/zig-torch

I’d love to get some feedback on the project structure and build.zig, as well as any tips on how to squeeze more out of @Vector for matrix multiplication.

Cheers!


zig wins against solo numpy

[!IMPORTANT]
without TORCH, ONLY NUMPY

python tests/benchmark.py
Size M×K × K×N         Torch (ms)   NumPy (ms)   Zig (ms)     Zig vs Torch   Zig vs NumPy   Correct
---------------------------------------------------------------------------------------------------
32×32 × 32×32              n/a        0.019        0.018            n/a          1.04x True
64×64 × 64×64              n/a        0.143        0.055            n/a          2.62x True
128×128 × 128×128             n/a        1.178        0.358            n/a          3.29x True
256×256 × 256×256             n/a        9.086        2.778            n/a          3.27x True
512×512 × 512×512             n/a       69.996       23.019            n/a          3.04x True
1024×1024 × 1024×1024            n/a      553.027      196.582            n/a          2.81x True
1024×512 × 512×256             n/a       72.071       22.113            n/a          3.26x True

4 Likes

A bold project to take on. There’s so much in this space which is non-obvious especially in regards to speed. Memory layout and access patterns to best utilise caches are key.It’s why libraries like ATLAS have tuning runs before compilation.

One thing that struck me is the goal of making the library available as a plug-in to PyTorch. I know that’s where a lot of numerical code is, so it makes a lot of sense. However, I clicked on the thread hoping you were actually doing the opposite – Making a Zig version of the PyTorch layer so that the workloads could be easily implemented in Zig. That is, I think, a huge task though.

3 Likes

Thanks for the feedback! You’re right that a native Zig PyTorch would be a massive undertaking.

For now, my main goal with this project is learning and ecosystem interoperability. I believe Zig’s strength lies in how well it can bridge the gap between high-level languages like Python and low-level performance. By building this as a ‘testing ground’ for FFI and SIMD optimization, I’m trying to understand the exact bottlenecks you mentioned—like memory layout and cache access patterns.

I’d rather contribute to the multi-language landscape first, where Zig acts as a high-performance engine for existing ecosystems. It’s a great way to learn the language’s nuances while solving real-world performance problems (before zig I never even thought about them) As the project matures and I get a better handle on these low-level optimizations, moving towards a more standalone Zig library is definitely on the horizon. But my this journey does not have straight pointed ending, this is my sandbox.

BUT IM HAPPY TO SEE SOME PULL REQUESTS

I agree with you. I hope there will be a similar zig project as burn/candle in rust.

1 Like

I had a (mandatory) university course which also went into matrix optimisations (for very large matrices which (with a naive implementation) wouldn’t even fit into one’s RAM).

Maybe look up different types of storage formats for matrices (there are a LOT; the ones I needed to understand during that course were Compressed Row Storage, Compressed Column Storage, Block Compressed Row Storage and Compressed Diagonal Storage, but there are way more).

1 Like

Thats amazing, thanks a lot. I will check that after my midterms.