Integrating Zig in a multi-language project?

I’ve got a project at work with the following key development dependencies:

  1. linux environment (wsl (ubuntu) or linux (ubuntu))
  2. uv (for python)
  3. docker
  4. ansible (via uv)
  5. < 10 other python packages (ruff, ty, etc)

So I’ve got a shell script (bootstrap.sh) that installs uv and docker on developer machines, then I’ve got some top-level shell-ish script implemented in python called work.py.

Developers call various functions in the script:

  • uv run work.py deploy something is used by CI for deployment (ansible dependency provided by uv)
  • uv run work.py test runs on developer machines and CI (calls pytest)
  • uv run work.py format (just shells to ruff)

The target environments are low-tier linux with docker installed (< 8 GB ram, < 4 CPUs).

In this system, uv is my build system (caching my dependencies), but work.py is not a build system, its just something more readable to me than bash. Its only about 1000 lines and can deploy to 10 different environments, etc.

So right now its a single-language project, everything works and the development experience is quite pleasant. Onboarding new developers is just bootstrap.sh, and uv keeps everything else aligned (except docker version), but we have hit some performance bottlenecks that can really only be solved with a compiled language in some small key areas. My team’s highest proficiency is python, by far, and I’m considering integrating Zig for the truly performance critical subsystems.

Heavy on my mind is the following talk from @matklad https://www.youtube.com/watch?v=jVC4DP-8xLM

The talk tells me to stick to the tools I have:

  • uv and docker

but that conflicts with the performance limitations I have extensively fought with python profilers etc. After a while of fighting the profilers, I end of writing some python code that looks horible and is just hard to understand and maintain (batching updates to dictionaries, little caching tricks, reducing function call overhead etc…).

My question is if and how I should integrate Zig?
My first project would wholly replace an existing python microservice with zig, reducing it from a 1 GB docker image to maybe 20 MB and reducing its dependencies to only Zig and a well-maintained c library.

  • Use zig as my build system (shelling to system installed uv,docker)?
  • keep my existing fairly simple build script, add the zig python package, and write python modules in zig?
  • Don’t integrate zig at all, and just buy bigger machines (lots of room here to upgrade the targets)?

Completely off the table is using bazel. Its just too hard to understand and teach to others (even for me who is comfortable using it).

Hi,

from your description I cannot really understand where the performance issue come from.

On hand you mention the build system, which I picture as slow builds during development & deployment and on the other you mention replacing a microservice and the target system capabilities which sound like runtime.

Maybe you can draw small diagram with the services using draw.io or some plantuml or graphviz, so my comments will be rather vague.

Also my experience with python (and also zig) is limited so take it with a grain of salt.

So for microservices what you usually want is a statically, (mostly) compiled language, that has a sweet spot for:

  • compile time

  • run time

  • developer productivity (IDE / Extensions / simplicity, etc)

  • battle tested libraries (especially your domain and communication)

IMHO this is either go, c# or java, as all of them hit that sweet spot, where I would argue that ruby, python, rust, zig etc. are probably lacking on one point or another. This doesn’t mean that it will not work, it’s just not the sweet spot.

If you have performance issue with those languages, you maybe doing something wrong or you have a very special use case. If you are just handling a little data, LOB style, beefy machine should not be required, the cheapest VMs should be able to do that.

In my experience those microservices are best coordinated via a message broker coordinated like nats or more heavy handed solutions like rabbitmq. If you have perf issues due to communication, consider switching to some packed format, eg. from REST to gRPC / messagepack etc.

If a single microservice is the actually bottleneck, just rewrite that one and plug it in, so you can actually take advantage of the microservice architecture.

For deployments and configuration, I usually also go the ansible route. I would not do a custom bash script, I would spell it all out in ansible and for convenience I stick that into semaphoreui so that anybody can trigger deployments with ease.

The downside to ansible is that it’s not parallelized and pretty slow, but often installations are exclusive anyway (no 2 installers running at the same time) and IO bound so that’s usually not an issue.

If you know you have a specific bottleneck due to a special cpu / memory intensive task, by all means rewrite that part in zig and integrate it into your solution.

just my 2 cents

Take care,

Martin

1 Like

Tricky question, because from a purely technical point of view, it’s a no brainer to just take the worst offender and rewrite it in anything other than python, and steer a better path forward for the team to grow into.

just do one as a proof of concept to demonstrate

  • That it works just fine
  • That the numbers are so much better
  • That it didn’t take too long to completely re-implement
  • That the overall architecture remains the same
  • That the resultant code isn’t out of reach for existing team members to work with

Assuming of course that you want components that can be debugged and maintained by any team members, in a pinch, given a bit of learning time to play with new tools.

That last point about adapting the single-language team to the multi-language tools is quite important to consider. So doing the first stage as a port to Go or C# or whatever is probably the least radical option to prove the concept, and not appear too alien.

it’s not so much a technical problem as it is a political / cultural problem you are facing here tbh. If the team is mostly immersed in python, and consider 1gb dockerfiles to be normal, then the struggle you are facing isn’t entirely a technical issue.

The path of least political resistance would be to advise management to just triple the hosting cost for everything, and let the bloat grow to keep everyone happy and working in familiar territory.

But if you want to go the better engineering approach (totally understandable- I would too), then yeah … bite off 1 manageable piece as a PoC, and demonstrate the benefits to the techs, and cost savings to everyone else.

Be prepared to put on the flame proof suit though, as your best efforts are bound to put some people offside unfortunately .. comes with the territory. It’s all do-able though.

questions - what exactly are these python services doing ? Is it a web app, or some sort of processing pipeline ? How do they communicate ? Are they all running on individual 4c/8gb nodes under k8s, or does one of these little servers host the entire app collection ?

3 Likes

They are industrial control applications used in manufacturing and testing. So I’ve got small industrial computers in some cabinets that serve a web GUI, do some data collection, write to a remote database, do some real-time controls.

The proof of concept compiled application is the data collection part. It subscribes to data from other applications running on the controller, repackages / batches it to slightly different format, and uploads it to the remote database. It consumes a full CPU (1 of 4) that I could be using to make the GUI be able to handle more than 2 clients, or leave me more room to protect the real-time processes on dedicated CPUs.

Sweet, that sounds pretty interesting.

Given what you have said there, sounds like it’s mostly custom in-house code, and not so much your avg web app. Also suggests that you have highly predictable loads, also unlike your avg public facing web app.

Might be a good case for zig latest then, especially if there is some C libs you need to include. Go for example is great at prototyping networked apps, but interfacing with C libs is a genuine pita with go.

Maybe you could do it in Io.threaded mode (which is still super efficient), and shuffle over to Io.evented when the networking lands to gain even better cpu utilisation. Worst case with threaded - an acceptor thread to ingest data, which should be mostly idle (?), and a processing thread to crunch the numbers. Maybe.

Don’t know, but it might be feasible to do a PoC that covers 80% of the most common events, just to get it out quickly to show real numbers and cost saving estimates, and get further management approval to push it forward. You will need some allies along the way to help champion the idea for sure.

Should definitely be possible to write the code so that it reads well, and your avg non-zig dev can see what’s going on and follow the code. At least with zig 16, you can totally trim back the boilerplate a lot setting up allocators and what not … and have a super clean looking main.

Fun project - go for it !

1 Like

Some ideas:

  • The docker image size should not be a big issue, as the docker image (should) consists of multiple layers and all of them except the last one (which contains your software) should stay the same and will be cached. A docker pull then only retrieves that last one. (If you really want, you can just use the same docker image with the python env and only volume map the app code in)

  • I would argue against the hardware upgrade as they are industrial PCs that are probably both expensive for their performance and for following IPC standards and you want them to be identical in your fleet, which makes maintenance easier. Also the workload you mention does not sound out of the ordinary and they should be able to do it.

  • If I read you correctly, you are hosting a full-stack web ui in python and are inside a cabinet, therefore the browser rendering happens on a differnt machine? If so this might be a bad fit and you are paying a performance premium for server-side rendering with python. In that case I would use svelte/vue/your favourite frontend framework to generate only the static content (no serverside rendering) and use a go/c#/java (or maybe keep python here?) api on the machine that runs in a dedicated process. nginx or similar serves the html/js/css and reverse proxies to the app.

  • The real time stuff sound like a perfect fit for zig, if you are itching to introduce it, I would do it here. :slight_smile:

Good luck!

Take care,
Martin

I think your question is very hard to answer as it seems you already have a solution that you want to implement but you are trying to find the problem where that solution fits. If you look at your system from a broader perspective it might be easier to see what the best architecture for your system would be.

  1. How do the microservices in your application interact? Are these glorified libraries that communicate via function calls and return values or are they communicating “online”?

That’s the most important question to answer given the original post, because Zig → Python bindings are not a solution for a micro service architecture. In that type of architecture you would most likely have services that have a language agnostic protocol to communicate between them, in that case creating a server or client in Zig for that micro service should be trivial. Especially since you are already using docker for running your services.

So I don’t think Python bindings is the correct approach for the architecture that you presented us with but in any case I recommend reading Packaging Zig as Python packages and Using the ziglang and setuptools-zig python extension to provide C extensions - #17 by Anthon

Since you own the build system of your users, you should be able to fix the issues that truamel.yaml suffered from. I would still recommend building all wheels as Adria did in Packaging Zig as Python packages - #10 by adria .

1 Like