What is the best way to implement embarrassingly parallel loops with the upcoming IO system and a problem

minkatter · December 27, 2025, 9:00pm

On the topic of existing work in C++, part of the “evil” C++ developer in me wishes Zig had an answer for HPC. I am thinking of the likes of a Kokkos abstraction (e.g., execution policies, parallel algorithms, and views). But, I imagine that type of framework hides too much control flow and logic in the eyes of the Zig ethos.

IME with HPC compute, you really want to be able to express work not as fine-grained tasks but as logical pipeline of operators with some fairly uniform distribution of data. My feeling is that std.Io will either introduce too much overhead, or feel like fighting the abstraction to express things in that way.

I could totally be wrong though; I plan on trying out std.Io.Threaded in the context of compute-heavy code when it is more stable.

vulpesx · December 28, 2025, 2:02am

If performance matters enough for you, then you shouldn’t be using std.Io for concurrency, it is tailored for non-blocking/asynchronous io operations, not compute performance.

It is a convenient way to get more performance, at the behest of the caller. But you shouldn’t stick with it if it’s too limiting.

On the other hand, they are looking for feedback on it, so it could/will get better, but I think that will just move the bar, not eliminate it, for switching to something more tailored to your use case.

andrewrk · December 28, 2025, 3:21am

As stated this is a judgment call - I’m sure you’re right that std.Io falls short of being optimal in various cases in which case users could better achieve their goals today by avoiding it.

However, I want to clarify that std.Io is absolutely, 100% intended for concurrency, and it is intended for optimal compute performance. So, please, anyone who wants to help out with Zig development, do use it for this use case, and make sure there is a nice issue open to track any situation where it cannot produce optimal programs. Addressing the lack of a way to determine ideal work batch size (i.e. the original post in this thread) would be a great start.

iceghost · December 28, 2025, 6:21am

matklad:

Translating to Zig,
const CpuPool = struct {
    // Runs an instance of `function` on all available CPU cores in parallel.
    fn multiplexJob(
        pool: *@This(),
        function: anytype,
        args: std.meta.ArgsTuple(@TypeOf(function)),
     ) void { ... }
}
This works for any sufficiently coarse-grained problem, and is massively simpler to use or implement than cilk-style fork-join (if you need the latter, check out GitHub - judofyr/spice: Fine-grained parallelism with sub-nanosecond overhead in Zig)

Augmenting this is this post I have seen: Multi-Core By Default - by Ryan Fleury - Digital Grove

The idea is the same: functions/code should be written for multi-core by default. They called those functions “wide”. When you need stuff that must be run on a single core, go from wide to narrow with an if block.

The interesting part I found in that article is how to do synchronization across threads (or lanes, using the article words):

Each thread participating in doing the work gets assigned a lane index (0 → lane count). Calling them lane because there can be multiple lane 0 across multiple multiplex works.
Introduce a sync primitive: laneSync(value, lane_index)
- If current lane == lane_index, copy the value to a known shared buffer…
- … and if current lane != lane_index, ignore the passed in value, copy the value from the known shared buffer and return the value.
- Wait for all threads to reach this point.

For the OP function, it can be written as:

/// Pseudo code assuming those are present:
///
/// - io.laneIdx(): What index are we in the group?
/// - io.laneCount(): How many of us is in the group?
/// - io.laneSync(value, lane_idx): broadcast a value from
///   a specified lane to all lanes in the same group.
///
/// If one lane fails, those functions returns error.Cancelled?
/// I don't know!
///
fn run(io: Io, gpa: Allocator, items: []const Item) !void {
    const main_lane_idx: usize = 0;

    // main_lane is in charge of allocating the buffers
    const main_lane_buffers = if (io.laneIdx() == main_lane_idx)
        try allocBuffers(allocator, io.laneCount());
    else
        undefined;

    // main_lane own the buffers, so only it do the deinit.
    defer if (io.laneIdx() == main_lane)
        buffers.deinit(allocator);

    // copy the buffers slice from main_lane to this lane
    const buffers = try io.laneSync(main_lane_buffers, main_lane_idx);
    if (io.laneIdx() == main_lane) assert(buffers == main_lane_buffers);

    // all threads now has the same buffers pointer
    // use our lane index to get the correct buffer
    const our_buffer = buffers.get(io.laneIdx());

    var main_lane_counter: u64 = 0;
    const counter: *u64 = try io.laneSync(&main_lane_counter, main_lane_idx);

    comptime errdefer unreachable;

    // all threads now has the same counter ptr,
    // pointing to main_lane stack counter

    while (true)  {
        var index = @atomicRmw(u64, counter, .Add, 1, .monotonic);
        if(index >= items.len) return;
        doWork(items[index], our_buffer);
    }

    // wait for all lanes in the same group to reach here
    // maybe a dedicated laneWait()...
    try io.laneSync({}, main_lane_idx);

    if (io.laneIdx() == main_lane_idx)
        // main_lane might do reporting
        // or return the aggregated result somehow
    }
}

The funny thing is that, the function signature is now free of all the synchronization / multi-core stuff

fn run(io: Io, gpa: Allocator, items: []const Item) !void;

It looks exactly what you would write in single core code as well!

If you imagine io.laneCount() == 1 and io.laneIdx() == 0, which means this function runs on one core only, the function still works!

The part that article handwaved through is actually how to incorporate multiple of those wide functions. I think Zig I/O abstraction can help here, something like:

io.batch(wide_function, args);

That executes a wide function on all available parallelism units, and populate the io.laneIndex() and io.laneCount() for the group. Available here means scheduling is involved, it might be less than core counts if it is too busy, or there can be a params to request how much paralellism we need.

(and return value? maybe just take the return value from the main_lane == 0 or something)

I also found out that Zig already has some builtin that is very much similar: @workItemId, @workGroupSize

minkatter · December 30, 2025, 6:02am

Aren’t those builtins for the SPIR-V backend?

iceghost:

The part that article handwaved through is actually how to incorporate multiple of those wide functions. I think Zig I/O abstraction can help here, something like:
io.batch(wide_function, args);

My take is that a loose view could be provided by std.Io over the items for “wide” functions. I put together a rough sketch of a batch API that could provide configurable distributions of work: uniform split (buffer size / n workers), lanes (@Vector friendly), or some custom tile/block size (GPU?).

This removes the property of being able to have a normal parameter list for a work function, but my argument is that it is actually good to express the fact that it is parallel, since synchronization may be necessary.

vulpesx · December 30, 2025, 7:39am

This is my opinion as well.

You beat me to making a sketch (i forgor), but your code can be simplified considerably.

Generally it’s good to give the caller control over the batching, but the code may also have its own requirements, so it might be worth giving the worker some control to make its code more ergonomic and get free asserts.

minkatter · December 30, 2025, 8:06am

Thanks. I tend to type out the overly verbose thing and simplify later — yours is much simpler.

I also think there is an opportunity to expose a SoA Batch, or maybe even some advanced order like Morton ordering. There is sort of a parallel to be had with the ArrayList containers.

Not sure how much pull the standard library is trying to have for the niche tricks used in a lot of compute code, though.

kristoff · December 30, 2025, 10:51am

From my perspective the design question we’re starting at right now is about coordination vs lack of coordination, in the following sense: async I/O is primarily a system that allows you to make your program more “elastic” with regards to I/O variability at runtime.

Evented I/O and lightweight task switching helps your program handle in a better way situations where you have a ton of clients, or some clients are more chatty than others, or some db queries have bigger replies than others, etc.

In contrast, some processing jobs have very well defined pipelines with a dramatically smaller level of runtime variability. One example of this is Zine, my static site generator. Some variability is still there (you could have a site with lots of small sections or one with fewer but bigger sections, or a site with small sections except for a gigantic one), but you can observe the shape of your workload as you prepare the pipeline and make optimal choices at that point (something that you can’t do with a webserver).

In this context, I would define the earlier example (a server with many clients) a highly “asynchronous” program because it has to deal with clients, which are naturally asynchronous to one another and thus cause a lot of uncoordinated activity to take place in the server.

The static generator example instead is much more “synchronous” in the sense that you don’t get many surprises at runtime, and once an initial pipeline setup phase is done, then it’s just a matter of sending down tasks down the right pipe as fast as possible (i.e. you can coordinate very precisely the activity happening in your pipelines).

Async I/O can serve both cases, but obviously the more coordination is present in the system, the less async I/O becomes the right tool for the job.

As people have pointed out , if you want to do a for* (same function, multiple arguments, essentially software SIMD) then the APIs currently offered by async I/O start falling short.

This, to me, means that maybe there should be separate APIs for highly-coordinated operations and that we should understand that async I/O and squeezing performance out of a perfect pipeline are two concepts partially at odds.