Whats the best way to split up work with Io?

So I am building a fzf/fzy clone called fz mostly for learning purposes and that its fun. A lot of the code is based on fzy (like the algorithm) but I am trying to learn how to program asynchronously more and am wondering what would be best the way to split up the work of updating the scoring using zigs new async Io? Is async Io a good idea here or should I use a threadpool instead?

Currently I tried using a Io.Group but performance seem to degrade on small item sets, haven’t measured difference for large once yet.

diff --git a/src/Match.zig b/src/Match.zig
index a3c51f9d7c..84db8e5928 100644
--- a/src/Match.zig
+++ b/src/Match.zig
@@ -1,6 +1,7 @@
 const std = @import("std");
 const build_options = @import("build_options");
 const util = @import("util.zig");
+const Io = std.Io;
 
 const Allocator = std.mem.Allocator;
 const assert = std.debug.assert;
@@ -73,7 +74,7 @@
 // TODO: update concurrently
 /// Returns a slice into matches and update each match score
 /// window is not allocated
-pub fn updateMatches(gpa: Allocator, search_str: []const u8, matches: []Match) ![]const Match {
+pub fn updateMatches(io: Io, gpa: Allocator, search_str: []const u8, matches: []Match) ![]const Match {
     if (search_str.len == 0) {
         // restore to original
         Match.sortMatches(matches, Match.orderByIdx);
@@ -84,10 +85,21 @@
     var buf: [MAX_SEARCH_LEN]u8 = undefined;
     const needle = util.lowerString(&buf, search_str);
 
-    for (matches) |*match| {
+    var group: Io.Group = .init;
+    defer group.cancel(io);
+
+    const chunk_size = 1024;
+    var remaining = matches[0..];
+    while (remaining.len >= chunk_size) {
+        group.async(io, updateChunk, .{ gpa, remaining[0..chunk_size], needle });
+        remaining = remaining[chunk_size..];
+    }
+    for (remaining) |*match| {
         try match.updateScore(gpa, needle);
     }
 
+    group.wait(io);
+
     Match.sortMatches(matches, Match.orderByScore);
 
     var len: usize = 0;
@@ -98,6 +110,12 @@
     return matches[0..len];
 }
 
+pub fn updateChunk(gpa: Allocator, chunk: []Match, needle: []const u8) void {
+    for (chunk) |*match| {
+        match.updateScore(gpa, needle) catch {};
+    }
+}
+

1 Like

I’m not familiar with the algorithm, but if it can be properly chunked than yes you should see some improvement in performance specifically in cases where the size is large enough where thread initialization and synchronization is less than the time it would have taken to do it single threaded.

You could use zigs io interface to help you, or you could use thread pool, or you could write your own.

If you don’t use IO you won’t have group, which means you will need to implement your own barrier, and other synchronization.

Really I would say try both, I’ve done a bit with the new IO stuff and it’s good, but incomplete, and because of that you’ll still be hand rolling a lot of solutions, just without proper documentation while changes are happening under you

2 Likes

Basically the algorithm searches for a needle in a haystack so a bunch of haystacks can be chunked together . There is a way to implement to implement it simd apparently but haven’t grasped that one yet.

Have been playing around with the new Io interface doing a server and the best approach was spawning worker threads waiting for an I o.Queue for handing off incoming requests to.

Tried with Io.Group but seems that the implementation of groupAsync in threaded has a limit on the number of threads that can be spawned. I.e if the Io.Threaded can’t launch a new thread it will execute the given function imminently. Since I want a bunch of workers waiting for work on queue new workers got blocked from spawning. But seems like group.concurrent will fix that which just landed.

I would approach this problem by chunking the data data into chunks that are equal to the number of threads available on your cpu equally (with the last thread having possibly +1) each thread should operate on its own contiguous region of memory and if you want to use simd make sure that you enforce that your memory regions are the proper alignment, and also that you push extra data tha would misalign them back to the last thread where you would handle tha edge case.

You could use group Io for this, just spawn function that takes the proper data for each memory region to complete work on.

It’s my recommendation to always provide threads with contiguous regions of memory when possible vs having a pool that picks up the next n bytes that needs work done. You get some good optimizations out of continuous aligned memory.

2 Likes

Give a try event-driven-state-machines.

GitHub - dee0xeed/xjiss: Just Intonation Synthesizer (X11+ALSA)

The principle is very simple - use a state/stage for any kind of a client-server message

Thank you for the suggestion, will look into that!

Yeah this is more or less the way I went with. There is still some performance gain on the table but it’s atleast not dead slow compared to fzf and fzy.