Making WasmAllocator thread-safe

This is related to an earlier post. A few days ago, there was a commit to the ziglang repo that removed WasmPageAllocator. It also added a if (!builtin.single_threaded) @compileError(...) to WasmAllocator.

I look at the code a bit today and realized that enabling multi-threading is actually non-trivial. The main challenge is that in the web-browser, we cannot block the main thread. We cannot use a mutex to protect variables employed by the allocator. We only have atomic operators at our disposal.

We cannot use a mutex to protect variables employed by the allocator. We only have atomic operators at our disposal.

Can you not just implement a spin-lock with atomics and call it a day?

1 Like

In theory, this should be sufficient:

var spinlock = false;

fn acquireLock() void {
    while (@cmpxchgWeak(bool, &spinlock, false, true, .acquire, .monotonic) != null) {}
}

fn releaseLock() void {
    @atomicStore(bool, &spinlock, false, .release);
}

Then:

fn alloc(ctx: *anyopaque, len: usize, log2_align: u8, return_address: usize) ?[*]u8 {
    _ = ctx;
    _ = return_address;
    acquireLock();
    defer releaseLock();

and

fn free(
    ctx: *anyopaque,
    buf: []u8,
    log2_buf_align: u8,
    return_address: usize,
) void {
    _ = ctx;
    _ = return_address;
    acquireLock();
    defer releaseLock();

I put together a little demo to test out the spinlock approach. It spawns 16 threads that continually allocate and free random amount of memory. Meanwhile, the main thread does the same four times a second in a JavaScript setInterval() callback.

Here’s the Zig portion:

const std = @import("std");
const zigar = @import("zigar");

const Reporter = fn (usize, usize) void;
const Promise = zigar.function.Promise(void);
const Signal = zigar.function.AbortSignal;
const allocator = zigar.mem.getDefaultAllocator();

pub fn runTest(count: usize, iterations: usize) !usize {
    var prng = std.Random.Xoshiro256.init(count);
    var random = std.Random.init(&prng, std.Random.Xoshiro256.fill);
    var i: usize = 0;
    while (i < iterations) : (i += 1) {
        const random_amount = random.uintAtMost(usize, 8 * 1024);
        const slice = try allocator.alloc(u8, random_amount);
        allocator.free(slice);
    }
    return count + i;
}

pub fn startThreads(count: usize, iterations: usize, reporter: *const Reporter, signal: Signal, promise: Promise) !void {
    const multipart_promise = try promise.partition(allocator, count);
    for (0..count) |i| {
        const thread = try std.Thread.spawn(.{
            .allocator = allocator,
            .stack_size = 64 * 1024,
        }, threadFn, .{
            i + 1,
            iterations,
            reporter,
            signal,
            multipart_promise,
        });
        thread.detach();
    }
}

fn threadFn(id: usize, iterations: usize, reporter: *const Reporter, signal: Signal, promise: Promise) void {
    var count: usize = 0;
    while (signal.off()) {
        if (runTest(count, iterations)) |new_count| {
            count = new_count;
            reporter(id, count);
        } else |_| {
            break;
        }
    }
    promise.resolve({});
}

And here’s the React useEffect hook where these functions get called:

  useEffect(() => {
    const { signal } = controller;
    const report = (id, count) => {
      setCounts((prevCounts) => {
        const newCounts = prevCounts.slice();
        newCounts[id] = count;
        return newCounts;
      });
    };
    startThreads(threadCount, 5000, report, { signal }).then(() => setTerminated(true));
    var mainCount = 0n;
    const interval = setInterval(() => {
      const start = performance.now();
      mainCount = runTest(mainCount, 100);
      const end = performance.now();
      setDuration(Math.ceil(end - start));
      report(0n, mainCount);
    }, 250);
    signal.addEventListener('abort', () => clearInterval(interval));
  }, []);

The demo is here, on Cloudfare. No effort was put into the layout. It’s just a list showing the number of allocations done by each thread. The item for the main thread also shows the time it took to run the test. I added a couple input controls at the bottom so you can test the responsiveness of the UI. If you open Chrome’s Task Manager you’ll see the Web Worker representing individual threads in action. The demo should push your CPU into 100% utilization–unless you have some ungodly system.

The spinlock approach seems to be perfectly reasonable. One concern I had initially was that the main thread would end up spinning for too long, thus unable to update the UI in a timely fashion. This doesn’t seem to happen. Even with the lock being contested constantly it usually manages to do a hundred allocations and frees (which are kinda excessive) in less than 1ms.

My laptop is only rocking 2C/4T so the level of concurrency I can test is rather limited. Would really like to get some feedback from those of you with beefier rigs.