Atomic order and futex

chung-leong · May 15, 2024, 8:58pm

I’m working on an example involving the creation of a simple, multi-threaded HTTP server. I want the function that spawns the thread to be able to return any error encountered by the thread when it tries to listen on the IP address and port. For this purpose I’m using std.Thread.Futex.wait(). I initialize an atomic u32 to 0 and wait for the thread to change it to something else. Here’s the code:

    pub fn spawn(self: *@This()) !void {
        var futex_val = std.atomic.Value(u32).init(0);
        self.thread = try std.Thread.spawn(.{}, run, .{ self, &futex_val });
        std.Thread.Futex.wait(&futex_val, 0);
        if (self.last_error) |err| {
            return err;
        }
    }

    fn run(self: *@This(), futex_ptr: *std.atomic.Value(u32)) void {
        var listen_result = self.address.listen(self.listen_options);
        if (listen_result) |*server| {
            self.server = server;
        } else |err| {
            self.last_error = err;
        }
        futex_ptr.store(1, .release);
        std.Thread.Futex.wake(futex_ptr, 1);
        const server = self.server orelse return;
        while (true) {
            var connection = server.accept() catch |err| {
                self.last_error = switch (err) {
                    std.net.Server.AcceptError.SocketNotListening => null,
                    else => err,
                };
                break;
            };
            self.handleConnection(&connection);
        }
        self.server = null;
    }

I haven’t done any multithreaded programming in a long, long time. Back then we all have single-core CPUs so memory ordering is something I’m largely unfamiliar with. I wonder if I’m using this stuff correctly. Would really appreciate it if someone can point out any mistake I’m making. In particular, the use of ‘.release’ to store the value is a total guess on my part. No idea whether it’s necessary.

dude_the_builder · May 15, 2024, 10:04pm

If I remember corectly, I think using .release is right. You would use .acquire in reading-like operations and .release in write-like operations, so in this case it looks OK. This would place a guarantee that before reading the atomic value, any .release operations have to be completed so they can be seen. This is all from my recollection of when I read the Rust Atomics and Locks book, which is an excellent resource on this, but you have to be somewhat familir with Rust.

Question: If each thread is going to listen and accept, then you would be listening on multiple IP:Port combinations right? From what I know up til’ now, you can only listen and accept on an IP:Port combination from a single thread given the socket is a file descriptor and interacting with it from multiple threads would require a mutex?

chung-leong · May 15, 2024, 10:56pm

Is the flag even necessary for code compiled for x86? I’ve always thought that cache-coherency means that if this core is seeing one value, then all other cores would see the same value. This stuff really feels beyond my pay grade

The reuse_address flag is supposed to let multiple threads listen on the same address and port. Not sure if there’s any downside.

dude_the_builder · May 16, 2024, 4:09pm

4 posts were split to a new topic: On SO_REUSEADDR and SO_REUSEPORT Socket Options

LucasSantos91 · May 16, 2024, 3:15am

x86 guarantees an ordering that is stronger than .acq_rel but weaker than .seq_cst, so any call using .acq_rel or weaker will compile to the same code.

From “C++ Concurrency in Action 2nd edition”:

For example, on x86 and x86-64 architectures, atomic load operations are always the same, whether tagged memory_order_relaxed or memory_order_seq_cst

Page 368.

CPUs that use the x86 or x8664 architectures (such as the Intel and AMD processors common in desktop PCs) don’t require any additional instructions for acquire-release ordering beyond those necessary for ensuring atomicity, and even sequentially-consistent ordering doesn’t require any special treatment for load operations, although there’s a small additional cost on stores

Page 147.

Still, since you already have to consider whether you want .seq_cst vs .acq_rel , you might as well choose the most appropriate memory ordering, in case you ever want to compile to another target.

chung-leong · May 16, 2024, 9:06am

Thanks for the info. That explains my total ignorance on the subject. I have never done any low level programming outside x86. With the rising popularity of ARM, I guess I will need to think more about memory ordering.

Let me see if I’m understand the issue correctly. Suppose I had used .unordered in my code instead:

        if (listen_result) |*server| {
            self.server = server;
        } else |err| {
            self.last_error = err;
        }
        futex_ptr.store(1, .unordered);
        std.Thread.Futex.wake(futex_ptr, 1);

When the main thread wakes up, there’s no guarantee that it would see a value in self.server or self.last_error. The value could still be sitting in a hidden register somewhere. I have to use .release to force prior store operations to be committed to the L1 cache. Cache coherency then guarantees that other cores will see the same value.

LucasSantos91 · May 16, 2024, 10:36am

Yes, I believe that is correct.

chung-leong · May 16, 2024, 12:39pm

Thanks for the help!

nyc · May 17, 2024, 10:58pm

There is still a small issue. Acquire and Release are only guaranteed to work when they are paired. (You have a release without an acquire).

Think of them like a local transaction where release is like a commit of any memory writes done by that thread and guaranteed to be visible to any other thread that does an acquire on the same memory address.

While the hardware might provide those guarantees above this, that doesn’t mean the compiler does (it is will force any memory writes optimized to registers to be written to memory. Memory orderings even acq and rel does reduce some optimization opportunities for the compiler (eg, it can’t write combine across a release)

SeqCst: If you have two threads doing writes to two different addresses, and you threads doing loads to each of those addresses. Acq and Rel allows each loading thread to see the releases in different orders. If you need a total global order (both loading threads to see the stores in the same order), that is when you use sequentially consistent. (You need 3 or more thrads and 2 or more memory addresses for seqcst to matter),

If you don’t need ordering of all writes, just that single address that’s monotonic. If you weren’t returning last_error to another thread, you would be able to use monotonic.

chung-leong · May 18, 2024, 11:25am

According to the doc, “the checking of ptr and expect, along with blocking the caller, is done atomically and totally ordered (sequentially consistent) with respect to other wait()/wake() calls on the same ptr.”