Store/load using shared memory for multithreade application

jabolcni · January 31, 2024, 2:49pm

I have a question regarding of storing and loading data from array from different threads. So basically how I can store/load using shared memory.

I have ttArray: []u128, where I store variables of type scoreEntry which is packed struct of size 128 bits

pub const scoreEntry = packed struct {
    hash_key: u64,
    move: Move, // 16-bits
    score: i32,
    bound: Bound, // 2-bits
    depth: u8,
    age: u6,
};

This is a structure that stores data about chess positions during search for best move. I plan to use multi-threaded search where I will run multiple search routines in separate threads and all threads with store or load data from global ttArray, which will act as shared memory. This approach is called Lazy SMP (Lazy SMP - Chessprogramming wiki).

As far as I was reading and studying I need to use @atomicRmw for storing and @atomicLoad for loading data, but I haven’t found good examples and I am not sure if this is correct approach.

dude_the_builder · January 31, 2024, 2:55pm

I think std.atomic provides a friendlier interface for atomic operations. For Zig master it’s here and for 0.11 here. You basically wrap your data in an Value or Atomic and then you have load and store methods.

dude_the_builder · January 31, 2024, 2:59pm

You may also want to test the non-atomic way using a std.Thread.Mutex. I’ve found that depending on the situation atomics may be faster and in other cases the mutex is faster. You always have to test and see with multithreading.

mperillo · January 31, 2024, 3:05pm

I don’t think using std.Atomic is the best solution in this case, since you need atomic access to an array element, not to the entire array.

LucasSantos91 · January 31, 2024, 6:49pm

It takes some effort to wrangle the compiler into doing what you want.

const std = @import("std");

pub const ScoreEntry = packed struct {
    hash_key: u64,
    move: u16,
    score: i32,
    bound: u2, 
    depth: u8,
    age: u6,
};

pub fn main() void{
  var a: ScoreEntry = undefined;
  const b: ScoreEntry =  .{
    .hash_key = 0,
    .move = 1,
    .score = 2,
    .bound = 3,
    .depth = 4,
    .age = 5,
  };
  const dstPtr: *u128 = @ptrCast(&a);
  const srcPtr: *const u128 = @ptrCast(&b);

  @atomicStore(
  u128, 
  dstPtr, 
  srcPtr.*, 
  .Monotonic,
  );

  std.debug.print("{any}\n", .{a});
}

Godbolt

There is no such thing as “atomic access to the entire array”. Atomic is about operations, not types. You operate atomically on an element. There is no way to operate atomically on an array, unless you can pack the entire array into a register. std.atomic is just a very thin wrapper over the @atomic* builtins. In this case, the atomic approach doesn’t work because @atomic* doesn’t operate on structs, hence the casts I used above.

AndrewCodeDev · January 31, 2024, 10:20pm

Yeah, at that point you’re crossing into mutex territory. Now, if you wanted an array of atomic values, that’s another issue. Or, if you wanted to build a kind of ticket-mutex that uses an atomic counter to track whose turn it is… then we’re getting into atomics being useful.

@jabolcni If you just want to lock down a resource, you probably want a mutex. I’d start with that first and see where your bottlenecks are. At the very least, it’s super easy to implement an example to get some baselines for further experiments.

jabolcni · February 1, 2024, 7:21am

So I arrived to something like this

    inline fn store_atomic(self: *TranspositionTable, entry: scoreEntry, indx: u64) void {
        @atomicStore(u128, &self.ttArray[indx], @as(*const u128, @ptrCast(&entry)).*, .SeqCst);
    }

    inline fn load_atomic(self: *TranspositionTable, indx: u64) scoreEntry {
        var raw = @atomicLoad(u128, &self.ttArray[indx], .SeqCst);
        return @as(*scoreEntry, @ptrCast(&raw)).*;
    }

    inline fn load_simple(self: *TranspositionTable, indx: u64) scoreEntry {
        return @as(*scoreEntry, @as(*scoreEntry, @ptrCast(&self.ttArray[indx]))).*;
    }

Pointer casts are trial and error, this seems to work, but using @atomicLoad is slower than just using load_simple. I am not sure which type of order I should use, I read about them all, but this is the first time that I am using anything atomic, so I do not have any experiences. I also need to test this on linux.

I will also try to figure out mutex-es, so that I can test differences.

IntegratedQuantum · February 1, 2024, 9:39am

Another question you should ask yourself is if you really need sequential consistency. There are many weaker(→faster) versions of atomicity which might be enough here.
You can check out the llvm guide for help, it’s a bit technical, but it gives you a description what each of the atomic orders do: https://llvm.org/docs/Atomics.html#id7

jabolcni · February 1, 2024, 11:01am

Actually I need the fastest one. I guess in this case I should use .Monotonic or is there ieven faster type of ordering?

IntegratedQuantum · February 1, 2024, 11:06am

There is also .Unordered. This doesn’t guarantee any order of reads/writes.
It basically only guarantees that you don’t read garbage values. What you read must have been written somewhere before.

dude_the_builder · February 1, 2024, 12:32pm

You should also test the performance of doing the load with .Acquire and the store with .Release. This should be faster than SeqCst and slower than Monotonic or Unordered, but are often recommended as the best practice for load / store sequences and offers sane guarantees on ordering of operations.

jabolcni · February 1, 2024, 12:43pm

I am now testing mutex-es using

var lock = std.Thread.Mutex{};

    inline fn setMutex(self: *TranspositionTable, entry: scoreEntry) void {
        _ = lock.lock();
        self.ttArray[self.index(entry.hash_key)] = @as(*const u128, @ptrCast(&entry)).*;
        lock.unlock();
    }

    inline fn getMutex(self: *TranspositionTable, hash: u64) scoreEntry {
        _ = lock.lock();
        var raw = @as(*scoreEntry, @ptrCast(&self.ttArray[self.index(hash)]));
        lock.unlock();
        return @as(*scoreEntry, raw).*;
    }

Once I figured out how to do pointer casting, the rest is quite easy and understandable.

Using atomics for storing/loading is relatively fast, almost no performance hit. I am not so sure about mutex, but I will have to wait for the results of the tests.

dude_the_builder · February 1, 2024, 12:56pm

Just a tip for common cases of using a mutex in Zig; you can take advantage of defer to ensure the mutex is unlocked no matter what happens. In your case, this isn’t really necessary because the code between lock and unlock (the critical section) is just one infallible operation. But for larger critical sections and ones containingg code that can fail, you can do:

{
    mutex.lock();
    defer mutex.unlock();
    // critical section
}

mperillo · February 1, 2024, 3:36pm

jabolcni:

I am now testing mutex-es using

var lock = std.Thread.Mutex{};

    inline fn setMutex(self: *TranspositionTable, entry: scoreEntry) void {
        _ = lock.lock();
        self.ttArray[self.index(entry.hash_key)] = @as(*const u128, @ptrCast(&entry)).*;
        lock.unlock();
    }

    inline fn getMutex(self: *TranspositionTable, hash: u64) scoreEntry {
        _ = lock.lock();
        var raw = @as(*scoreEntry, @ptrCast(&self.ttArray[self.index(hash)]));
        lock.unlock();
        return @as(*scoreEntry, raw).*;
    }

Using a mutex will block all the access to the array, even if all the worker threads access different array elements.