Seqlock memory barriers for concurrent hashmap

Hi everyone,

I had a lot of fun working on a fixed-capacity, open-addressing Hash Table using Robin Hood hashing. It uses sharded locking for writers and optimistic concurrency control (in particular seqlocks) for readers.

It works and passes all my tests on my x86_64. But I’m unsure if I’m missing a(or multiple) barriers in particular in the get/getBuffer function to make it truly correct.

Summarized it basically looks like this(kinda pseudo code)

retry: while(true) {
    // Read version
    version = ts.load(.acquire);

    // Hashmap lookup logic and getting the final value
    const k = slot.key;
    const v = slot.value;

    // Do I need a fence here?

    // Validate version
    const valid = version == ts.load(.acquire);
    if (!valid) continue :retry;
}

The real code is of course more involved (ConcurrentHashTable.zig (24.7 KB)). I’ve written a lot of explanatory comments which could help (they were mostly for myself because else I’m forgetting the details…).

I unfortunately don’t have a machine with a weak memory model available to test it. Could anyone with an ARM or something run the tests for me? If you want you could also run the stress test by changing this to false:

test "multithreaded stress" {
    if (true) { // change to false to run
        return;
    }
    try stressTest(
        1024 * 1024,
        1024,
        1_000_000, // maybe play with iterations
        1024 * 1024,
        8,
    );
}

Thanks

1 Like

Some things I’ve noticed:

218: assert(current.key == incoming.key);

This is going to give you problems when K can’t be compared with ==.

515: atomic.spinLoopHint();

Do a Fibonacci backoff here, it makes some space when there’s contention. Period-doubling is a little simpler, but it has harmonics.

Let’s look at lock:

fn lock(self: *ShardLock) void {
    while (true) {
        var current = self.version.load(.monotonic);
        // Wait for even
        while (current & 1 != 0) {
            atomic.spinLoopHint();
            current = self.version.load(.monotonic);
        }

        // CAS to switch to odd
        if (self.version.cmpxchgWeak(current, current + 1, .acquire, .monotonic) == null) {
            return; // Locked successfully
        }
    }
}

You want a back-off here as well, possibly even sleep before starting over if you get really unlucky. Also, nothing in this should be .monotonic: .acquire on reads, .release on writes. With .monotonic you’re not actually synchronizing state, you’re just setting up for more spurious exchange failures. It’s probably not a correctness issue, keeping in mind I am looking at all of this for the first time.

Also: cmpxchgWeak will give you the latest value if you miss your window, why not hold onto it for the next try?

So:

const BACKOFF_LIMIT = 19; // For instance..
fn lock(self: *ShardLock) void {
    var f_minus: usize = 1;
    var back_off: usize = 1;
    var log_backoff: usize = 0;
    var current: u64 = self.version.load(.acquire);
    spin: while (true) {
        // Wait for even
        if (current & 1 != 0) {
            for (0..back_off) |_| {
                atomic.spinLoopHint();
            }
            if (log_backoff <= BACKOFF_LIMIT) { // LTE because we start at fib(2)
                log_backoff += 1;
                // Fibonacci back-off (no phase lock, slower gain):
                const f_next = back_off;
                back_off = back_off + f_minus;
                f_minus = f_next;
            } else {
                // Here you could sleep and reset, if you want
            }
            current = self.version.load(.acquire);
            continue :spin;
        }

        // CAS to switch to odd
        if (self.version.cmpxchgWeak(current, current + 1, .release, .acquire)) |new_version| {
            current = new_version;
        } else {
            return; // Locked successfully
        }
    }
}

Note that I have not so much as compiled this, it’s a sketch. The idea is that your structure can stay locked for considerable periods of time, so hammering on the release bit like this is not efficient.

Anyway: ran it on ARM and the five tests pass. My guess is that if you pound on this a lot harder you’ll find a bug or two in the sharding-lock logic, but we could be pleasantly surprised. So pound on it a lot harder :slight_smile:

Cool project btw.

(Edited to get the proper release / acquire order for the compare exhange).

Thank you for your answer. I should’ve probably said it, but I’ve a pretty narrow usecase. I just need a fast u64 to u64 conversion for translating from page ids to indices/pointers for a storage engine. That’s also why I didn’t make it generic on the types and eql/hash.

That’s true, but okay for my usecase.

Never thought about the harmonics, but it does make sense that this could be bad. I will likely need to do some profiling to see if the contention is to high and if the backoff is needed.

Thanks :slight_smile:

Oh yeah, that’s true. I just thought, that the cas will synchronize anyway, so I don’t need to force a memory barrier there.

All in all thanks for your feedback. I will run the stress test on a big node on a cluster with some more threads and maybe fuzz a bit.

1 Like

Well I’ve profiled it and depending on the number of threads between 20% and 35% of the time was spent… contending on count or due to the false sharing induced by it. I’ve distributed it to
have one count per shard and now I’ve gotten this for 10 million operations on a 1m size. I’ve also compared it with std.AutoHashMap protected by a mutex.

threads std.HashMap(Mutex) one count distr count diff(one-distr)
1 462ms 590ms 589ms 0%
2 791ms 369ms 309ms -16%
6 1.16s 271ms 134ms -50%
12 1.92s 269ms 97ms -64%

40% put, 50% delete, 10% get

I’ve also looked into the backoff and changes from .monotonic to .acquire in the spinlock and this just makes it slower.

I hope I find other interesting things.

It completely slipped my attention that you had defined K and V, since they’re so routinely used for generic types. But that does make life easier if you need to expand things, so fair enough.

You’re welcome :slight_smile:

Your stress test segregates key values by thread. If that matches your use case, great, so much the better, but if different threads can clobber each other’s key/values, make sure you’re testing that also. It’s the part of the code which makes me think “hmm there’s a lot going on here”.

1 Like

Thank you, I completely neglected to think about this. I just did this with non-overlapping keys to easier keep track of the reference maps.

Based on this I’ve written three more tests for the main problems (torn reads, structural integrity, linearizability) and it seems to work fine (at least on my machine).

I’ve also done some cleanup and added a bit of prefetching which made it 1-2% faster.

1 Like

This is not the correct way to reason about memory orderings: in general, the correct method is to identify which memory operations (on the same memory object!) need to establish a synchronizes-with relationship.

In this case, the acquire-monotonic (weak) CAS in `lock` synchronizes with a release operation (e.g. a store or xchg) in `unlock` which has not been shown here. All other operations are non-essential and can indeed be monotonic. In a lock/unlock pattern, a simple acquire-release barrier is everything you need. In the lock path, you want to make sure that all memory operations that are part of the critical section requiring mutual exclusion are not reordered before the memory operation that observably acquires the lock. This is precisely achieved by an acquire-barrier. Likewise, on the `unlock` side you only need to make sure that all memory operations that belong to the locked section are not reordered after the operation releasing the lock. This is again achieved by a release-barrier.

Semantically, you might wan’t to ensure that, for instance, no operations that are not part of the locked region are reordered into it (which an acquire-barrier on lock permits for instance), but this is not a requirement for correctness. Also, monotonic ordering has nothing do to whatsoever with the amount of spurious failures of a weak CAS, neither on an abstract memory model level nor on a concrete hardware level of weakly ordered CPU architectures, it simply imposes no ordering constraints on the memory operations before or after it. The memory model still guarantees a total and global order of all stores to the *same* memory object, regardless of memory ordering (that’s why its called monotonic BTW).

2 Likes

What then is the different between .unordered and .monotonic? Also since you seem very knowledgeable about that, do you have any detailed references with which I can learn more?

From what I can tell, .unordered is essentially an LLVM primitive that zig exposes directly to its atomic API, which no other language I know does:

https://llvm.org/docs/Atomics.html#unordered

From what I can tell it basically means only that reads and writes are atomic and nothing else. I don’t know, what a proper use case for this would be. Zig’s .monotonic is basically identical to relaxed in C/C++/Rust.

Honestly, it took me a very long time to get a decent grasp on memory orderings and I can not remember there being a single reference that made it “click” for me.

From memory, Herb Sutter’s “atomic weapons” talk (you can find it on YouTube for instance) was fairly insightful.
Also, there is a good series of articles by Mara Bos*:

[*] Apparently also a book, which I wasn’t aware of until now.

4 Likes

There’s a great series of articles about lock-free programming on LWN:

1 Like

This was not an all-purpose guide to how to use atomics. It was a specific comment relating to actual code.

If you want to continue in that vein, I documented the changes I thought were necessary:

Is there something in here which you think can be productively weakened? If so, what.

My assertion is that using .monotonic for the load can only result in an occasional stale read. That would make the compare-exchange fail.

I took this is a general recommendation for assigning memory orderings to operations which is based on some common misconceptions and wanted to provide a generally useful guide for how to reason about memory orderings.

Yes, there are orderings that can be weakened and one that must be strengthened in order to be correct. None of the .acquire loads matter for correctness, they can be weakened to .monotonic. The failure ordering of the CAS can also be weakened to .monotonic. The success ordering of the CAS however must be acquire and not release. Its not a question of stronger and weaker, actually, .release is just wrong here. The CAS is the atomic instruction that “closes” the lock and it must synchronize-with a corresponding memory operation that unlocks it again. You have to make sure that no memory operations that occur (in program order) after the lock is acquired can be reordered before the locking instruction (the CAS) and .release does not do that. So you need .acquire when acquiring a lock and .release when releasing it again.

Using .release instead of .acquire here means that a memory operation that requires mutual exclusion (i.e. should only execute on a single thread) could be reordered before the successful CAS and therefore be executed by multiple threads in parallel, leading to a data race.

This is a common misconception, but memory orderings have nothing to do with the timing of the memory operations in general. Mara Bos’ article I’ve linked addresses this as well, I believe. In my view, a good way to think about memory orderings is that they only affect the possible reordering of atomic and non-atomic operations around them, but not themselves.

1 Like

I have no idea why I decided after writing that, that I had gotten the parameter order wrong and failure came before success, but, it happened. You are of course correct.

Yes

No.

In fact they do not affect the ordering of .monotonic operations either, except insofar as a single total order is guaranteed. Hence ‘monotonic’: if a bunch of threads +1 a shared variable 10,000 times, the value will increment by 10,000, and not less or more. But that’s all you get.

A .release affects a happens-before for an .acquire, and vice versa, but that’s it, and only for the variable in question (which is all this code cares about, but worth stating since we’re doing aggressive pedantry today).

By the spec, your all-monotonic-read version is allowed to never see action on another thread, although the compiler is not allowed to assume that such action will never happen. Meaning in practice, eventually, it will see the change.

But the semantics are obviously wrong. You cannot synchronize correctly without pairing .acquire with .release. The result, as I am saying for the third time, can only be stale reads which cause more retries than were strictly needed.

The only reason .monotonic isn’t a flat-out bug here, is that the code is acquiring the lock. If it were checking it, .monotonic would create a race. Either way it’s the wrong thing.

I’ve read Mara Bos btw.

To go a bit more on topic again.
Is there any way to easily insert an architecture specific fence instruction in zig? I expected there to be something like atomic.fence or similar but can’t find anything.

I don’t think something like this would be enough, no?

asm volatile ("" ::: .{ .memory = true });
// or if protecting a value I just read
asm volatile (""
    :
    : [value] "r,m" (value),
    : .{ .memory = true });

Does this, on x86_64, implicitly insert a mfence? But I would just need an lfence in get and sfence in put and remove.

It’s possible I am misunderstanding what you are saying here, but yes: memory orderings do in fact affect the ordering of atomic (including .monotonic) and non-atomic variables around them.
Only .monotonic is special since it permits any reordering, including the mutual re-ordering of relaxed operations on separate memory objects.

The synchronizes-with relationship is only established for a single variable with a release-acquire pair, but the happens-before and happens-after affects all memory operations around the acquire-release barrier. Otherwise, locks would not work at all. I don’t know, maybe that was what you were trying to say also…

This is technically speaking even possible with .acquire reads in the spin loop, the memory model simply has no bearing on the staleness or timeliness of reads and stores other than guaranteeing a monotonic order of stores. All that you ever get from an .acquire/.release pairing is the happens-before (and happens-after) relationship for memory operations around that particular barrier, nothing else. In other words, at least by my understanding of the memory model, this statement of yours is false.

I am curious to know what you mean by this, but I don’t get it. What do you mean by “checking the lock” and how would that create a data race?

There used to be @fence builtin for this, I don’t know why it was removed, though.

I don’t think inline assembly will ever implicitly emit an mfence instruction, only a compiler barrier.

1 Like

Then I remember this correctly that there was something like this. Here is a link to the relevant issue and section of the release notes.

I think I will use this (or a similar) solution they proposed there:

// old
thread-1:                    thread-2:
  queue.push()                e = signal.listen()
  fence(.seq_cst)             fence(.seq_cst)
  signal.notify()             if queue.empty(): e.wait()
// new
thread-1:                    thread-2:
  queue.push()                e = signal.listen()
  fetchAdd(0, .seq_cst)       fetchAdd(0, .seq_cst)
  signal.notify()             if queue.empty(): e.wait()

It’s curious that this would be recommendation.

I did a quick test and only v2 and v4 do compile to a mfence and neither version actually elides the loads (which might not actually be allowed). v1 and v3 emit lock; or which acts like a seq-cst barrier I believe. I tested this with v0.15.2 with -O ReleaseFast, both with and without -fllvm (makes no difference).

var global: i32 = 0;

export fn v1() void {
    _ = @atomicRmw(i32, &global, .Add, 0, .seq_cst);
}

export fn v2() i32 {
    const g = @atomicRmw(i32, &global, .Add, 0, .seq_cst);
    return g;
}

export fn v3() void {
    var local: i32 = 0;
    _ = @atomicRmw(i32, &local, .Add, 0, .seq_cst);
}

export fn v4() i32 {
    var local: i32 = 0;
    const l = @atomicRmw(i32, &local, .Add, 0, .seq_cst);
    return l;
}

My main gripe with this is, that neither LLVM nor zig has any business “optimizing” away a fetch-add like this, at all. Doing a deliberate lock xadd 0 on x86-64 instead of a regular load is a legit optimization technique in scalable lock-free code.

In C/C++ land, GCC correctly does not do this, btw.