Why `std.once`?

cancername · April 7, 2024, 8:41pm

I see std.once exists. What is its purpose? There isn’t a single usage of it in the stdlib.

dimdin · April 7, 2024, 9:43pm

I used it in a testing library for pseudorandom number generation initialization.

var prng: std.rand.DefaultPrng = undefined;

var prng_once = std.once(prng_init);

fn prng_init() void {
    const seed = std.crypto.random.int(u64);
    prng = std.rand.DefaultPrng.init(seed);
}

pub fn check() void {
    // initialize prng, calling prng_init, but only the first time
    prng_once.call();
    // after initialization we can call prng.random()
    ...
}

permutationlock · April 7, 2024, 9:45pm

It is used in a couple places in the Zig repo.

AndrewCodeDev · April 8, 2024, 12:52am

For those who are not familiar with this utility, I’d like to add a bit more context to this thread. Here’s what’s up with std.once…

Here’s the source: zig/lib/std/once.zig at master · ziglang/zig · GitHub

First, we’ll look at the test in the once.zig file. This test creates a global counter and provides a function incr that adds one to the global counter. Then, they spawn 10 threads that each try to call incr. This would usually lead to problems due to a data race. To guard from this, they wrap incr in a std.once. Here’s the test:

var global_number: i32 = 0;
var global_once = once(incr);

fn incr() void {
    global_number += 1;
}

test "Once executes its function just once" {
    if (builtin.single_threaded) {
        global_once.call();
        global_once.call();
    } else {
        var threads: [10]std.Thread = undefined;
        defer for (threads) |handle| handle.join();

        for (&threads) |*handle| {
            handle.* = try std.Thread.spawn(.{}, struct {
                fn thread_fn(x: u8) void {
                    _ = x;
                    global_once.call();
                }
            }.thread_fn, .{0});
        }
    }

    try testing.expectEqual(@as(i32, 1), global_number);
}

At the end, we can see that the test expects the global counter to equal 1, implying that the incr function has only been called once. The implementation of std.once also gives us a few more clues…

Here in the call function, we see an atomic load that checks if the call has already been made:

        pub fn call(self: *@This()) void {
            if (@atomicLoad(bool, &self.done, .acquire))
                return;

            return self.callSlow();
        }

Likewise, there is a callSlow function that uses a mutex.

I think we can safely assume that this utility was made with thread safety and mutable state in mind. The api offered by std.once is quite limited - you get two member functions that takes no additional parameters and return void. It looks like the intention behind this is to guard against data-races on global variables in multi-threaded scenarios (like initializing something only one time or coordinating calls). Odd little utility, but I’m sure it’s handy where it makes sense.

nyc · April 14, 2024, 10:13pm

This can be .monotonic or .relaxed or whatever it is called in zig

matklad · April 15, 2024, 12:49pm

This would be incorrect. The purpose of once is to synchronize some sort of a side-effect, so an acquire/release pair is required.

In other words, the following program would be buggy with relaxed:

var global_flag: bool = false;
var global_once = once(struct {
    fn set() void { global_flag = true; }
}.set);

for (0..2) |_| {
    std.Thread.spawn(.{}, struct {
      fn thread_fn() void {
        global_once.call();
        assert(global_flag); // could trip with .relaxed
      }
    }.thread_fn, .{}) catch @panic()
}

nyc · April 15, 2024, 3:35pm

you’re right it can’t be relaxed. I thought they were the same just different named (llvm vs c++). ~~That’s clearly wrong. Relaxed and unordered are the same? ~~

edit: the llvm docs says that monotonic and relaxed are the same but the description of them is totally different. c++ says:

there are no synchronization or ordering constraints imposed on other reads or writes, only this operation’s atomicity is guaranteed

and llvm says:

It essentially guarantees that if you take all the operations affecting a specific address, a consistent ordering exists.

Those seem totally at odds with each other and I’m totally confused.

you essentially want this load to see a store from another thread.

matklad · April 15, 2024, 3:50pm

These are two compatible definitons. In the first one, they key word is other. Relaxed/monotonic guarantees a modification order of a single particular location in memory, but doesn’t say anything about operations involving any different location.

So, relaxed/monotinic can be used for counters. When you want to synchronize access to some different data, as is the case with once, you need acquire/release pair.

nyc · April 15, 2024, 3:54pm

they key word is other.

Yes, I didn’t pick up on that. The test code is writing to two variables the done bool and the global int. Why wouldnt’ it need SeqCst to ensure ordering on loads and stores between unrelated addresses?

edit: I think i got it. SeqCst you need two threads doing stores for another thread to read.

Acq and Rel only apply to the state of a specific thread. So as long as just one thread is going to the writes Rel is all that is needed. Still not sure about Acq.

matklad · April 15, 2024, 4:23pm

You might want to get Rust Atomics and Locks by Mara Bos a read, its a good resource to wrap one’s head around atomics (and, for me personally, Java Memory Model Pragmatics (transcript) was the resource that made me unserstand atomics). My current understanding:

unsychronized access (eg, just read non-atomic variable) is something you use when there’s something else ensuring synchronization. E.g., if a data is protected by the mutex, it’s fine to use non-synchronized accesse for the data when you hold a mutex, because mutex.unlock operation synchronises with mutex.lock. That is, when thread A unlocks the mutex and thread B subesqently locks the mutex, there’s a guarantee that thread B observes all writes by thread A, even if they are not atomic
Relaxed/monotonic access is what you use for counters. It makes access to a single variable atomic, but doesn’t say anything about unrelated operations.
Acquire/Release is more or less default ordering, and the one you need to wrap your head around to get atomics. The key idea here is that accesses come in pairs. Thread A does a write, thread B does a read, and these two operations match and synchronize with each other. This is how the mutex works interanlly. When you unlock a mutex, you do a write with a release. When you subseqently lock the mutex, you do a read with acquire. This acquire-release pair is what allows unsynchronized access to interior data.
SeqCst — I am not smart enough to understand this one It is usually introduced first, as the simplest ordering, but that’s actually incorrect. If you try to reason that a particular synchronization primtive (like Once) is correct, you almost always will reason in terms of release/acquire semantics, of particular reads synchronizing with particular writes. SeqCst doesn’t make such proofs easier. There are some data structures that do rely on SeqCst’s additional property of enforcing a global order, but they are tricky, and I never understood those.
Consume — the thing about Acquire/Release is that it synchronize everything. With mutex lock/unlock example, not only the data inside mutex will be synchronized, but, actually, any other shared memory between the two threads will be synced up. This is allegedly wasteful, and it allegedly would be beneficial to say “this atomic (holding state of the mutex) protects the data in the mutex, but only it”. The mythical “consume” ordering is intended to solve this, but my understanding is that nobody manged to make a logicaly consistent model of Consume, so it got removed from C++.

nyc · April 15, 2024, 4:44pm

my understanding

for SeqCst you need more than 2 threads. Acq and Rel only work in pairs (I dont know why. Acq still confuses me) Rel is almost like a commit. Any thread that then does and Acq on the same address will be guaranteed to see all writes done by the thread that did the Rel at the time of the Rel. There is an order to each memory location (if you write 1 to 10 you are guaranteed to see them in that order – but not necessarily all of them) but not between locations (if you did a write of 1 to 10 alternating between locations you could still see all the evens written to on location then all the odds written to another), but once the Rel is done you will see 9 and 10 in both locations. I think it is as long as you do an Acq on the Rel’ed location. (dont understand how/why at all)

SeqCst often example

thread 1 writes x
thread 2 writes y
thread 3 reads x then y
thread 4 reads y then x

If the write are all Rel and reads all Acq they only apply to the individual threads. 3 and 4 can see the writes to x but not y or the other way around. Ther is no GLOBAL ordering just per thread and per memory location. thread 3 could see it as write x then y, but 4 could see it as y then x.

SeqCst means both threads 3 and 4 are guaranteed to see the same ordering (both will eithe see write x then y or the other)

I think that is correct at least

Tosti · April 16, 2024, 5:24pm

Currently I’m reading Chapter 3 of Mark Batty’s thesis which gives a formal definition of C/C++ memory model. It may be complicated for an unprepared reader, but it gives exhaustive rigorous definitions which describe how all loads and stores are supposed to work. Description of C++'s std::memory_order is very-very simplified version of it, but it’s still very useful.

Each load and store is an action. A pair of actions may be in some relation (there are a lot of different relations). The main relation that drives this machinery is happens before. If some non-atomic location is accessed from 2 threads, and these accesses are not in a happens before relation, it’s an UB. So, simply speaking, the goal of synchronization mechanisms is to establish happens before relationships.

Relaxed actions indeed don’t affect happens before, but, as was mentioned in this thread, relaxed writes to the same location form a modification order, which is coherent with happens before. Coherence of modification order is defined via write-write, read-read, read-write, and write-read coherence. Simply speaking, if some relaxed action on a location happens before some other relaxed action on the same location, then the first action reads/writes a value that appears not later in the modification order of this location than a value of a read/write of the second action.

And of course, it’s not UB to concurrently load and load/store on the same location, if those loads and stores are relaxed, even if those actions are not in happens before relation.

Important note: a release store A synchronizes with (=> happens before) an acquire load B only if this load reads a value from a release sequence headed by A. In simple cases it is often enough to just chech whether B reads the value written by A in a loop.

I’ll use syntax from Batty’s thesis, which looks like C++ (sorry for that) with special syntax for creating and joining threads. {{{ starts each statement delimited by ||| in a new thread, corresponding }}} joins all threads.

// main thread
int x = 0;
atomic_int y = 0;
{{{ { x = 1; y.store(1, release); } // thread 1
||| { while(y.load(acquire) != 1); assert(x == 1); } // thread 2
}}}

Note that loads of y in thread 2 may read only 0 or 1, they can’t read garbage that was there before y’s initialization, even though initialization of atomics is not atomic. That’s because thread’s start establishes relation of additional synchronzies with (=> synchronizes with => happens before) between initialization y = 0 and the beginning of threads’ functions (x = 0 and the first y.load(acquire)).

Another note: this message passing mechanism is not optimal, the loads may be relaxed if there is an acquire barrier after the loop, but barriers are whole another story.

Not exactly. Release-consume ordering is built upon carries dependency relation. Read A carries dependency into evaluation B, if (very simplified) the value of A is used as an operand of B. Release-consume ordering establishes happens before relation only for those actions, which the consume read carries dependency into.

The use of this memory ordering is indeed discrouraged since C++17, AFAIK, whenever you use consume, compiler simply transforms it to acquire (which gives stronger guarantees).

SeqCst write has the same guarantees as a release write, SeqCst read had the same guarantess as an acqkure read, and additionally all SeqCst actions on all locations form a single total order, which agrees with happens before and modification order. The following example from @nyc demonstrates it.

Frankly, I don’t quite understand this example, so can’t say whether it’s true. Code snippet may help.

Correct.

nyc · April 18, 2024, 4:41am

Thanks for the paper. I’ll put it in my ever growing queue.

LOL. And I still have no idea what acquire does or how it works (does it even generate a fence at the point it is exists or does it change what release does?) I think I’m doomed to never understand but just know how to do ti (the Monty Hall Problem all over against – fucking nightmare).

Tosti · April 18, 2024, 2:02pm

According to the thesis, it depends on the target architecture.

C/C++	x86	Power	ARM	Itanium
Load Acquire	mov (from memory)	ld; cmp; bc; isync	ldr; teq; beq; isb	ld.acq
Store Release	mov (from memory)	lwsync; st	dmb; str	st.rel

I don’t know what it means exactly, but the main idea that I took from it is that x86 has stronger synchronization guarantees by default, hence it’s unable to gain additional perfromance from relaxed behaviour. But it matters for other architectures.

No, they are completely separate instructions.

A lot of people like to think about releases and acquires in terms of reordering that the compiler/hardware is allowed to do. Compiler/hardware may reoder writes after release to be before the release, and reads before acquire to be after the acquire.

x1 = 0;  // can't be reordered after the release
y1.store(1, release);
z1 = 2; // can be reoredered before the release

a = x2; // can be reordered after the acquire
b = y2.load(acquire);
c = z2; // can't be reordered before the acquire

This reasoning isn’t as rigorous as the approach with happens before, but it gives an intuition. That said, if I want to prove something about my program, I would use happens before terminology.

nyc · April 18, 2024, 2:41pm

That’s true. For example x64 can’t retire a store until it completes unlike arm (tangentially, this is alsoo why perf can give weird results because it blames the oldest instruction in the ROB when the sample timer expires).

I haven’t time to read the thesis yet, but I’ll trry to make time this weekend.

I think i finally get it all (i just glanced at the thesis but will read more details later:

acquire(m) - all memory ops after here will not be moved in front of this (but early ones can still be pushed after it and all prior writes to m done with with release are visible (but not necessarily other locations). in your second example changes to x2 and z2 on other threads are not guaranteed to be see but changes to y2 are.

release(m)- all memory ops before here will not be moved after this, but memory ops after this can still be pulled ahead. all other threads that use acquire on m will be guaranteed to see anything written to m but not necessarily other (like x1 and z1).

acquire-release(m)- no memory ops will be moved ahead or behind this. and and creates a total ordering for m for any other thread using acq and rel on m but not necessarily others.

seqcst(m) - like acq-rel but is a total ordering for all memory locations across all threads, I think. not sure why it needs an argument then. If one thread does seqcst(m) and aother does seqcst(n) do they will share the same ordering on all variables?

I find acquire semantics much harder to reason about. Rel is like a commit. Would be intutive so think of acq like a snapshot? not really correct though.