Help with Io.Select and Cancelation on 0.16 latest with Io.Threaded

Got a bit of a mess here with thread condition variables and cancelation, using Io.Threaded

Used to have cond.timedWait(&mutex, timeout) … now trying to do the same thing using Io.Select, giving it a thread condition + a timer to select on

The following code sort of works

    const SelectTypes = union(enum) {
        condition_wait: error{Canceled}!void,
        timeout: error{Canceled}!void,
    };

    var buffer: [2]SelectTypes = undefined;
    var select = Io.Select(SelectTypes).init(self.io, &buffer);

    select.async(.condition_wait, Io.Condition.wait, .{
        &self.cond,
        self.io,
        &self.event_mutex,
    });
    // add a timeout to the select
        select.async(.timeout, Io.Clock.Duration.sleep, .{
            .{ .raw = timeout, .clock = .real },
            self.io,
        });
    

    // This will block until either the signal hits or the timer fires
    const result = try select.await();
    select.cancel(); // need this to cancel the non-winning task
    std.log.debug("Got result {}", .{result});

It successfully grabs the first result that occurs - whether that’s the thread cond being signalled, or the timer expiring

The problem I have now is when the timer wins, the select.cancel() is called - which picks up the thread cond wait and cancels it.

Now, deep inside Io.Threaded.Group.Task.start() - where the cond wait is started then cancelled, getting weird behaviour that crashes the program on an assert

Code for Threaded.Group.Task.start() is here

           
// This is the where select.async() starts the thread cond wait
// note that it returns error{Canceled}!void
            const result = task.func(task.contextPointer());
            const cancel_acknowledged = switch (thread.status.load(.monotonic).cancelation) {
                .none, .canceling => false,
                .canceled => true,
                .parked => unreachable,
                .blocked => unreachable,
                .blocked_alertable => unreachable,
                .blocked_alertable_canceling => unreachable,
                .blocked_canceling => unreachable,
            };

            // add some debugging to see whats going on - 
            std.log.debug("task result {!} cancelled {}", .{ result, cancel_acknowledged });
            if (result) |res| {
// ---> Getting here, where result = void, and no error after being cancelled
                std.log.debug("here with non-error result {}", .{res});
// ---> so the next assert crashes the program - ouch
                assert(!cancel_acknowledged); // group task acknowledged cancelation but did not return `error.Canceled`
            } else |err| switch (err) {
                error.Canceled => assert(cancel_acknowledged), // group task returned `error.Canceled` but was never canceled
            }

So - not sure whether my app code is completely wrong in how its using Io.Select, or there is a stdlib bug, or even a miscompilation

More info - if I build in ReleaseFast mode, it works fine

If I force the allocator in the test program to be std.heap.smp_allocator - then it works fine with ReleaseSmall

Crashes on cancellation of thread.cond.wait with ReleaseSafe (??)

Crashes on cancellation of thread.cond.wait with Debug / non-llvm build

Smells like stack getting clobbered somewhere. Not sure yet

It’s a bug. I think you are better off using regular group and queue the results yourself.

1 Like

And btw, the internal futex API has timeout support so the missing Condition.timedWait can be fairly easily implemented, if anyone wants to do a PR.

Cool thanks.

What concerns me most is the bit in Threaded.Task.start() .. where the spawned task is cancelled, the cancelation flag is set on the thread, but the error{Canceled} is not returned from the task fn. So that’s probably a bug or miscompilation in debug mode or something. Weird that it’s fine in ReleaseFast mode though.

My solution for now is to disable the timed wait function in my lib until 0.16 gets released properly anyway … can live with that for now.

Yeah, and there is some chatter about adding timeouts to ALL currently cancelable functions too as a new feature

One more thing, using async like in your original example can lead to very surprising results. I see this all over Zig communities. People using async and hoping it executes concurrently. If it did not, as it is allowed to, you would first wait on the condition without timeout, and then sleep right after that.

1 Like