Deterministic std.Io

When 0.16.0 was released and I started to wrap my head around std.Io, one of the first things I thought about was how it could be leveraged for deterministic simulation testing.

I haven’t had time to really dive into it yet, so bear with me! I’m imagining something that would mock all IO operations, return random failures (based on a seed), and deterministically schedule async and concurrent work. Through various simulations, we’d hopefully uncover race conditions, weak response / error handling, panics as a result of data corruption, etc.

What are some potentially tricky things I’m missing here? I don’t anticipate it being “easy” of course, but are there some really hard problems I’m unaware of in doing something like this? Something that comes to mind for me is how to manage preemptive thread scheduling. If I ask for some concurrent work to be done, I can’t actually start an OS thread (to be deterministic, the entire execution has to be single-threaded), so I would have to stop and restart threads at (seed-based) random times, in the same way an actual thread would behave, as far as the OS has control over it. Are we getting to “stackless coroutine” terriority here? Looking at std.Io.Evented, maybe we can take some inspiration from the backing Dispatch implementation? Is that specific to underlying OS primitives? Would be nice to be portable across at least Linux and macOS…

Potentially a massive topic, but very interested in getting your thoughts.

4 Likes

To be truly deterministic, you would have to not talk with the underlying OS at all.

For concurrent, your best option is coroutines indeed. There’s std.Io.fiber you can reuse, but it does not support all archs. When and if zig gets stackless coroutines then those can be used without having to worry about hardware support.

As for OS threads, it’s still possible to serialize their execution with locks/parking/unparking [1], but ideally you’d want to avoid any OS primitives here at all.

1: For example here is port of a game that is written using bunch of busy loops, but the platform I ported it to needs to not loop forever, so I used a thread so I could yield the game when it’s waiting for events thus “asyncifying” it https://github.com/sorvi-platform/sorvi-elma-classic/blob/master/src/elma-classic.zig

1 Like

I doing some similar stuff. But the only IO I need to simulate is async, sleep, clock, and a raw socket. There still needs to be something that “ticks” the simulator for me, perhaps advancing the clock 1us at a time. But I havent gotten very far, I still need to wrap my head around how to use the networking part of Io.

I am interested if anyone has resources (like textbooks) they can recommend about the design of deterministic simulators.

2 Likes

Another problem which afaik you can’t solve by making Io deterministic are atomics. You’ll have to somehow patch their ordering between threads and also the results you’re getting. There are of course ways to do that, but it can’t just only be done in Io.

For managing threads, see:

One subtle issue is that IO might be the wrong boundary for DST.

If the goal is to DST an existing non-DST-aware application, you might as well go all the way to antithesis and mock at the vm level.

If the goal is to co-design an application and DST harness, you’ll want to DST at the level of applications domain-objects.

TigerBeetle has IO, but we don’t mock IO. We mock storage (fixed sized sector aligned reads/writes) and network (message passing).

This allows us to inject domain-relevant faults directly: we can misdirect a message, rather than an individual network write.

3 Likes

To expand on this, this interacts in an interesting way with polymorphism.

Right now, TigerBeetle implements message passing on top of TCP. In theory, we could mock the TCP in such a way that leads to an interesting message passing semantics. But the problem here is that, one day, we might want to switch to UDP, at which point we are doing entirely different syscalls, and our simulation no longer works.

By simulating the message passing layer directly, we get a much stronger property that TB works with any networking implementation, as long as that implementation is not completely broken. We can swap out our networking and be sure that that won’t bork our consensus.

We do have a “unit-fuzzer” for out networking component in isolation, and that fuzzer mocks at the level of IO. But the whole-system one doesn’t.

1 Like