Any easy way to check if the ChildProcess has exited?

matklad · February 15, 2024, 10:45pm

There’s fn wait(ChildProcess) !Term, which blocks until the child has exited. Is there an easy way to get something like

fn wait_now(ChildProcess) !?Term

which doesn’t block, and returns a null if the process hasn’t exited yet?

In case this is an XY problem: I have a bunch of fuzzers, which I want to run in a loop for some time. On a multicore CPU, I also want to run one fuzzer process per CPU. So I’d love to write the code a-la

var fuzzers: [num_cpus]?ChilldProcess = .{null} ** num_cpus;
for (0..60) {
    for (&fuzzers) |*fuzzer| {
        if (fuzzer.* == null) fuzzer.* = spawn_fuzzer_process();
    }

    std.time.sleep(1 * std.time.ns_per_s);

    for (&fuzzers) |*fuzzer| {
        if (fuzzer.?.wait_now()) |exit_code| {
            if (exit_code.is_error() report_error();
            fuzzer.?.* = null;
        };
    }
}

dimdin · February 15, 2024, 10:58pm

“easy way”

Unix SIGCHLD signal is send to parent when a child dies. By default it does nothing.
Windows have WaitForMultipleObjects.

Sze · February 16, 2024, 1:53am

I haven’t tried so I am not sure about maybe other better alternatives, but I would imagine this could work:

Have 2n+1 processes, 1 main process that fills a queue, n child processes that pull from a queue and start a “grandchild” process to execute the command, then block on the completion of the grandchild, when it’s done pull the next item from the queue or exit if the queue is empty.

That way as long as there is work new processes get started, once the work queue is empty/closed, we (in the worst case) wait for the last “process starter” to complete until they get joined, but this is fine because the starters have done their work already.

The only thing I imagine that might be problematic about that, might be that you have a bunch more processes, but I would have to do some experiments to see whether that is actually a problem. I guess you also have more moving pieces and if your starters could crash in some way then you would have a similar problem again, but if the starter processes don’t fail in 99.9999% of cases then you at least aren’t blocking in the normal case.

Instead of:

Each starter could run in a while loop:

while(queue.popFront()) |item| { // popFront() returns null if the queue was closed by the write end?
    const fuzzer = spawn_fuzzer_process();
    const exit_code = fuzzer.wait_now(); // blocking
    if(exit_code.is_error()) results_queue.pushBack(.{.id=item.id, .code=exit_code});
}
// work queue empty -> exit

The parts I don’t currently know are:

whether such a queue could be implemented easy/efficiently via for example io_uring
can we reuse something the zig compiler uses, or do we want something with more / other features?
are there some details I am unaware of, because I haven’t done any multi threaded/process programming, in zig yet

Until I have done some zig multi process programming, I may have some misconceptions, based on what other languages have hidden via their abstractions.

dee0xeed · February 16, 2024, 6:30am

But beware signal merging. This doc is specifically about waiting for child processes.

dee0xeed · February 16, 2024, 6:41am

Oh, yeah It’s not easy. I’ve dug up some (pretty old) C code, here is a snippet

void mon_wait_workers(struct monitor *mon)
{
        int status, pid;
        int k;
        /* account for signal merging */
        for (k = 0; k < mon->nworkers; k++) {
                struct worker *w = &mon->workers[k];
                if (!w->pid)
                        continue;
                pid = waitpid(w->pid, &status, WNOHANG);

You have to maintain a list of workers and upon getting SIGCHLD check each child if it has been terminated or not in non-blocking mode.

dimdin · February 16, 2024, 7:00am

It seems that there is an easy way: pid = waitpid(WAIT_ANY, &status, WNOHANG);

dee0xeed · February 16, 2024, 7:13am

If you mean “how to check if a child exited without going to sleep” then yes, just do it in non-blocking manner. I meant the machinery with child processes in general.

dee0xeed · February 16, 2024, 9:00pm

And also some kinda more elaborated version, ~10 yrs old too:

here it is

int do_wait_workers(struct monitor *m)
{
        int err;
        int k, s;
        int n;

        /* account for signal merging */
        while (1) {

                n = 0;

                for (k = 0; k < m->nworkers; k++) {

                        struct worker *w = &m->workers[k];

                        if (!w->pid)
                                continue;

                        err = wait_worker(w);
                        if (err) /* this one did not exited */
                                continue;

                        if (WIFSTOPPED(w->status)) {
                                log_msg("WARN: '%s' stopped... being traced?\n", w->name);
                                continue;
                        }

                        if ((!(WIFEXITED(w->status))) && (!(WIFSIGNALED(w->status)))) {
                                log_msg("WARN: '%s': WTF? (status = %d)\n", w->name, w->status);
                                continue;
                        }

                        /* worker terminated */
                        n++;
                        w->pid = 0; /* mark as not running */
                        s = WEXITSTATUS(w->status);
                        log_msg("INFO: '%s' terminated with status %d (0x%.8X)\n", w->name, s, w->status);
                        edsm_put_event(monsm[k], EC_WORKER_EXITED);

                        if (WIFSIGNALED(w->status)) {
                                /* most likely worker crashed during initialization */
                                int sig = WTERMSIG(w->status);
                                log_msg("      by signal %d (most likely crashed)\n", sig);
                                continue;
                                // w->restart = 1; (?)
                        }

                        if (s & EXIT_WORK) {
                                log_msg("      during it's working stage\n");
                        } else {
                                log_msg("      during it's init stage\n");
                        }

                        if (s & EXIT_RESPAWN) {
                                w->restart = 1;
                                log_msg("      and WANTS to be restarted\n");
                        } else {
                                log_msg("      and DO NOT WANT to be restarted\n");
                        }
                }
                if (!n)
                        break;
        }
        return 0;
}

wait_worker() function is quite simple, it’s just as:

int wait_worker(struct worker *w)
{
        pid_t pid;
        int status;

        pid = waitpid(w->pid, &status, WNOHANG);
        if (-1 == pid) {
                log_msg("OOPS: waitpid(%d): %s (worker '%s')\n", w->pid, strerror(errno), w->name);
                return 1;
        }
        if (!pid)
                return 1;

        w->status = status;
        return 0;
}

If I remember right, that mon (which is an event driven state machine, btw) was designed not only to watch if a worker process has exited, but also it monitors workers’ behavior in this manner:

worker did not consume no CPU in a period - kill it
worker did not perform no I/O operations in a period - kill it

But also bother about restarting them.

Something like that, I do not remember exact details.

After a year or so I realized that inventing a system inside a system is not a right way to go and I separated all those workers (they were not a bunch of same program copy in my case, each has it’s own specific job) into separate services (in systemd terms) - and let them managed by systemd.

Cloudef · February 17, 2024, 2:11am

You can use epoll on linux, but generally there is no cross-platform way

github.com

Cloudef/pid-defer/blob/master/src/defer.zig#L64-L81


      
          const fds = .{ ppid_fd, child_fd };
          const pids = .{ ppid, chld.id };
          inline for (fds, pids) |fd, pid| {
              var ev: std.os.linux.epoll_event = .{ .events = std.os.linux.EPOLL.IN, .data = .{ .fd = pid } };
              try std.posix.epoll_ctl(efd, std.os.linux.EPOLL.CTL_ADD, fd, &ev);
          }
          
          var events: [2]std.os.linux.epoll_event = undefined;
          var nevents: usize = 0;
          while (true) {
              nevents = std.os.linux.epoll_pwait(efd, events[0..], events.len, -1, null);
              for (events[0..nevents]) |ev| {
                  switch (waitpid(ev.data.fd, false)) {
                      .noop, .alive => {},
                      .exited, .nopid => return,
                  }
              }
          }

dee0xeed · February 17, 2024, 7:17am

Sure, look for edsm on my gh page, same nick as at this forum.
There is a version in D where I’ve made some abstraction,
so it goes both for Linux and FreeBSD. Timers in particular are interesting,
because they are not fd in FreeBSD.

Cloudef · February 17, 2024, 7:49am

For generic event loop abstractions there’s GitHub - mitchellh/libxev: libxev is a cross-platform, high-performance event loop that provides abstractions for non-blocking IO, timers, events, and more and works on Linux (io_uring or epoll), macOS (kqueue), and Wasm + WASI. Available as both a Zig and C API. for zig

dee0xeed · February 17, 2024, 8:39am

That’s great, but it would be interesting to see a (single thread) application, that

handles multiple clients, sending data using various protocols
interacts with multiple instances of some DBMS
interacts with multiple instances of some KVS (REDIS or so)

not to mention such a “little things” as signals and file system events.

BTW, about file system - there is IN_Q_OVERFLOW event that screwed my brain out a couple of times and eventually I came to a conclusion that in case of very frequent events it is better to use polling (opendir/readdir/closedir) approach, so now I use inotify only for things which I am sure about to happen rarely.

dee0xeed · February 17, 2024, 9:01am

Take a step forward, add a state machine layer on top of event loop and you’ll get:

automatically structured code (a function for each particular state/event combination)
100+ level of concurrency without fibers/goroutines compiler magic
true OOP (entities interacting with messages - this is exactly what Alan Kay meant, not that “incapsulation/inheritance/polymorphism” triad set on the edge)

dee0xeed · February 17, 2024, 3:37pm

And there is funny typo:

$ grep -r IN_Q_OVERFLOW /usr/include/
/usr/include/linux/inotify.h:#define IN_Q_OVERFLOW		0x00004000	/* Event queued overflowed */
                                                                                  ^
/usr/include/x86_64-linux-gnu/sys/inotify.h:#define IN_Q_OVERFLOW	 0x00004000	/* Event queued overflowed.  */

matklad · February 23, 2024, 7:06pm

Figured something sufficiently hacky — I’m just going to .Pipe child’s stdin and wait until I get a broken pipe when trying to write to it.

        var child = try shell.spawn_options(
            .{ .stdin_behavior = .Pipe, .stderr_behavior = .Inherit },
            "zig/zig build fuzz -- --seed={seed} canary",
            .{ .seed = "92" },
        );

        const stdin = child.stdin.?;
        _ = try std.os.fcntl(stdin.handle, std.os.F.SETFD, @as(u32, std.os.O.NONBLOCK));

        while (true) {
            _ = stdin.write(&.{0}) catch |err| switch (err) {
                error.WouldBlock => continue,
                error.BrokenPipe => break,
                else => return err,
            };
        }
        _ = try child.wait();

dee0xeed · February 23, 2024, 8:01pm

For a simple scenario as in your example (launch a child then wait indefinitely for broken pipe) it might work, but your while is a CPU hog (at least) and this is not a solution for a server with “serve one or more clients in each child” model, since such waiting mechanism would significantly complicate the logic - you would have to poll periodically with this _ = stdin.write.

matklad · February 23, 2024, 8:47pm

I don’t have a server problem, I have “I need to run 10 fuzzers problem”. For the server, of course I’d avoid spawning sub process in the first place, and, if I do need sub processes, I’d shovel pidfd into io_uring or something. But this question is a different genre — I am looking for the simplest solution to solve the problem in the small.

dimdin · February 23, 2024, 9:05pm

You mention pipes and I remembered another way, the djb self-pipe trick.

matklad · May 2, 2024, 10:50am

For posterity, impletended this, it has been running for a couple of weeks and no issues were found so far:

matklad · May 8, 2024, 12:25pm

Important correction: should have been SETFL rather than SETFD:

github.com/tigerbeetle/tigerbeetle

cfo: actually make file non-blocking

tigerbeetle:main ← tigerbeetle:matklad/i-love-systems-progarmming-and-never-question-my-career-choice

opened 12:24PM - 08 May 24 UTC

matklad

+4 -1

SETFD manipulates flags of the file _descriptor_ (a single file can have many de…scriptors pointing to it). SETFL manipulates flags of the file itself (they are shared between all descriptors pointing at the file). NonBlocking is sadly a property of file, and not of a file descriptor, though this isn't a problem for our use-case. What is a problem is that I wrongly used SETFD instead of SETFL, which made this into a no-op (it worked in practice because the default pipe buffer was enough to accomodate one byte a second).