Any easy way to check if the ChildProcess has exited?

There’s fn wait(ChildProcess) !Term, which blocks until the child has exited. Is there an easy way to get something like

fn wait_now(ChildProcess) !?Term

which doesn’t block, and returns a null if the process hasn’t exited yet?

In case this is an XY problem: I have a bunch of fuzzers, which I want to run in a loop for some time. On a multicore CPU, I also want to run one fuzzer process per CPU. So I’d love to write the code a-la

var fuzzers: [num_cpus]?ChilldProcess = .{null} ** num_cpus;
for (0..60) {
    for (&fuzzers) |*fuzzer| {
        if (fuzzer.* == null) fuzzer.* = spawn_fuzzer_process();
    }

    std.time.sleep(1 * std.time.ns_per_s);

    for (&fuzzers) |*fuzzer| {
        if (fuzzer.?.wait_now()) |exit_code| {
            if (exit_code.is_error() report_error();
            fuzzer.?.* = null;
        };
    }
}
1 Like

“easy way” :smile:

Unix SIGCHLD signal is send to parent when a child dies. By default it does nothing.
Windows have WaitForMultipleObjects.

I haven’t tried so I am not sure about maybe other better alternatives, but I would imagine this could work:

Have 2n+1 processes, 1 main process that fills a queue, n child processes that pull from a queue and start a “grandchild” process to execute the command, then block on the completion of the grandchild, when it’s done pull the next item from the queue or exit if the queue is empty.

That way as long as there is work new processes get started, once the work queue is empty/closed, we (in the worst case) wait for the last “process starter” to complete until they get joined, but this is fine because the starters have done their work already.

The only thing I imagine that might be problematic about that, might be that you have a bunch more processes, but I would have to do some experiments to see whether that is actually a problem. I guess you also have more moving pieces and if your starters could crash in some way then you would have a similar problem again, but if the starter processes don’t fail in 99.9999% of cases then you at least aren’t blocking in the normal case.

Instead of:

Each starter could run in a while loop:

while(queue.popFront()) |item| { // popFront() returns null if the queue was closed by the write end?
    const fuzzer = spawn_fuzzer_process();
    const exit_code = fuzzer.wait_now(); // blocking
    if(exit_code.is_error()) results_queue.pushBack(.{.id=item.id, .code=exit_code});
}
// work queue empty -> exit

The parts I don’t currently know are:

  • whether such a queue could be implemented easy/efficiently via for example io_uring
  • can we reuse something the zig compiler uses, or do we want something with more / other features?
  • are there some details I am unaware of, because I haven’t done any multi threaded/process programming, in zig yet

Until I have done some zig multi process programming, I may have some misconceptions, based on what other languages have hidden via their abstractions.

But beware signal merging. This doc is specifically about waiting for child processes.

Oh, yeah :slight_smile: It’s not easy. I’ve dug up some (pretty old) C code, here is a snippet

void mon_wait_workers(struct monitor *mon)
{
        int status, pid;
        int k;
        /* account for signal merging */
        for (k = 0; k < mon->nworkers; k++) {
                struct worker *w = &mon->workers[k];
                if (!w->pid)
                        continue;
                pid = waitpid(w->pid, &status, WNOHANG);

You have to maintain a list of workers and upon getting SIGCHLD check each child if it has been terminated or not in non-blocking mode.

It seems that there is an easy way: pid = waitpid(WAIT_ANY, &status, WNOHANG);

2 Likes

If you mean “how to check if a child exited without going to sleep” then yes, just do it in non-blocking manner. I meant the machinery with child processes in general.

And also some kinda more elaborated version, ~10 yrs old too:

here it is
int do_wait_workers(struct monitor *m)
{
        int err;
        int k, s;
        int n;

        /* account for signal merging */
        while (1) {

                n = 0;

                for (k = 0; k < m->nworkers; k++) {

                        struct worker *w = &m->workers[k];

                        if (!w->pid)
                                continue;

                        err = wait_worker(w);
                        if (err) /* this one did not exited */
                                continue;

                        if (WIFSTOPPED(w->status)) {
                                log_msg("WARN: '%s' stopped... being traced?\n", w->name);
                                continue;
                        }

                        if ((!(WIFEXITED(w->status))) && (!(WIFSIGNALED(w->status)))) {
                                log_msg("WARN: '%s': WTF? (status = %d)\n", w->name, w->status);
                                continue;
                        }

                        /* worker terminated */
                        n++;
                        w->pid = 0; /* mark as not running */
                        s = WEXITSTATUS(w->status);
                        log_msg("INFO: '%s' terminated with status %d (0x%.8X)\n", w->name, s, w->status);
                        edsm_put_event(monsm[k], EC_WORKER_EXITED);

                        if (WIFSIGNALED(w->status)) {
                                /* most likely worker crashed during initialization */
                                int sig = WTERMSIG(w->status);
                                log_msg("      by signal %d (most likely crashed)\n", sig);
                                continue;
                                // w->restart = 1; (?)
                        }

                        if (s & EXIT_WORK) {
                                log_msg("      during it's working stage\n");
                        } else {
                                log_msg("      during it's init stage\n");
                        }

                        if (s & EXIT_RESPAWN) {
                                w->restart = 1;
                                log_msg("      and WANTS to be restarted\n");
                        } else {
                                log_msg("      and DO NOT WANT to be restarted\n");
                        }
                }
                if (!n)
                        break;
        }
        return 0;
}

wait_worker() function is quite simple, it’s just as:

int wait_worker(struct worker *w)
{
        pid_t pid;
        int status;

        pid = waitpid(w->pid, &status, WNOHANG);
        if (-1 == pid) {
                log_msg("OOPS: waitpid(%d): %s (worker '%s')\n", w->pid, strerror(errno), w->name);
                return 1;
        }
        if (!pid)
                return 1;

        w->status = status;
        return 0;
}

If I remember right, that mon (which is an event driven state machine, btw) was designed not only to watch if a worker process has exited, but also it monitors workers’ behavior in this manner:

  • worker did not consume no CPU in a period - kill it
  • worker did not perform no I/O operations in a period - kill it

But also bother about restarting them.

Something like that, I do not remember exact details.

After a year or so I realized that inventing a system inside a system is not a right way to go and I separated all those workers (they were not a bunch of same program copy in my case, each has it’s own specific job) into separate services (in systemd terms) - and let them managed by systemd.

You can use epoll on linux, but generally there is no cross-platform way

1 Like

Sure, look for edsm on my gh page, same nick as at this forum.
There is a version in D where I’ve made some abstraction,
so it goes both for Linux and FreeBSD. Timers in particular are interesting,
because they are not fd in FreeBSD.

For generic event loop abstractions there’s GitHub - mitchellh/libxev: libxev is a cross-platform, high-performance event loop that provides abstractions for non-blocking IO, timers, events, and more and works on Linux (io_uring or epoll), macOS (kqueue), and Wasm + WASI. Available as both a Zig and C API. for zig

1 Like

That’s great, but it would be interesting to see a (single thread) application, that

  • handles multiple clients, sending data using various protocols
  • interacts with multiple instances of some DBMS
  • interacts with multiple instances of some KVS (REDIS or so)

not to mention such a “little things” as signals and file system events.

BTW, about file system - there is IN_Q_OVERFLOW event that screwed my brain out a couple of times and eventually I came to a conclusion that in case of very frequent events it is better to use polling (opendir/readdir/closedir) approach, so now I use inotify only for things which I am sure about to happen rarely.

Take a step forward, add a state machine layer on top of event loop and you’ll get:

  • automatically structured code (a function for each particular state/event combination)
  • 100+ level of concurrency without fibers/goroutines compiler magic
  • true OOP (entities interacting with messages - this is exactly what Alan Kay meant, not that “incapsulation/inheritance/polymorphism” triad set on the edge)

And there is funny typo:

$ grep -r IN_Q_OVERFLOW /usr/include/
/usr/include/linux/inotify.h:#define IN_Q_OVERFLOW		0x00004000	/* Event queued overflowed */
                                                                                  ^
/usr/include/x86_64-linux-gnu/sys/inotify.h:#define IN_Q_OVERFLOW	 0x00004000	/* Event queued overflowed.  */

Figured something sufficiently hacky — I’m just going to .Pipe child’s stdin and wait until I get a broken pipe when trying to write to it.

        var child = try shell.spawn_options(
            .{ .stdin_behavior = .Pipe, .stderr_behavior = .Inherit },
            "zig/zig build fuzz -- --seed={seed} canary",
            .{ .seed = "92" },
        );

        const stdin = child.stdin.?;
        _ = try std.os.fcntl(stdin.handle, std.os.F.SETFD, @as(u32, std.os.O.NONBLOCK));

        while (true) {
            _ = stdin.write(&.{0}) catch |err| switch (err) {
                error.WouldBlock => continue,
                error.BrokenPipe => break,
                else => return err,
            };
        }
        _ = try child.wait();
4 Likes

For a simple scenario as in your example (launch a child then wait indefinitely for broken pipe) it might work, but your while is a CPU hog (at least) and this is not a solution for a server with “serve one or more clients in each child” model, since such waiting mechanism would significantly complicate the logic - you would have to poll periodically with this _ = stdin.write.

1 Like

I don’t have a server problem, I have “I need to run 10 fuzzers problem”. For the server, of course I’d avoid spawning sub process in the first place, and, if I do need sub processes, I’d shovel pidfd into io_uring or something. But this question is a different genre — I am looking for the simplest solution to solve the problem in the small.

You mention pipes and I remembered another way, the djb self-pipe trick.

2 Likes

For posterity, impletended this, it has been running for a couple of weeks and no issues were found so far:

2 Likes

Important correction: should have been SETFL rather than SETFD: