I know some of the theoretical things from talks and hearing people talk about it, but I don’t have that much practical experience trying to analyze this sort of thing. Maybe someone else has another idea, for what tools to use or try.
Maybe you could get better performance with io_uring or some other asyncronus io library, if those could help coordinate the system calls better or reduce the amount of them that are needed?
Also the file you posted is only 8 something megabytes, so another approach might be to std.fs.Dir.readFileAlloc the entire file as long it is smaller than some threshold and then process it.