Piping whole file through stdin

Hi I’m a new zig user. Loving the simplicity of zig. This is the way. :wink:

I’m building a bitcoin blockchain parcer as my first zig project and I want to pipe a whole dat file (128 MiB) from the commanline:

cat blk00000.dat | ./parce 

or

./parce < blk00000.dat

In my normal read loop I read one block at a time. And stop when all the bytes in the file have been read.
So, I know where this goes wrong.

const file_size = try (file.stat()).size;

My read loop condition is:

    while (bytes_read < file_size) {

It seems that the file handle that std.io.getStdIn() retuns, does not give a regluar number when I try to get the file size.
I have been looking through the std library code, and I was thinking it had something to do with:

pub fn getStdIn() File {
    return File{
        .handle = getStdInHandle(),
        .capable_io_mode = .blocking,
        .intended_io_mode = default_mode,
    };
}

So, I tried changing the io modes to evented. But was not having any luck with that. Perhaps that is not the reason, or perhaps I just did it wrong.

I was also thinking that my whole approach to detecting the end of file is unhelpful, and there is a more straightforwards way to do this in zig. The dat files are pretty huge binary files, but there is probably a better way to detect EOF.

I normally don’t write help posts. But due to zig being a new language and all, I am having trouble finding the info which might put me on track.

1 Like

If you’re trying to read all data into memory at once and are ok blocking until it has been loaded, you probably want something like std.io.Reader.readUntilDelimiterOrEofAlloc(…)

If you really want to stream bytes, then you’re going to need to rethink the whole idea of trying to get the file size… It’s a stream of bytes, that will be in the process of being actively populated by the time your program is being loaded

Edit: In the streaming case, you’ll likely want to catch error.EndOfStream for EOF, and catch error.BrokenPipe in case there’s some other reason to stop (for example piping into something like head)

I mangaged to get this working

cat blk00000.dat | ./parce 

By changing my readloop condition to:

    while (try readBlock(&magic_bytes, &block_size, &raw_block, in_stream, allocator) > 0) {

However, piping with

./parce < blk00000.dat

Does not work. Is this handled differently? Hm?

Yes, I got the reading in sorted in the general case. I only read one Block at a time to memory (1 MiB).

It’s just the piping in from the command line which I’m trying to figure out.

1 Like

I used

    var rf = std.io.getStdIn();

    if (rf.isTty()) {
        help(prog);
        return;
    }

and then when reading

            self.bcnt = @intCast(u32, try self.file.read(self.buff[0..]));
            if (0 == self.bcnt) return null;

When fs.File.read() returns zero it is EOF.

I used this approach in this programm (file compessor)

1 Like

Ok thanks I will look into this. You think this will work in this redirection case?

./parce < blk00000.dat

Yes, it works both for redirections and pipe-lining:

$ ./srz c < ~/dc/calg/bib > bib.srz
$ cat ~/dc/calg/bib | ./srz c > bib.srz_
$ cmp bib.srz bib.srz_ // identical

Ok I found a solution.

    const in = std.io.getStdIn();
    const in_stat = try in.stat();
    if (in_stat.kind == .NamedPipe or in_stat.kind == .File) { // Piped input
        try read(in);

So, it seems in the first case in is of kind .NamedPipe

cat blk00000.dat | ./parce 

And in the latter case, in is a .File

./parce < blk00000.dat

I tried using .isTty(), but it did not trigger as true in my case. Perhaps I used it wrong.

Thank you all for your replies.

Edit: To keep all things in one place. I also changed my read loop condition so not to depend on file size at all. Now it just reads until no bytes are read.

while (try readBlock(&magic_bytes, &block_size, &raw_block, in_stream, allocator) > 0) {
2 Likes

Unless I’m missing something, the stat is unnecessary. Is there any particular reason you need to check the kind?

To check if it is indeed a pipe or a file that is being piped or redirected in on the command line. - As far as I know.

I would turn this around over its head: always reading as from a stream of bytes (so, you just don’t inquire for any file size at all) and splitting that into chunks based on whatever condition you know about the format. For typical Unix files this would be “splitting into lines using the \n terminator”; for your case, it might be “accumulating data until you have a 1 MB block, or until EOF”.

Yes you are absolutly right. I ended up doing exactly that, when I changed my read loop condition to:

while (try readBlock(&magic_bytes, &block_size, &raw_block, in_stream, allocator) > 0) {