Help with iterating over tar content in the new IO interface

I have a loop over the files in a tar file, skip some files, and write the rest to a target tar file:

…I managed to get to the point where the data should be written to the target file. But how do I get a reader which gives me the bytes of the current iterator item in the tar file? The old std.tar.Iterator.File had a method which directly gives me a reader object:

…the new std.tar.Iterator.File just gives me a file size:

…how do I go from such an iterator item to a reader over the bytes for that item in the tar file? Especially when I don’t have an offset to “seek” to the correct starting position?

…ok I think I found part of the answer:

…the reader that’s passed into the std.tar.Iterator keeps track of the current seek position, and apparently that’s updated by the iterator… and this together with the file size allows me to find the bytes in the raw tar data… but only by accessing both the original reader and the iterator item… (instead of just the iterator item like before)… and also only when I have direct random access to the tar file content.

But my situation is different because the data is coming directly from another tar file which is streamed in via a reader, not from bytes in memory…

So the question still stands… how do I get a reader for the current tar.Iterator item…

Tbh, this feels more like puzzle solving than programming…

PS: …it’s tempting to just pass in the original reader that was passed into the iterator, but of course as expected this completely messes up the iterator state…

I think the example you want is

2 Likes

Ah ok, I guess std.tar.Iterator.streamRemaining() is key:

…I guess that let’s me read bytes out of the iterator’s reader without messing up the iterator state…

Thanks! I’ll try that.

I have the nagging feeling that Readers and Writers should be more freely ‘pipe-able’ / ‘pluggable’… e.g. here I need to call a special method on the iterator to stream the content of the current ‘iterator item’ into a writer instead of ‘hey give me a reader for the current item which I can then directly plug into a writer’.

I think the basic idea that I can ask some object to give me a reader or writer object which I can then plug somewhere else, or connect the reader directly to the writer (e.g. some sort of piping) is very intuive.

Of course I haven’t thought that through, I guess there’s reasons :slight_smile:

…hmm I still don’t get it I guess…

I can use std.tar.Iterator.streamRemaining() to write the data of the current file in the tar file into a std.Io.Writer.

But how do I connect that to a std.tar.Writer? This doesn’t seem to offer a std.Io.Writer interface, and instead has methods to write new file items to output tar file from various sources.

E.g. how do I connect std.tar.Iterator.streamRemaining() which expects a std.Io.Writer to std.tar.Writer.writeFileStream() which expects a std.Io.Reader? Is there a universal WriterReader which I can plug between the two? Is that even the correct approach (e.g. have some sort of intermediate reader/writer betwee the two methods?)

Just to be clear this API was written in kind of a bad way (i.e. by requiring a Writer as you mentioned and not providing a Reader) and I did the bare minimum to get it working again with new I/O stuff. So please don’t take it as an example of how things are supposed to be.

It would be better if each file provided a Reader instead, then like you said the programmer can chain it or plug a Writer into it.

This API needs to be reworked.

3 Likes

Oki good to know :slight_smile:

I will still see if I can cobble something together… worst case I guess would be reading the whole tar file into memory so that I have random access on the input data, which should help me to create an ‘adhoc fixed-buffer reader’ on some portion of the input data.

…or probably better: just read the current file item into an intermediate buffer…

In any case, thanks for the help!

I got further by going through an intermediate writer/reader pair:

    var tar_writer: std.tar.Writer = .{ .underlying_writer = &file_writer.interface };
    var file_name_buffer: [1024]u8 = undefined;
    var link_name_buffer: [1024]u8 = undefined;
    var iter: std.tar.Iterator = .init(&file_reader.interface, .{
        .file_name_buffer = &file_name_buffer,
        .link_name_buffer = &link_name_buffer,
    });
    while (try iter.next()) |tar_item| {
        switch (tar_item.kind) {
            .file => {
                if (std.mem.startsWith(u8, tar_item.name, prefix)) {
                    // FIMXE: currently it's not possible to directly plug iter.streamRemaining()
                    // into a std.tar.Writer, so let's go through an intermediate buffer
                    var imm_writer: std.Io.Writer.Allocating = .init(arena);
                    defer imm_writer.deinit();
                    // stream the current tar item into the intermediate writer
                    try iter.streamRemaining(tar_item, &imm_writer.writer);
                    // get an intermediate reader on the intermediate writer's buffer
                    var imm_reader = std.Io.Reader.fixed(imm_writer.getWritten());
                    // ... and write the file data into the tar-writer
                    try tar_writer.writeFileStream(tar_item.name, tar_item.size, &imm_reader, .{ .mode = tar_item.mode });
                }
            },
            else => continue,
        }
    }
    try tar_writer.finishPedantically();
    try tar_writer.underlying_writer.flush();

…this gives me a tar file which I can unpack with tar -xf sources.tar just fine on the macOS cmdline, but when opening the webpage this now shows an error about an unexpected EndOfStream, but I’ll leave it at that for now and try to debug later:

If anybody wants to have a look, or do a code review (I’d be thankful for any simplifications) here’s the complete source:

PS: huh weird… the generated tar is actually totally fine, and when refreshing the doc web page (when running in a local node.js http-server) it also works, just the first load is broken when running the node.js http-server locally).

And here on a real web-server it also works: