Read an arbitrary number of bytes from a binary file into an arbitrary data object

Most of the ways to read from a binary file I have discovered so far (File.read(), Reader.readStruct() and similar) read an inferred amount of bytes (basically (up to) the whole size of the type passed to it). Some allocate a specified size and returns a []u8.

The nearest to my need, Reader.readStruct(), does not accept a place into which to read, but returns a ‘new’? struct, allocated? on the stack? with which scope? With a (for me) strange (and undocumented) error: NoEofError; no EOF? Reaching an unexpected EOF could be an error, but not reaching one? Am I missing something?

Is there a way to ‘simply’, ‘just’, read n bytes into a *mem, à la C’s fread(), where the pointer to the memory may be a []u8 buffer, a whole (or a part of a) (extern) struct or (more or less) anything else?

std.fs.File.read?

1 Like

Thanks for the reply, but File.read() only reads into an array of bytes ([]u8); I need to read into struct’s (often a complete struct at each read, but occasionally just a part of it), into arrays of struct and so on, and also into unstructured arrays of bytes.

It also does not allow to specify how many bytes to read; it just read (up to) the whole byte buffer.

Reader.readAtLeast

Thank you for the pointer. It seems I cannot use itbecause:

  1. It is unclear what it does. “At least” in which sense? reader.readAlLeast(buffer, 20); will read at least 20 byte, but may read more, if more are available and maybe if the buffer is longer than 20 bytes? This would break the synchronism with the file offset. In the sense of at most 20 bytes? The documentation is unclear.

  2. It only reads into a []u8, but I need to read sometime into a structure, sometime into an array of structures and sometime into an array of bytes. I tried both:

  • myStruct: MyStruc = @as(MyStruct, byteBuffer); after reading into the byte buffer,
  • or the other way around, reading into the struct seen as a byte buffer: `reader.readAtLeast(@as([@sizeOf(MyStruct)]u8, @sizeOf(MyStruct)); but neither compiled.
  1. Even when reading into a bytes buffer, all these solutions ultimately require a piece of memory of the precise length to be read. In my application, there can be hundreds of different read sizes, which means to allocate and deallocate hundreds of buffers, just to read from a file?

Something like C’s fread() seems to me a rather simple concept, in fact the simplest read-from-file operation one can imagine; why it should be so counter-intuitive to achieve?

I’ll investigate how to use the actual C function.

The equivalent function of fread is File.read().

// ptr − This is the pointer to a block of memory with a minimum size of size*n bytes.
size_t fread(void *ptr, size_t size, size_t n, FILE *stream)

is

var buf: [size * n]u8 = undefined;
const readed = std.fs.File.read(file, buf);

Sorry, but I disagree: C’s fread() and Zig’s File.read() achieve related tasks, but they are far from equivalent; simply put, a 4-parameter function and a 2-parameter function cannot be equivalent.

Not to mention (actually mentioning :upside_down_face:) that the memory pointed to in the parameters has a radically different meaning in the two contexts both in length and in nature:

  • in fread() it points to a piece of memory of whatever type at least (this time really at least!) n bytes long, but possibly extending more in both directions;
  • while in File.read() it can be (the address of) only a byte array and only exactly n byte long.

This implies, for instance, that to read once 200 bytes and another time 100 bytes one needs two different byte arrays in Zig, but can — if useful — use the same memory block in C (perhaps the second time at the end of it, if convenient!) and, if the buffer is not an array of bytes, one can still use fread() but cannot use File.read().

So, I see a significant functional non-equivalence. So far, it seems what is described in the post topic cannot be achieved in Zig; as I said, I’ll investigate using the very C function from a Zig context…

Thanks to all for the suggestions, though!

The Quick Answer

I believe the following is what you’re after:

const std = @import("std");

const DataStruct = packed struct { a: u8 = 0, b: u8 = 0, c: u8 = 0, d: u8 = 0 };

pub fn main() !void {
    var binf = try std.fs.cwd().openFile("someData.bin", .{});
    defer binf.close();
    const binf_reader = binf.reader();
    const data_struct = try binf_reader.readStruct(DataStruct);
    std.debug.print("My struct: {any}\n", .{data_struct});
}

Where someData.bin is a binary file containing 4 bytes in this order:
0x1 0x2 0x3 0x4

You will see (depending on your system’s endianness!!) the following output:
My struct: main.DataStruct{ .a = 1, .b = 2, .c = 3, .d = 4 }

If you have weird endianness, check out the equivalent:

binf_reader.readStructEndian()

The Longer Answer

The reason what you’re after is “so easy” in C is because C doesn’t care in the slightest what the memory location you’re reading things into is, it will just do exactly as you ask. This can be problematic because:

  • Maybe you accidentally gave it the wrong pointer offset (off by 1 error!)
  • You forgot to specify the struct was packed/has specific alignment using __attribute__((packed, aligned(4))), so crucial data ends up in padding bytes and your actual struct members are garbage

The nice thing about Zig is it will still absolutely let you do this if that’s what you’re after, but if you’re using a standard library function it generally has some more guard rails around it. For instance, take away the packed specifier from the struct in this example and look at the compiler error:

/usr/local/bin/lib/std/debug.zig:412:14: error: reached unreachable code
    if (!ok) unreachable; // assertion failure
             ^~~~~~~~~~~
/usr/local/bin/lib/std/io/Reader.zig:329:20: note: called from here
    comptime assert(@typeInfo(T).Struct.layout != .auto);
             ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Amazing! Structs in Zig, by default, do not have guaranteed memory layout so you can’t know what order in memory the fields are going to be in. But, through the magic of comptime, the standard library guarded against this footgun for you!

Check out the source code for readStruct in the standard library for a good example of the low level calls required to read bytes directly into an arbitrary data object. A related function relevant to your use case you might want to check out is std.mem.asBytes().

A General Word of Warning

Many, many, people post on Stack Overflow for a grab-bag of languages essentially asking what comes down to a serialization/deserialization problem:
“How do I read/write bytes into/from a data structure in my language?”

This is a weirdly non-trivial problem as it depends on your system’s endianness, the endianness of the byte source (file, network?), how your particular language lays out data structures in memory, and what the alignment requirements are of your system.

So, you absolutely can do what you’re after, just make sure you know exactly how bytes are handled on both ends of the process.

7 Likes

…and just to follow up because I realized I didn’t fulfill the condition where you wanted to read less than the entire size of the struct worth of bytes + I was curious:

const std = @import("std");

const DataStruct = packed struct { a: u8 = 0, b: u8 = 0, c: u8 = 0, d: u8 = 0 };

pub fn main() !void {
    var binf = try std.fs.cwd().openFile("someData.bin", .{});
    defer binf.close();
    const binf_reader = binf.reader();

    // Example where reader stream may *not* return the entire size of struct in a single read
    var data_struct: DataStruct = undefined;
    var data_struct_slice = std.mem.asBytes(&data_struct);
    var bytes_read: usize = 0;
    while (bytes_read < data_struct_slice.len) {
        bytes_read += try binf_reader.readAtLeast(data_struct_slice[bytes_read..], data_struct_slice[bytes_read..].len);
    }
    std.debug.print("My struct: {any}\n", .{data_struct});
}

Obviously for this dummy file example, it will always grab all 4 bytes in one go, but imagine the reader stream is instead a network socket, etc…

1 Like

Thanks a lot for your reply!

As I said in the top post, I arrived at a rather similar solution, but I was distracted by the NoEofError of Reader.readStruct() (still mysterious) and I am still unsure if the returned data are on the stack or have to be deallocated in some way (probably not: according to Zig philosophy, if you don’t see an explicit allocation, there is none; a bit more of function documentation would be helpful).

Thank you also for reminding me of passing through a slice to reuse only a part of a buffer: I am still rather unfamiliar with details of the language which are probably automatic to a more experienced Zig coder. This, together with std.mem.asBytes() will probably be useful to read a [n]SomeStruct where n is known at run time and may vary within the same run (in fact it is read from the previous DataStruct read).

Finally, a general comment: having coded in C for decades, I know (probably only a part of) the many things which can go wrong with C (and probably contracted some bad habits with it…) and I had several glimpses of the many ‘guard-rails’ Zig can provide.

Still, I am convinced (and I do not pretend to convince anyone else!) that it is my job as a programmer to evaluate when taking more risks is worthwhile and when the convenience of the safety nets prevails.

Thanks!

1 Like

if you don’t see an explicit allocation, there is none;

Yep, where it gets a little tricky is you can absolutely have a std.mem.Allocator that is allocating to a fixed sized buffer that lives on the stack:
https://ziglang.org/documentation/master/std/#std.heap.FixedBufferAllocator

Despite living on the stack, you still need to make explicit allocation calls though. In this particular example, I’ve not explicitly allocated anything and so nothing needs to be freed.

Finally, a general comment: having coded in C for decades, I know (probably only a part of) the many things which can go wrong with C (and probably contracted some bad habits with it…) and I had several glimpses of the many ‘guard-rails’ Zig can provide.

Sweet, you’re well familiar with the trials and tribulations of C then :slight_smile:

will probably be useful to read a [n]SomeStruct where n is known at run time and may vary within the same run (in fact it is read from the previous DataStruct read).

Yep, in fact check out std.mem.sliceAsBytes() which I believe accomplishes exactly what you want. Instead of reading into a single struct of type T, you can read into a slice of them!

Sounds very useful, thanks!

Note that NoEofError is an error set, not an error, and it’s defined as:

pub const NoEofError = ReadError || error{
    EndOfStream,
};

So the actual error you’d get is error.EndOfStream (or possibly some other error depending on the ReadError [also an error set] of the particular reader you’re using).

1 Like