Alternative ways to iterate through bytes while parsing

natecraddock · March 3, 2022, 6:09pm

Hey all! I’m working on a project where I need to read a large blob of binary data into slices of bytes. The data is in this format:

| HEADER | BYTES | HEADER | BYTES | ...

With a header being some id information and the length of the subsequent bytes. I have this data embedded directly in my source file with @embedFile(). I am porting some code I wrote in C. Here is a reduced example of my solution.

const std = @import("std");
const print = std.debug.print;

const Header = struct { id: u8, len: u8 };

pub fn main() !void {
    const data = "\x00\x05\x48\x65\x6c\x6c\x6f\x01\x03\x61\x6c\x6c";
    var ptr: [*c]const u8 = data;
    const end = ptr + data.len;

    while (ptr < end) {
        const header = @ptrCast(*const Header, ptr[0..@sizeOf(Header)]);
        ptr += @sizeOf(Header);

        print("{} data: {s}\n", .{ header, ptr[0..header.len] });

        ptr += header.len;
    }
}

I ended up using a C pointer to allow the pointer comparisions, which is very similar to my original C code. But this removes some of Zig’s safety features, and the language reference recommends avoiding using C pointers. I’m wondering what other methods might exist.

After finding the correct offset for the slice of bytes, the header will be discarded, so in this case I could just read n bytes to find the id, and another n bytes for the length. I would like to find a way to do this with a struct through, since I’ll need to do more like this later on for other parts of this project.

natecraddock · March 3, 2022, 8:45pm

Here is another option I found that seems a bit more safe.

const std = @import("std");
const print = std.debug.print;

const Header = struct { id: u8, len: u8 };

fn readBytes(data: []const u8, offset: *usize) ?[]const u8 {
    if (offset.* >= data.len) return null;
    const header = @ptrCast(*const Header, data[offset.* .. offset.* + 2]);
    offset.* += @sizeOf(Header);
    const bytes = data[offset.* .. offset.* + header.len];
    offset.* += header.len;
    return bytes;
}

pub fn main() !void {
    const data = "\x00\x05\x48\x65\x6c\x6c\x6f\x01\x03\x61\x6c\x6c";

    var offset: usize = 0;
    while (readBytes(data, &offset)) |bytes| {
        print("{s}\n", .{bytes});
    }
}

I am interested to see other options though!

cnx · March 4, 2022, 2:16pm

I have this data embedded directly in my source file with @embedFile().

Since you have them available in compile time, you can strip off
the invalid datum and be sure that whatever you do is safe.

As for the pointer cast, you’ll need a packed struct since normal
structs don’t warrant orders of fields and may have alignments.

const-void · March 4, 2022, 3:12pm

is the header a fixed size? If so, can you read header bytes from your input stream, pluck out the page size, read page size bytes, etc?

natecraddock · March 4, 2022, 8:52pm

I hadn’t thought about parsing at comptime, that’s a good idea!

And I forgot about packed structs. I had some hacky @alignCast() calls that worked, but packed struct is definitely the correct solution. Thanks!

I can read the values one at a time, and I did consider this. I prefer the solution of using a struct representing the header though, because it seems more clear to me what is going on. And later in this project I’ll be reading more complex data and storing it in a struct so I wanted to figure this out this sooner rather than later.

cnx · March 5, 2022, 7:30am

I hadn’t thought about parsing at comptime, that’s a good idea!

What I meant was that the input is determined, you can ensure
the validity of the data and thus the safety of parsing. It’d be great
to parse at compile time, but whether it is possible may depend on
the size of the data.

dude_the_builder · March 5, 2022, 2:52pm

@natecraddock I can’t offer a better solution because I haven’t had much experience with @ptrCast, but just wanted to let you know that your code examples helped me to finally really understand what @ptrCast does and can be used for. Thanks for that!

jmc · March 5, 2022, 4:07pm

The thing that throws a little bit of a wrench in the works is the variable-length bytes after the header, but even so with a bit of work it can be tamed.

The stdlib is your friend, and combining std.io.fixedBufferStream() and std.io.Reader.readStruct() you can make this relatively clean and type-safe:

$ cat parsedata.zig
const std = @import("std");
const print = std.debug.print;

const Header = packed struct { id: u8, len: u8 };

pub fn main() !void {
    const data = "\x00\x05\x48\x65\x6c\x6c\x6f\x01\x03\x61\x6c\x6c";
    const reader = std.io.fixedBufferStream(data).reader();

    var pos: usize = 0;
    while (true) {
        const header = reader.readStruct(Header) catch |err| switch (err) {
            error.EndOfStream => break,
            else => return err,
        };
        pos += @sizeOf(Header);
        print("header: {}\n", .{header});

        const bytes = data[pos .. pos + header.len];
        try reader.skipBytes(bytes.len, .{});
        pos += bytes.len;
        print("bytes: {s}\n", .{bytes});
    }
}

$ zig run parsedata.zig
header: Header{ .id = 0, .len = 5 }
bytes: Hello
header: Header{ .id = 1, .len = 3 }
bytes: all

The only annoyance here is not having anything that can return a slice of bytes from the stream without allocation (because the Reader has no idea this is just reading from memory, it could be any streamable resource), and so I have to update a separate pos variable so that I can slice through the original byte stream for the second bytes read, and additionally skipBytes() by the same amount to keep the “file-view” and memory-view in sync.

natecraddock · March 8, 2022, 12:59am

Thanks for sharing these stdlib functions! I hadn’t seen them yet, and for some use cases this seems like an optimal solution.

I looked through the implementation for readStruct() and for my specific project, it is a bit overkill. All I really need is the ptrCast() because I know the exact contents of the data. I ended up using my second solution shared above I then have a test to verify that I can iterate through and find all 33 slices in my embedded data.

Thanks again for all the input everyone!