Reading/writing strings from binary data

jeang3nie · July 12, 2023, 3:48pm

I have a binary data format that I’m reading from and writing to. Strings are stored by writing the length as a u16 in little endian, followed by the bytes making up the string. I have this, and it’s passing my tests.

const std = @import("std");
const mem = std.mem;

pub const StringError = error {
    ReadError,
    WriteError,
};

pub fn load(reader: anytype, allocator: mem.Allocator) ![]const u8 {
    var bytes: [2]u8 = undefined;
    var res = try reader.readAll(&bytes);
    if (res != 2) return StringError.ReadError;
    const len = mem.readIntLittle(u16, &bytes);
    var list = try std.ArrayList(u8).initCapacity(allocator, @as(usize, len));
    list.expandToCapacity();
    list.shrinkAndFree(@as(usize, len));
    res = try reader.readAll(list.items);
    if (res != @as(usize, len)) return StringError.ReadError;
    return list.toOwnedSlice();
}

pub fn store(string: []const u8, writer: anytype) !void {
    var len: [2]u8 = undefined;
    mem.writeIntLittle(u16, &len, @as(u16, @intCast(string.len)));
    try writer.writeAll(&len);
    try writer.writeAll(string);
}

As I said, it’s working. Just want to know if there might be a better way to do the read than the way I’m initializing the array list, followed by expanding it to capacity and shrinking it to the exact size. That part seems clunky to me.

Note I am planning to add a check to make sure the length fits, although since all of the strings in question are going to be Unix path names I shouldn’t ever encounter anything that doesn’t fit in a u16.

AndrewCodeDev · July 12, 2023, 4:13pm

So help me understand something here - why are we calling shrinkAndFree? I’m walking through the ArrayList logic and maybe I’ve missed something?

So first you set the initial capacity which makes an allocating call using ensureTotalCapacityPrecise:

        /// Initialize with capacity to hold at least `num` elements.
        /// The resulting capacity is likely to be equal to `num`.
        /// Deinitialize with `deinit` or use `toOwnedSlice`.
        pub fn initCapacity(allocator: Allocator, num: usize) Allocator.Error!Self {
            var self = Self.init(allocator);
            try self.ensureTotalCapacityPrecise(num);
            return self;
        }

Then you call expandToCapacity - very simple function:

        /// Increases the array's length to match the full capacity that is already allocated.
        /// The new elements have `undefined` values. **Does not** invalidate pointers.
        pub fn expandToCapacity(self: *Self) void {
            self.items.len = self.capacity;
        }

But then you call shrink and free and this is where I lose ya:

        /// Reduce allocated capacity to `new_len`.
        /// May invalidate element pointers.
        pub fn shrinkAndFree(self: *Self, new_len: usize) void {
            var unmanaged = self.moveToUnmanaged();
            unmanaged.shrinkAndFree(self.allocator, new_len);
            self.* = unmanaged.toManaged(self.allocator);
        }

shrinkAndFree creates a new ArrayListUnmanaged, then tries to call resize (an allocator vtable call) and more… here’s the implementation:

        /// Reduce allocated capacity to `new_len`.
        /// May invalidate element pointers.
        pub fn shrinkAndFree(self: *Self, allocator: Allocator, new_len: usize) void {
            assert(new_len <= self.items.len);

            if (@sizeOf(T) == 0) {
                self.items.len = new_len;
                return;
            }

            const old_memory = self.allocatedSlice();
            if (allocator.resize(old_memory, new_len)) {
                self.capacity = new_len;
                self.items.len = new_len;
                return;
            }

            const new_memory = allocator.alignedAlloc(T, alignment, new_len) catch |e| switch (e) {
                error.OutOfMemory => {
                    // No problem, capacity is still correct then.
                    self.items.len = new_len;
                    return;
                },
            };

            @memcpy(new_memory, self.items[0..new_len]);
            allocator.free(old_memory);
            self.items = new_memory;
            self.capacity = new_memory.len;
        }

I’m willing to bet you’ll hit the resize block - since you’re using the same len to resize with, resize will probably return true, then it’ll just assign self.capacity and self.items.size equal to the same length, pass that back and self assign the whole list back to itself from the unmanaged list.

jeang3nie · July 12, 2023, 6:08pm

@AndrewCodeDev to be honest I didn’t look at the source for those functions, even though I know I probably should. My understanding. based on the API docs, was that initCapacity ensures that the capacity is at least num, but may in fact be more. Hence the call to shrinkAndFree, because I don’t want an array of at least num bytes, I want an array of exactly num bytes. I’ll have a look at the source though. You may be right that call isn’t needed.

I’m also probably carrying some assumptions over from Rust since that’s what I use the most. With Rust, I’d be calling Vec::with_capacity and then getting a handle on the exact number of bytes in the reader with reader.take(num). So far what I have is the closest Zig equivalent, because I haven’t found a function to read exactly n bytes except by providing an exact sized buffer.

jeang3nie · July 12, 2023, 6:41pm

So I made a small change.

    var list = try std.ArrayList(u8).initCapacity(allocator, @as(usize, len));
    list.expandToCapacity();
    if (list.items.len != @as(usize, len)) list.shrinkAndFree(@as(usize, len));

That at least prevents it from making the shrinkAndFree call if it doesn’t have to.

AndrewCodeDev · July 12, 2023, 7:32pm

So you want to use the “capacity” data member and not the “items.len” member data. The capacity is what may be larger, not the length of the items.

The capacity is determined by the allocator and the alignment… etc… etc… take a look at “allocBytesWithAlignment” in the following file to be more familiar: zig/lib/std/mem/Allocator.zig at master · ziglang/zig · GitHub

Basically, you don’t need that call at all. The resize call is going to come from the allocator again, which already determined the original capacity based on the kind of allocator that it is. For instance, if it’s a caching allocator, it may point you to a larger region of memory (larger capacity) but give you a small chunk of it (item length).

jeang3nie · July 12, 2023, 9:13pm

Huh, I think I was totally overthinking this anyway.

pub fn load(reader: anytype, allocator: mem.Allocator) ![]const u8 {
    var bytes: [2]u8 = undefined;
    var res = try reader.readAll(&bytes);
    if (res != 2) return StringError.ReadError;
    const len = mem.readIntLittle(u16, &bytes);
    var s = try allocator.alloc(u8, @as(usize, len));
    res = try reader.readAll(s);
    if (res != @as(usize, len)) return StringError.ReadError;
    return s;
}

Tests passing.

squeek502 · July 12, 2023, 9:32pm

This is how I would write it:

const std = @import("std");
const mem = std.mem;

/// Returned string is allocated by `allocator` and must be freed by the caller
pub fn load(reader: anytype, allocator: mem.Allocator) ![]const u8 {
    const len = try reader.readIntLittle(u16);
    var buf = try allocator.alloc(u8, len);
    // clean up the memory if the read fails
    errdefer allocator.free(buf);

    try reader.readAll(buf);

    return buf;
}

/// Assumes that string.len is <= maxInt(u16)
pub fn store(string: []const u8, writer: anytype) !void {
    try writer.writeIntLittle(u16, @as(u16, @intCast(string.len)));
    try writer.writeAll(string);
}

EDIT: It looks like reallocAtLeast has been removed entirely, so ArrayList.initCapacity always gives you exactly the capacity you request now and the doc comments should be updated.

Outdated initCapacity 'at least' explanation

About ArrayList.initCapacity: it does give you exact amount of capacity you ask for usually. The ‘at least’ part is there because the allocation itself passes the ‘.at_least’ option during the allocation, which means that the allocator is free to give a bigger allocation if that’s convenient for its implementation or platform. In practice, this is only the case when you start doing allocations that get into allocations nearer the page size. For example, with ArrayList(u8) on Linux, initCapacity will give you the exact capacity you ask for until capacity of 2049 and then it’ll look like this:

// asked for capacity => actual capacity gotten
2049...4096 => 4096
4097...8192 => 8192
8193...12288 => 12288
12289...16384 => 16384

EDIT#2: PR to fix the ArrayList docs: docs: Fix outdated doc comments about allocating 'at least' the requested size by squeek502 · Pull Request #16391 · ziglang/zig · GitHub

dude_the_builder · July 12, 2023, 10:10pm

Isn’t it errdefer allocator.free(buf) ?

squeek502 · July 12, 2023, 11:26pm

Yep; fixed. Also realized that the ‘at least’ behavior of ArrayList.initCapacity is not even true anymore.

jeang3nie · July 13, 2023, 1:29am

@squeek502 thanks for the input. You’re definitely right about using errdefer to clean up if there’s a failure, should have been doing that anyway. Lots of good tips. I’m definitely overcomplicating things in a few different ways.

squeek502 · July 13, 2023, 2:06am

Note that sometimes it is necessary to do some complicated stuff. Just as something to have in your back pocket if you ever need it, here’s an example of using ArrayList.resize and then passing the uninitialized part of the ArrayList.items slice to a function that writes into it:

github.com

ziglang/zig/blob/d78517f4f0f540f0d0254a78e47a7b4a30e98c74/lib/std/unicode.zig#L583-L597


      
          pub fn utf16leToUtf8Alloc(allocator: mem.Allocator, utf16le: []const u16) ![]u8 {
              // optimistically guess that it will all be ascii.
              var result = try std.ArrayList(u8).initCapacity(allocator, utf16le.len);
              errdefer result.deinit();
              var out_index: usize = 0;
              var it = Utf16LeIterator.init(utf16le);
              while (try it.nextCodepoint()) |codepoint| {
                  const utf8_len = utf8CodepointSequenceLength(codepoint) catch unreachable;
                  try result.resize(result.items.len + utf8_len);
                  assert((utf8Encode(codepoint, result.items[out_index..]) catch unreachable) == utf8_len);
                  out_index += utf8_len;
              }
          
              return result.toOwnedSlice();
          }

Or ensureUnusedCapacity, unusedCapacitySlice, and setting the len manually can be done as well:

github.com

squeek502/resinator/blob/00393196e6955a496950a4953e57df24df97f8d4/test/fuzzy_fonts.zig#L32C1-L35


      
          try font_buffer.ensureUnusedCapacity(random_bytes_len);
          var slice_to_fill = font_buffer.unusedCapacitySlice()[0..random_bytes_len];
          rand.bytes(slice_to_fill);
          font_buffer.items.len += random_bytes_len;