Serialization and Deserialization patterns

I’ve got three implementations of a serializer and deserializer for a struct. Which do you think is best and why?

Supports allocation, or no allocation

const std = @import("std");

pub const Packet = struct {
    header: Header,
    data: []u8,

    pub const Header = packed struct(u16) {
        length: u11,
        reserved: u5 = 0,
    };

    /// asserts data.len < max u11
    /// data must have lifetime beyond returned memory
    pub fn init(data: []u8) Packet {
        return Packet{
            .header = .{ .length = @intCast(data.len) },
            .data = data,
        };
    }

    /// asserts data.len < max u11
    /// no lifetime requirement on data parameter
    /// free returned memory with deinit
    pub fn initAlloc(allocator: std.mem.Allocator, data: []u8) !Packet {
        return Packet{
            .header = .{ .length = @intCast(data.len) },
            .data = try allocator.dupe(u8, data),
        };
    }

    pub fn serialize(self: Packet, writer: *std.Io.Writer) !Packet {
        try writer.writeStruct(self.header, .little);
        try writer.writeAll(self.data);
    }

    /// buf must have lifetime past returned memory
    pub fn deserializeNoAlloc(buf: []const u8) !Packet {
        var reader = std.Io.Reader.fixed(buf);
        const header = try reader.takeStruct(Header, .little);
        const data: []u8 = try reader.take(header.length);
        return .{ .header = header, .data = data };
    }

    /// free using deinit
    pub fn deserialzeAlloc(allocator: std.mem.Allocator, reader: *std.Io.Reader) !Packet {
        const header = try reader.takeStruct(Header, .little);
        const data = try reader.readAlloc(allocator, header.len);
        return Packet{ .header = header, .data = data };
    }

    /// only required if you use deserializeAlloc
    pub fn deinit(self: *Packet, allocator: std.mem.Allocator) void {
        allocator.free(self.data);
    }
};

Notes:

  1. most complex
  2. lifetimes are harder to explain / requires more documentation

Supports no allocation

pub const Packet = struct {
    header: Header,
    data: []u8,

    pub const Header = packed struct(u16) {
        length: u11,
        reserved: u5 = 0,
    };

    /// asserts data.len < max u11
    /// data must have lifetime beyond returned memory
    pub fn init(data: []u8) Packet {
        return Packet{
            .header = .{ .length = @intCast(data.len) },
            .data = data,
        };
    }

    pub fn serialize(self: Packet, writer: *std.Io.Writer) !Packet {
        try writer.writeStruct(self.header, .little);
        try writer.writeAll(self.data);
    }

    /// buf must have lifetime past returned memory
    pub fn deserializeNoAlloc(buf: []const u8) !Packet {
        var reader = std.Io.Reader.fixed(buf);
        const header = try reader.takeStruct(Header, .little);
        const data: []u8 = try reader.take(header.length);
        return .{ .header = header, .data = data };
    }
};

Notes:

  1. simpler, less code
  2. lifetimes might be a bit sketchy

supports allocation

pub const Packet = struct {
    header: Header,
    data: []u8,

    pub const Header = packed struct(u16) {
        length: u11,
        reserved: u5 = 0,
    };

    /// asserts data.len < max u11
    /// no lifetime requirement on data parameter
    /// free returned memory with deinit
    pub fn initAlloc(allocator: std.mem.Allocator, data: []u8) !Packet {
        return Packet{
            .header = .{ .length = @intCast(data.len) },
            .data = try allocator.dupe(u8, data),
        };
    }

    pub fn serialize(self: Packet, writer: *std.Io.Writer) !Packet {
        try writer.writeStruct(self.header, .little);
        try writer.writeAll(self.data);
    }

    /// free using deinit
    pub fn deserializeAlloc(allocator: std.mem.Allocator, reader: *std.Io.Reader) !Packet {
        const header = try reader.takeStruct(Header, .little);
        const data = try reader.readAlloc(allocator, header.len);
        return Packet{ .header = header, .data = data };
    }

    /// only required if you use deserializeAlloc
    pub fn deinit(self: *Packet, allocator: std.mem.Allocator) void {
        allocator.free(self.data);
    }
};

Notes:

  1. requires allocator
  2. lifetimes are clearer (because allocator screams lifetime stuff at me)

What I hate about allocation is that is introduces the possibility of error.OutOfMemory. Generally, I prefer to not use any allocation if possible, so I typically use the “no allocation” approach. Sometime going to great lengths to do so. This I feel helps me better manage error explosion.

Some may say to “just use fixed buffer allocator”. I would respond that, practically, I want infailable deserialization. Or extremely limited failures. The deserializeNoAlloc method can produce only one error: error.ReadFailed. It cannot produce OutOfMemory. If I attempt to use fixedBufferAllocator, I need to guess the size by reading the code in deserializeAlloc. Sometimes this is not practical, there can be many branches in the deserializer. Also, you must fully understand the behavior of the allocator under possible fragmentation, you must know how efficient your allocator is, which can also be difficult. Does the API contract for fixedBufferAllocator include how efficient it is? IDK.

What do you think is the most maintainable, robust implementation? Which would you prefer to come back to and maintain ~5 years from now? Which provides the best balance of complexity / error proneness?

Edit: actually, the take I am using is unsafe under untrusted input because it asserts the size of the buffer has at least header.length. That’s annoying, anyone have a fix for that?

2 Likes
pub const packet = struct {
    pub const Header = packed struct(u16) {
        length: u11,
        reserved: u5 = 0,
    };

    pub fn serialize(data: []const u8, writer: *std.Io.Writer) !void {
        const header: Header = .{ .length = @intCast(data.len) };
        try writer.writeStruct(header, .little);
        try writer.writeAll(data);
    }

    /// return a slice of the buf
    pub fn deserialize(buf: []const u8) ![]u8 {
        var reader = std.Io.Reader.fixed(buf);
        const header = try reader.takeStruct(Header, .little);
        return try reader.take(header.length);
    }
};
2 Likes

you removed data which was the only thing that necessitated either allocating or a prolonged buffer lifetime

Yes, I think the Packet concept here adds an entity whose lifecycle needs to be managed, but its existence doesn’t seem necessary for the serialize and deserialize behaviors themselves, and it simply adds cognitive complexity.
So I changed it to simply a namespace, rather than an entity type.

Your specific example is too trivial and doesn’t have certain interesting properties that would require you to consider different strategies for copying data. There’s barely any need for (de)serialization at all, you could instead just keep the buffer of deserialized data around and read/reinterpret memory as needed:

const serialized: []const u8 = "\x05\x00abcdefg";

const Int = @typeInfo(Header).@"struct".backing_integer.?;
const header: Header = @bitCast(std.mem.readInt(Int, serialized[0..@sizeOf(Int)], .little));
try std.testing.expectEqual(5, header.length);

const data = serialized[@sizeOf(Header)..][0..header.length];
try std.testing.expectEqualStrings("abcde", data);

If your work with the struct requires it to have a longer lifetime than the incoming buffer of deserialized data, you could use Allocator.dupe() it, or @memcpy it into a different buffer that is under your control.

The question itself would become more interesting if we use some different examples, so I’ll throw in some simple ones:

1. The serialized format is exactly four decimal integers in the range 0-65535 inclusive, separated by spaces. The deserialized format is an array containing exactly four unsigned 16-bit integers.

const serialized: []const u8 = "1 51 765 65535";

const deserialized = ???;

const expected: [4]u16 = .{ 1, 51, 765, 65535 };
try std.testing.expectEqual(expected, deserialized);

2. Modification of the above: The serialized format is an unbounded count of decimal integers in the range 0-65535 inclusive, separated by spaces. The deserialized format is an array containing an unbounded count of unsigned 16-bit integers.

const serialized: []const u8 = "1 51 765 65535";

const deserialized = ???;

const expected: []u16 = &.{ 1, 51, 765, 65535 };
try std.testing.expectEqualSlices(u16, expected, deserialized);

3. The deserialized format is an unbounded count of quoted string literals separated by spaces. Only printable ASCII characters are permitted. Literal quotes inside a string can be escaped by doubling the quote. The deserialized format is an array containing an unbounded count of strings.

const serialized: []const u8 =
    \\"abc" "one two three" "Dwayne ""The Rock"" Johnson"
;

const deserialized = ???;

const expected: []const []const u8 = &.{
    \\abc
    ,
    \\one two three
    ,
    \\Dwayne "The Rock" Johnson
};
for (expected, deserialized) |e, d| {
    try std.testing.expectEqualStrings(e, d);
}

These three examples each have different properties that favor slightly different API designs. How would you design a Deserializer API for each of them?

Ignore the specifics of the parsing/implementation, the interesting question is which deserializeXyz() functions you would expose to the user, which parameters they take, which results/errors they return and any preconditions the user must fulfill.

I have my own ideas but I think it would be interesting to hear what others have to say first and see if you can find the same things I see.

Let me suggest a variation on #1. I did one for diffz like this:

    /// Emit patch hunk in Unidiff format, as specifified here:
    /// https://github.com/google/diff-match-patch/wiki/Unidiff
    /// This is similar to GNU Unidiff format, but not identical.
    /// Header: @@ -382,8 +481,9 @@
    /// Indices are printed as 1-based, not 0-based.
    /// @return The GNU diff string.
    pub fn asText(patch: Hunk, allocator: Allocator) ![]const u8 {
        var text_array = ArrayList(u8).init(allocator);
        defer text_array.deinit();
        const writer = text_array.writer();
        try patch.writeText(writer);
        return text_array.toOwnedSlice();
    }

    const format = std.fmt.format;

    /// Stream textual patch representation to Writer.  See `asText`
    /// for more information.
    pub fn writeText(patch: Hunk, writer: anytype) !void {
        // Write header.
        _ = try writer.write(PATCH_HEAD);
        // Stream coordinates
        if (patch.length1 == 0) {
            try format(writer, "{d},0", .{patch.start1});
        } else if (patch.length1 == 1) {
            try format(writer, "{d}", .{patch.start1 + 1});
        } else {
            try format(writer, "{d},{d}", .{ patch.start1 + 1, patch.length1 });
        }
        _ = try writer.write(" +");
        if (patch.length2 == 0) {
            try std.fmt.format(writer, "{d},0", .{patch.start2});
        } else if (patch.length2 == 1) {
            _ = try format(writer, "{d}", .{patch.start2 + 1});
        } else {
            try format(writer, "{d},{d}", .{ patch.start2 + 1, patch.length2 });
        }
        _ = try writer.write(PATCH_TAIL);
        // Escape the body of the patch with %xx notation.
        for (patch.diffs.items) |a_diff| {
            switch (a_diff.operation) {
                .insert => try writer.writeByte('+'),
                .delete => try writer.writeByte('-'),
                .equal => try writer.writeByte(' '),
            }
            _ = try writeUriEncoded(writer, a_diff.text);
            try writer.writeByte('\n');
        }
        return;
    }

This can be non-allocating, by providing a Writer which is backed by stack memory. If the serial form is just going to be written out anyway (common), it can be streamed directly, and if you need a slice, there’s a convenience function for that too.

I tried to illustrate the problem I have with scenario 2 (deserialize unbounded list of integers.

The problem I have is that it is difficult to express a reserve first API. Its difficult for me to provide an API (to myself) that is not programmer-error prone that allows me to express my assumptions about the size of the input.

For example, what if I know that there will be at most 5 integers to deserialize?

Lets take a look at how I can encode this assumption into my program:

const std = @import("std");

// scenario 2: unbounded list of integers
/// caller must free returned memory
fn deserializeIntsAlloc(allocator: std.mem.Allocator, reader: *std.Io.Reader) ![]u16 {
    var list = std.ArrayList(u16).empty;
    errdefer list.deinit(allocator);
    while (true) {
        const bytes_containing_int: []const u8 = reader.takeDelimiterExclusive(' ') catch |err| switch (err) {
            error.EndOfStream => break,
            error.StreamTooLong => return error.StreamTooLong,
            error.ReadFailed => return error.ReadFailed,
        };
        const num = try std.fmt.parseInt(u16, bytes_containing_int, 10);
        try list.append(allocator, num);
    }
    return try list.toOwnedSlice(allocator);
}

test "alloc four ints" {
    const serialized: []const u8 = "1 51 765 65535";
    var reader = std.Io.Reader.fixed(serialized);

    const deserialized = try deserializeIntsAlloc(std.testing.allocator, &reader);
    defer std.testing.allocator.free(deserialized);

    const expected: []const u16 = &.{ 1, 51, 765, 65535 };
    try std.testing.expectEqualSlices(u16, expected, deserialized);
}

// scenario 2: how can we provide errorless  / reserve first API?
// For example, the user knows there are only up to 5 ints ahead of time

// this sucks!
test {
    const serialized: []const u8 = "1 51 765 65535";
    var reader = std.Io.Reader.fixed(serialized);

    // Guessing sizes here! Depends on the efficiency of the allocator!
    // Look at how much work it is to do "reserve first", something
    // we should encourage!
    var stack_memory: [4096]u8 = undefined;
    var fba = std.heap.FixedBufferAllocator.init(&stack_memory);

    const deserialized = deserializeIntsAlloc(fba.allocator(), &reader) catch |err| switch (err) {
        error.OutOfMemory => unreachable,
        error.ReadFailed => return error.ReadFailed,
        error.StreamTooLong => return error.StreamTooLong,
        error.Overflow => return error.Overflow,
        error.InvalidCharacter => return error.InvalidCharacter,
    };

    const expected: []const u16 = &.{ 1, 51, 765, 65535 };
    try std.testing.expectEqualSlices(u16, expected, deserialized);
}


And here is my bonus solution to scenario 1 (returning the de serialized values on the stack)

// scenario 1: 4 fixed-width integers

/// return 4 ints on the stack because its small and we prefer not to allocate
/// if we do not have to.
fn deserializeFourInts(reader: *std.Io.Reader) ![4]u16 {
    var result: [4]u16 = undefined;
    for (&result) |*num| {
        const bytes_containing_int: []const u8 = try reader.takeDelimiterExclusive(' ');
        num.* = try std.fmt.parseInt(u16, bytes_containing_int, 10);
    }
    return result;
}

test "four ints" {
    const serialized: []const u8 = "1 51 765 65535";
    var reader = std.Io.Reader.fixed(serialized);
    const deserialized = try deserializeFourInts(&reader);

    const expected: []const u16 = &.{ 1, 51, 765, 65535 };
    try std.testing.expectEqualSlices(u16, expected, &deserialized);
}

A “reserve first” API wouldn’t normally take an allocator, it would take an output buffer like []T.

The way I would approach the problem is to try to think of the smallest and most primitive API that removes as many concerns from the implementation as possible, while still remaining useful. For the second example I would go with an iterator-like design:

pub const Deserializer = struct {
    source: []const u8,
    pos: usize,

    pub fn init(source: []const u8) Deserializer {
        return .{ .source = source, .pos = 0 };
    }

    pub fn next(d: *Deserializer) ?(error{SyntaxError}!u16) {
        // Implementation...
    }
};

test Deserializer {
    const serialized: []const u8 = "1 51 765 65535";

    var deserialized_buf: [100]u16 = undefined;
    var deserialized: std.ArrayList(u16) = .initBuffer(&deserialized_buf);

    var deserializer: Deserializer = .init(serialized);
    while (deserializer.next()) |value_or_err| {
        const value = try value_or_err;
        try deserialized.appendBounded(value);
    }

    const expected: []const u16 = &.{ 1, 51, 765, 65535 };
    try std.testing.expectEqualSlices(u16, expected, deserialized.items);
}

With something like this, the deserializer API doesn’t even have to concern itself with memory management at all, it’s entirely up to the user as for whether or not a fixed-size output buffer should be reserved in advance and how out-of-memory conditions should be handled.

For common use cases you could of course provide utility functions like the following: (Edit: Fixed the second one to free allocated memory in the event of failure.)

pub fn deserializeAll(
    dst: []u16,
    src: []u8,
) error{ OutOfMemory, SyntaxError }![]u16 {
    var deserialized: std.ArrayList(u16) = .initBuffer(dst);
    var deserializer: Deserializer = .init(src);
    while (deserializer.next()) |value_or_err| {
        const value = try value_or_err;
        try deserialized.appendBounded(value);
    }
    return deserialized.items;
}

pub fn deserializeAllAlloc(
    allocator: std.mem.Allocator,
    src: []u8,
) error{ OutOfMemory, SyntaxError }![]u16 {
    var deserialized: std.ArrayList(u16) = .empty;
    errdefer deserialized.deinit();

    var deserializer: Deserializer = .init(src);
    while (deserializer.next()) |value_or_err| {
        const value = try value_or_err;
        try deserialized.append(allocator, value);
    }
    return deserialized.toOwnedSlice(allocator);
}

But it’s the primitive iterator-like API that provides the most flexibility and power to the user.


The third example with the string literals is a little trickier to design around because now you have an unbounded sequence of items that are themselves unbounded. I might go with something like this, giving the user the choice between copying character data to a fixed-size buffer or appending it to an array list:

pub const Deserializer = struct {
    source: []const u8,
    pos: usize,

    pub fn init(source: []const u8) Deserializer3 {
        return .{ .source = source, .pos = 0 };
    }

    /// Writes character data to `buffer`.
    /// Returns a slice over the written portion of `buffer`.
    /// Does not consume the current token if `buffer` can't fit all character data.
    pub fn next(
        d: *Deserializer,
        buffer: []u8,
    ) ?(error{ OutOfMemory, SyntaxError }![]u8) {
        // Implementation...
    }

    /// Appends character data to `buffer`, growing it if needed.
    /// Returns a slice over the appended portion of `buffer`.
    /// Does not consume the current token if `error.OutOfMemory` is returned.
    pub fn nextAlloc(
        d: *Deserializer,
        buffer: *std.ArrayList(u8),
        allocator: std.mem.Allocator,
    ) ?(error{ OutOfMemory, SyntaxError }![]u8) {
        // Implementation...
    }
};

One neat observation you might make about the string literals is that the deserialized output will never be longer than the serialized input. This means instead of always copying the character data to a provided output buffer, you could instead modify the source buffer in-place (so long as it is mutable). And since most strings probably won’t contain the "" quote escape sequence there won’t be any copying at all most of the time!

pub const Deserializer = struct {
    source: []u8,
    pos: usize,

    pub fn init(source: []u8) Deserializer3 {
        return .{ .source = source, .pos = 0 };
    }

    /// Might modify `source` in-place.
    /// Returns a slice over `source`.
    pub fn next(d: *Deserializer) ?(error{SyntaxError}![]u8) {
        // Implementation...
    }
};
2 Likes