Notes on implementing a base64 writer

As an exercise to learn the new I/O interfaces in Zig 0.15, I wrote a base64 Writer. It was harder than expected (partly because I detoured into fuzz testing), but it was fun and I learned a lot. This is a summary of my implementation notes, along with observations and questions.

Repo: GitHub - benburkert/base64.zig: base64 encoding Writer experiment

Background

I chose a base64 Writer because I need an allocation-free encoder usable from a jsonStringify function, and base64 is a straight forward and familiar.

The main requirement is that it only emit full base64 chunks (3 input bytes to 4 output characters) to avoid adding padding midstream. A single input stream should produce a single contiguous base64 sequence.

Design & Implementation

I wanted to reuse std.base64 as much as possible. std.base64.Base64Encoder.encodeWriter had what I needed, except it consumes a single byte slice (source) rather than a byte stream. So my writer implementation focused on chunking input to that function, not on base64 encoding. I added an explicit ā€œfinal flushā€ function, end, to output trailing padding.

  • Is a writer allowed to retain data in buffer after flush? If not, a separate internal buffer may be required.
  • defer w.end() would be a nice pattern, but it can error, so it isn’t practical.

Next, I jumped strait into the implementation of the drain function. I’m used to writers being a single (and straight-forward) write function that takes a single byte buffer or stream source. In these other situations, a base64 writer implementation mostly involves chunking out writes 3 bytes at a time until the final write or end of the stream. Those other implementations are simple enough to reason through an effiecent implementation on the first try, so that’s what I did for this new writer.

  • I assumed that once I had drain implemented the other VTable functions would be for optimizing, not correctness, but I was wrong.

After spending 2+ hours on multiple drain implementations (all of them incorrect), I decided this was a dead-end approach. The combination of buffering built into the interface and the drain function accepting a splat-able byte vector really complicates the amount of book keeping for an implementation. It was too much to keep in my head while trying to juggle both correctness and performance.

  • I also feel like my intuition from other languages didn’t cary over here since the interface is such a departure from previous writers i’ve encountered.

New Approach

At this point it was clear to me that this new I/O interface required a more rigorous approach even for a simple encoder like this one. So I shifted to a more deliberate strategy: start with a naive but correct implementation, then build scaffolding to evolve it while mechanically checking correctness and performance.

The scaffolding began with a fuzz-test corpus that I could grow into an exhaustive suite. The same corpus would support profiling to quantify improvements. First step: get a naive writer implementation working.

  • It’s a lot of up front work for a simple encoder, but I’m also hoping the scaffolding could be generic enough for any new writer implementation.

Naive Implementation

The naive implementation is a drain, flush, and end function that only works with a 3 byte buffer. drain encodes one full chunk if available; otherwise it buffers a single input byte. flush calls drain if a full chunk is buffered; otherwise it’s a no-op. end encodes any remaining buffered bytes and adds padding.

  • It’s interesting that a minimal drain implementation only needs to perform one tasks per invocation: a) move bytes from buffer to sink, or b) move bytes from data to buffer.
  • By only consuming at most one byte from the data vector, the splat argument can be ignored entirely.
  • My hunch is that I got a better feel for the new I/O interface implementing the naive implementation vs if the first approach had worked.
  • The fixed size buffer requirement avoids extra book-keeping (and aliasing) when rebalancing the buffer.
  • Would a seek field (like in std.Io.Reader.VTable) help offload buffer rebalancing from writers?

For correctness, I wrote unit tests that use std.Io.Writer functions that in turn directly call the drain and flush functions. Reading through the std.Io.Writer implementation, I identified those as flush, writeAll, writeByte, writeVecAll, and writeSplatAll. Getting initial coverage on these different ā€œoperationsā€ gave me enough confidence on the implementation to move onto fuzz testing.

  • For a writer that provides more than just drain & flush, there will most likely be more function coverage required beyond those five functions.

Fuzz Testing

Building on the idea of unit testing the different writer ā€œoperationsā€, my goal with fuzz here is to generate a large amount of random operations on a writer, then check that the final output matched the expected output. The fuzz test would exercise all the varitions I was unable to hold in my head earlier. Once I had the initial set of test cases, the fuzzer could expand the test casees and shake out all the bugs I added in the implementation.

  • I used afl (zig-afl-kit) because the builtin fuzzer didn’t work on my mac laptop.

I started with a compact encoding of the test case because I learned from my previous experience with fuzzing that compact representations are the best for mutating. For example, "Zm9v:min:a:Zm9v" is a test case that creates writer with a minimaly-sized buffer (min), calls writeAll (a) with "foo" (Zm9v), and expects the output "foo" (the first Zm9v).

  • This custom ASCII encoding may still be too verbose, I wasn’t very happy with cases generated by the fuzz tester, I suspect that there’s still to much entropy even in the compact form.
  • I included the buffer length in the test case because fuzzing & profiling could be a good way to determine optimal buffer size. (And possibly provide some numbers to support or refute Andrew’s statement about always using the smallest valid buffer.)

I then added zon encoding for the test cases, and used an LLM to generate an initial set of 30 zon tests to use as inputs to the fuzzer. I also wrote a command to convert a zon encoded test case to the compact encoding for the fuzzer.

A roughly 8-hour run produced ~1,800 corpus files over ~850 cycles. Most were invalid cases; some legitimate crashes appeared and mostly traced to integer overflows in std.Io.Writer with absurdly large splats. Hangs were usually huge allocations. There’s more triage to do here.

  • I doubt fuzzing will really pay off until the implementation is optimized.

Profiling

I added two profiling commands that run all test cases 10x and exit. The first is a baseline that writes directly to the sink. The second writes through the base64 writer.

Benchmark 1 (159 runs): zig-out/bin/fuzz-bench-nop
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          50.8ms ± 13.5ms    21.8ms …  169ms         25 (16%)        0%
  peak_rss           2.97MB ± 10.8KB    2.97MB … 3.06MB          5 ( 3%)        0%
  cpu_cycles         5.85M  ± 2.06M     2.41M  … 27.9M          22 (14%)        0%
  instructions       3.87M  ± 1.26M     1.64M  … 17.2M          34 (21%)        0%
  cache_misses        122K  ± 35.7K     43.0K  …  472K          24 (15%)        0%
  branch_misses      8.13K  ± 3.25K     2.30K  … 42.4K          17 (11%)        0%
Benchmark 2 (90 runs): zig-out/bin/fuzz-bench-b64
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           348ms ± 34.9ms     302ms …  514ms         10 (11%)        šŸ’©+585.1% ± 12.0%
  peak_rss           3.31MB ± 3.45KB    3.31MB … 3.34MB          1 ( 1%)        šŸ’©+ 11.6% ±  0.1%
  cpu_cycles         44.6M  ± 5.53M     37.6M  … 70.5M          10 (11%)        šŸ’©+662.9% ± 16.4%
  instructions       27.9M  ± 4.13M     24.2M  … 48.1M           9 (10%)        šŸ’©+620.1% ± 17.9%
  cache_misses        919K  ± 78.7K      821K  … 1.36M           7 ( 8%)        šŸ’©+651.8% ± 11.7%
  branch_misses      39.4K  ± 13.7K     31.0K  …  106K          12 (13%)        šŸ’©+384.0% ± 27.4%

Conclusion

After building out the scaffolding, I haven’t yet optimized the writer itself. I hope to make a follow-up post once I do. Or maybe I can nerd-snipe someone into doing that work; the tooling is in place now!

10 Likes

Here’s an approach I think you might want to take for drain:

  • Process as many bytes as possible from buffer, while leaving >= chunk_len bytes buffered
  • Move the unprocessed bytes to the beginning of buffer
  • Refill the buffer as much as possible with data, and return the length refilled
Excerpt from a WIP Writer implementation adapted to your use case
    fn drain(w: *std.Io.Writer, data: []const []const u8, splat: usize) std.Io.Writer.Error!usize {
        const self: *Writer = @fieldParentPtr("writer", w);

        // Process, returns the number of bytes processed and does not modify w.end
        const num_processed = try processWithAtLeastNRemaining(self, chunk_len);

        // Move remaining buffered data to the start
        if (num_processed > 0) {
            const len = w.end - num_processed;
            @memmove(w.buffer[0..len], w.buffer[0..num_processed][0..len]);
            w.end = len;
        }

        // Refill buffer using data
        {
            const end_before_fill = w.end;
            for (data[0 .. data.len - 1]) |bytes| {
                const dest = w.buffer[w.end..];
                const len = @min(bytes.len, dest.len);
                @memcpy(dest[0..len], bytes[0..len]);
                w.end += len;
            }
            const pattern = data[data.len - 1];
            switch (pattern.len) {
                0 => {},
                1 => {
                    @memset(w.buffer[w.end..][0..splat], pattern[0]);
                    w.end += splat;
                },
                else => {
                    const dest = w.buffer[w.end..];
                    for (0..splat) |i| {
                        const remaining = dest[i * pattern.len ..];
                        const len = @min(pattern.len, remaining.len);
                        @memcpy(remaining[0..len], pattern[0..len]);
                        w.end += len;
                    }
                },
            }

            return w.end - end_before_fill;
        }
    }

In my implementation, flush then processes the buffer down to 0 length; unsure if that’d work for your use case or not.

1 Like