Notes on implementing a base64 writer

benburkert · September 10, 2025, 3:13am

As an exercise to learn the new I/O interfaces in Zig 0.15, I wrote a base64 Writer. It was harder than expected (partly because I detoured into fuzz testing), but it was fun and I learned a lot. This is a summary of my implementation notes, along with observations and questions.

Repo: GitHub - benburkert/base64.zig: base64 encoding Writer experiment

Background

I chose a base64 Writer because I need an allocation-free encoder usable from a jsonStringify function, and base64 is a straight forward and familiar.

The main requirement is that it only emit full base64 chunks (3 input bytes to 4 output characters) to avoid adding padding midstream. A single input stream should produce a single contiguous base64 sequence.

Design & Implementation

I wanted to reuse std.base64 as much as possible. std.base64.Base64Encoder.encodeWriter had what I needed, except it consumes a single byte slice (source) rather than a byte stream. So my writer implementation focused on chunking input to that function, not on base64 encoding. I added an explicit “final flush” function, end, to output trailing padding.

Is a writer allowed to retain data in buffer after flush? If not, a separate internal buffer may be required.
defer w.end() would be a nice pattern, but it can error, so it isn’t practical.

Next, I jumped strait into the implementation of the drain function. I’m used to writers being a single (and straight-forward) write function that takes a single byte buffer or stream source. In these other situations, a base64 writer implementation mostly involves chunking out writes 3 bytes at a time until the final write or end of the stream. Those other implementations are simple enough to reason through an effiecent implementation on the first try, so that’s what I did for this new writer.

I assumed that once I had drain implemented the other VTable functions would be for optimizing, not correctness, but I was wrong.

After spending 2+ hours on multiple drain implementations (all of them incorrect), I decided this was a dead-end approach. The combination of buffering built into the interface and the drain function accepting a splat-able byte vector really complicates the amount of book keeping for an implementation. It was too much to keep in my head while trying to juggle both correctness and performance.

I also feel like my intuition from other languages didn’t cary over here since the interface is such a departure from previous writers i’ve encountered.

New Approach

At this point it was clear to me that this new I/O interface required a more rigorous approach even for a simple encoder like this one. So I shifted to a more deliberate strategy: start with a naive but correct implementation, then build scaffolding to evolve it while mechanically checking correctness and performance.

The scaffolding began with a fuzz-test corpus that I could grow into an exhaustive suite. The same corpus would support profiling to quantify improvements. First step: get a naive writer implementation working.

It’s a lot of up front work for a simple encoder, but I’m also hoping the scaffolding could be generic enough for any new writer implementation.

Naive Implementation

The naive implementation is a drain, flush, and end function that only works with a 3 byte buffer. drain encodes one full chunk if available; otherwise it buffers a single input byte. flush calls drain if a full chunk is buffered; otherwise it’s a no-op. end encodes any remaining buffered bytes and adds padding.

It’s interesting that a minimal drain implementation only needs to perform one tasks per invocation: a) move bytes from buffer to sink, or b) move bytes from data to buffer.
By only consuming at most one byte from the data vector, the splat argument can be ignored entirely.
My hunch is that I got a better feel for the new I/O interface implementing the naive implementation vs if the first approach had worked.
The fixed size buffer requirement avoids extra book-keeping (and aliasing) when rebalancing the buffer.
Would a seek field (like in std.Io.Reader.VTable) help offload buffer rebalancing from writers?

For correctness, I wrote unit tests that use std.Io.Writer functions that in turn directly call the drain and flush functions. Reading through the std.Io.Writer implementation, I identified those as flush, writeAll, writeByte, writeVecAll, and writeSplatAll. Getting initial coverage on these different “operations” gave me enough confidence on the implementation to move onto fuzz testing.

For a writer that provides more than just drain & flush, there will most likely be more function coverage required beyond those five functions.

Fuzz Testing

Building on the idea of unit testing the different writer “operations”, my goal with fuzz here is to generate a large amount of random operations on a writer, then check that the final output matched the expected output. The fuzz test would exercise all the varitions I was unable to hold in my head earlier. Once I had the initial set of test cases, the fuzzer could expand the test casees and shake out all the bugs I added in the implementation.

I used afl (zig-afl-kit) because the builtin fuzzer didn’t work on my mac laptop.

I started with a compact encoding of the test case because I learned from my previous experience with fuzzing that compact representations are the best for mutating. For example, "Zm9v:min:a:Zm9v" is a test case that creates writer with a minimaly-sized buffer (min), calls writeAll (a) with "foo" (Zm9v), and expects the output "foo" (the first Zm9v).

This custom ASCII encoding may still be too verbose, I wasn’t very happy with cases generated by the fuzz tester, I suspect that there’s still to much entropy even in the compact form.
I included the buffer length in the test case because fuzzing & profiling could be a good way to determine optimal buffer size. (And possibly provide some numbers to support or refute Andrew’s statement about always using the smallest valid buffer.)

I then added zon encoding for the test cases, and used an LLM to generate an initial set of 30 zon tests to use as inputs to the fuzzer. I also wrote a command to convert a zon encoded test case to the compact encoding for the fuzzer.

Here are all the zon encoded test cases, and here is the compressed versions.
It’s slightly annoying that the zon parser doesn’t have a function that takes a std.Io.Writer yet.
I really like that zon deserializing errors turn into compile errors when you @import a .zon.

A roughly 8-hour run produced ~1,800 corpus files over ~850 cycles. Most were invalid cases; some legitimate crashes appeared and mostly traced to integer overflows in std.Io.Writer with absurdly large splats. Hangs were usually huge allocations. There’s more triage to do here.

I doubt fuzzing will really pay off until the implementation is optimized.

Profiling

I added two profiling commands that run all test cases 10x and exit. The first is a baseline that writes directly to the sink. The second writes through the base64 writer.

Benchmark 1 (159 runs): zig-out/bin/fuzz-bench-nop
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          50.8ms ± 13.5ms    21.8ms …  169ms         25 (16%)        0%
  peak_rss           2.97MB ± 10.8KB    2.97MB … 3.06MB          5 ( 3%)        0%
  cpu_cycles         5.85M  ± 2.06M     2.41M  … 27.9M          22 (14%)        0%
  instructions       3.87M  ± 1.26M     1.64M  … 17.2M          34 (21%)        0%
  cache_misses        122K  ± 35.7K     43.0K  …  472K          24 (15%)        0%
  branch_misses      8.13K  ± 3.25K     2.30K  … 42.4K          17 (11%)        0%
Benchmark 2 (90 runs): zig-out/bin/fuzz-bench-b64
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           348ms ± 34.9ms     302ms …  514ms         10 (11%)        💩+585.1% ± 12.0%
  peak_rss           3.31MB ± 3.45KB    3.31MB … 3.34MB          1 ( 1%)        💩+ 11.6% ±  0.1%
  cpu_cycles         44.6M  ± 5.53M     37.6M  … 70.5M          10 (11%)        💩+662.9% ± 16.4%
  instructions       27.9M  ± 4.13M     24.2M  … 48.1M           9 (10%)        💩+620.1% ± 17.9%
  cache_misses        919K  ± 78.7K      821K  … 1.36M           7 ( 8%)        💩+651.8% ± 11.7%
  branch_misses      39.4K  ± 13.7K     31.0K  …  106K          12 (13%)        💩+384.0% ± 27.4%

The naive base64 writer is over five times slower than the baseline, which performed better than I expected, but there’s certainly room for improvement.
I used this fork of poop that works on macOS

Conclusion

After building out the scaffolding, I haven’t yet optimized the writer itself. I hope to make a follow-up post once I do. Or maybe I can nerd-snipe someone into doing that work; the tooling is in place now!

squeek502 · September 10, 2025, 10:34pm

Here’s an approach I think you might want to take for drain:

Process as many bytes as possible from buffer, while leaving >= chunk_len bytes buffered
Move the unprocessed bytes to the beginning of buffer
Refill the buffer as much as possible with data, and return the length refilled

Excerpt from a WIP Writer implementation adapted to your use case

    fn drain(w: *std.Io.Writer, data: []const []const u8, splat: usize) std.Io.Writer.Error!usize {
        const self: *Writer = @fieldParentPtr("writer", w);

        // Process, returns the number of bytes processed and does not modify w.end
        const num_processed = try processWithAtLeastNRemaining(self, chunk_len);

        // Move remaining buffered data to the start
        if (num_processed > 0) {
            const len = w.end - num_processed;
            @memmove(w.buffer[0..len], w.buffer[0..num_processed][0..len]);
            w.end = len;
        }

        // Refill buffer using data
        {
            const end_before_fill = w.end;
            for (data[0 .. data.len - 1]) |bytes| {
                const dest = w.buffer[w.end..];
                const len = @min(bytes.len, dest.len);
                @memcpy(dest[0..len], bytes[0..len]);
                w.end += len;
            }
            const pattern = data[data.len - 1];
            switch (pattern.len) {
                0 => {},
                1 => {
                    @memset(w.buffer[w.end..][0..splat], pattern[0]);
                    w.end += splat;
                },
                else => {
                    const dest = w.buffer[w.end..];
                    for (0..splat) |i| {
                        const remaining = dest[i * pattern.len ..];
                        const len = @min(pattern.len, remaining.len);
                        @memcpy(remaining[0..len], pattern[0..len]);
                        w.end += len;
                    }
                },
            }

            return w.end - end_before_fill;
        }
    }

In my implementation, flush then processes the buffer down to 0 length; unsure if that’d work for your use case or not.