I've made a small Huffmann coder implementation in (raw) Zig. Feedback welcome!

Qfl3x · November 8, 2024, 10:55am

I’m working on a file synchronizer, so first step in my mind was to implement a compressor, I wanted to do arithmetic coding, but I figured I should start with the simpler Huffman coding instead. This is just a small project to learn Zig, please let me know what I can do to improve the code (It’s a bit of a mess right now, sorry):

I tested it by compressing a simple small text file then decompressing it, and then by compressing the compiled zig executable for the project itself and decompressing it. Both worked, though because the files weren’t too big the compression was 75% for the binary, and the compressed file was larger (header overhead) for the small text file.

dee0xeed · November 8, 2024, 11:21am

Here is an implementation of arithemetic/range codec with (0,1) alphabet (see bit-predictor.zig, bit-encoder.zig and bit-decoder.zig).

Also there are ANS entropy codecs which also are able to encode long sequence of bits with a single bit.

wrapitup · November 8, 2024, 3:35pm

This kind of project is also one of the first things that I did to learn Zig. This is a good start.

Here’s some feedback based on my own personal journey just to give some ideas:

Just off the bat, I think one thing you can try to do next is to replace many of your ArrayList(u8) with comptime known arrays. For example, when getting frequency statistics (that is, the counts_sum), you can just use a [256]u32. You can even make the intention more explicity with [std.math.maxInt(u8)]u32.

var counts: [std.math.maxInt(u8)]u32 = .{0} ** std.math.maxInt(u8); // initialize to zero
// or bytes.items if you read bytes into an ArrayList(u8)
for (bytes) |byte| {
     counts[byte] += 1;
}

You don’t necessarily need to divide each count by the total count sum to get a float probs array. You’re dividing by a constant, so change findmin to use integers rather than floats. That way, you can simplify from having to cast to f64.
As part of the next step, you could implement 2 priority queues (one for leaf nodes and one for branch nodes), and pop off the two nodes with the lowest priority, as described in the wikipedia page on huffman codes. And even after that you can sort by priority and then use this paper to do an in-place calculation of huffman tree.
Strive to reduce heap allocations. This is related to my first point where you can replace ArrayLists with an array in the stack. This requires some more thinking about the invariants in the algorithm and understanding your data. Each time you have code that is being allocated on the heap, take a moment to consider whether you can use an array on the stack.