Zstd performance benchmarking correctness

ultragreed · January 30, 2026, 2:24am

Hey! I’m currently working on Zstandard decompression performance issue, whose goal is to benchmark the Zig std Zstd decompression implementation, compare it with the reference C implementation, and possibly propose improvements.

I’ve already done some work on this, but I’m not confident that all the decisions I made along the way were justified and the results I got were valid. So, I’d like to discuss a few things with the community here before reopening the issue on Codeberg. I’d be glad to hear any opinions or suggestions about the issue in general or about the results I present below.

First of all, the results I currently have:

Compression levels go from -7 to 19 because the current Zig Zstd implementation does not support the ultra compression (20-22) levels, as far as I understand.

As you can see, Zig performs significantly (~10 times) worse on every tested dataset, and I can’t tell whether this is expected. Also, the Zig curve looks strange, having a huge spike at compression level 1 and rapidly decreasing at higher compression levels.

Datasets I used are Silesia corpus, enwik9, and Linux kernel v6.19-rc7 sources. The first is used in the lzbench, which is referenced on the Zstd implementation GitHub page, while the other two just seemed reasonable choices, although I don’t have a strong justification for them.
The issue proposes either adding a benchmark utility (which I assume would look pretty much like lib/std/hash/benchmark.zig), or adding a benchmark to gotta-go-fast. Gotta-go-fast hasn’t even been ported to Codeberg yet and hasn’t been receiving any updates for 3 years, so I’m leaning towards implementing a standalone benchmark utility similar to std.hash.
I’m currently building the Zstd master branch with standard make options (which default to -O3 optimizations), then linking the produced library statically with the Zig executable, and building it with -Doptimize=ReleaseFast. All benchmarking is performed in Zig using the following code snippets (czstd.decompress is a wrapper around ZSTD_decompress):

/// Decompress ZSTD compressed bytes to out n times with libzstd.a and return total time elapsed in seconds
fn benchC(compressed: []const u8, out: []u8, n: usize) !f64 {
    var timer = try Timer.start();

    for (0..n) |_| {
        _ = try czstd.decompress(out, out.len, compressed, compressed.len);
    }

    const elapsed_ns = timer.read();
    return @as(f64, @floatFromInt(elapsed_ns)) / time.ns_per_s;
}

/// Decompress ZSTD compressed bytes to out n times with zstd.zig and return total time elapsed in seconds
fn benchZig(compressed: []const u8, out: []u8, n: usize) !f64 {
    var timer = try Timer.start();

    for (0..n) |_| {
        var writer: io.Writer = .fixed(out);
        var comp_reader: io.Reader = .fixed(compressed);
        var zstd_stream: Decompress = .init(&comp_reader, &.{}, .{});

        _ = try zstd_stream.reader.streamRemaining(&writer);
    }

    const elapsed_ns = timer.read();
    return @as(f64, @floatFromInt(elapsed_ns)) / time.ns_per_s;
}

There are, of course, a lot of other options for building things for benchmarking. For example, I could build a C binary and a Zig binary and benchmark them with external tools, or benchmark the C implementation in C, and the Zig implementation in Zig. I’m not sure whether it matters.

As you can see, I include io.Writer and io.Reader initialization inside the timed section of the Zig implementation. I’m not sure whether the compiler fully optimizes this, but at least it should have a very minor overhead.
I run each test case 10 times, and verify correctness only once at the end.

Benchmarking host specs:

Gentoo Linux, x86_64, AMD Ryzen 5 5600X
Zig 0.15.2 binary
Zstandard v1.5.7

Git repository with current state of work. It contains short README on how to reproduce results.

dimdin · January 30, 2026, 3:30am

Do you get sane results if you run each test case only one time (that is 1/10 of the 10 times run time)?

squeek502 · January 30, 2026, 3:50am

Just to be sure, are you going through build.zig for this? If you aren’t (i.e. you’re using zig build-exe directly), then you aren’t actually compiling the Zig code in ReleaseFast mode, as -D means something else entirely in those commands (-Doptimize=ReleaseFast defines the C macro optimize to have the value ReleaseFast).

Relatedly, it would be helpful to post the full source code of your benchmarks so others can reproduce your results.

ultragreed · January 30, 2026, 7:26am

Yeah, I’m using the build.zig file. I’m sure the project is built with ReleaseFast because sometimes I accidentally start benchmarking without -Doptimize set, and the difference is immediately noticeable (the C implementation runs as usual, while the Zig implementation is very slow, as expected).
I don’t want to post the full source code yet because it’s still WIP and very messy. I’m going to make the Git repo public a bit later. If you think it is important to do it right now, I can, of course.

ultragreed · January 30, 2026, 7:36am

More or less, yes. (Don’t mind the last point — I interrupted the run)

gonzo · January 30, 2026, 8:47am

Are you using the native code generation, or LLVM (clang)?

Perhaps a pointer to the code and build (at least for the tests) would help diagnose.

dimdin · January 30, 2026, 10:26am

You can get more insight using a statistical profiler to run your benchmark for n=10 level=1. During the ~2 minutes of run time the profiler can capture stack traces, then you can visualize the results using a cpu flame graph.
See: Benchmarking

ultragreed · January 30, 2026, 10:56am

I’m using native build, as is the default for the x86_64 architecture in Zig 0.15. I’ve just tried setting .use_llvm = true and got pretty much the same results.

I’ve decided to make the Git repository public. I’m adding the link to the original post now.

pachde · January 30, 2026, 5:36pm

If you’re including io in the parts you’re profiling, then you’re profiling io, not the compression algorithm. If io is included in one and not the other then your benchmark isn’t a fair one. Actual IO is always orders of magnitude slower than memory operations.

squeek502 · January 30, 2026, 11:40pm

It’s a fixed reader and fixed writer. No syscalls are being called.

squeek502 · January 31, 2026, 3:07am

Here are the results I got (I used the zstd built from the zstd submodule to generate the compressed files):

Command used to generate the files:

./zstd/zstd data/raw/silesia.tar -1 -o data/compressed/1/silesia.tar.zst

etc

A lot of room for improvement for sure.

EDIT: Didn’t realize --fast=1 corresponds to compression level -1, --fast=2 is -2, etc. Updated results with negative compression levels included.

squeek502 · January 31, 2026, 3:39am

In case it’s helpful for profiling, I’ve added a version of the benchmark program that avoids all filesystem IO and just does a single run of 1 implementation:

https://github.com/squeek502/zig-zstd-benchmark/commit/d57906c6e8f6078a16d4918bb1ab25ee2e10b04e

That way you can do stuff like this (using poop):

$ poop './zig-out/bin/zstd_single c' ./zig-out/bin/zstd_single
Benchmark 1 (17 runs): ./zig-out/bin/zstd_single c
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           302ms ± 4.10ms     296ms …  309ms          0 ( 0%)        0%
  peak_rss            287MB ± 81.0KB     287MB …  287MB          0 ( 0%)        0%
  cpu_cycles          744M  ± 4.43M      739M  …  755M           1 ( 6%)        0%
  instructions       2.20G  ± 29.5      2.20G  … 2.20G           0 ( 0%)        0%
  cache_references   32.4M  ±  436K     31.9M  … 33.4M           0 ( 0%)        0%
  cache_misses       1.11M  ± 39.6K     1.06M  … 1.22M           1 ( 6%)        0%
  branch_misses      4.80M  ± 13.8K     4.78M  … 4.83M           0 ( 0%)        0%
Benchmark 2 (3 runs): ./zig-out/bin/zstd_single
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          4.55s  ± 22.8ms    4.53s  … 4.57s           0 ( 0%)        💩+1403.6% ±  3.7%
  peak_rss            287MB ± 75.7KB     287MB …  287MB          0 ( 0%)          +  0.0% ±  0.0%
  cpu_cycles         18.5G  ±  109M     18.4G  … 18.6G           0 ( 0%)        💩+2385.7% ±  6.5%
  instructions       49.5G  ± 3.14K     49.5G  … 49.5G           0 ( 0%)        💩+2153.3% ±  0.0%
  cache_references   33.1M  ±  604K     32.4M  … 33.5M           0 ( 0%)          +  2.2% ±  1.9%
  cache_misses       1.52M  ±  114K     1.42M  … 1.65M           0 ( 0%)        💩+ 37.0% ±  6.3%
  branch_misses       134M  ±  160K      133M  …  134M           0 ( 0%)        💩+2678.8% ±  1.5%

or use it with stuff like hotspot, callgrind/kcachegrind, etc

Here’s the hotspot summary I got for what it’s worth (using the compression level 1 silesia.tar.zst):

(data recorded via perf record --call-graph dwarf -- ./zig-out/bin/zstd_single)

ultragreed · February 27, 2026, 3:06pm

Hello again, thank you for your effort. I’ve merged your pull request.
I was busy over the last few weeks, but I opened a new issue on Codeberg a few days ago.
I looked into profiling a bit and made some flame graphs for both implementations:

Honestly, I don’t think there is much useful information here, at least as far as I understand.
Speaking of perf profiling, I don’t think metrics like cache misses are particularly relevant here, because we have an order-of-magnitude difference in run time. On the other hand, analyzing some memory-related stats (e.g. the total number of allocations) could be useful. Do you have any thoughts in this direction?
Such a large difference in run time seems very suspicious. I still have a feeling that I might be comparing something wrong. One thing I overlooked earlier is that I use the streaming API for Zig, while using the single-compression API for C. It would probably be better to use the streaming API in C as well, but I don’t believe it could be this much slower. I’m going to test it a bit later.
For now I’m going to dig into the source code of both implementations.
I’ve also implemented a single-run script in C, with embedded file and statically linked zstd and got the same results in terms of both runtime and flame graphs, as expected.

squeek502 · February 27, 2026, 11:55pm

In the single benchmark, this is the entirety of the heap allocation going on, so I don’t think it’s possible for that to be relevant (also, it’s exactly the same regardless of implementation).

I wouldn’t be so sure. One thing to note is that decompression here is probably effectively one hot loop happening over and over. In that scenario, minor differences per-iteration can turn into major differences when that loop is happening millions of times. These are probably the most relevant metrics:

  instructions       49.5G  ± 3.14K       💩+2153.3% ±  0.0%
  branch_misses       134M  ±  160K       💩+2678.8% ±  1.5%

I’m not super familiar with the implementation, but something that might be useful is to break things down into even more component parts: compare only the huffman decoding implementations, look at the performance of Zig’s ReverseBitReader in isolation to see if something could be improved there, is there something heavy that’s being recomputed constantly that could be cached, etc.

The huge difference between the decompression times of the negative compression level files compared to the positive compression level files might also give some clues as to what areas to focus on: you might be able to effectively exclude the areas of the code being hit when decompressing negative compression levels and turn your attention to the areas exclusive to the positive compression levels (however, my guess is that this will point you towards this branch as the difference which is what I think the “Top Hotspots” summary also hints towards being the relevant area).