Is 0.15 new io slower than 0.14?

After migrating my (only) zig project from 0.14 to 0.15. I find that the program execution speed becomes slower. For a test case, in Debug mode, the program execution duration becomes about 0.3s (from v0.14’s 0.2s), and in ReleaseFast mode, it becomes about 0.032s (from v0.14’s 0.028s).

Expected or unexpected?

The debug mode time likely comes from the fact that Zig switched from the llvm backend to its self-hosted x86 backend which produces slightly worse instructions.

The release time difference appears small enough that it could be a measuring error. Unless you can narrow this down to a specific IO operation then it is very bold to say that the IO interface got slower.

4 Likes

I’ll also add the obligatory note that benchmarking is hard. At the very least it’d be helpful to use a tool like poop or hyperfine to determine if the measured performance difference is meaningful, but there’s countless other ways that benchmarking can go wrong.

I’ll never miss a chance to recommend this talk:

8 Likes

Are you on x86? This might be a side effect of the new non-LLVM backend which is used in debug mode.

Doesn’t explain the slowdown in release mode though, but 4 millisecons for IO operations might also be noise.

The other thing that comes to mind is maybe buffering, e.g. accidentially doing unbuffered IO somewhere where the old IO system implicitly used buffered IO where the new IO system doesn’t.

1 Like

It looks the poop tool doesn’t build with any minor toolchain versions.

My machine is Linux/x86.

I made the test case data set larger and still found that the 0.15 version is about 10% slower than the 0.14 version (ReleaseFast, 0.22s vs 0.20s). I use the time command to benchmark program execution time.

The code differences between 0.14 and 0.15 versions are all about io. There are three parts related to io:

  • read files from disk.
  • write files to disk.
  • write buffered files (0.14’s FixedBufferStream vs 0.15’s fixed writer).

I benched the three parts solely in some small zig programs, and find that 0.15’s fixed writer is a little faster than 0.14’s FixedBufferStream but don’t find obvious performance differences for the read/write-files parts. So I think the cause should be io unrelated.

I have no directions to find the cause now.

BTW: I actually don’t care much about the performance difference. The program is already much faster than my expectation. It can handle about 1000 .tmd files (Markdown alike) in 0.2 second!

You can get it from github releases

Or build it form fork which upgraded to 0.15.1:

1 Like

poop has pre-compiled releases.

It would be helpful if you could provide the code you’re testing. Running the two versions with strace -c may also provide some clues about possible differences.

2 Likes

Thanks. The repo is just updated. I built/run it and get the following results.
I’m not an expert on the numbers. So just FYI. My machine is Linux/amd64.
CPU is old (Intel(R) Core™ i5-4200M CPU @ 2.50GHz).

Benchmark 1 (25 runs): tmd-14-fast build .
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           206ms ± 8.16ms     199ms …  239ms          2 ( 8%)        0%
  peak_rss           5.88MB ±  988KB    5.37MB … 9.15MB          3 (12%)        0%
  cpu_cycles          426M  ± 6.72M      420M  …  450M           2 ( 8%)        0%
  instructions        919M  ± 5.00K      919M  …  919M           0 ( 0%)        0%
  cache_references   1.97M  ± 38.3K     1.90M  … 2.04M           7 (28%)        0%
  cache_misses       62.8K  ± 14.0K     43.9K  … 93.9K           0 ( 0%)        0%
  branch_misses      3.61M  ± 27.5K     3.58M  … 3.68M           1 ( 4%)        0%
Benchmark 2 (21 runs): tmd-15-fast build .
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           240ms ± 4.21ms     233ms …  250ms          3 (14%)        💩+ 16.5% ±  1.9%
  peak_rss           6.01MB ± 1.10MB    5.37MB … 9.25MB          3 (14%)          +  2.2% ± 10.6%
  cpu_cycles          540M  ± 7.45M      528M  …  557M           2 (10%)        💩+ 26.9% ±  1.0%
  instructions        927M  ± 3.67K      927M  …  928M           0 ( 0%)          +  0.9% ±  0.0%
  cache_references   2.03M  ± 59.8K     1.95M  … 2.17M           2 (10%)        💩+  3.4% ±  1.5%
  cache_misses       65.7K  ± 14.7K     47.9K  …  103K           0 ( 0%)          +  4.6% ± 13.7%
  branch_misses      3.56M  ± 24.3K     3.53M  … 3.64M           1 ( 5%)        ⚡-  1.5% ±  0.4%

If this is the code you’re testing then note that you are writing the file contents twice here:

2 Likes

:flushed_face: such a silly mistake. Thanks for the finding!

Though the performance difference doesn’t change much after removing this line.

Post the results of running

strace -c tmd-14-fast build .

and

strace -c tmd-15-fast build .
strace -c tmd-14-fast build .
Done. Output directory: .../tapirmd/tmd/@debug/demo/@tmd-build/demo
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 37.46    0.020473          24       833         1 unlinkat
 18.02    0.009846           5      1732         1 openat
 11.98    0.006549           8       777           write
  7.04    0.003848           4       902           munmap
  6.38    0.003485           4       794       736 mkdirat
  5.27    0.002878           2      1232           read
  5.14    0.002809           1      1732           close
  4.47    0.002442           1      1615           statx
  2.08    0.001135           8       135           getdents64
  2.03    0.001107           2       461           mmap
  0.09    0.000049           2        19           mremap
  0.05    0.000029           4         6           readlink
  0.00    0.000000           0         1           open
  0.00    0.000000           0         1           rt_sigaction
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           gettid
  0.00    0.000000           0         2           prlimit64
------ ----------- ----------- --------- --------- ----------------


strace -c tmd-15-fast build .
Done. Output directory: .../tapirmd/tmd/@debug/demo/@tmd-build/demo
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 35.33    0.010238          12       833         1 unlinkat
 21.74    0.006298           3      1732         1 openat
 13.49    0.003909           5       774           pwritev
  8.43    0.002443           3       794       736 mkdirat
  6.15    0.001781           1      1732           close
  5.34    0.001546           0      1615           statx
  4.41    0.001277           1       886           preadv
  2.32    0.000673           4       135           getdents64
  2.07    0.000600           1       312           munmap
  0.62    0.000180           1       166           mmap
  0.09    0.000025           4         6           readlink
  0.02    0.000005           2         2           writev
  0.00    0.000000           0         4           read
  0.00    0.000000           0         1           open
  0.00    0.000000           0         1           rt_sigaction
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           gettid
  0.00    0.000000           0         2           prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00    0.028975           3      8998       738 total

And poop results:

$ poop "tmd-14-fast build ." "tmd-15-fast build ."
Benchmark 1 (25 runs): tmd-14-fast build .
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           206ms ± 2.13ms     202ms …  211ms          1 ( 4%)        0%
  peak_rss           6.12MB ± 1.26MB    5.37MB … 9.17MB          5 (20%)        0%
  cpu_cycles          426M  ± 3.74M      421M  …  436M           0 ( 0%)        0%
  instructions        919M  ± 3.97K      919M  …  919M           4 (16%)        0%
  cache_references   1.98M  ± 28.1K     1.93M  … 2.06M           1 ( 4%)        0%
  cache_misses       68.8K  ± 7.47K     57.4K  … 93.2K           1 ( 4%)        0%
  branch_misses      3.62M  ± 19.5K     3.59M  … 3.67M           1 ( 4%)        0%
Benchmark 2 (21 runs): tmd-15-fast build .
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           241ms ± 5.86ms     237ms …  265ms          2 (10%)        💩+ 17.1% ±  1.2%
  peak_rss           6.29MB ± 1.24MB    5.39MB … 9.25MB          0 ( 0%)          +  2.6% ± 12.2%
  cpu_cycles          546M  ± 12.9M      537M  …  597M           1 ( 5%)        💩+ 28.0% ±  1.3%
  instructions        927M  ± 3.68K      927M  …  927M           0 ( 0%)          +  0.9% ±  0.0%
  cache_references   2.06M  ± 51.1K     2.01M  … 2.25M           1 ( 5%)        💩+  4.1% ±  1.2%
  cache_misses       81.0K  ± 19.0K     67.4K  …  158K           1 ( 5%)        💩+ 17.7% ± 12.1%
  branch_misses      3.58M  ± 50.5K     3.54M  … 3.76M           3 (14%)          -  1.1% ±  0.6%

$ poop "tmd-15-fast build ." "tmd-14-fast build ."
Benchmark 1 (21 runs): tmd-15-fast build .
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           239ms ± 3.06ms     235ms …  244ms          0 ( 0%)        0%
  peak_rss           5.88MB ±  962KB    5.32MB … 8.23MB          3 (14%)        0%
  cpu_cycles          541M  ± 5.52M      531M  …  551M           0 ( 0%)        0%
  instructions        927M  ± 3.61K      927M  …  927M           0 ( 0%)        0%
  cache_references   2.05M  ± 35.5K     2.00M  … 2.14M           1 ( 5%)        0%
  cache_misses       78.9K  ± 9.94K     58.4K  … 99.2K           0 ( 0%)        0%
  branch_misses      3.56M  ± 23.7K     3.53M  … 3.61M           0 ( 0%)        0%
Benchmark 2 (25 runs): tmd-14-fast build .
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           207ms ± 7.25ms     202ms …  241ms          1 ( 4%)        ⚡- 13.4% ±  1.4%
  peak_rss           5.92MB ± 1.11MB    5.27MB … 9.14MB          4 (16%)          +  0.7% ± 10.6%
  cpu_cycles          429M  ± 15.8M      420M  …  503M           1 ( 4%)        ⚡- 20.7% ±  1.4%
  instructions        919M  ± 4.19K      919M  …  919M           6 (24%)          -  0.9% ±  0.0%
  cache_references   1.98M  ± 62.4K     1.93M  … 2.25M           1 ( 4%)        ⚡-  3.4% ±  1.5%
  cache_misses       73.9K  ± 22.1K     61.6K  …  176K           2 ( 8%)          -  6.4% ± 13.4%
  branch_misses      3.63M  ± 54.0K     3.58M  … 3.85M           1 ( 4%)        💩+  1.9% ±  0.7%


Weird, I’m not sure how to square the minor difference in instruction count with the large difference in cpu cycles. Also strange that the strace of tmd-15-fast looks strictly better than tmd-14-fast (always <= the number of syscalls when compared syscall-to-syscall (e.g. mmap count vs mmap count), and more than 1,000 fewer syscalls overall).

Might be time to bust out a profiler if you want to investigate further.

3 Likes

This shows that tmd-15-fast is twice as fast as tmd-14-fast for all system calls. It is definitely not an I/O problem.

Are you sure that you use llvm as backend for both?

I don’t understand this. All are default.

They are build from the develop (for tmd-15-fast) and zig-0.14 (for tmd-14-fast) branches with the zig build -Doptimize=ReleaseFast command. Test data is in pages folder here: tmd/documentation at develop · tapirmd/tmd · GitHub. I just copied those .tmd files about sveral hundreds of times to make program execution to 0.2s.

Try using a different allocator for release builds.
See:

EDIT:
One idea is to try const gpaAllocator = std.heap.page_allocator; to see if the allocator is indeed the problem.

2 Likes

page_allocator doesn’t matter much.

Thanks for the info.

I’ve decided to stop dwelling on this matter and will focus on the following development of the project.

3 Likes