Make Zig tokenizer faster using only one ensureTotalCapacity/malloc

Inspired from stackoverflow realloc:

should I spend one pass in identifying the number of records & do malloc only once

I don’t mean that zig should malloc tokens from file size since that would be waste of space. Zig allocates 1/8 of file size for tokens in lib/std/zig/Ast.zig then it only realloc tokens if more space is needed, checking every time tokens length exceed malloc length. I feel like “realloc” should not be use since the allocator may need to move tokens data like C’s realloc and call syscalls.

Instead, the tokenizer should run 2 times in the file, one to ONLY get tokens length, and another to store the actual contents after it malloc tokens length. The former doesn’t need to know the token.tag, token.start or if the identifier is a keyword. The latter calls tokenizer.next() in a loop like Ast.parse() but without the gpa in try tokens.append(gpa,

Note that the first pass will be faster than second pass.

1 Like

I’d be curious to know if this would actually improve performance. Yes, it may not be allocating multiple times, nor would it be checking whether it needs to allocate for every token… But it does need to tokenize everything twice…

1 Like

If you look at the MultiArrayList implementation that is used for token, note that it says append is O(1). It does not allocate one more byte only.

I’d suggest to just try it yourself before speculating about it.

Generally I’m doubtful that this will be faster. Copying is quite a cheap operation, and I’d think that iterating through all bytes of the file a second time is going to be more expensive, especially given that there already is a good first estimate, so it likely won’t realloc for most of the files.
But maybe you could prove me wrong.

4 Likes

After looking into this a bit, I think you’re on the right track.

Two important pieces of information:

  • MultiArrayList will never resize in place (that is, it always calls alloc, never resize/remap)
    • Right now, it allocates one big chunk of bytes and then uses parts of that chunk for each of the per-field arrays, e.g. AAABBBCCC.... While this makes resizing-in-place less useful, in theory, resizing-in-place would still be possible where you could keep the A elements in-place and move the B/C/etc elements. Would need to do some testing to understand if that would be worth it.
  • The source.len / 8 no longer seems to be empirically true. From my testing, the average denominator value among all files in lib/ and src/ is 5.39. In just lib/std it’s 5.57.

Together, these pieces of information mean that getting the denominator right is important, and that the current denominator is probably not optimal. The simplest fix would just be to update the denominator, but it’s also worth noting that the variance involved is actually quite large: for example, compiler_rt/udivmodti4_test.zig, has a ton of really large integer literals, skewing the bytes-to-tokens ratio and therefore leading to a lot of initial overallocation when the denominator is smaller.

So, ultimately there’s a tradeoff between overallocation on outliers vs reducing allocations in the average case. If there’s a simple heuristic that could get a better estimate, that might be worth doing (I’m also doubtful that a full count-the-real-tokens pass would be worth it, but maybe a very simple ‘count the spaces’ pass or something like that?).

To give more context, here’s some stats from a janky test program that shows some of the effects of different denominator values:

With the current denominator of 8:

> token_allocs.exe 8 ..\zig\lib ..\zig\src
225 out of 1148 files tokenized in one alloc
102.69MiB allocated (61.93MiB with frees subtracted) in 2244 allocs
total capacity overallocation from files tokenized in one alloc: 4.85MiB, average ± σ: 22.08KiB ± 227.90KiB, median: 1.45KiB
average ± σ ratio of bytes-to-tokens: 5.39 ± 1.98, min: 2.48, max: 28.08

With a denominator of 5:

> token_allocs.exe 5 ..\zig\lib ..\zig\src
919 out of 1148 files tokenized in one alloc
84.94MiB allocated (71.31MiB with frees subtracted) in 1379 allocs
total capacity overallocation from files tokenized in one alloc: 19.48MiB, average ± σ: 21.71KiB ± 247.67KiB, median: 2.54KiB
average ± σ ratio of bytes-to-tokens: 5.39 ± 1.98, min: 2.48, max: 28.08

With a denominator of 4:

> token_allocs.exe 4 ..\zig\lib ..\zig\src
1114 out of 1148 files tokenized in one alloc
85.75MiB allocated (85.47MiB with frees subtracted) in 1182 allocs
total capacity overallocation from files tokenized in one alloc: 38.46MiB, average ± σ: 35.36KiB ± 421.24KiB, median: 4.00KiB
average ± σ ratio of bytes-to-tokens: 5.39 ± 1.98, min: 2.48, max: 28.08

I’ll make an issue/PR once I’ve done some real benchmarking.

1 Like

Another idea could be that their is a cache file that tracks token length, AST length, ZIR/AIR IR length for every Zig file and a git commit of when it was last updated (This should work as long as no changed to Zig’s IR are made between track git commit and current commit). Then Zig use those values to feed into IR so we get way less realloc assuming the tracked git commit is the current one. For tokenizer, Zig need scan the file only once but without checking for realloc everytime, bringing massive speeds. If the git commit is not current, Zig can modify the tokens/IR length.

I believe something like you’re describing is planned for incremental compilation (minus the dependency on git):

Note also that you might be overestimating the cost of parsing.

Would be interesting if your test programs also contained timing. I mean, does it make a difference actually if we allocate and move bytes in memory only once or twice or thrice?

Timing the test program itself yields this (on Windows):

> poop "token_allocs.exe 8 ..\zig\lib ..\zig\src" "token_allocs.exe 5 ..\zig\lib ..\zig\src" "token_allocs.exe 4 ..\zig\lib ..\zig\src"
Benchmark 1 (16 runs): token_allocs.exe 8 ..\zig\lib ..\zig\src
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           324ms ± 2.68ms     320ms …  332ms          1 ( 6%)        0%
  peak_rss           74.7MB ± 1.65KB    74.7MB … 74.8MB          3 (19%)        0%
Benchmark 2 (16 runs): token_allocs.exe 5 ..\zig\lib ..\zig\src
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           313ms ± 2.45ms     309ms …  317ms          0 ( 0%)        ⚡-  3.4% ±  0.6%
  peak_rss           74.8MB ± 5.69KB    74.7MB … 74.8MB          1 ( 6%)          +  0.0% ±  0.0%
Benchmark 3 (17 runs): token_allocs.exe 4 ..\zig\lib ..\zig\src
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           310ms ± 6.90ms     306ms …  336ms          1 ( 6%)        ⚡-  4.3% ±  1.2%
  peak_rss           64.4MB ± 5.11KB    64.3MB … 64.4MB          2 (12%)        ⚡- 13.9% ±  0.0%

But I’m not sure how relevant those results are, as I was not trying to make a representative benchmark at all. A more relevant benchmark will be something like the compiler building itself.

ZIR has already been fully cached for a long time now - it’s an independent cache system than the other stuff since it’s per-file. When you get a ZIR cache hit, source code, tokens, and AST are not even loaded into memory, unless an error needs to be reported. That’s the “z” directory of zig-cache.

6 Likes

Thanks for the correction, my vague recollection of the IR/caching stuff failed me.

Additionally, I’ve not been able to find any meaningful difference in my benchmarking attempts when changing the denominator to be 4 or 5 instead of 8. In hindsight, that probably should have been obvious: saving less than 1 mmap syscall per file (on average) is not going to move the needle much, and any decrease in mmap syscalls also depends on the allocator implementation (with glibc, strace was showing almost the same number of mmap syscalls for each version).

My benchmarking setup

Building the compilers

With the denominator at its default value of 8:

zig3 build -p stage4-release-8 -Dno-lib -Doptimize=ReleaseFast

After changing the denominator to 4:

zig3 build -p stage4-release-4 -Dno-lib -Doptimize=ReleaseFast

Benchmark: zig fmt

$ strace -e %memory -c -- ./stage4-release-8/bin/zig fmt ../lib
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 83.89    0.007223           8       856           munmap
 13.83    0.001191           1       841           mmap
  2.28    0.000196           1       119           mremap
------ ----------- ----------- --------- --------- ----------------
100.00    0.008610           4      1816           total
$ strace -e %memory -c -- ./stage4-release-4/bin/zig fmt ../lib
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 81.54    0.007657           9       814           munmap
 15.17    0.001425           1       799           mmap
  3.29    0.000309           2       119           mremap
------ ----------- ----------- --------- --------- ----------------
100.00    0.009391           5      1732           total
Benchmark 1 (7 runs): ./stage4-release-8/bin/zig fmt ../lib
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           745ms ± 10.0ms     729ms …  760ms          0 ( 0%)        0%
  peak_rss           42.2MB ± 37.8KB    42.1MB … 42.2MB          0 ( 0%)        0%
  cpu_cycles         2.69G  ± 40.5M     2.63G  … 2.75G           0 ( 0%)        0%
  instructions       6.89G  ± 5.55K     6.89G  … 6.89G           0 ( 0%)        0%
  cache_references   39.1M  ± 1.28M     37.1M  … 40.7M           0 ( 0%)        0%
  cache_misses       2.78M  ±  141K     2.57M  … 2.92M           0 ( 0%)        0%
  branch_misses      13.0M  ± 41.2K     12.9M  … 13.1M           0 ( 0%)        0%
Benchmark 2 (7 runs): ./stage4-release-4/bin/zig fmt ../lib
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           741ms ± 10.4ms     727ms …  755ms          0 ( 0%)          -  0.6% ±  1.6%
  peak_rss           42.0MB ± 34.6KB    42.0MB … 42.1MB          0 ( 0%)          -  0.3% ±  0.1%
  cpu_cycles         2.68G  ± 42.2M     2.63G  … 2.74G           0 ( 0%)          -  0.3% ±  1.8%
  instructions       6.89G  ± 5.01K     6.89G  … 6.89G           0 ( 0%)          -  0.0% ±  0.0%
  cache_references   39.3M  ± 1.47M     37.2M  … 41.0M           0 ( 0%)          +  0.5% ±  4.1%
  cache_misses       2.73M  ±  163K     2.53M  … 2.97M           0 ( 0%)          -  1.6% ±  6.4%
  branch_misses      13.0M  ± 63.4K     12.9M  … 13.1M           0 ( 0%)          -  0.1% ±  0.5%

Removing the ensureTotalCapacity call entirely only leads to wall_time increasing by 💩+ 2.2% ± 0.5%

I also tried benchmarking the compiler building itself but the results are the same, which makes sense because that should see much less of a difference than zig fmt.

Basically, I was wrong about this:

It’s not actually that important, source.len / 8 seems to get you enough of the way there that the difference (in practice) is barely noticeable.