After looking into this a bit, I think you’re on the right track.
Two important pieces of information:
MultiArrayList
will never resize in place (that is, it always calls alloc
, never resize
/remap
)
- Right now, it allocates one big chunk of bytes and then uses parts of that chunk for each of the per-field arrays, e.g.
AAABBBCCC...
. While this makes resizing-in-place less useful, in theory, resizing-in-place would still be possible where you could keep the A
elements in-place and move the B
/C
/etc elements. Would need to do some testing to understand if that would be worth it.
- The
source.len / 8
no longer seems to be empirically true. From my testing, the average denominator value among all files in lib/
and src/
is 5.39. In just lib/std
it’s 5.57.
Together, these pieces of information mean that getting the denominator right is important, and that the current denominator is probably not optimal. The simplest fix would just be to update the denominator, but it’s also worth noting that the variance involved is actually quite large: for example, compiler_rt/udivmodti4_test.zig, has a ton of really large integer literals, skewing the bytes-to-tokens ratio and therefore leading to a lot of initial overallocation when the denominator is smaller.
So, ultimately there’s a tradeoff between overallocation on outliers vs reducing allocations in the average case. If there’s a simple heuristic that could get a better estimate, that might be worth doing (I’m also doubtful that a full count-the-real-tokens pass would be worth it, but maybe a very simple ‘count the spaces’ pass or something like that?).
To give more context, here’s some stats from a janky test program that shows some of the effects of different denominator values:
With the current denominator of 8:
> token_allocs.exe 8 ..\zig\lib ..\zig\src
225 out of 1148 files tokenized in one alloc
102.69MiB allocated (61.93MiB with frees subtracted) in 2244 allocs
total capacity overallocation from files tokenized in one alloc: 4.85MiB, average ± σ: 22.08KiB ± 227.90KiB, median: 1.45KiB
average ± σ ratio of bytes-to-tokens: 5.39 ± 1.98, min: 2.48, max: 28.08
With a denominator of 5:
> token_allocs.exe 5 ..\zig\lib ..\zig\src
919 out of 1148 files tokenized in one alloc
84.94MiB allocated (71.31MiB with frees subtracted) in 1379 allocs
total capacity overallocation from files tokenized in one alloc: 19.48MiB, average ± σ: 21.71KiB ± 247.67KiB, median: 2.54KiB
average ± σ ratio of bytes-to-tokens: 5.39 ± 1.98, min: 2.48, max: 28.08
With a denominator of 4:
> token_allocs.exe 4 ..\zig\lib ..\zig\src
1114 out of 1148 files tokenized in one alloc
85.75MiB allocated (85.47MiB with frees subtracted) in 1182 allocs
total capacity overallocation from files tokenized in one alloc: 38.46MiB, average ± σ: 35.36KiB ± 421.24KiB, median: 4.00KiB
average ± σ ratio of bytes-to-tokens: 5.39 ± 1.98, min: 2.48, max: 28.08
I’ll make an issue/PR once I’ve done some real benchmarking.