Using the Unmanaged variant of ArrayList have negative performance impact?

Like quite a few here I’m learning Zig by building the bytecode vm interpreter from Crafting Interpreters.

I’ve recently read that the Unmanaged variant of ArrayList and the like will be the default in the next Zig version, so I went and replaced all of my ArrayList with ArrayListUnmanaged. Here’s the diff main…ArrayListUnmanaged. Sorry I couldn’t paste the code change here because it’s quite long (but is quite a simple change). The diff can also show how my ArrayLists are being used in their container for more context.

To my surprise, when I benchmarked the Unmanaged binary there are a few places where it runs slower than previously. Most notably:

You can see the actual scripts being interpreted for the benchmarking here in the benchmark dir.

I know this is not a comprehensive benchmark, but why would this be case? What was the tradeoff? For a comparison, NanBoxing the Value struct yielded about 2% increase in performance across all benchmarking tests.

Sorry if this is too specific and is a lot of context. Thanks in advance!

These numbers are suspiciously small. Differences on that scale can happen accidentally due to other factors such as how the assembly code is aligned in memory. I highly recommend watching this talk on the topic:

2 Likes

Thanks for the video!

Could you clarify “these numbers”? Did you mean the time taken for each test? Or the % of change?

The % change. I’d be somewhat suspicious of less than 10% changes.

Generally in such a situation it is useful to try and narrow down what you are measuring to the relevant part, instead of blindly measuring the entire application, this can help increase the % difference you measure to a more significant number.

It would also be useful to know standard deviation, if you very consistently are 5% slower with a tight stddev, then while small it’s statistically significant and worth investigation. Also check out Andrew’s Performance Optimizer Observation Platform, it will give you more insight on different aspects of performance, like cache misses:

3 Likes

Well the problem is that these could be caused by systemic errors and statistics cannot help you with systemic errors.
I do agree though it is definitely worth investigating, and Poop might be helpful to figure out what’s going on. But I think that these “profile the whole program” tools are not the right tool for this, since they are more prone to systemic errors because they naturally measure more unrelated code parts which code all be affected by code alignment changes.

The vast majority of the benchmarks had the exact same result. These two results had significantly larger standard deviation than the rest. It looks like these two are flukes.
Also, I highly support looking at the generated assembly before benchmarking. A lot of times the assembly is the same, so there’s no point in benchmarking.

1 Like

Sadly I’m not on Linux and cannot use this tool :frowning:

Are you able to look at the diff linked above? I did include hyperfine benchmark output which has stddev which seems quite small, but I’m not sure. Unless you’re talking about something else, then I’m all ears!

Sorry I’m super new, would you be able to elaborate on what are systemic errors, and what could be the unrelated code parts in this case?

Here’s some more context: the code change I made was just take out the allocator stored in each of the ArrayLists on the VM struct, and pass the allocator from the VM whenever each list needs to allocate. One of the lists is the Stack, and in this implementation there is quite a bit of stack manipulations in the main bytecode execution loop. Because I’m touching the main hot path, I thought profiling the whole interpreter is quite relevant.

Would you be able to suggest how to profile the main bytecode execution loop? Thanks!

Maybe the difference might be in the argument passing ? if you now pass the allocator explicitly maybe you use more than 4 arguments, and it leads to more stack manipulation ?

From your benchmarks results I think only these two are (mildly) interesting:

❯ just benchmark-compare ./zlox-alm ./zlox-alu
Benchmarking: binary_trees
Benchmark 1: old
  Time (mean ± σ):      2.151 s ±  0.007 s    [User: 2.028 s, System: 0.117 s]
  Range (min … max):    2.141 s …  2.168 s    10 runs

Benchmark 2: new
  Time (mean ± σ):      2.346 s ±  0.013 s    [User: 2.226 s, System: 0.113 s]
  Range (min … max):    2.330 s …  2.377 s    10 runs

Summary
  old ran
    1.09 ± 0.01 times faster than new

Benchmarking: trees
Benchmark 1: old
  Time (mean ± σ):      3.867 s ±  0.036 s    [User: 3.850 s, System: 0.014 s]
  Range (min … max):    3.841 s …  3.930 s    10 runs

Benchmark 2: new
  Time (mean ± σ):      4.021 s ±  0.014 s    [User: 4.007 s, System: 0.012 s]
  Range (min … max):    4.008 s …  4.054 s    10 runs

Summary
  old ran
    1.04 ± 0.01 times faster than new

For all the other ones I am skeptical if they don’t just boil down to noise / no significant change. And even for those the change isn’t big, maybe you could experiment with making the trees bigger/more complex.

Overall I would say that these benchmark results wouldn’t stop me from just going forward with ArrayListUnmanaged, especially if this code was just written without any in depth thoughts about optimization/performance.

If so far you haven’t really tried to improve the performance, but just now have started to compare the performance and you are interested in improving the performance, then my suggestion would be to use a profiler to get some insight into where the program spends the most time, or even better, write down some thought experiments about where you think it will spend the most time and then see whether your expectations match what you can see with a profiler.

1 Like

The systemic errors can come from e.g. how the code is located inside the binary. If you change one function then this can change how the code of completely different functions is aligned in memory. And the hardware behind instruction parsing is quite sensitive to alignment. I’m also not quite sure on the details here, and the answer to why this is the case is likely vendor and model specific.

Either way the point is that this can make your binary, as a whole, slower. However this is not a direct consequence of the change you made, and doesn’t tell you if your change actually is better in the long run. Instead it is likely that this will return to neutral state a few versions down the line as you make other unrelated changes, that accidentally align your functions favorably again.

Everything that is not directly influenced by the change. I’m not familiar with your code in detail, but I assume that there are sections (maybe a for loop somewhere) where you frequently append stuff to your lists and sections where you do other stuff. Now in order to reduce the impact of errors you should try to measure only the sections that actually use the array lists append operations, since that is the only things that really changed.