I think I made a mistake trying to tackle Benchmarking together with possible things that could be relevant to create improvements in one topic.
After pondering a bit about it, I think it is better to let the docs be focused on things that are more definitely known and pull my comments about things I am not super sure about yet, into this brainstorming topic.
Putting my vague ideas that need to be tested and further researched besides the more concrete ideas is confusing.
So I will try to pull apart the ideas here in different comments.
I think that the presentation of the results, maybe could be done better then just some numbers, if we can produce measurements that can be described by normal distributions.
So I don’t necessarily disagree that these simple metrics seem typical, but maybe instead of seeing this as how it is, we should wonder whether we should/could apply the techniques of Stabilizer[1]?
Then if we have normal distributions showing these with graphs in some way (maybe over many different runs of the benchmarks) could not only show the measurements but also some of the information about how precise these measurements are or maybe we even could calculate more with that?.
But things like stabilizer give you a normal distribution (the Sound Statistics footnote) of the measurements.
I think taking 40 samples and then calculating specific properties from that can be potentially bad, if instead we should first try to look at the samples to see whether they are actually good, or we instead need to do something like stabilizer does (randomize address/memory layout) to get better measurements (not asking 1 person of 40 people, 40 times).
I think if in the future the hot-reloading support could also get a feature to be able to change the code around and randomize layout while it is running, that this could then potentially bring improvements for taking measurements that aren’t dependent on layout and have a normal distribution. Which statisticians seem to be able to do neat mathy things with that can’t be done with things that aren’t normally distributed.
With that we have a nice to work with distribution, instead of specific values that were calculated from samples (that may only reflect part of what the original distribution was).
Also the reduced values may lead people to think it is normally distributed when it isn’t and instead alternates between 2 values or is some other distribution or no discernable pattern, or has multiple peaks etc. I guess what I don’t want is that some library calculates these reduced properties and nobody ever takes a look and sees whether there is something more hidden in the samples themselves.
Being able to make use of more of a histogram like processing of the measurements, even if it is just to evaluate whether there is some strange pattern, before reducing the data down to something that is used for automation.
I think the ideas here are potentially relevant Bret Victor - Ladder of Abstraction to show more of what is going on in ways that are easy to understand. When measurements just get reduced in specific ways, without showing more detailed data, you potentially gain less insight.
I wonder if benchmarking wouldn’t be better as something you setup through a profiler and then it becomes something that is repeatable, essentially taking a dynamic screenshot that extends into the future of a piece of code you had under the microscope. Like you save a view into a specific area and it then just gets monitored continuously.
One thing I didn’t add to the post, is that it seems to me that benchmarking and profiling really kind of belong together and possibly could be two sides of one coin, opposites of a spectrum.
I think in an ideal world you could easily switch from interacting with a program to find where something is slow, to then looking at how it behaves there and essentially leaving a probe/marker there (kind of like a permanent breakpoint for performance monitoring) and that would essentially create a benchmark.
While I think that the distinction makes sense to be able to communicate what you are focusing on, I think better tools could blur the lines a lot.
Also a lot of the benchmarking seems to reinvent profiler things, just in a program first kind of way, instead of through allowing you to create benchmarks directly you need to write them as code, this reminds me a bit of this talk Bret Victor - The Future of Programming
Wikipedia articles on math are almost always a non-starter. Editors love to be exact and mathematics gives them a lot of room for being exact, without being clear. The central limit theorem says that if you sample a value enough the distribution of the samples will approach a normal distribution. In other words take 10 samples 10 times (100 tests) take the mean of each 10 runs and you will have an idea of the average that falls along a normal distribution. So even if your underlying process doesn’t follow a normal distribution, you can get one.
I Personally like seeing variance/standard deviation reported with stats because it gives an idea of how good the benchmark test actually was.
But the impression I got from the stabilizer talk is that the distribution without using randomized address layout, isn’t randomized enough to actually be sure that the changes in code are the cause of change in speed: https://youtu.be/r-TLSBdHe1A?t=540
Basically things that are unrelated to what you are trying to measure can impact your measurements and you think it is faster when it isn’t. According to the talk (anecdotally) even changes in environment variable names or folder names can lead to that. With the address layout randomization those weird things get averaged out. So maybe after every x samples we should randomize the layout and then continue.
What I get is: Runing your optimization test 100 times with the same layout isn’t really randomly sampling your program. It’s like shuffling a deck of cards and then drawing the top card, seeing what you get, putting it back on top and redrawing the same card. You will get a random card, but you won’t get 10 or 30 or n random cards. So by randomizing the layout each round you are able to get a better random set of data points to calculate mean/std dev etc from.
My comment about the Central Limit Theorem probably doesn’t apply in this case. It’s more useful when you have a random variable that isn’t normally distributed.
The way I understood it was that the random variable needs to be random enough (in relation to what you want to measure?), so that you can apply the central limit theorem, that allows you to treat it as a normal distribution and have that result distribution mean what you wanted to measure.
I think the main problem with not randomizing address layout, is that the address layout then still biases the results, the sampling has already filtered out other external effects that are sufficiently random, but because the layout isn’t random it still influences the calculated result.
So I would say it does apply (it is also mentioned in the talk), the resulting distribution just doesn’t measure what you want to measure.
It basically measures the performance of a specific layout and you don’t know which one.
Measuring a specific one can be interesting, but most of the time we care more about performance independent of layout.
(Unless you want to hardcode or find a specific better layout)
There’s no hard line between the activities, agreed. I’m not convinced the wiki post about benchmarking needs to begin by contrasting it to profiling, actually. I figured it was a better first move to clean up the contrast, rather than just delete a bunch of paragraphs.
What’s true is that, as a matter of current practice, profiling and benchmarking are two distinct activities, which we use different software for, and generate categorically different sorts of data.
Then again, hooking a benchmark up to a profiler is pretty normal! If you have a regression, and you want to find the hotspot, a profiler is your friend.
One can draw a pretty clean line between profiling and benchmarking, because profiling is fundamentally spacial as well as temporal. It’s measuring where in the program space things happen, how much time and other resources those regions use.
It’s tricky to contrast this with benchmarking, because one can easily say something about benchmarking which is too restrictive. A lot of things are benchmarking. Example: varying a parameter across a collection of benchmarks, and graphing the scaling complexity, that’s benchmarking. So it’s easier to separate profiling from benchmarking, than the other way around.
Basically, if you’re measuring properties of a program line-by-line, that’s profiling. Other measurements are benchmarks.
You could fold profiling into the benchmarking umbrella, but I wouldn’t personally do that. Benchmarking and profiling have a different focus, and imply a different mindset and behavior on the part of the programmer.
Thank you for sharing this talk, was fun to watch. So the solution for all slow code is: use short usernames!
Seriously now, I think what “Luke” does in the story, namely sample the same, non-randomized setup repeatedly can still tell you something: how good your measurement method is (reproducibility). It made me think about the post over here, the no-op benchmark. If you sample the same thing over and over again, and get results all over the place (large standard deviation), your precision is simply bad. That then makes it harder to compare different sets of samples, even if you made sure that you’re not sampling distinct populations (comparing apples with pears).
For me benchmarking, well, it’s just benchmarking. However, the methods as they work or collect the results are essentially two (or a combination of them).
The first method requires a snapshot at certain time intervals and does not require intervention in the source code.
The second method is profiling which involves intervention in source code.
I may be wrong (haven’t used it that way), but I think tracy profiler is a combination of the two methods. It can collect data based on frame rate, such as for games, and it can only collect data for a zone of the code.