Research about Benchmarking

But things like stabilizer give you a normal distribution (the Sound Statistics footnote) of the measurements.

I think taking 40 samples and then calculating specific properties from that can be potentially bad, if instead we should first try to look at the samples to see whether they are actually good, or we instead need to do something like stabilizer does (randomize address/memory layout) to get better measurements (not asking 1 person of 40 people, 40 times).

I think if in the future the hot-reloading support could also get a feature to be able to change the code around and randomize layout while it is running, that this could then potentially bring improvements for taking measurements that aren’t dependent on layout and have a normal distribution. Which statisticians seem to be able to do neat mathy things with that can’t be done with things that aren’t normally distributed.
With that we have a nice to work with distribution, instead of specific values that were calculated from samples (that may only reflect part of what the original distribution was).

Also the reduced values may lead people to think it is normally distributed when it isn’t and instead alternates between 2 values or is some other distribution or no discernable pattern, or has multiple peaks etc. I guess what I don’t want is that some library calculates these reduced properties and nobody ever takes a look and sees whether there is something more hidden in the samples themselves.

Being able to make use of more of a histogram like processing of the measurements, even if it is just to evaluate whether there is some strange pattern, before reducing the data down to something that is used for automation.

I think the ideas here are potentially relevant Bret Victor - Ladder of Abstraction to show more of what is going on in ways that are easy to understand. When measurements just get reduced in specific ways, without showing more detailed data, you potentially gain less insight.

I wonder if benchmarking wouldn’t be better as something you setup through a profiler and then it becomes something that is repeatable, essentially taking a dynamic screenshot that extends into the future of a piece of code you had under the microscope. Like you save a view into a specific area and it then just gets monitored continuously.