I never put any real energy in it, but the below code is called many million times per second.
Any ideas from performance freaks?
/// Get the next best scoring move, putting it at current_idx, before we move on.
fn extract_next(self: *MovePicker, current_idx: usize) *ExtMove {
const ptr: [*]ExtMove = &self.extmoves; // this is a 224 fixed size array in self.
var best_idx: usize = current_idx;
var max_score: Value = ptr[current_idx].score;
for (current_idx + 1..self.count) |idx| {
const e = ptr[idx];
if (e.score > max_score) {
max_score = e.score;
best_idx = idx;
}
}
if (best_idx != current_idx) {
std.mem.swap(ExtMove, &ptr[best_idx], &ptr[current_idx]);
}
return &ptr[current_idx];
}
What comes to mind by the looks of it is, since you pick arbitrary positions in the list to sort from that point onwards, is that this is called in a loop so O(N log N). Not too bad if that’s the case. Perhaps there’s an opportunity for SIMD somewhere. You’ll get an exact match always since 224 elements are fixed.
But the question I’d ask is: how’s the timing? Is it too bad or have you measured this to be the bottleneck?
I don’t know, but you showed only the implementation. Optimization could be possible if the call site was taken into account as well, eg if you always want the top 10 best moves.
This really depends on the size, but typically for a larger array, heap would make the most sense, for small arrays, the linear search might be actually more efficient. But if you want the linear search to be optimal, you need to store it in columnar arrays, i.e. have an array of scores and array of whatever else you need. Depending how how large the `ExtMove` struct is, it might help with memory caches. These kind of micro optimizations really need benchmarking, as it’s not really intuitive what is faster and what is slower.
I came up with something to find the index of the max value of an array using SIMD. It uses Zig’s vector to operate on the whole array in one shot, taking advantage of loop unrolling, utilizing as many as the SIMD registers possible, and doing as much instruction-level parallelism as possible. It’s generic enough to work with all the integer/float types and on arbitrary array size (with some limit because the array is passed on stack.)
maxIndex() takes in the size of the array as N and the array element type T, where N*sizeof(T)should be a multiple of the SIMD register byte size (e.g. x64, x128, x256, etc). It takes in the MIN as the lowest possible value of the array. It takes in a begin and an end limits, both inclusive. This allows actual data elements to have smaller count than the array length. I believe you have the requirement of walking the beginning index and having the last element not aligned to x256.
Also you need to arrange your score data to be in one contiguous array. Have two arrays - one for the scores and one for the ExtMove. Access them using the same index. Swap both of them when needed.
I haven’t run any benchmark. Look at the generated SIMD assembly in Godbolt for detail. Cheers.