Inplace sort optimizing

I never put any real energy in it, but the below code is called many million times per second.
Any ideas from performance freaks?

    /// Get the next best scoring move, putting it at current_idx, before we move on.
    fn extract_next(self: *MovePicker, current_idx: usize) *ExtMove {
        const ptr: [*]ExtMove = &self.extmoves; // this is a 224 fixed size array in self.
        var best_idx: usize = current_idx;
        var max_score: Value = ptr[current_idx].score;
        for (current_idx + 1..self.count) |idx| {
            const e = ptr[idx];
            if (e.score > max_score) {
                max_score = e.score;
                best_idx = idx;
            }
        }
        if (best_idx != current_idx) {
            std.mem.swap(ExtMove, &ptr[best_idx], &ptr[current_idx]);
        }
        return &ptr[current_idx];
    }
1 Like

What comes to mind by the looks of it is, since you pick arbitrary positions in the list to sort from that point onwards, is that this is called in a loop so O(N log N). Not too bad if that’s the case. Perhaps there’s an opportunity for SIMD somewhere. You’ll get an exact match always since 224 elements are fixed.

But the question I’d ask is: how’s the timing? Is it too bad or have you measured this to be the bottleneck?

There’s one in the standard library but you might want to write a custom version that isn’t allowed to grow dynamically

2 Likes

I have not measured this at all. Just from the look of it I just thought it could be faster somehow.
I’ll check the priority queue. Forgot that one.

I don’t know, but you showed only the implementation. Optimization could be possible if the call site was taken into account as well, eg if you always want the top 10 best moves.

The moveloop is not that spectacular. We always need all moves deciding here to skip or process it.

moveloop: for (0..movepicker.count) |move_idx| {
    const ex: *ExtMove = movepicker.extract_next(move_idx);
    // do something with move or skip this move
}

No break? That means, you are sorting the whole array?

This really depends on the size, but typically for a larger array, heap would make the most sense, for small arrays, the linear search might be actually more efficient. But if you want the linear search to be optimal, you need to store it in columnar arrays, i.e. have an array of scores and array of whatever else you need. Depending how how large the `ExtMove` struct is, it might help with memory caches. These kind of micro optimizations really need benchmarking, as it’s not really intuitive what is faster and what is slower.

1 Like

No there are continues and often a break. Thats why I don’t sort before.

moveloop: for (0..movepicker.count) |move_idx| {
    const ex: *ExtMove = movepicker.extract_next(move_idx);
    // if condition_a continue :moveloop
    // procesmove
    // if some_result break :moveloop
}

The size of the array = 224 and the size of the ExtMove = 64 bits.

What’s the type of score? Is it u8, u16, u32? There could be an opportunity to do batch compare for max using simd.

1 Like

score is 32 bits.
Yeah SIMD could be done. Never done that…

pub const ExtMove = packed struct {
    /// 16 bits. The generated move.
    move: Move,
    /// 32 bits. The score for move ordering.
    score: i32,
    /// 4 bits. Set during processing.
    piece: Piece,
    /// 4 bits. Set during processing.
    captured: Piece,
    /// 1 bit. Set during processing.
    is_bad_capture: bool,
    /// 1 bit. Set during processing.
    is_killer: bool,
    /// 1 bit. Updated during search.
    is_seen_by_search: bool,
}

I came up with something to find the index of the max value of an array using SIMD. It uses Zig’s vector to operate on the whole array in one shot, taking advantage of loop unrolling, utilizing as many as the SIMD registers possible, and doing as much instruction-level parallelism as possible. It’s generic enough to work with all the integer/float types and on arbitrary array size (with some limit because the array is passed on stack.)

maxIndex() takes in the size of the array as N and the array element type T, where N*sizeof(T)should be a multiple of the SIMD register byte size (e.g. x64, x128, x256, etc). It takes in the MIN as the lowest possible value of the array. It takes in a begin and an end limits, both inclusive. This allows actual data elements to have smaller count than the array length. I believe you have the requirement of walking the beginning index and having the last element not aligned to x256.

Also you need to arrange your score data to be in one contiguous array. Have two arrays - one for the scores and one for the ExtMove. Access them using the same index. Swap both of them when needed.

I haven’t run any benchmark. Look at the generated SIMD assembly in Godbolt for detail. Cheers.

3 Likes