Following my last reply, I’ve had a go at benchmarking several methods of implementing an iterator over a slice, with surprising results. I benchmarked:
- My original index-based iterator (at the top of this topic)
- A simplified slicing version (a cross between my original and @Sze’s)
-
@Sze’s implementation (above)
- A “direct slicing” approach based on @kristoff’s suggestion
- A ‘C’ equivalent using pointers
The full code, with all iterators and the timings, is quite long, so I won’t post it all here unless requested to. However, below is the core measurement routine I used for most of the tests. Items 4 and 5 above required different approaches but I kept them as similar as possible:
fn timeIt(name: []const u8, iter: anytype) void {
var total: u32 = 0;
var zero_count: u32 = 0;
const start = std.time.nanoTimestamp();
while (true) {
const val = iter.next() orelse break;
if (val == 0) {
zero_count += 1;
_ = iter.next() orelse break; // Skip next value after a 0
} else {
total += val;
}
}
const duration = std.time.nanoTimestamp() - start;
std.debug.print(
"{s:15}: Total: {:10} Zeros: {:10} Duration: {}\n",
.{ name, total, zero_count, duration },
);
}
The loop body is intended to simulate an application which uses all of the returned values, and where the number of calls to next()
per loop varies, so that the optimiser doesn’t lead to an unrepresentative result.
I compiled all Zig versions of the test code with -O ReleaseFast
, and the ‘C’ version with gcc with -O3
. Typical timings when running each test once with a 10,000,000 item slice of psudo-random values were:
- My original index-based iterator: 50 ms
- The simplified slicing version: 50 ms
-
@Sze’s implementation (above): 65 ms
-
@kristoff’s “direct slicing” approach: 50 ms
- A ‘C’ equivalent using pointers: 50ms
HOWEVER, I observed an interesting effect: If I tested several of the methods in succession in the same program, and repeated some of the same tests multiple times on the same data, the timings of all the runs changed - some by significant amounts. I can only assume the optimiser was making different inlining choices, as changes later on in the test sequence were affecting the results of unchanged code earlier in the test run. In particular, if version 3 was tested twice in a row it became a lot slower - 130ms instead of 50ms. Here’s an extract of the code to show what I mean:
var is_iter = IndexSliceIter(u8).init(data);
timeIt("Index", &is_iter);
var ss_iter = SlicingSliceIter(u8).init(data);
timeIt("Slicing", &ss_iter);
var s_iter = sliceIterator(data);
timeIt("Slicing2", &s_iter);
// s_iter = sliceIterator(data);
// timeIt("Slicing2", &s_iter);
If the two commented out lines are introduced both runs of sliceIterator()
become slower by a similar amount. Other combinations also caused surprising variations, though none of the others varied as much as version 3.
So in conclusion (and rather surprisingly) my naive index-based approach is as fast as any and faster than some, so I’ll probably stick with it. I hope this is of interest.