Is there an iterator for slices in the standard library?

I’d like to iterate over a slice using a next() method (i.e. not by using a for loop) so that I can take the next value as required at different points in the program. It seems an obvious thing, but I couldn’t see a way of creating such an iterator in the standard library. Am I missing something? I created a simple one myself (below) but I’m not sure if this is the best or most efficient method. Can anyone advise?

fn SliceIter(comptime T: type) type {
    return struct {
        const Self = @This();
        data: []const T,
        pos: usize,

        fn init(data: []const T) Self {
            return Self{ .data = data, .pos = 0 };
        }

        fn next(self: *Self) ?T {
            if (self.pos == self.data.len) {
                return null;
            } else {
                const retval = self.data[self.pos];
                self.pos += 1;
                return retval;
            }
        }
    };
}

Here’s a simple example of in use:

pub fn main() !void {
    const data_array = [_]i64{ 1, 3, 5, -1, 3 };
    const data = &data_array;
    var s_iter = SliceIter(i64).init(data);

    const v1 = s_iter.next();
    std.debug.print("{any}\n", .{v1});

    const v2 = s_iter.next();
    std.debug.print("{any}\n", .{v2});
}
1 Like

So far I only had a similar case once, in that instance I did not wrap it in an iterator, instead I overwrote the slice with the new slice (of the remaining elements). However I like the idea of using an iterator, it can be convenient to have the same interface.

I created my own version that is based on your code and the slice replacement I have done manually before, but this time wrapped in an easy to use function that returns an iterator.

fn SliceElement(comptime Slice: type) type {
    return switch (@typeInfo(Slice)) {
        .Pointer => |p| p.child,
        else => unreachable,
    };
}

fn SliceIterator(comptime Slice: type) type {
    std.debug.assert(std.meta.trait.isSlice(Slice));
    return struct {
        const Self = @This();
        const Element = SliceElement(Slice);
        data: Slice,

        fn next(self: *Self) ?Element {
            const len = self.data.len;
            if (len == 0) return null;

            const retval = self.data[0];
            self.data = if (len > 1) self.data[1..len] else self.data[0..0];
            return retval;
        }
    };
}

fn sliceIterator(slice: anytype) SliceIterator(@TypeOf(slice)) {
    return .{ .data = slice };
}

usage example:

pub fn main() !void {
    const data_array = [_]i64{ 1, 3, 5, -1, 3 };
    var s_iter = sliceIterator(@as([]const i64, &data_array));

    std.debug.print("{any}\n", .{s_iter.next()});
    std.debug.print("{any}\n", .{s_iter.next()});
    std.debug.print("{any}\n", .{s_iter});

    const data2: []const i64 = &.{ 1, 3, 5, -1, 3 };
    var s_iter2 = sliceIterator(data2);

    std.debug.print("{any}\n", .{s_iter2.next()});
    std.debug.print("{any}\n", .{s_iter2.next()});
    std.debug.print("{any}\n", .{s_iter2.next()});
    std.debug.print("{any}\n", .{s_iter2.next()});
    std.debug.print("{any}\n", .{s_iter2.next()});
    std.debug.print("{any}\n", .{s_iter2.next()});
    std.debug.print("{any}\n", .{s_iter2.next()});
    std.debug.print("{any}\n", .{s_iter2});
}

output:

1
3
main.SliceIterator([]const i64){ .data = { 5, -1, 3 } }
1
3
5
-1
3
null
null
main.SliceIterator([]const i64){ .data = {  } }

Your code is a perfectly legit iterator.

One trick that might (or might not) work for you is instead to re-slice the slice that you accept as input in your function to remove the consumed prefix. It’s less explicit than a regular iterator, but it’s much more efficient since it boils down to a simple for loop iteration.

fn processOneTask(tasks: []const i64) []const i64 {
   assert(tasks.len >= 1); // just to simplify this example code
   std.debug.print("processed: {}\n", .{tasks[0]});
   return tasks[1..];
}

pub fn main () !void {
   var tasks: []const i64 = &.{1,2,3};
   
   tasks = processOneTask(tasks);
   tasks = processOneTask(tasks);
   tasks = processOneTask(tasks);
}

Or something along those lines.

1 Like

Thanks, @Sze. I have used re-slicing before elsewhere but didn’t think about using it in this context. Thank you. @kristoff has also suggested re-slicing, though he suggested doing it directly, whilst you have provided a way of wrapping it in an iterator. Both of your answers are helpful, but since I can only mark one as the solution I’ll mark this one.

1 Like

Thanks, @kristoff. I have used a similar technique elsewhere in my code, but didn’t think about using it in this case. I’ve marked @Sze 's answer as the solution as they provided an iterator implementation, but the fact is both of you have pointed me in the right direction. I was concerned that the way I had proposed was not efficient as the slice had to be re-indexed every time. I suspect that as you say re-slicing is more efficient, which is what I’m looking for. I’ll try to concoct some benchmarks to compare the methods.

1 Like

Following my last reply, I’ve had a go at benchmarking several methods of implementing an iterator over a slice, with surprising results. I benchmarked:

  1. My original index-based iterator (at the top of this topic)
  2. A simplified slicing version (a cross between my original and @Sze’s)
  3. @Sze’s implementation (above)
  4. A “direct slicing” approach based on @kristoff’s suggestion
  5. A ‘C’ equivalent using pointers

The full code, with all iterators and the timings, is quite long, so I won’t post it all here unless requested to. However, below is the core measurement routine I used for most of the tests. Items 4 and 5 above required different approaches but I kept them as similar as possible:

fn timeIt(name: []const u8, iter: anytype) void {
    var total: u32 = 0;
    var zero_count: u32 = 0;
    const start = std.time.nanoTimestamp();
    while (true) {
        const val = iter.next() orelse break;
        if (val == 0) {
            zero_count += 1;
            _ = iter.next() orelse break; // Skip next value after a 0
        } else {
            total += val;
        }
    }
    const duration = std.time.nanoTimestamp() - start;
    std.debug.print(
        "{s:15}: Total: {:10}    Zeros: {:10}    Duration: {}\n",
        .{ name, total, zero_count, duration },
    );
}

The loop body is intended to simulate an application which uses all of the returned values, and where the number of calls to next() per loop varies, so that the optimiser doesn’t lead to an unrepresentative result.

I compiled all Zig versions of the test code with -O ReleaseFast, and the ‘C’ version with gcc with -O3. Typical timings when running each test once with a 10,000,000 item slice of psudo-random values were:

  1. My original index-based iterator: 50 ms
  2. The simplified slicing version: 50 ms
  3. @Sze’s implementation (above): 65 ms
  4. @kristoff’s “direct slicing” approach: 50 ms
  5. A ‘C’ equivalent using pointers: 50ms

HOWEVER, I observed an interesting effect: If I tested several of the methods in succession in the same program, and repeated some of the same tests multiple times on the same data, the timings of all the runs changed - some by significant amounts. I can only assume the optimiser was making different inlining choices, as changes later on in the test sequence were affecting the results of unchanged code earlier in the test run. In particular, if version 3 was tested twice in a row it became a lot slower - 130ms instead of 50ms. Here’s an extract of the code to show what I mean:

    var is_iter = IndexSliceIter(u8).init(data);
    timeIt("Index", &is_iter);
    var ss_iter = SlicingSliceIter(u8).init(data);
    timeIt("Slicing", &ss_iter);
    var s_iter = sliceIterator(data);
    timeIt("Slicing2", &s_iter);
    // s_iter = sliceIterator(data);
    // timeIt("Slicing2", &s_iter);

If the two commented out lines are introduced both runs of sliceIterator() become slower by a similar amount. Other combinations also caused surprising variations, though none of the others varied as much as version 3.

So in conclusion (and rather surprisingly) my naive index-based approach is as fast as any and faster than some, so I’ll probably stick with it. I hope this is of interest.

I looked at the generated code with https://godbolt.org/ for my variant and saw more bounds checks than I would have liked, so I tweaked the code to make use of more operations that aren’t checked.

Can you try these 2 variations with your other ones?

fn sliceIteratorReslice(slice: anytype) SliceIteratorReslice(@TypeOf(slice)) {
    return .{ .data = slice };
}

fn SliceIteratorReslice(comptime Slice: type) type {
    std.debug.assert(std.meta.trait.isSlice(Slice));
    return struct {
        const Self = @This();
        const Element = SliceElement(Slice);
        data: Slice,

        fn next(self: *Self) ?Element {
            return if (self.data.len == 0) null else res: {
                const retval = self.data.ptr[0]; // avoiding redundant bounds checks
                self.data.ptr += 1;
                self.data.len -%= 1; // avoiding redundant bounds checks
                break :res retval;
            };
        }
    };
}

fn sliceIteratorIndex(slice: anytype) SliceIteratorIndex(@TypeOf(slice)) {
    return .{ .data = slice };
}

fn SliceIteratorIndex(comptime Slice: type) type {
    std.debug.assert(std.meta.trait.isSlice(Slice));
    return struct {
        const Self = @This();
        const Element = SliceElement(Slice);
        data: Slice,
        i: usize = 0,

        fn next(self: *Self) ?Element {
            return if (self.i == self.data.len) null else res: {
                const c = self.i;
                self.i +%= 1; // avoiding redundant bounds checks
                break :res self.data.ptr[c]; // avoiding redundant bounds checks
            };
        }
    };
}

The second variant is basically your first implementation but using unchecked operations.
I like both variants, I think the indexed ones could be useful in situations, where you don’t just want a linear scan, but maybe also backtrack, or sometimes calculate a new value for the index.
Or you could reset the index to zero and reuse the iterator.

Single return with if expression seemed to generate slightly less code, but I only looked at it roughly and probably that can change a lot based on compiler details.
So far I haven’t done any timings myself, but I might some other time.

Hello @Sze. Thanks for following up on this. I did think about having a look at it via godbolt.org but had not got that far. I’ve added your new variants to my test suite, and benchmarked them several ways. Each of the results are below.

First I created a set of separate programs, each of which ran a single version of the iterator once over the data set. I then used a shell script to run each of these programs three times in succession. So each line in the results below is a separate program run. Here are the timings (in nanoseconds) from those runs:

          Index:  5043440
          Index:  5053813
          Index:  5020686

        Slicing:  5158572
        Slicing:  5243056
        Slicing:  5172921

          Sze 1:  6681714
          Sze 1:  6648648
          Sze 1:  6762502

          Sze 2:  5196653
          Sze 2:  5282166
          Sze 2:  5206039

          Sze 3:  5033126
          Sze 3:  5050702
          Sze 3:  5016642

 Direct Slicing:  4964597
 Direct Slicing:  4978511
 Direct Slicing:  4985862

Here @kristoff’s direct slicing approach comes out the fastest, whilst you last variant and my original are very close behind.

Secondly, I created a single program which ran all of the iterators one after the other over the same data set. Each iterator was run a single time:

          Index:  4985195
        Slicing:  4525113
          Sze 1:  6545125
          Sze 2:  5065851
          Sze 3:  5014205
 Direct Slicing:  4952904

No surprises here - the results are similar to the previous test.

However, it gets interesting if I run the same iterator multiple times in a row. Here are the results when I run each twice in a row, in the same program:

          Index:  4420894
          Index:  4366436

        Slicing:  5987781
        Slicing:  5953336

          Sze 1:  13486210
          Sze 1:  13564202

          Sze 2:  6007085
          Sze 2:  6036577

          Sze 3:  5060082
          Sze 3:  5056900

 Direct Slicing:  4978479
 Direct Slicing:  4949461

My original seems to have become faster (though possibly not by enough to be significant), but your first version has become a lot slower. This result was quite consistent over many test runs. I was assuming that the compiler was making different inlining choices. However, it gets even stranger if I run each one three times in a row in the same program:

          Index:  4829426
          Index:  4831482
          Index:  4852561

        Slicing:  6104155
        Slicing:  6075648
        Slicing:  6106256

          Sze 1:  5450357
          Sze 1:  5446743
          Sze 1:  5409334

          Sze 2:  5892943
          Sze 2:  5897173
          Sze 2:  5892991

          Sze 3:  4314107
          Sze 3:  4313575
          Sze 3:  4372871

 Direct Slicing:  4910835
 Direct Slicing:  4909785
 Direct Slicing:  4909465

The slowdown has disappeared again! The only difference in the code is how many of the following entries were commented out (and likewise for the other iterators):

    var s_iter = sliceIterator(data);
    timeIt("Sze 1", &s_iter);
    // s_iter = sliceIterator(data);
    // timeIt("Sze 1", &s_iter);
    // s_iter = sliceIterator(data);
    // timeIt("Sze 1", &s_iter);

If I get time, I might try to use godbolt.org to figure out what’s going on.

1 Like

My first implementation had several branches, 2 ones I wrote and a few more for bounds checks inserted by the compiler, from what I saw in godbolt.

So I think it kind of makes sense that the performance is less predictable, I would guess that depending on what else is going on on the cpu, it may sometimes guess right and actually prefetch the right memory (thus ending up with similar speed to the other implementation) and other times it doesn’t. With every branch the chances for a correct guess are lower.

This is my guess of what could be happening, that said I don’t have that much practical experience and this is more from what I have heard about hardware.

I think one thing that may make sense to try is to use shorter data but iterate over it many times and include the iterator construction in the timings.
So basically instead of 1 times over 10_000_000 items, maybe 100_000 times iterating over 100 items.
In an ideal case one could create a matrix out of many different combinations out of number of items and repetitions, I think that may show differences between cache locality and costs of setting up the iterator and things like that. For very long sequences the cost of iterator initialization is probably always negligible, but for short ones it may not be.

Thank you for running the tests!

A friend recently gave me a pointer to this book: Algorithms for Modern Hardware - Algorithmica
I haven’t yet read it, but it is on my todo list, I think it may have some insights for other things to try.