Fastest filtered scanning of arraylist

ericlang · October 28, 2024, 11:17pm

Thank you for your explanations.

The switch was even a little bit faster than storing a function pointer I saw.

Don’t believe the “Clean Code” dogma

I don’t believe in anything. I wrote chessprograms and there you lose all believes because the only thing that counts is speed.

A big ‘global context’ is nice, but I had some problems as well there in the past.
The order of initialization / executing / changing things can be tricky.

ericlang · October 29, 2024, 9:51am

I was actually thinking… I will probably run into more things and do not want to ‘spam’ the help with new questions again and again.
Would it be better to rename this thread to My Lemmings Questions?
Or what is the best way?

n0s4 · October 29, 2024, 11:34am

I’d say new posts for new questions will make it easier to find for people who have similar questions in the future. You can also use the brainstorming/explain categories for things that aren’t direct problems.

Sze · October 29, 2024, 1:48pm

I think creating new topics is fine, not spam and actually helpful for people who want to help, but don’t have the time to read a really long topic.

So I think in the best case, with new questions you are able to describe them without needing all of the context of what you are working on, often times this context is helpful to provide better answers, but not everything is relevant or needed and when people think they could give better answers with more details, they also can ask for more information.

You also can link to some previous posts if that helps in describing your situation better. You can think of creating a new topic as providing a new direction of focus and a new entry point for people who may want to join the discussion.

ericlang · October 29, 2024, 8:19pm

I implemented some stuff from Sze and I must say it looks a lot better already. The Lemming is 32 bytes now.

After that I experimented with skipping dead Lemmings.

Using a StaticBitSet, looping through the bit-indexes- decreased performance by 500.000 ticks per second.
Then I added prev and next pointers to the Lemming - which is quite tricky during processing - but this definitely a very fast solution regarding my original question. combining a fixed sized arraylist with prev and next pointers.
When a lemming dies the prev and next pointers are adjusted.
Although it is fast I do not like the tricky part.

    fn update_lemmings(self: *Self) void
    {
        var current: ?*Lemming = self.first_lemming;
        while(current) |lem|
        {
            const next = lem.next; // the tricky part; save this one, because the linked list changes.
            lem.update(self);
            current = next;
        }

        // for (self.lemmings.items) |*lem|
        // {
        //     if (lem.st.is_dead()) continue;
        //     lem.update(self);
        // }
    }

dude_the_builder · October 29, 2024, 9:06pm

Off topic but BTW: I am the greatest fan of Lemmings ever. I completed all levels in PC Lemmings and periodically listen to the soundtrack just for relaxation. If you are re-implementing Lemmings, I will eagerly await the final product to put it to the test. lol

ericlang · October 30, 2024, 7:48am

Nice! You can already use my delphi version. It has a long history and 5000 usermade levels (somewhere down there I uploaded them).

chrboesch · October 30, 2024, 7:55pm

You should definitely avoid linked lists. They use a lot of memory, on current PCs 64 bits per entry. Just use two arrays, as I mentioned in #13, that saves memory and is faster. And read the book, he is a game developer with a lot of experience.

AndrewKraevskii · October 31, 2024, 12:26pm

Since there are only 128 lemmings in array it is possible to store index to next and prev lemming in just 14 bits (u7+u7). So you definitely can do linked list but still save on memory

chrboesch · October 31, 2024, 8:59pm

But that’s not the point. The point is that when you check whether a lemming is alive or not, you have to go through the whole array of 128 lemmings every time. Instead of just the number of actually alive ones. Of course, you might think that if a living lemming points to the next living lemming, you don’t have to go through the whole array, but - after you’ve found the first living one - just the living ones. But, and this has already been mentioned, you’re no longer dealing with simple pointer arithmetic, which the processor basically handles with ease, but with much more complex arithmetic under the hood, which leads to cache misses. And those take up the most time. It’s as simple as that.

ericlang · October 31, 2024, 9:51pm

I know the original DOS game (1992) did it exactly like this:
loop through lemmings (max 80 then), if dead than skip,
And currently I am doing that too (as I always did in the early delphi days).

The bookkeeping of dying lemmings, moving them out of the array is maybe more expensive than just looping, it depends maybe on the gamestate.
I will have to measure some examples…
For example if lemming #127 is alive and the rest is gone

chrboesch · October 31, 2024, 11:27pm

That was the time when CPUs had only one core and didn’t try to predict what the next instruction might be and then load it. Today’s CPUs try exactly that and then load one structure at a time into the cache. Depending on the size of the structure, one or more can fit in. It’s bad if they are “dead” lemmings, because then the CPU loads again. But if you have an array that only stores the indices of the living ones, the CPU just has to make its prediction and load the corresponding structure into the cache. That’s basically the whole secret of modern CPUs. And that’s why an approach like in the 1990s doesn’t work so well anymore.

ericlang · November 1, 2024, 12:26am

Interesting, I will certainly keep all this in mind

swenninger · November 1, 2024, 8:06am

Isn’t the whole point of cache coherency though, that the actual memory for the alive Lemmings is next to each other in memory?

With your suggested array of indices to alive Lemmings, the processor can also not prefetch the memory and will have cache misses, no?

So in my limited understanding, having to separate arrays of actual Lemming data should be the fastest, as you can only touch coherent memory, when iterating one or the other.

pierrelgol · November 1, 2024, 10:01am

The suggestion to use two separate arrays instead of one is based on optimizing the processor’s branch prediction and throughput. Iterating over a compact array is efficient for the branch predictor and keeps cache utilization high. When an array is compact, the processor’s cache loads entire blocks of data, making the iteration process faster. The more densely packed the data, the better the cacheline utilization.

Consider a scenario where you’re iterating over an array of 32-byte structures but only need to examine a single byte within each structure. This results in only 1/32 of each cacheline being actively used per load. In contrast, using a tightly packed array of 32 booleans yields 32/32 utilization, maximizing the cacheline efficiency. This structure improves throughput by reducing cache waste.

As @chrboesch noted, modern processors achieve peak performance by consistently supplying the pipeline with instructions. An L1 cache miss can be manageable, but an L2 miss introduces significant latency, and an L3 or memory/page miss can severely stall the pipeline. A Data-Oriented Design (DOD) approach with a Struct of Arrays (SoA) layout maximizes cacheline efficiency, feeding the processor’s speculative and predictive execution engines with data and instructions, helping to sustain maximum processing throughput.

This becomes even more important in the context of vectorization, because it doesn’t matter that you can execute 8 checks let’s say per load because if your processor constantly has to wait for the cache to load new data, you won’t get the benefits of SIMD throughput.

ericlang · November 1, 2024, 10:32am

Would the MultiArrayList be a candidate for this DOD story?

swenninger · November 1, 2024, 10:50am

Thanks for the explanation! This confirms my point, that having two dense arrays should be better than one compact array and an array of indices into this single array, right?
Of course you should always measure your concrete use case at hand.

pierrelgol · November 1, 2024, 10:51am

just to help you see I’ve made a little example :

const std = @import("std");

const BoundedArray = std.BoundedArray;
const Timer = std.time.Timer;

pub fn FooAos(comptime T: type) type {
    return struct {
        const Self = @This();
        alive: bool = true,
        padding: T = 0,

        pub fn init(comptime N: usize) !BoundedArray(@This(), N) {
            return try BoundedArray(Self, N).init(N);
        }
    };
}

pub fn main() !void {
    const stdout = std.io.getStdOut();
    const writer = stdout.writer();

    {
        const ITERATION = 10;
        const LEN = 128;
        const foo_aos = try FooAos(u31).init(LEN);
        const foo_slice = foo_aos.slice();

        try writer.print("foo_aos : ITERATION {d}, LEN {d}\n", .{ ITERATION, LEN });
        var timer = try Timer.start();
        for (0..ITERATION) |_| {
            for (foo_slice) |foo| {
                std.mem.doNotOptimizeAway(foo.alive);
            }
            const lap = timer.lap();
            try writer.print("foo_aos : time {d}us\n", .{(lap / std.time.ns_per_us)});
        }
    }

    {
        const ITERATION = 10;
        const LEN = 256;
        const foo_aos = try FooAos(u31).init(LEN);
        const foo_slice = foo_aos.slice();

        try writer.print("foo_aos : ITERATION {d}, LEN {d}\n", .{ ITERATION, LEN });
        var timer = try Timer.start();
        for (0..ITERATION) |_| {
            for (foo_slice) |foo| {
                std.mem.doNotOptimizeAway(foo.alive);
            }
            const lap = timer.lap();
            try writer.print("foo_aos : time {d}us\n", .{(lap / std.time.ns_per_us)});
        }
    }

    {
        const ITERATION = 10;
        const LEN = 1024;
        const foo_aos = try FooAos(u31).init(LEN);
        const foo_slice = foo_aos.slice();

        try writer.print("foo_aos : ITERATION {d}, LEN {d}\n", .{ ITERATION, LEN });
        var timer = try Timer.start();
        for (0..ITERATION) |_| {
            for (foo_slice) |foo| {
                std.mem.doNotOptimizeAway(foo.alive);
            }
            const lap = timer.lap();
            try writer.print("foo_aos : time {d}us\n", .{(lap / std.time.ns_per_us)});
        }
    }
}

and the same version in soa

const std = @import("std");
const BoundedArray = std.BoundedArray;
const Timer = std.time.Timer;

pub fn FooSoa(comptime T: type, comptime N: usize) type {
    return struct {
        alives: BoundedArray(bool, N),
        paddings: BoundedArray(T, N),

        pub fn init() !@This() {
            return .{
                .alives = try BoundedArray(bool, N).init(N),
                .paddings = try BoundedArray(T, N).init(N),
            };
        }
    };
}

pub fn main() !void {
    const stdout = std.io.getStdOut();
    const writer = stdout.writer();

    {
        const ITERATION = 10;
        const LEN = 128;
        const foo_soa = try FooSoa(u31, LEN).init();
        const alives = foo_soa.alives.slice();

        try writer.print("foo_soa : ITERATION {d}, LEN {d}\n", .{ ITERATION, LEN });
        var timer = try Timer.start();
        for (0..ITERATION) |_| {
            for (alives) |foo| {
                std.mem.doNotOptimizeAway(foo);
            }
            const lap = timer.lap();
            try writer.print("foo_soa : time {d}us\n", .{(lap / std.time.ns_per_us)});
        }
    }

    {
        const ITERATION = 10;
        const LEN = 256;
        const foo_soa = try FooSoa(u31, LEN).init();
        const alives = foo_soa.alives.slice();

        try writer.print("foo_soa : ITERATION {d}, LEN {d}\n", .{ ITERATION, LEN });
        var timer = try Timer.start();
        for (0..ITERATION) |_| {
            for (alives) |foo| {
                std.mem.doNotOptimizeAway(foo);
            }
            const lap = timer.lap();
            try writer.print("foo_soa : time {d}us\n", .{(lap / std.time.ns_per_us)});
        }
    }

    {
        const ITERATION = 10;
        const LEN = 1024;
        const foo_soa = try FooSoa(u31, LEN).init();
        const alives = foo_soa.alives.slice();

        try writer.print("foo_soa : ITERATION {d}, LEN {d}\n", .{ ITERATION, LEN });
        var timer = try Timer.start();
        for (0..ITERATION) |_| {
            for (alives) |foo| {
                std.mem.doNotOptimizeAway(foo);
            }
            const lap = timer.lap();
            try writer.print("foo_soa : time {d}us\n", .{(lap / std.time.ns_per_us)});
        }
    }
}

the result when using poop are :

❯ poop ./foo_soa ./foo_aos
Benchmark 1 (10000 runs): ./foo_soa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           329us ± 21.0us     302us …  982us        823 ( 8%)        0%
  peak_rss            526KB ±  180KB     184KB …  815KB          0 ( 0%)        0%
  cpu_cycles         36.7K  ± 1.02K     35.3K  … 61.6K         491 ( 5%)        0%
  instructions       69.7K  ± 5.06      69.7K  … 69.8K          24 ( 0%)        0%
  cache_references    499   ± 50.1       399   … 1.08K         392 ( 4%)        0%
  cache_misses        177   ± 18.1       129   …  388          236 ( 2%)        0%
  branch_misses       479   ± 15.5       453   …  561          786 ( 8%)        0%
Benchmark 2 (10000 runs): ./foo_aos
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           338us ± 19.1us     314us …  555us       1062 (11%)        💩+  2.8% ±  0.2%
  peak_rss            840KB ± 2.30KB     733KB …  840KB        102 ( 1%)        💩+ 59.6% ±  0.7%
  cpu_cycles         37.2K  ±  972      35.8K  … 61.0K         556 ( 6%)        💩+  1.6% ±  0.1%
  instructions       69.7K  ± 4.47      69.7K  … 69.8K          19 ( 0%)          -  0.0% ±  0.0%
  cache_references    815   ± 79.0       637   … 1.89K         511 ( 5%)        💩+ 63.3% ±  0.4%
  cache_misses        223   ± 23.8       164   …  449          179 ( 2%)        💩+ 25.8% ±  0.3%
  branch_misses       481   ± 15.9       459   …  558          878 ( 9%)          +  0.3% ±  0.1%

This is very amateurish benchmark, and doesn’t really tells you how it will perform within an entire piece of complete software, but it’s meant to show that this design does reduce cache misses.

and if I set the LEN to 4096 that will exceed my L1 size the Results are dramatic this time.

❯ poop ./foo_soa ./foo_aos
Benchmark 1 (10000 runs): ./foo_soa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           317us ± 21.7us     288us …  987us        871 ( 9%)        0%
  peak_rss            513KB ±  168KB     184KB …  823KB          0 ( 0%)        0%
  cpu_cycles         52.5K  ± 2.10K     51.6K  …  109K         509 ( 5%)        0%
  instructions        169K  ± 1.66       169K  …  169K         893 ( 9%)        0%
  cache_references    575   ± 58.4       399   … 1.24K         327 ( 3%)        0%
  cache_misses        159   ± 17.9       113   …  320          287 ( 3%)        0%
  branch_misses       170   ± 11.5       153   …  235          865 ( 9%)        0%
Benchmark 2 (10000 runs): ./foo_aos
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           328us ± 19.4us     306us …  549us       1088 (11%)        💩+  3.7% ±  0.2%
  peak_rss            840KB ± 2.18KB     651KB …  840KB        103 ( 1%)        💩+ 63.8% ±  0.6%
  cpu_cycles         55.2K  ± 1.68K     54.3K  …  115K         423 ( 4%)        💩+  5.2% ±  0.1%
  instructions        169K  ± 2.20       169K  …  169K         933 ( 9%)          +  0.0% ±  0.0%
  cache_references   5.25K  ±  155      4.59K  … 6.31K        2119 (21%)        💩+812.5% ±  0.6%
  cache_misses        337   ±  129       207   … 1.40K        2024 (20%)        💩+112.2% ±  1.6%
  branch_misses       177   ± 10.4       157   …  232          721 ( 7%)        💩+  3.8% ±  0.2%

pierrelgol · November 1, 2024, 10:56am

Yes it’s made specifically for this use, case, but you can also just use a structure of BoundedArrays. But yes otherwise you can use the MultiArrayList. Of course you should measure before and after to be sure that it does benefit your program execution speed, we never know what could happen within the context of a complete program. Measure twice, cut once.

chrboesch · November 1, 2024, 11:01am

Unfortunately no. Here is another simple example with the assembler output. The approach with two arrays only requires one jump condition, which the processor can easily predict because it knows all the elements from the second array (which is traversed) and can load the cache accordingly. In the second example, everything depends on the second jump condition, which the processor cannot predict, so the cache is also loaded in simple order, but that then leads to cache misses.

const lemming = struct {
    val: u8,
    alive: bool,
};

export fn foo(lems: *[]lemming, living: *[]u8) void {
    for (living.*) |n| {
        lems.*[n].val += 3;
    }
}

export fn bar(lems: *[]lemming) void {
    for (lems.*) |*v| {
        if (v.alive) {
            v.*.val += 3;
        }
    }
}

foo:
        mov     rax, qword ptr [rsi]
        mov     rcx, qword ptr [rsi + 8]
        xor     edx, edx
.LBB0_1:
        cmp     rcx, rdx
        je      .LBB0_3
        movzx   esi, byte ptr [rax + rdx]
        mov     r8, qword ptr [rdi]
        add     byte ptr [r8 + 2*rsi], 3
        inc     rdx
        jmp     .LBB0_1
.LBB0_3:
        ret
----------------------------------------------------
bar:
        mov     rax, qword ptr [rdi]
        mov     rcx, qword ptr [rdi + 8]
        xor     edx, edx
.LBB1_1:
        cmp     rcx, rdx
        je      .LBB1_5
        cmp     byte ptr [rax + 2*rdx + 1], 0 // second compare!
        je      .LBB1_3
        add     byte ptr [rax + 2*rdx], 3
.LBB1_3:
        inc     rdx
        jmp     .LBB1_1
.LBB1_5:
        ret

Here the godbolt link: Compiler Explorer