Fastest filtered scanning of arraylist

LucasSantos91 · November 1, 2024, 11:24am

No. The MultiArrayList keeps various arrays, but they all have the same amount of elements. So it would only work if you always have the same number of alive lemmings as dead ones. You can think of it as a bunch of pointers (not slices), with a single count.

Validark · November 2, 2024, 6:53am

Most of what people said in this thread I’d agree with in spirit, although I believe there are some optimizations that have been discussed that are not beneficial for the reasons people think.

However, first I would point out the problem with benchmarking. In order for a benchmark to actually inform you of some performance characteristics of your algorithmic idea, you have to assume:

You know what you’re doing.
The compiler knows what it’s doing.
The micro-benchmark translates well to actual usage.

Often enough, none of those assumptions are true. Firstly, you should know what architecture or set of architectures you are targeting. Then you should think about the context in which the code will be used and what your expected usage patterns are. In reverse order:

Usage Patterns

I could imagine a scenario where branchy code would be as good as anything else you’d be able to come up with. If the lemmings are not changing from alive to dead to alive very often, then the branch predictor may well get pretty close to 100% accuracy. It sounds like you’re programming a game, and I saw you mention a performance difference of five hundred thousand ticks per second, if I’m interpreting correctly. Are lemmings dying or being resurrected at that rate (as @sze asked)? I’m guessing it’s extremely rare for those things to happen, relative to that many ticks per second. In that case, the branch predictor guessing incorrectly will practically never happen, just a few on startup and a few on change but not enough to notice. The branch predictor would likely hurt you less than a single L3 cache miss. That’s really important to realize. If you haven’t optimized your data, optimizing at the instruction level isn’t going to do you as much good. Eliminating jumps in the code is going to be the main benefit. You save a few cycles by not needing to jump around in the instruction stream. Don’t get me wrong, branchless code is a good idea, a lot of the time. But you have to understand why it’s good if you want to be able to predict performance accurately.

I think we would predict that a short-running benchmark would dramatically overrepresent the branch miss issue, since it would take a little time for the CPU to realize the periodicity is 128.

Also, when it comes to loop unrolling, the compiler has no information about real-world usage of your code. Maybe it’s super common to need to operate on 5 things and it’s almost impossible to do more and the compiler unrolls the loop such that it only branches every 4th iteration, meaning you always branch backward once and you do 3 unnecessary iterations every time even though you pretty much never need that. If you are trying to chase down the last cycle, you have to think for yourself how much to unroll loops (or just try different amounts).

Execution Context

Micro-benchmarks are good when they are carefully crafted to test a specific characteristic of a machine for a particular strategy. However, it’s exceptionally easy to mess this up. When comparing two strategies, the compiler might unroll one of the strategies 4 times versus 8 on the other one, it might vectorize one of them or both when in your real code maybe vectorization doesn’t make sense because you just need something to happen once. A decent enough remedy is to do a loop where you call a noinlined function, that way the compiler can be prevented from doing the tricks I mentioned.

But there are other problems. In trivial code, the compiler can sometimes change the strategy you picked into something it likes better. You might be testing a piece of code against itself! I’ve done that before by accident! That’s why you read the assembly if you actually care.

It’s also possible that getting higher performance on one piece of code comes at the expense of the rest of your code. The most obvious example of this is if you had a giant lookup table that occupied the whole L2 or L3 cache and observed that it was the fastest technique in your microbenchmark, only to realize that if you want to run literally anything else at the same time, the CPU isn’t big enough for the both of them. But also you have to weigh the cost of maintaining whatever data structure you use to accelerate one type of operation against the benefit.

And, as already mentioned, a micro-benchmark might mislead you about how big of a deal the branch misses are (or cache misses) if too much of it is start-up costs, which isn’t representative of a videogame that you expect people to play for longer than half a millisecond.

Know your hardware

In general, I’d prefer the bitmap approach over the array of indices approach, but you sort of need to know what you want in order to do that most efficiently. It also depends on your hardware. Which architectures are you most interested in?

If you are targeting ARM or Aarch64, you don’t have a count-trailing-zeroes, you have a bit-reverse and a count-leading-zeroes. If you are going for every available cycle, you’d want to store the bits in an order that corresponds to big-endian-ordered bytes. On x86 since Intel’s Haswell or Zen 3 on the AMD side, you have a pdep instruction which can be used to unset the lowest X one bits at once, with a latency of 3 cycles. That means if you have a chain of blsr instructions (unset lowest bit), you can jump ahead in the serial dependency chain and extract more ILP using pdep instead. The compiler doesn’t know about this trick, so I opened an LLVM issue here.

Also, if you operate on a u128 directly, more than likely the compiler won’t understand how it can just operate on each 64‐bit half, so just use [2]u64 and type it in yourself what it should do.

Branchless code and SIMD

Sometimes doing a little extra work has performance benefits. E.g. @chrboesch’s approach could be improved by doing v.*.val += if (v.alive) 3 else 0;. You may end up doing an unnecessary addition with 0, but it will be faster than taking a branch, even a correctly predicted one (on modern, powerful hardware).

Even better would be if you had a 128 bit bitstring, and with each VEC_SIZE piece of it you create a vector of the values you want to add together, then you add the vectors together and do an add reduction. How fast that is depends on the hardware. Horizontal operations are not the main thing SIMD is good for though, so be careful.

If you can’t efficiently run the dead lemmings through no-op code, you might try using a vector compression operation if your target hardware has good support. x86’s AVX-512 is the best, RISC-V has good facilities too, but some of them are hard to access in Zig since we don’t have vscale vectors (or maybe I just don’t know how to get the compiler to emit the RISC-V vector instructions I want, I haven’t tried that hard yet). If you have a fast pext instruction you can use that to get fast 16 byte compression as well (by operating on 16 nibbles and then expanding each nibble into a byte in a vector and doing a vector shuffle). Another option is “gather” operations. This is where you have a vector of pointers or indices into an array and tell the hardware to look all of them up at once and fill a vector with the results. I’ve actually seen the compiler figure out it can do this automatically in some cases! It’s pretty slow compared to staying in registers though, so obviously it shouldn’t be your first choice, and some hardware has a very slow implementation, like AMD machines before Zen 4.

In general, you should store the data you care about for the loop in a separate buffer or buffers, and try to operate on all of it simultaneously in SIMD if possible. Even if not, try to run the dead lemmings through no-op code if that’s inexpensive, otherwise use vector compression to isolate just the data you need. You could use a vector gather but for me that’s a last resort.

Oh, and by the way, if you do have vector compression, going from a bitstring to a vector of indices can be done as efficiently as an L1 or L2 hit, depending on whether you have to expand the results of the vector compression from e.g. a byte to 4 bytes. And the bitstring will be faster to query for other use cases too.

I’m a bit pressed for time but if you provide more details I can help you choose the best strategy. Also if you send me some code to look at I might read through it when I have some time and render my opinion on it.

And the most important thing is to optimize your data access pattern first! No amount of instruction level optimization is going to make up for the performance you can throw away on unused or bloated data.

ericlang · November 2, 2024, 7:44am

I was looking for that. (I think the guy who made that ‘hyper optimizing the compiler / parser’ on youtube used that. I thought it was an interesting idea. No clue how to actually do it. I’ve done very little simd.

Furthermore, I just started with Zig, but withing a week I got further than 2 months of Rust.
This is my language.

By the end of december - when the code is ready to simulate all possible actions in real levels - when the messy state it is now in is cleaned up - it will be on Github.

squeek502 · November 2, 2024, 8:29am

That guy is the guy you’re replying to

(context for those unfamiliar: part 1, part 2, part 3)

ericlang · November 2, 2024, 9:16am

nice to see part 2 en 3 mentioned there as well.

Validark · November 2, 2024, 12:55pm

I’m on my phone right now but this should work for 64 byte vectors on AVX-512 targets:

fn compress(compressable: @Vector(64, u8), mask: u64) @Vector(64, u8) {
        if (!@inComptime() and comptime std.Target.x86.featureSetHas(builtin.cpu.features, .avx512vbmi2)) {
            return struct {
                extern fn @"llvm.x86.avx512.mask.compress.b"(@Vector(64, u8), @Vector(64, u8), u64) @Vector(64, u8);
            }.@"llvm.x86.avx512.mask.compress.b"(compressable, @splat(0), mask);
        }
}

I’ll get more general purpose code for you later.

ericlang · November 4, 2024, 10:02am

That is a complicated one.
I didn’t do any vectors yet in Zig.

const x = @Vector(2, u64)  { 0, 0, 0, 0 };
const y: ??? = @splat(x???); // it seems there is no argument type-inference here?

I also was wondering why Zig’s StaticBitSet(128) produces an ArrayBitSet instead of some u128 or Vector .

Then regarding this whole optimization and L1 cache: The game accesses a chunk of 512.000 bytes inside lemming.update() so I have to let than one go I believe.
But skipping the bits is I believe a good option.

Edit: would a u128 not be easier? And use bitscanning?

Calder-Ty · November 4, 2024, 11:20am

Looking at the source, I see that StaticBitSet creates an array bit set for any size value greater than 64.

The docs say that integer bitsets with large backing integers will produce inefficient code. So either this choice is a hint at the ideal size for integer bit sets, or a bug.

I’m guessing the choice of 64 bits has to do with register sizes on common architectures, but have no real clue.

https://ziglang.org/documentation/master/std/#src/std/bit_set.zig

kcbanner · November 4, 2024, 1:56pm

Something to keep in mind when testing with std.bit_set is that you may be hitting this bug: sometimes there is an unwanted memcpy when passing large structs by-value · Issue #17580 · ziglang/zig · GitHub

Sze · November 4, 2024, 5:34pm

headless update (no render code)

I think with using SoA layout the animation should completely disappear from the update code, so that you can run simulations that don’t have to do any rendering (or work that is related to rendering).

Instead of counting up which frame in the animation cycle is currently playing you could safe the offset of when the animation cycle started and use that to calculate, alternatively drawing related things could have their own much smaller update embedded (for non gameplay related, animation behavior), or you have 2 update functions one for things that are only needed when actually drawing and the other for all the things that have an influence on the outcome of the gameplay.

think about dependency chains, figure out how to “cheat”

But that is because you have a “monolithic” update function.
I think the biggest problem is the serial dependency chain you have because the first lemming being updated can influence how the second lemming will be updated.

I assume that the way they can influence each other is mostly what type of lemming they are, whether they are a stopper, dead, exploded, etc.
So I would try to find out, which of these can be detected early and what is the fastest way to find that out in the code.

For example some depend on player actions, so maybe if there aren’t any player actions there can be a fast path that knows to skip checks for those.
Then others may only be possible if two lemmings are currently close to another, for that you could have some data structure that tracks close lemmings, to allow a quick check for that. (that fits within a few cache lines)

If that problem was solved and you would be able to store just the relevant information to know that lemming 7 represents a barrier / serial dependency to the following lemmings and lemming 78 does too.
Then you could write optimized routines that process multiple lemmings in parallel in-between these barriers, especially if you find lots of ways to make certain things that don’t apply to certain lemmings into no-op’s like @Validark said. (Basically ask yourself this question: how much of my update logic can be turned into a math formula? Where parts of it are 0 or 1 causing some term to not be added to the result)

two implementations (or more)

I think it would make sense to have 2 implementations one that is just as simple as possible and serves to be used for validation that the produced results are still the same and a second one that uses clever support data structures to track things that tell you when and where you can take shortcuts without producing different results. Then you can run these 2 implementations in parallel to check that the fast one doesn’t introduce any bugs in the results.

groups of lemmings and skipping frames

Just another example, when lemmings fall of a cliff and there aren’t any player interactions you probably can pre-compute where and when they land, so instead of just taking shortcuts or running more in parallel, you might be able to change things so that your code can find ways to skip several frames of execution (for example by ignoring animation or encoding its periodicity).
Basically separating lemmings that are able to have an effect from those that don’t. For example if there is a group of lemmings that runs back and forth between 2 stopper lemmings, your fast implementation should be able to recognize that as a closed system, group those lemmings, calculate the periodicity and then at any frame in the future where an outside lemming influences that group, you calculate their state the frame before and continue the non-predictive code.

So instead of saying a generic lemming in general can possibly influence the next lemming. Stop treating the lemmings generic, at any point in time it has a specific type and that type influences whether it can influence other lemmings.

That should allow you to chunk groups of lemmings that can be simulated as a group. For replays you also have the input data for the future that might help in finding ways to skip forward till the next input changes the cause of repeating outcomes. (When a group of lemmings is walking off a cliff that is high enough so they splat, then skipping forward in time means to calculate how many haven’t walked off yet and whether there are any left that haven’t hit the ground yet at the next input event)

other possible sources for inspiration

Also it may make sense to look into the predictive networking code of first person shooters and how they do client side prediction and rollback for inspiration. Another thing that might have useful ideas might be Hashlife - Wikipedia, which can simulate some patterns in Conway’s Game of Life more efficiently (if they contain patterns that get recognized by the algorithm), while being worth at other patterns.

Overall these suggestions are a bit for future work / research and fun optimization directions you could think about. But I think if you want to validate replays really fast recognizing groups of lemmings and their periodic nature could enable some great speedups, you probably then would need to create heuristics around when it is worth to switch between different implementations based on when they are fast, or maybe create a hybrid implementation that incorporates different strategies and switches between them automatically (or maybe even races them in parallel sometimes on different cores and picks the one that finishes the subtask faster, when it isn’t known before which one would be quicker).

These are mostly ideas for fun things you could try, instead of having just one implementation that burns through replay frames naively as quickly as possible, have other implementations that try other clever strategies. Then you can run these different strategies and for example find out that certain levels run faster on that implementation then that one, etc…

ericlang · November 4, 2024, 6:48pm

It would be quite a challenge to handle ‘distant’ lemmings in parallel. Or computing certain things like a formula. Certainly there are some interesting ideas here!

Indeed it is a monolothic, sequential update function, because of the sequential nature of the game. Note: I want the mechanics to be very close to the original, except the bugs (which I emulated in my Delphi version) and except the memory limitations there were in 1992, which forced the makers to do rounding / truncating of trigger area’s etc.

The “512000” bytes is only the interaction with terrain / traps. Bothering me there is boundary checks on each pixel.
Anyway: I think making the basic loop faster with SIMD or adjusting an array is not going to help.

By the way: there is zero graphics / messages / interaction going on in the game-engine right now. The animation info that is there is only the information which is part of the mechanics.

In a complicated scenario with 128 lemmings the FPS is around 1.5 million in releasefast mode.
My 2D masking routines could certainly be faster, but that is off-topic here.

Currently I am building a plethora of test-routines to check the validity of all mechanics.

Sze · November 4, 2024, 7:32pm

I could be wrong, but my suspicion is that you are overestimating the effect of serial things and how much they prevent applying parallelization, to confirm or deny that, this would be my test:

create 2 naive implementations, the first is the current one, the second is one that tries to do things parallel without considering things that changed in this frame, every time the second one produces a different outcome then the first, it gets reset to the state the first produced. You then run this on a lot of replays and count how often it produces wrong frames, based on that you can evaluate whether it would be a worthwhile optimization, without spending that much time on it. But maybe I am not thinking of something in my thought experiment.

My intuition tells me there should be a good percentage of frames that could completely work in parallel, but maybe it is off, or making that work requires more effort then I currently imagine.

LucasSantos91 · November 4, 2024, 8:13pm

I’ve never played Lemmings but this doesn’t sound right to me. If this were true, the outcome of the game could change depending on what lemming the game decided to process first. Usually, on games like this, each game tick is based on the previous game tick. That is, at the beggining of each tick, the game takes a snapshot of the state of the world, and every Lemming acts based on this snapshot. What other Lemmings are doing in the same tick doesn’t matter. This makes the game behave as if every Lemming was acting at the same time.
Consider if Lemming A decides to punch Lemming B and, at the same time, Lemming B decides do punch Lemming A. If the games process A first, A wins. If it process B first, B wins. That’s not how we would expect the game to behave. With the snapshot approach, both punch each other simultaneously, which is ideal.
With this approach, the behavior of the Lemmings is trivially parallelizable, as every one depends on the same immutable snapshot.

Sze · November 4, 2024, 8:19pm

Ideally that would be true, but @ericlang has already confirmed that there are intra frame dependencies, I think you are right that it would be much easier to parallelize if those could be avoided. But if the original game didn’t then it isn’t such an easy task to get closer towards that, without changing gameplay behavior.

The original games are old and likely ran on single cores, so it is also likely they didn’t care about parallelization and instead cared more about saving memory, for example by not having double buffering.

ericlang · November 4, 2024, 8:22pm

Yes that is most definitely the case. If lemming #1 is building a brick in frame #100 and lemming #2 is in that place it steps up that brick in frame #100.
Around 2006 I got all mechanics in pseudo code from a guy who disassembled the original executable.

Validark · November 5, 2024, 12:00am

Well, my tokenizer project has inherent dependency chains too (in a quote vs in a comment vs open code) and I still use SIMD and bit manipulation for everything. If there is any parallelism there you might be able to efficiently exploit it. E.g. if most lemmings are not modifying the environment for subsequently processed lemmings then maybe you can still process most of them in parallel.

Maybe I’ll look at your code around the end of the year and look for optimization opportunities.

mnemnion · November 5, 2024, 1:38am

I would consider trying this again with two 64 bit IntegerBitSets. Make a [2]IntegerBitSet(64), and unroll the iterations, with the second loop using a static +64 into the array:

while (alive[0].next()) |i| {
      lemming[i].doStuff();
} 
while (alive[1].next()) |i| {
    lemming[i+64].doStuff();
}

If you take a look at the source you’ll see that an IntegerBitSet Iterator is completely branchless, it’s all masks. If you used a single u128 you’re relying on some compiler magic which might not kick in. The StaticBitSet is more complex (therefore branchy) and for exactly 128 flags, you’re not getting the advantages which come with that.

ericlang · November 5, 2024, 7:20am

This calls for a video then: “hyper-optimizing the lemmings engine”

Validark · November 6, 2024, 4:39pm

Haha, maybe so, maybe so. That could be a fun one since video games are typically more relatable to people since the output is more fun.