Fastest filtered scanning of arraylist

ericlang · October 28, 2024, 8:35am

I have a list of game-entities inside an arraylist.
The list has a capacity of (max) 128.
When an entity dies I flag it as removed.
When looping through the list what would be the fastest way to skip the removed entities?

I do not delete anything from the list
I do not want to swap removed entities to the end because of statistics / other internal reasons.

Currently it is:

for(entities)|e|
{
    if (e.is_removed()) continue;
    // process e
}

which is suboptimal.

I was thinking about some crazy 128 bits SIMD instruction, but don’t know how.
Other tricks welcome!

nurpax · October 28, 2024, 10:49am

IMO using a separate std.bit_set for “is removed” status is pretty good. You can loop through “set bits” very quickly. I posted about using this technique to implement a simple ECS here: Simple ECS implementation inspired by Zig's MultiArrayList

ericlang · October 28, 2024, 11:37am

That could be an idea. I can’t read your complete code (yet) because I am just starting.
But the main idea is probably (I wrote a chess engine so I used that widely used trick)

get the next set bit (LSB) index.
grab the entity by index inside the list
clear the LSB.
Is that right?

dude_the_builder · October 28, 2024, 11:47am

You could store ?T and set the item to null to mark removal. Then it’s just an

if (maybe_e == null) continue;
// or
if (maybe_e) |e| {
    // process e
}

nurpax · October 28, 2024, 12:44pm

I don’t know if I can say it any better than what I already wrote in the linked post, but I think the idea is similar to what you had in mind: use fast bit scans to skip over “unset” elements. So for example, if you have 64 consecutive elements all “not removed”, the code looping over your entity list would be able to skip all 64 elements at once as all the “removed” bits in a 64-bit int would be zero. In my original post, I did also verify by looking at the generated assembly that the compiler can use the “count the number of trailing zero bits” (tzcnt) instruction on x86 to further speed up looping over the individual bits.

ericlang · October 28, 2024, 12:48pm

Ok, understood.
The question is when it gets faster and when not.
Looping over 128 elements checking a bool is quite fast even if you skip 127 of them.
I will have to measure that…

Sze · October 28, 2024, 2:08pm

Do you have many of those 128 elements collections, else it seems unusual to optimize that, you would have to check but just optimizing fitting those into cachelines and making the looping over all 128 elements as fast as possible might be enough (not necessarily but could be).

Basically I think you should look at your cache line utilization, if you touch all the cache lines, then I don’t think avoiding a few elements here and there will actually give you a speedup. But this also depends on how big your elements are and their memory layout.

Also reminds me a bit of the “Sparse Set” datastructure ECS back and forth but if you are really limited to max 128 elements it seems likely that using a bitset is better. But it seems statistically unlikely that you would skip entire cachelines, which is why swap-removing elements seems better to me (avoiding entire cachelines and having densely packed cachelines).

If you are willing to share more details so that people can actually work on a equivalent example, you also could ask Validark to help you optimize this, he has expressed interest in helping with performance optimizations related to SIMD.

But currently I think there isn’t enough information about the actual memory and operations to come up with code that will match. For example what is @sizeOf(Entity)?

I think you should reconsider using swap-remove, are you sure you can’t change your other code so that you don’t have to keep order?

It also would be good to measure/analyze which parts are more important for performance, what is the code that has the biggest impact on performance and is that what you are optimizing?

Ideally something like GitHub - plasma-umass/coz: Coz: Causal Profiling could tell you which parts of your program would be worth optimizing. I eventually want to try that out in practice but I haven’t yet (so I am not sure what would be required to use it with Zig).

IntegratedQuantum · October 28, 2024, 3:02pm

The main performance problem here will be branch misses because e.is_removed() will basically be randomly true or false for each element.

I think the best solution would be to remove the check entirely.
To do that we can apply some ideas from data oriented design:
Instead of storing a bool in each element, we can store two (dense) lists of indices to elements.
One list contains all the active entity indices, and one list contains all the removed entity indices.

This allows iterating only over the active entities without needing to ever branch on the value:

for(activeEntityIndices.items) |index| {
    const e = entities[index];
    // process e
}

And always remember:
Don’t trust random people’s opinions on performance, always measure it yourself!

n0s4 · October 28, 2024, 3:36pm

This reminded me of Andrey Kelley’s talk on DOD, specifically this part. Maybe you’ll get some ideas from there.

chrboesch · October 28, 2024, 3:40pm

All these ideas come more or less from the book “Data-Oriented Design” by Richard Fabian.

n0s4 · October 28, 2024, 4:11pm

Just found it online, thanks for sharing.

ericlang · October 28, 2024, 4:34pm

I’ve seen the DOD of Andrew online, but did not yet get a real idea of the benefits in my case.

To sketch the situation a bit: as you can see from my picture I am remaking - or trying to - my Lemmings clone. My finished one is written in Delphi and works quite well - it is on github and called Lemmix.

Lemmings can die very quickly or not at all and just run around.
In Zig the lemming is (now) around 60 bytes and will not become much larger than that.
As mentioned I create an arraylist with a capacity of 128 which is the maximum.

A little test in release mode showed an update speed of around 3.2 million game-ticks per second. with the simple:

for (self.lemmings.items) |*lem|
{
    // no check for dead or alive
    lem.update();
}

where the internal update function changes with the job the Lemming is executing.
When the lemming dies the update function is set to this one:

fn handle_none(self: *Self) void
{
    _ = self;
}

A little test, using a u128 with bit checking showed a performance decrease in case all 128 lemmings are alive.

The reason I need the best performance is because I want to be able to check things like replay files, do a multi-frame-skip in-game or feed stuff into some A.I. system.

For now I stick with the above code and will read some DOD.

chrboesch · October 28, 2024, 5:41pm

The DOD idea in your case is the following: you have 2 arrays, the first with all the lemmings, the second [128]u8 with only the numbers of living lemmings. Every time you need to update the living lemmings, you simply iterate through your array of living lemmings which is index linked to the actual array. This is much faster because you don’t have any if statements. These are the most time consuming because your processor causes cache misses.

for(living_lems) |i| {
   doSomething(lemmings[i]);
}

LucasSantos91 · October 28, 2024, 6:39pm

I agree with everything you said here. However, I’d like to point out that Casey Muratori strongly suggested the approach that OP is taking in quite a few places. More recently, he did an interview with Primeagen, where Prime was asking tips for how to write a game (which he’s writing in Zig), and Casey specifically told him to not move entities between lists. He said that the complexity of an entity leaving a list is not worth any performance gain you might get. Here, with only 128 entities maximum, maybe the dumbest and most straightforward approach might actually provide the most performance, and be many times easier to implement.

ericlang · October 28, 2024, 6:52pm

That is what i was thinking! The dumbest approach could very well be the fastest.

Sze · October 28, 2024, 8:01pm

With 60 bytes per lemming assuming a bunch of 4 byte fields that is 15 fields, it seems unlikely that those fields are all accessed all the time, thus I would suggest trying out a std.MultiArrayList and only looping over the required fields in parallel. Basically I think SoA or some form of splitting hot and cold data could be beneficial, but I think that may only help if it allows you to ignore larger parts of the cold data and thus pack more of the hot data into fewer cache lines.

Further I would find it interesting what those 60 bytes are, are those already packed densely?

I think trying different approaches is needed to know for sure.

Considering that for many game ticks only low fractions of the total maximum may exist an approach that actually only has to iterate the 20 alive lemmings may be faster and may be able to compute more ticks.

It also could be faster to have separate lists for each type of lemming so that the update function can be type specific (and update all lemmings of one profession at once instead of being a field in the lemming (if that would be possible)) instead of having to switch/dispatch based on lemming type, also possibly not needing certain fields for specific lemming types. From taking a look at some youtube lemming gameplay I think there are way way more game ticks than there are actual occurrences of lemmings changing their profession or dying, so I think having separate lists may actually accumulate performance wins over many ticks of not having to ignore irrelevant data.

There are some subtle problems like update order that influence what kind of organization is possible, for example if the original game updates lemmings from first spawned to last spawned and demands that behavior for authentic gameplay, then there may be update dependencies that make it difficult or impossible to update them in a different order.

ericlang · October 28, 2024, 8:20pm

Yes, the multi array list is definitely something I want to have a look at!
Some of the ideas here are very interesting!
The current data-state of my lemming is like this.
Not optimized… just working on the port from Delphi. Starting phase.

(Btw I am a git-hub total noob, but it will be there when the mechanics are working).

The mechanics are very very specific in the old DOS game. And much of it should be kept intact. The order of things is strictly sequential for example.
Lemming 1 can change the terrain for lemming 2 during one frame.

pub const Lemming = struct // 56 bytes now
{
    game: *Game,
    list_index: u7,
    action: LemmingAction, // enum(u8)
    update_function: *const UpdateFunction,
    state: LemmingState,
    animation: LemmingAnimation,
    pos: Point, // (i32, i32)
    delta_x: i32, // this could be a i8 too
}

const LemmingState = struct
{
    const FLAG_CAN_CLIMB: u8 = 1 << 0;
    const FLAG_CAN_FLOAT: u8 = 1 << 1;
    const FLAG_IS_DYING: u8 = 1 << 2;
    const FLAG_IS_DEAD: u8 = 1 << 3;

    born: u32,
    death: Death, // enum(u8)
    flags: u8,
    fallen: i32 = 0,
    excavated_pixels: u8 = 0,
    explosion_timer: u8 = 0,
    remaining_bricks: u8 = 0,
}

pub const LemmingAnimation = struct
{
    current_frame: u8,
    frame_count: u8,
    cycles: u8, // we actually only need cycles for zero or one for checks. after 255 it resets to zero.
    float_param_index: u8,
    frame_width: u8,
    frame_height: u8,
    foot_x: u8,
    foot_y: u8,
}

const UpdateFunction = fn(*Lemming) void;

Sze · October 28, 2024, 9:20pm

Why does the lemming have a pointer to *Game?

I would have expected the game to own all the lemmings so that there is no need for a pointer back to the game.

Instead of update_function consider having just an enum and then switching on that enum that way you likely can use u8 to back that enum.

pub const Lemming = struct // 33 bytes now
{
    list_index: u7,
    action: LemmingAction, // enum(u8)
    update_state: UpdateState, // enum(u8)
    state: LemmingState, // 13 bytes
    animation: LemmingAnimation, // 8 bytes
    pos: Point, // (i32, i32)   8 bytes
    delta_x: i32, // this could be a i8 too
}

I think this with MultiArrayList could mean that your update loop touches way less memory.

For flags I would use a packed struct of booleans, but that is just for not having to deal with manual bitshifts and easier syntax reading/writing those flags.

Maybe you could use the packed struct to pack flags together with the death enum if there aren’t more than 16 different ones enum(u4), that would spare another byte making the lemming struct 32 bytes which is always nice to have powers of 2.

const LemmingState = struct
{
    const Death = enum(u4) { splat, boom, drown, ... };
    const Data = packed struct {
        can_climb: bool,
        can_float: bool,
        is_dying: bool,
        is_dead: bool,  // is dead could be part of the death enum as one of the states, instead there would be an alive state in the enum, that would give us 31 kinds of death
        death: Death,
    }; // 1 byte

    born: u32,
    fallen: i32 = 0,
    data: Data,
    excavated_pixels: u8 = 0,
    explosion_timer: u8 = 0,
    remaining_bricks: u8 = 0,
} // 12 bytes

ericlang · October 28, 2024, 10:03pm

const Data = packed struct {....}

1 byte. REALLY? Is this some trick I don’t yet know?

I am not yet 100% sure of the structure.
At least some backlink is needed. The game owns the lemmings yes. Putting the mechanics code inside lemming is more of a convenient thing to avoid getting the game.zig too big.
Anyway, game holds the terrain (world) as well and the lemming needs that one.
Additionally I also need some ‘store’ to load stuff. And want too avoid too much function arguments (as was needed in my abandoned Rust borrowchecked version).

fn handle_walking(lemming: *Self)
{
    const gw = &self.game.world; // backllink needed.
    // check terrain etc.
}

I thought this comptime thingy would be very fast when changing job…
avoiding ‘switch action’ lookup for the right function…

fn transition(self: *Self, comptime new_action: LemmingAction) void
{
    self.action = new_action;
    const index: u8 = @intFromEnum(new_action);
    self.update_action = LEMMING_UPDATE_FUNCTIONS[index]; // global fixed array
}

and here the update function

pub fn update(self: *Self) void
{
    const func = self.update_function;
    func(self);
}

Sze · October 28, 2024, 10:42pm

Langref - packed struct

With games I usually end up having a context object that is the root object and contains or points to everything that is needed, then I just have to pass the pointer to that context object around as function argument.

I think it is better to pass one or even a few function arguments around on the stack then putting redundant pointers into your instances where they aren’t really needed and mess up your tight memory layout.

Abandon individual element thinking as much as possible, when ever possible try to instead group instances and deal with them as groups. When you have an array of lemmings it doesn’t make sense to have the identical pointer to game in every element.

A switch is simpler than function pointers, it is less flexible, syntactically more constrained, this is good because the compiler has to do less analysis to figure out what can be optimized.

If you don’t need the flexibility of function pointers, writing a switch is simpler and more likely that it will be optimized well, you also have the additional benefit of the compiler telling you if you forget to handle a case and less pieces to worry about and keep in sync.

I would always prefer going with a switch first, if it turns out you need some dynamic option later, you always have the option to add a prong that uses a function pointer indirection later.

“Code organization” is not a good reason to avoid a switch and turn it into a table of function pointers, it needlessly scatters the implementation, adds unnecessary levels of indirection. Let the compiler optimize the switch into jump tables or other constructs, if it sees a reason for it. Don’t believe the “Clean Code” dogma, big functions are fine and can sometimes be easier understood, especially if they are mostly a big switch statement.

Using switch statements is the better default, because you aren’t starting out assuming (and paying for) flexibility you may not need.