Slice indexing huge speed difference

Zonion · January 27, 2025, 6:27pm

Hi everyone,

I have another question for you. I’m currently building a VM and as I was profiling yesterday I ran into something I wasn’t expecting.

The VM’s stack uses a fixed size array and indexing the stack at runtime is done via a function call. When I profiled this, I saw a huge amount of time spent in a system call (I think?). I then tried different architectures for the stack and ended up trying an ArrayList instead. There were no more system calls resulting in a waay faster execution.

I did a little test in which I just do useless stack manipulation to show you what I’m talking about.

In the first program, I use a fixed size array for the values:

const Stack = struct {
    values: [256]usize,

    pub fn init() Stack {
        return Stack{ .values = .{1} ** 256 };
    }
};

fn get_idx(i: usize) usize {
    return i % 256;
}

pub fn main() !void {
    const stack = Stack.init();

    for (0..100000000) |i| {
        const a = stack.values[get_idx(i)];
        _ = a;
    }
}

When I profile this I have:

The vast majority of execution time is spent in “_plateform_memmove”.

Now, if I do this exact same thing but replacing fixed buffer by an ArrayList:

const Stack = struct {
    values: ArrayList(usize),

    pub fn new(allocator: Allocator) Stack {
        return .{ .values = ArrayList(usize).init(allocator) };
    }

    pub fn init(self: *Stack) !void {
        for (0..256) |i| {
            try self.values.append(i);
        }
    }

    pub fn deinit(self: *Stack) void {
        self.values.deinit();
    }
};

fn get_idx(i: usize) usize {
    return i % 256;
}

pub fn main() !void {
   var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    var stack = Stack.new(allocator);
    try stack.init();
    defer stack.deinit();

    for (0..100000000) |i| {
        const a = stack.values.items[get_idx(i)];
        _ = a;
    }
}

I have this profiler result:

We go from a 1.9s of execution time to 269ms and it’s just because there is no more calls to “_plateform_memmove”.

I would like to understand why it behaves like this or what triggers those system calls to know exactly when or not to use an ArrayList (performance wise).

Edit: Just in case, I use zig version 0.14.0-dev.2577+271452d22 on MacOS.

Thanks!

IntegratedQuantum · January 27, 2025, 6:31pm

It’s a compiler bug. I’ve reported it while ago: Another array access performance issue. · Issue #13938 · ziglang/zig · GitHub
The issue also contains a simple workaround to the problem.

Basically the compiler needlessly copies large arrays onto the stack in some situations when accessing individual elements.

Zonion · January 27, 2025, 6:33pm

Oh wow nice, that’s exactly that. Thanks for pointing me out the issue!

Edit: In case anyone is interested, I tried the workaround described in the issue (using (&stack.values)[get_index()]) and I have slightly better performance over using the ArrayList version.

markus · January 27, 2025, 8:13pm

What profiler is this? I wanna try to use one some time soon and it seems there arent too many around for zig.

IntegratedQuantum · January 27, 2025, 8:15pm

it seems there arent too many around for zig.

You can use any profiler that supports C/C++.

Zonion · January 27, 2025, 8:18pm

“Instruments” from XCode (only MacOS)