ReleaseSafe slower than Debug? And performance degrading with larger buffers?

So I have been toying around with passing data around with network sockets, and I’ve found behaviour that is strange to me.

When I use 1k buffers:

  • ReleaseFast: 100% performance
  • ReleaseSafe: 90% performance
  • Debug: 70% performance

When I use 32k buffers:

  • ReleaseFast: 100% performance (same speed as 1k buffers for low amount of data)
  • ReleaseSafe: 30% performance
  • Debug: 40% performance

Can someone explain possible causes for this massive performance degradation? And why Debug became faster than ReleaseSafe?

Hello ! Could you give us a code sample to deal with ? It’s very hard to guess without having access to any detail of the implementation.

I’ll just share a snippet if thats okay.

I don’t actually use two of the buffers from the Bufs struct. I found that just removing those speeds up my program dramatically. I thought it could maybe be due to zig setting the memory to undefined with runtime safety, so I tried to turn off runtime safety for the create/destroy of that structure which didn’t help at all.

I thought that having those buffers there should have near zero impact because it is in memory pool. But it seems that is what is causing my slowdowns.

fn runProxy(proxy_listener: *net.Server, counter: *Counter, gpa: std.mem.Allocator, srv_address: net.UnixAddress) void {
    var io_ev_rt = zio.Runtime.init(gpa, .{ .executors = .exact(1), .thread_pool = .{ .max_threads = 0 } }) catch @panic("OOM");
    defer io_ev_rt.deinit();

    const io_ev = io_ev_rt.io();
    var my_srv: MyProxy = .{ .counter = counter, .proxy_listener = proxy_listener, .io = io_ev, .tasks = .init, .srv_address = srv_address, .pool = .empty, .gpa = gpa };
    while (true) {
        my_srv.accept();
    }
}

const MyProxy = struct {
    const buf_size = 1024 * 32;
    const Bufs = struct {
        // If i delete the unused buffers from this struct, I get huge speedup in debug and ReleaseSafe
        cl_read: [buf_size]u8,
        srv_read: [buf_size]u8,
        cl_write: [buf_size]u8,
        srv_write: [buf_size]u8,
    };
    const BufsPool = std.heap.MemoryPool(Bufs);

    io: Io,
    proxy_listener: *net.Server,
    counter: *Counter,
    tasks: Io.Group,
    srv_address: net.UnixAddress,
    pool: BufsPool,
    gpa: std.mem.Allocator,

    fn accept(self: *MyProxy) void {
        const conn = self.proxy_listener.accept(self.io) catch |err| {
            log.warn("Accept err {t}", .{err});
            return;
        };
        self.tasks.async(self.io, handleConnection, .{ self, conn });
    }
    fn handleConnection(self: *MyProxy, cl_connection: net.Stream) void {
        defer cl_connection.close(self.io);

        var srv_connection = self.srv_address.connect(self.io) catch |err| {
            log.err("Connect error {t}", .{err});
            return;
        };
        defer srv_connection.close(self.io);

        var bufs = self.pool.create(self.gpa) catch @panic("OOM");
        defer self.pool.destroy(bufs);

        var cl_reader = cl_connection.reader(self.io, &.{});
        var cl_writer = cl_connection.writer(self.io, &bufs.cl_write);
        var srv_reader = srv_connection.reader(self.io, &.{});
        var srv_writer = srv_connection.writer(self.io, &bufs.srv_write);

        var fut = self.io.concurrent(streamUntilClose, .{ "srv->cl", &srv_reader.interface, &cl_writer.interface }) catch @panic("NO");
        streamUntilClose("cl->srv", &cl_reader.interface, &srv_writer.interface);
        cl_connection.shutdown(self.io, .recv) catch return;
        srv_connection.shutdown(self.io, .send) catch return;
        fut.await(self.io);
    }

    fn streamUntilClose(name: []const u8, reader: *std.Io.Reader, writer: *std.Io.Writer) void {
        while (true) {
            _ = reader.stream(writer, .unlimited) catch |err| switch (err) {
                error.EndOfStream => {
                    log.debug("Stream {s} done", .{name});
                    return;
                },
                else => {
                    log.err("Stream err {t}", .{err});
                    return;
                },
            };
            writer.flush() catch |err| {
                log.err("Flush err {t}", .{err});
                return;
            };
        }
    }
};

What happens if you don’t use the pool ? or you call addCapacity then call self.pool.free_list.popFirst().? ?

I’ve had code that uses the pool’s create method and for some reason it was incredibly slow. And I mean tens of milliseconds slow for creating a few hundred elements.

Also, try running it on io.threaded, it could be some zio assert in the hot path that leads to slow code with LLVM.

I increased the buffer size even further for testing to 64k per buffer.

When I put the Bufs on stack, I started having 90k requests per second (which is around 30% of the ReleaseFast speed which is around 330K/s) in both ReleaseSafe and Debug

If I allocated it on gpa (allocator from std.process.Init), Debug stopped working before the allocation with error.WouldBlock, which is very suspicious. ReleaseSafe was slightly faster at 100k/s

With the memory pool in this case Debug runs at 130k/s while ReleaseSafe runs 60k/s

I am not sure if I would get comparable results, the performance characteristics would be completely different, I think.

I am suspicious of the buffers as arrays. Maybe there’s copying happening that’s slowing things down? Have you tried using heap allocated buffers ?

1 Like

What do you mean? They are heap allocated, since they are in MemoryPool backed by gpa, no?

Anyway, I tried bringing my own pool into the experiment, and that brought speeds back to the expected values.

ReleaseFast ~330k/s
ReleaseSafe ~300k/s
Debug ~220k/s

I got it. std.heap.MemoryPool = bad.

It might also be something about my pool not being thread safe too.

What I meant is, the read/write buffers should have type []u8, and their values created by an allocator.

The actual thing bad on MemoryPool for big structures is the ptr.* = undefined; which is both in create and destroy. Just removing that fixed my performance even when using MemoryPool.

2 Likes

Were you able to fix it using @setRuntimeSafety? I’ve always wondered if that would work to avoid initialization. I’m currently allocating memory with Allocator.rawAlloc to avoid initializing it in ReleaseSafe mode.

Edit: I wonder whether initialization to undefined should be Debug-mode only.

No, I wasn’t. I just made an issue and PR to add a way to opt-out of it.

2 Likes