Building an HTTP server with no per-request allocations in Zig

Swerver:


I’ve been building swerver, an HTTP server in Zig. One design goal from the start was zero heap allocations on the HTTP/1.1 request hot path. Not only for benchmarks, but because the server’s memory behavior is fully predictable under load.

This post covers the memory architecture: what gets pre-allocated, how requests are parsed without copying, and where the tradeoffs hit.

swerver is currently the #1 fastest server for json processing and short lived connections on HttpArena

The shape of the problem

A typical HTTP server allocates memory in a few places per request:

  • A buffer for the incoming bytes
  • Parsed header structures (name/value pairs, method, path)
  • Whatever the handler needs (JSON encoding, template rendering)
  • The outgoing response buffer

If you’re allocating and freeing all of that per request, you’re at the mercy of your allocator’s fragmentation behavior under load. At 100K+ req/s, even a fast general-purpose allocator becomes a meaningful cost, both in CPU time and in the unpredictability it adds to tail latency.

The alternative: allocate everything up front.

Buffer pools: a contiguous slab with a free stack

All request and response buffers come from a pre-allocated pool. At server init, we allocate one contiguous slab. By default, 4096 buffers × 64 KB = 256 MB per worker process:

pub const BufferPool = struct {
    storage: []u8,          // one contiguous allocation
    free_stack: []u32,      // LIFO free list of buffer indices
    free_len: usize,
    acquired: []bool,       // double-release detection

    pub fn acquire(self: *BufferPool) ?BufferHandle {
        if (self.free_len == 0) return null;
        self.free_len -= 1;
        const index = self.free_stack[self.free_len];
        self.acquired[index] = true;
        return .{ .index = index, .bytes = self.bufferSlice(index) };
    }

    pub fn release(self: *BufferPool, handle: BufferHandle) void {
        self.acquired[handle.index] = false;
        self.free_stack[self.free_len] = handle.index;
        self.free_len += 1;
    }
};

A BufferHandle is an index + slice pair. Acquire pops from the stack, release pushes back. O(1) both ways, no locks (each worker is single-threaded), no fragmentation. The acquired bitmap catches double-release bugs at runtime rather than letting them silently corrupt the free list.

The pool never grows or shrinks. If it’s exhausted, the server stops accepting new connections until buffers are returned. Backpressure instead of OOM.

Parsing without copying

The HTTP/1.1 parser takes a mutable byte slice (the read buffer) and a pre-allocated header array, and returns slices pointing back into that buffer:

pub fn parseHeaders(_bytes: []u8, _limits: Limits) HeaderParseResult {
    // ...
    const method_str = line[0..first_space];          // slice into _bytes
    const request_target = line[first_space + 1 .. second_space];  // slice into _bytes
    // ...
    _limits.headers_storage[header_count] = .{
        .name = name,    // slice into _bytes
        .value = value,  // slice into _bytes
    };
}

The parser never allocates. It never copies. The parsed method, path, and every header name/value are slices that borrow directly from the read buffer. Zig makes this natural: slices carry pointer + length, the borrow is explicit, and there’s no hidden reference counting or copy-on-write. But there’s no compile-time enforcement either. If you return a slice into a buffer that’s been released, you get a dangling pointer. The safety comes from the ownership discipline, not the language.

The header array itself ([128]Header) lives inline on the Connection struct, which is pre-allocated in a slab. The “allocation” for parsed headers happened once, at server startup.

The lifetime constraint this creates

The read buffer must stay alive and unmodified for as long as anyone holds a reference to parsed request data. In practice, this is fine: the buffer is owned by the connection, and the entire request/response cycle completes before the buffer is reused. But if you wanted to do background processing on a request after responding, you’d need to copy the parts you need.

Connection structs: everything inline

Each connection is a single struct, pre-allocated in a slab of max_connections entries:

pub const Connection = struct {
    fd: ?std.posix.fd_t,
    state: State,
    read_buffer: ?buffer_pool.BufferHandle,
    read_offset: usize,
    read_buffered_bytes: usize,

    // Write queue: fixed circular buffer
    write_queue: [256]WriteEntry,
    write_head: u16,
    write_tail: u16,
    write_count: u16,

    // Parsed headers: inline, reused per request
    headers: [128]request.Header,

    // HTTP/2 pending bodies (inline)
    h2_pending: [32]PendingH2Body,

    // ... TLS state, timeouts, peer IP, etc.
};

On accept, we grab a Connection from the slab and acquire a read buffer from the pool. On close, release the buffer and return the struct. For HTTP/1.1, there’s no per-connection heap allocation. (HTTP/2 connections do heap-allocate an HPACK state machine on upgrade, since that state is too large and variable to inline.)

The write queue is a fixed 256-entry circular buffer of (BufferHandle, len, offset) tuples. Response bytes are written into pool buffers and enqueued here. The kernel drains them via writev. If the queue fills, we apply write-side backpressure.

The handler arena: lazy and optional

Not every handler can avoid allocation. JSON serialization, query parameter parsing, template rendering. Some handlers need scratch memory. For these, we provide a FixedBufferAllocator backed by a pool buffer:

var response_buf: [8192]u8 = undefined;
var response_headers: [4]response.Header = undefined;

const needs_arena = (method != .GET and method != .HEAD and method != .DELETE);
const arena_handle = if (needs_arena) io.acquireBuffer() else null;
const arena_buf = if (arena_handle) |h| h.bytes else empty[0..];

var scratch = router.HandlerScratch{
    .response_buf = response_buf[0..],
    .response_headers = response_headers[0..],
    .arena_buf = arena_buf,
    .arena_handle = arena_handle,
};

const result = router.handle(request_view, &middleware_ctx, &scratch);
if (scratch.arena_handle) |h| io.releaseBuffer(h);

The key choices:

  1. GETs skip the arena entirely. Most GET handlers return static content or format into the 8 KB response scratch buffer. Skipping the pool round-trip matters on workloads where a significant fraction of requests are GETs that never touch the arena.

  2. The arena is a FixedBufferAllocator. 64 KB of linear bump allocation. No individual frees, just a full reset after the handler returns. If a handler exhausts it, it fails, and that’s intentional. A handler that needs more than 64 KB of scratch is doing something the server shouldn’t be doing in the hot path.

  3. The arena is released synchronously. The handler returns, we release the buffer, done. No deferred cleanup, no reference counting.

Pre-encoded response cache

For the hottest endpoints (/plaintext, /health, /json) we go further. We skip the router, the middleware chain, and the response encoder entirely:

// At server init: encode full HTTP/1.1 wire bytes for each hot endpoint
// At request time: URL match → cache hit → memcpy to write queue → done

The Date header is refreshed once per second via a cached timestamp. Security headers are baked in at init time. The fast path for a cached GET is: parse headers, match URL, copy pre-encoded bytes, enqueue write.

This is protocol-specific. HTTP/1.1, HTTP/2, and HTTP/3 each have their own pre-encoded cache because the wire formats differ completely.

Large request bodies: a separate pool

A 64 KB buffer works for most requests, but POST uploads can be megabytes. Rather than oversizing the main pool, there’s a separate body buffer pool:

body_buffer_size: usize = 1024 * 1024,  // 1 MB each
body_buffer_count: usize = 32,          // 32 MB total per worker

When a request body exceeds the read buffer, the server accumulates into body pool buffers across multiple reads. A POST uploading a 10 MB file doesn’t starve 150 concurrent GETs of their read buffers.

What a request actually looks like in memory

Here’s the lifecycle of an HTTP/1.1 GET on the fast path:

1. Accept → acquire Connection from slab, acquire 64 KB read buffer from pool
2. Read   → recv() into read buffer
3. Parse  → parseHeaders() with conn.headers[128], all slices into read buffer
4. Route  → pre-encoded cache hit, or handler with stack-allocated 8 KB scratch
5. Write  → enqueue to write_queue[256], writev() to kernel
6. Done   → release write buffer(s), keep connection + read buffer (keep-alive)

Total heap allocations on this path: zero.

Tradeoffs

Memory upfront. 4096 × 64 KB = 256 MB per worker, committed at startup. Four workers = 1 GB before a single request arrives. You’re trading memory efficiency for allocation predictability and speed.

Fixed sizes waste the tail. 64 KB per buffer is generous for typical headers (~500 bytes). A buddy allocator over the same slab could hand out 4 KB / 16 KB / 64 KB chunks — same O(1), less waste. More bookkeeping, but if I were starting over I’d try it.

Synchronous ownership. Parsed headers are slices into the read buffer, so you can’t hand a request off to a background thread without copying. This works for a proxy/API server; it wouldn’t work for long-running computation.

Handler budget. The 64 KB arena is a hard cap. Handlers that need more either stream or fail. Unusual if you’re used to an unbounded allocator, but it keeps the hot path honest.

Numbers

On HttpArena’s H/1.1 isolated benchmarks (64 CPU threads, best of 3 runs):

Test Connections Req/s
baseline 512 3,558,639
baseline 4096 3,749,014
pipelined 512 12,309,938
pipelined 4096 27,282,020
limited-conn 512 2,324,326
limited-conn 4096 2,589,588
json 4096 2,348,213
static 4096 1,215,439

The pipelined-4096 number (27M req/s) is where the zero-allocation design pays off most directly. With 4096 persistent connections each sending pipelined requests, there’s no per-request memory churn to stall the event loop. Parse, route, memcpy pre-encoded response, advance the write queue. The allocator never runs.

On the limited-conn-4096 test (new TCP connection per request, 4096 concurrency) and the json-4096 test, swerver takes first place. Over 2x the next entry on json. These are the tests where per-request allocation cost compounds: every connection is fresh, there’s no amortization over a keep-alive stream. The fixed-buffer architecture means accept → parse → respond → close touches zero heap state regardless of connection lifetime.

Wrapping up

The core idea isn’t novel. nginx and HAProxy have used fixed-buffer approaches for years. What’s interesting is how natural this is in Zig. Slices make borrow-style parsing ergonomic. FixedBufferAllocator makes bounded allocation a first-class pattern. Inline arrays on structs give you cache-friendly layouts without fighting the language.

~40K lines of Zig, HTTP/1.1 + HTTP/2 + HTTP/3 (QUIC), kqueue/epoll/io_uring. The memory model described here is the foundation everything else is built on.

Supported Zig versions

zig-0-16

AI / LLM usage disclosure

This project is AI assisted.

8 Likes

“Parsing requests, unescaping them in place, without copying from request buffers” - an item on my ideas to investigate list.

Do you happen to have a comparison of this optimization in isolation?

1 Like

I love this approach, and will be exploring it at some point in the next few months.

1 Like

Yeah I think it’s really cool that Zig makes it so straight forward to do stuff like this.

I don’t have an isolated test of it, but the cost without it is one memcpy per header value (typical request has 6-10 headers, ~500 bytes total), plus an allocator round-trip for the copies. At 27M req/s that’s ~160M memcpys/sec and ~27M alloc/free pairs/sec.

1 Like

Thanks. It’s not a new trick or anything but I think it does highlight some of the coolest things about building with Zig. I think the method is applicable to more than just hp wire stuff too. Im also working on a render pipeline that has a very similar layout.