Small Object Buffer Optimizations

AndrewCodeDev · May 28, 2024, 7:41pm

Alright, the topic is avoiding allocations using small buffer optimization. This is also helpful for storing generic data for polymorphic types.

This is commonly used with strings and the typical pattern is something like a union with an array the size of a slice (2 * sizeof(usize)) and if the number of characters goes over that, it allocates and sets a flag letting you know which one is in use.

I’m going to kick this off with a video from Jason Turner: https://www.youtube.com/watch?v=CIB_khrNPSU

I bet we could probably be more clever than that and do something more interesting/optimal.

Here’s a basic buffer that I hacked together a while ago. It’s very situationally dependent and uses an extern union so I wouldn’t say it’s a good “generic” buffer, but it suited the need I had at the time:

const ClosureBuffer = struct {
    const BufferSize = @sizeOf(usize) * 6; // or whatever size...

    items: extern union {
        inplace: [BufferSize]u8 align(@alignOf(usize)),
        any_ptr: *align(@alignOf(usize)) anyopaque,
    },

    state: enum { buf, ptr },

    pub fn init(x: anytype) ClosureBuffer {
        const T = @TypeOf(x);
        var self: ClosureBuffer = undefined;

        if (@sizeOf(T) <= BufferSize) {
            const tmp: *T align(@alignOf(usize)) = @ptrCast(@alignCast(&self.items.inplace));
            tmp.* = x;
            self.state = .buf;
        } else {
            const ptr = std.heap.c_allocator.create(T) catch @panic("Closure.init: Out of memory.");
            ptr.* = x;
            self.items.any_ptr = ptr;
            self.state = .ptr;
        }
        return self;
    }

    pub fn cast(self: *ClosureBuffer, comptime T: type) *T {
        return switch (self.state) {
            .buf => @ptrCast(@alignCast(&self.items.inplace)),
            .ptr => @ptrCast(@alignCast(self.items.any_ptr)),
        };
    }

    pub fn deinit(self: *ClosureBuffer, comptime T: type) void {
        if (self.state == .ptr) {
            std.heap.c_allocator.destroy(@as(*T, @ptrCast(@alignCast(self.items.any_ptr))));
        }
    }
};

There’s a few things that could make this more generic - instead of a *anyopaque, we could use []u8 and use something like rawFree instead of having to cast back to the type to destroy it as this is what happens in the destroy call for allocators anyway:

pub fn destroy(self: Allocator, ptr: anytype) void {
    const info = @typeInfo(@TypeOf(ptr)).Pointer;
    if (info.size != .One) @compileError("ptr must be a single item pointer");
    const T = info.child;
    if (@sizeOf(T) == 0) return;
    const non_const_ptr = @as([*]u8, @ptrCast(@constCast(ptr)));
    self.rawFree(non_const_ptr[0..@sizeOf(T)], log2a(info.alignment), @returnAddress());
}

But I want to hear more ideas about what else we can do to make this better. I think having a nice small buffer that can be used in place of direct pointer or slice would be a good addition to the ecosystem.

nyc · May 28, 2024, 7:53pm

I have a full small string optimization lib that spills at 23 bytes:

this is the top string (leaving out methods):

pub const String = extern union {
    lowbyte: u8,
    small: SmallString,
    large: LargeString,
};

here is small string

pub const SmallString = extern struct {
    const buf_size = @sizeOf(LargeString) - 1;
    len: u8,    // true length is `len >> 1`
    data: [buf_size]u8,
}

and here is the large that spills to:

pub const LargeString = extern struct {
    cap: u64,  // always allocates and even sized buffer
    len: u64,
    data: [*]u8,
}

If the bottom bit of the string struct is 1, it is small and the lenght is shifted by one bit. if the bottom bot is 0, then it is a large string and the capacity is by that nature always even. We just make sure wen always allocate a large string with an even number of bytes to keep that rule.

mntnmntn · May 29, 2024, 2:37am

Correct me if I’m wrong, but can’t you use the std.heap.stackFallback allocator in the standard library to do this? There isn’t that much overhead compared to using a tagged union and it’s a lot cleaner in my opinion.

AndrewCodeDev · May 29, 2024, 2:58am

Allocators definitely cover a lot of the use case here but there are a few differences.

The big difference is locality. The stack allocator may have a memory buffer anywhere (could be static, could be allocated via malloc). In this case, if we qualify for the small buffer, our data is built in-place.

I like the idea though, you could definitely get some interesting behavior in having a fall-through pattern that also included a stack allocator between the in-place buffer and a syscall.

mntnmntn · May 29, 2024, 5:40am

According to the standard library, the StackFallbackAllocator does have a stack-allocated backing array (see lib/std/heap.zig:522), and it prioritizes allocating within that buffer until it runs out of space, so unless I misunderstood how it works, the StackFallbackAllocator does a locally-allocated memory buffer.

AndrewCodeDev · May 29, 2024, 5:47am

Ah yes, I read that and mentally flipped it to FixedBufferAllocator. I see your point and I’ll have to experiment with that - good idea.

I will point out that we are still getting fundamentally different behaviour in the long run here. One is a buffer that fills up and then falls back on a aggregate level. The other is a case-by-case basis per object.

That said, I do see a lot of value in introducing more thought surrounding the allocator itself. I also like some of @nyc’s ideas for string allocation.

nyc · May 30, 2024, 12:18am

the biggest advantage to small buffer allocation is being able to keep the memory inline with a parent structures, think like small string keys in a hashtable. where keeping them inline and this in order of the hash table indexe is key to performance.

the optimization makes small objects lighter by using the space associated with things like a pointer to what memory it will occupy and overlaying the object directly into that struct with a union. Like my string variant above has a length, capacity, and pointer (all usize), and when it is in small mode, 1 bytes goes for length and the other 23 to hold the text (capacity is an implied 23) .

The small object isn’t even tied to an allocator until it spills to the heap, and nothing needs to be freed until then.

the stack stackfallback has much different characteristics. it is much better for something like a network buffer where most of the time you know the size is 1k or less but sometimes it might grow much larger.

It is really just a wrapper around two other allocators:

buffer: [size]u8,
fallback_allocator: Allocator,
fixed_buffer_allocator: FixedBufferAllocator,

Those are the only 3 fields buffer is the buffer that the fixed_buffer_allocator will be user over. when the fixed allocator fills up it will start using tthe fallback_allocator (pased in on type definition);

This is the entire allocation routine:

return FixedBufferAllocator.alloc(&self.fixed_buffer_allocator, len, log2_ptr_align, ra)
    orelse return self.fallback_allocator.rawAlloc(len, log2_ptr_align, ra);

(the double return really confuses me)

Try fixed, then try fallback.

If you had a million <20 char strings, they don’t need any external allocation, but with the stackfallbak you are still allocating just into a fixed buffer at first. You still need to free them or the buffer, there is still internal memory fragmentation (the fixed allocator can potentially free or resize the most recent allocation but nothing else so you need to basically use it and toss it - unlike small buffer optimization where there is none of those issues and can like in perpetuity without fragmenting anything – they are made to be invasive of a larger structure often).

If you made a fixed buffer allocator variant that contained its own memory buffer instead of being passed one (there might actually be one), you could make a OneTwoAllocator that literally just tried allocator one then tried allocator and it would be the exact same thing as the stackfallback. (that probably a better idea than what is in std now actually).

edit: thnking about the fixedbufferallocator right now, and I can’t really see a reason for it to take an external buffer. It just wastes space and loses locality by using a pointer to the region it is managing. you should just give it a size in its type creation function

fn FixedAlloc(S: type, bufsiz: S) type {
  return extern struct { // needs to be extern bc you  REALLY want  them
    top: S,            // before the buf to keep in page (ery impt)
    buf: [bufsiz]u8 = undefined;

    pub fn init() @This { return .{ .top = 0 }; }
 
    pub fn from_region(from: *[bufsiz]u8) @This {
      const this: *@This() = @ptrCast(from);
      this.top = 0;
      return this.*;
    }
};

I doubt keeping the most recent allocation there is even worth it, but with a type S of u16 and alignment requirement of 4 bytes, you can add it without costing any space.

This variant just takes over blocks (and this might actually eat into the uses cases of small buffer optiomization because my small string becomes const SmallString = FixedAlloc(u8, 23)

This is actually a much better idea now that I see it.

mnemnion · May 30, 2024, 5:43pm

The type you’ve sketched out is an interesting one to consider for some cases, but not a capital-A Allocator.

Making it one would be possible, but I doubt you’d save space. Because the size of the buffer is comptime-known, you’d need to monomorphize alloc, resize and free to the size of the buffer. Three functions for each buffer size is going to cost more than a fat pointer would.

What you’re describing could definitely be used within a function to dole out buffer space, but the FixedBufferAllocator needs to be able to be passed in to anything expecting an Allocator, so it isn’t a drop-in replacement.

The point about data locality is an interesting one.

Let’s take some representative code:

fn localizedFBA() !void {
    var buf: [64 * 8]u8 = undefined;
    var FBA = std.heap.FixedBufferAllocator.init(&buf);
    const allocator = FBA.allocator();
    var AList = std.ArrayList(u64).init(allocator);
    try AList.append(42);
    try std.testing.expectEqual(42, AList.items[0]);
}

test "local FBA" {
    try localizedFBA();
}

I’d say this is a typical use of a FixedBufferAllocator: reserve some stack space, dress it up as an allocator, and pass it to data structures like ArrayList which are dynamic. Obviously the specifics here are kind of useless, it’s illustrative.

So the data locality issue is that it looks like the FBA is stacked at the end of the buffer, so its slice pointer is pointing “back” to the beginning of that buffer.

But there’s nothing requiring the compiler to lay out the memory that way. I’m not Godbolting this, but it would be a reasonable optimization to move the FBA’s stack memory before the buffer.

Then again, relying on a compiler to get stuff right isn’t as reliable as insisting that it do so. So an allocator which is passed a byte count and uses result location semantics to put the whole thing on the stack in the correct order could be pretty useful. I don’t think it could take advantage of the buffer size efficiently, due to the aforementioned monomorphization issue, so it would be the same size as an FBA + buffer. But it might give better performance.

I would check the ASM before writing one though. LLVM is pretty clever.

If the compiler can’t get it right, we might be able to force the issue with some dirty monkey patching:

fn localizedFBA() !void {
    const buf: []u8 = undefined; // fake
    var FBA = std.heap.FixedBufferAllocator.init(buf);
    var data: [256]u8 = undefined;
    FBA.buffer = &data;
    const allocator = FBA.allocator();
    var AList = std.ArrayList(u64).init(allocator);
    try AList.append(42);
    try std.testing.expectEqual(42, AList.items[0]);
}

test "local FBA" {
    try localizedFBA();
}

mnemnion · May 30, 2024, 5:47pm

Actually there’s no need for the wasted buffer slice, even if, y’know, the compiler would probably drop it.

fn localizedFBA() !void {
    var FBA = std.heap.FixedBufferAllocator{ .end_index = 0, .buffer = undefined };
    var data: [256]u8 = undefined;
    FBA.buffer = &data;
    const allocator = FBA.allocator();
    var AList = std.ArrayList(u64).init(allocator);
    try AList.append(42);
    try std.testing.expectEqual(42, AList.items[0]);
}

nyc · May 30, 2024, 6:56pm

That’s a very bad thing to do. In base and base+offset (the two simple addressing modes) there is a fast and slow path. You take a 5-10 cycle penalty if the address is on a different page than the base (eg a typical 10 cycle load getting tlb hit turned into a 15 cycle hit. It offset was larger than 2k you took another 1-2 cycles hit since the ALU had to get involed (complex base+offset*scale or base+offset+disp take a hit too).

Having to go through a pointer to load the buffer is already adding a loading the pointer the base address of the region, so while the load is prob 5 cyles is to wait for the result of the load to be be available before it can get the base address.

Since the top index needs to be grabbed every time putting that first makes sense, and since the base of the memory region is just calculated from the base of the struct there is no dependency on the top load finishing and be started in parallel. And all those loads will get the fast path.

Unless I had a special allocator designed for it, I would never rely on my allocator for locality (that’s why I thin the small string/vec optimizations are better than some fix stack allocator - it is just never going to perform as well (special purpose invasive non-allocator beats even the fastest generic allocator – shocking lol).

I stopped using Allocator a while ago and all my current code uses… alloc: anytype (usually a pointer to some allocator) yeah ugly (and from what i understand this was how it was originally) so back to the future. So I’m understanding the issues with that. The purpose of this ultra slimmed down allocator is so performance drive that accessing it through an vtable sees not to have a large usecase. Could you just to the same {opaque*, table} of everything else?

While the OneTwoAllocator sounds interesting and is literally what they stack allocator does, you get much better semantics and fallback behavior by writing somethign that spills explicitly. And I can’t see any uses behind this stack kind of thing, so i wouldnt consider it worth pursing.

I have half a compacting slab allocator done right now that will only allocate a single size struct (like 29-32 bytes is one im writing the test code for). When you free it sets a bit so each region has a bitmask set, and it keeps track of how full each region is. when it drops before a threshold, it marks it as sparse and tries to merge it with other sparse regions by finding two bitsets that don’t overlap, copying the region with fewer allocations ontop of the one with more alloations (since they don’t have overlapping allocations this is fine), then sets the page table to point to the now merged region and frees the newly empties region. The rest of the program doesn’t have to know anything about the compaction (there are now two virtual addresses pointing to the same physical address). In VIPT caches (almost all modern computers, excluding the usuals) this is free. But I only have half of it done right now.

nyc · May 30, 2024, 7:03pm

It is still going through a load dependency chain having to get the buffer variables first. In this example you aren’t going to see if since it is trivial for the compiler to figure out where it points to ahead of time.

If the fixed region isnt on the stack (nothing magical about the stack after all) having FBA point to the region vs the invasive version i did (it would be the FBA pointing to the region vs a pointer to the invasive one, so indirection happens regardless), they look the exact same to me (as long as FBA orders the pointers properly).

I see very little benefit for FBA to have a pointer to the region (might be able to have something a little more generic, but that might not actually be true). At worst they appear the same, but there are definite situations where the version that includes the buffer wins out.

mnemnion · May 30, 2024, 8:51pm

nyc:

mnemnion:

But there’s nothing requiring the compiler to lay out the memory that way. I’m not Godbolting this, but it would be a reasonable optimization to move the FBA’s stack memory before the buffer.

That’s a very bad thing to do. In base and base+offset (the two simple addressing modes) there is a fast and slow path. You take a 5-10 cycle penalty if the address is on a different page than the base (eg a typical 10 cycle load getting tlb hit turned into a 15 cycle hit. It offset was larger than 2k you took another 1-2 cycles hit since the ALU had to get involed (complex base+offset*scale or base+offset+disp take a hit too).

Having to go through a pointer to load the buffer is already adding a loading the pointer the base address of the region, so while the load is prob 5 cyles is to wait for the result of the load to be be available before it can get the base address.

I’m not understanding the point you’re trying to make here. You’ll be handing around a pointer to something after all, or you won’t be passing an Allocator to anything.

If the buffer is intrusive to whatever struct you’re handing around, that would mean that a load of that struct’s pointer will pull in some of the buffer as well. That’s good.

My point was that, given that, and given everything allocated on the stack (I’ll return to this shortly), the compiler could decide to put the FBA before the buffer, instead of after it (as the order of operations in the first example would suggest). Then the load would pull in the cache line containing the start of the buffer anyhow.

An intrusive buffer would also let the codegen do some fixed arithmetic to remove the indirect from the struct pointer to the buffer. But this optimization is also available if it’s all laid out in the stack, a struct isn’t necessarily special here. It might raise the odds that you get the optimization: but this isn’t a matter of chance, the compiler will or it won’t.

If the buffer in question is wider than a page, then eventually you’ll cross the page boundary between accessing the struct and allocating from the buffer. The idea behind wanting the struct to be at the beginning of that buffer is to localize the base pointer near the beginning, so that you benefit from locality at first, rather than at some later point which may never arrive.

I guess I vaguely see the idea that it’s wasteful to have two words, one pointing to the buffer and one containing the capacity, when it could just be the capacity and then the buffer. So we’re talking about saving one word here.

It’s certainly possible to heap-allocate a buffer for an FBA, using, well, another allocator. But I dare say that allocating some stack space and wrapping it in an FBA is the more ordinary use case, because the other way is just using the original allocator with extra steps.

As I said, it might indeed make sense for an FBA to receive a size and allocate the buffer that way, at comptime. But it’s not obvious that it makes sense, if you want to use it as an Allocator. If you’re doing something else with it, all bets are off, you might be able to save that one (1) usize block of memory.

nyc · May 30, 2024, 9:53pm

There are two issues, neither related to L1d cache hit vs miss. Saying invasive was probably the wrong thing to say - I meant the buffer being invasive of the LocalFixed allocator (ie, the type I laid out above) vs the current FixedAllocator I’ll call it RemoteFixed). LFA is just a length an embedded buffer so generating an address is [base+offset] (ignore length since both RMA and LFA do the same and it doesn’t affect aything). RMA is a pointer to a region that we will assume is directly next to it (either before or after it). L1d cache hits we’ll take to be the same since that’s prob going to be true.

It would be a mistake for the the compiler to place the pointer to the buf far away from the base of the struct (in this case, ordering it after the buffer). This is why auto field reordering isn’t wanted behavior for low-level stuff sometimes.

It isn’t a cache line thing, it is a address generation and TLB thing. There is a fast and slow path for physical pointer address generation (the physical tag needs to be known before a successful L1d hit) and is usually dispatched in parallel with the L1 cache search, the result comes back and the cache tag has to be compared to the physical address to make sure you have the right piece of memory. Even with just a handful of TLB entries (something crazy small like 20-30), it is almost always a hit

When base + offset simple addressing is used, base and base+offset need to be on the same page and offset needs to be 2k or less (anything larger or complex addressing like base+offset*scale has to spend 1-2 cycles in the ALU first). The address generation assume the base+offset are on the same page. when that spans a page boundary, the physical tags don’t match up and the cpu has to go back and take the slow path. This about a 5 cycle hit. The L1d hit is already costing around 5-10 cycles so makes the L1d hit take almost twice as long.

Always put your important stuff early in the struct. For anoher pointer you want it to start that load as fast as possible since there is two loads going on (load the pointer from the struct, then load the destination of that pointer) and pointer isn’t on the same page as the base address, the dependency will sap performance.

If you going to miss L1 none of this matters at all of course.

remove the indirect from the struct pointer to the buffer.

The pointer load only has to happen once in the function (assuming don’t run out of registers and spill to the stack) thank full. Wost case is a bunch of accesses across function boundaries where you’ll feel the pain often. Best case is the pointer is loaded once and can keep it in a register. (This keep the pointer next to the base address matters most for things like linked list or tree travrsal where you are constantly loading the next element, and then you feel his really painfully esp if you get lucky and have a largish struct size and you’re stalling out every 2-3 node.)

I’m not sure how this fast path /slow path stall shows up in performance counters because technically it is not a TLB miss I think, but not sure on that. If it doesn’t show up in the perf counters, this very easy to test. If It doesn’t it might be easier to test in C to controll the optimizer aggression from folding everything away and just returning a giant middle finger.

mnemnion · May 31, 2024, 12:36am

We seem to agree then.

What I was illustrating: pretend the compiler is banned from any memory reordering. This:

mnemnion:

fn localizedFBA() !void {
    var buf: [64 * 8]u8 = undefined;
    var FBA = std.heap.FixedBufferAllocator.init(&buf);
    const allocator = FBA.allocator();
    var AList = std.ArrayList(u64).init(allocator);
    try AList.append(42);
    try std.testing.expectEqual(42, AList.items[0]);
}

Would put the pointer far away from the start of the buffer. Hence I was saying a good compiler might order it like this:

mnemnion:

fn localizedFBA() !void {
    var FBA = std.heap.FixedBufferAllocator{ .end_index = 0, .buffer = undefined };
    var data: [64 * 8]u8 = undefined;
    FBA.buffer = &data;
    const allocator = FBA.allocator();
    var AList = std.ArrayList(u64).init(allocator);
    try AList.append(42);
    try std.testing.expectEqual(42, AList.items[0]);
}

Where the buffer begins just after the struct itself.

I don’t expect it would represent much of an optimization for the buffer to be included in the allocator itself, as long as both the struct and the buffer are contiguous, and not laid out backward. The allocator is just a self-pointer and the vtable, the backing struct is three words, a slice and the end index.

But I could be wrong, maybe the extra indirection would cause meaningful loss of performance. It’s practical to write what you’re describing as an allocator and benchmark it, that’s the one way to be sure of these things.

nyc · May 31, 2024, 4:24am

micro benchmarking harnes incoming - i added gnuplot output today - but trying to get the package a manager to work