What is the best way to handle unrecoverable errors like OutOfMemory in an application?

IntegratedQuantum · January 27, 2024, 12:11pm

Crashing globally is exactly my intention. Otherwise I’d need to handle the error, reverting all allocations and other state changes and coding a fallback behavior, like reducing render distance or whatever. This is not worth the extra effort.

I don’t exclusively use a single global allocator. I do have a couple of functions that use an ArenaAllocator and I also have a global stack-like allocator for each thread that gets used for temporary runtime-known fixed-sized allocations. And I do plan to use MemoryPools more often, which would probably include tasks.

The threadpool owns the task, but the task is responsible for cleaning up its own memory. At the end of the task.run() method it deallocates its memory. It also has a separate task.cleanup() method which gets called by the thread pool if task.isStillNeeded() returns false.

The task.run() method also gives its result to a list in mesh_storage to be sent to the gpu and stored.

Basically the task itself is responsible for managing its memory and the result of its execution.

So the code bit you sent schedule(allocator: ...) isn’t really applicable. This would create the illusion that the memory is somehow owned by the caller, or discarded after the end of the function. Instead the memory is supposed to be shared with other threads and the calling will probably never see it again.

On preallocating and reusing memory

Overall you seem to be advocating for allocating as much memory as needed upfront and then using that memory for memory pools, circular buffers and arenas.

While I agree that this is generally a good measure to reduce the number of potentially failing allocations, it doesn’t work for everything. E.g. there is no good upper estimate for mesh memory. It depends not only on render distance (configurable at runtime), but also on where you are in the world. For example if you are in a large cave system, mesh memory can easily quadruple.

So my question remains: What should I do in these cases? I don’t want to implement fallback behavior for all cases. That would make my code more complicated, for the gain of having a game that gets uglier(missing/LODed meshes or wrong lighting data) instead of crashing.

On mixing allocators

You both seem to agree that using different allocators inside the same function is bad.
But I think often having multiple allocators is the best solution. Let’s say you have a function that takes some data, processes it (processing needs extra memory), and returns the final result using an allocator that was passed in. Using the passed in allocator is not a good idea in my opinion.

fn process(allocator, data) !@TypeOf(data) {
    const internalData = try allocator.alloc(...);
    defer allocator.free(internalData);
    // processing...
    const returnData = try allocator.alloc(...);
    // more processing...
    return returnData;
}

By using the same allocator for both allocations we have introduced a potential performance bottleneck(if the allocator is slow, such as a GPA) or a potential memory fragmentation(if the allocator is an Arena).
Instead it would be much better to allocate with a stackfallback allocator or another stack based allocator. For example in my game I’d do something like this:

...
    const internalData = try main.stackAllocator.alloc(...);
    defer main.stackAllocator.free(internalData);
...

Where the main.stackAllocator is a threadlocal stack emulating allocator (which also falls back to the globalAllocator when the allocation is too big). This makes it much faster than the GPA, while not having the fragmentation problem of the Arena. And additionally it basically cannot fail unless the system runs out of memory.

Sze · January 27, 2024, 1:32pm

Thank you for the additional details, it helps with getting more of a mental picture of what is going on.

Yes I like the strategy and yes it can only be used in some situations.

I only can come up with these 2 options, do a lot of work to potentially recover and release some less important memory, or focus on the happy path and exit if you don’t have enough memory. For the latter maybe you can at least exit in a nice way, so the player doesn’t lose anything, don’t know maybe the server or save system already ensures that…

I think using different allocators is good, I just think letting different allocators errors bubble up into the same code path via try is bad, because the call site would have trouble to recognize, whether the global allocator is out of memory or some smaller more specialized one.

So if you want to handle some of the out of memory errors, it is useful to know whether that error came from some allocator, where you want to use some kind of fallback. If you catch and crash the ones that aren’t supposed to be handled, that is one way to avoid mixing that error, with another one, that might make sense to handle.

You can decide that every OOM crashes, but I feel like there are cases where you use allocators for certain things, where it may make sense to sometimes do something different. For example if your game has developer tools, maybe those should try not to crash the game, so that you still can gain insights into a program that is about to crash. I also think developer tools are a great option for something where pre allocating a buffer may make sense and a build option to compile them in or not.

You also could pass in multiple allocators, I think this is mostly a difference in personal preferences, in how to structure code (I don’t care if the allocator is passed in or managed in some data structure), mixed with me not knowing how your code actually looks like (apart from snippets).

I agree that mixing your temporary working memory data into your results data in the same allocator isn’t good.

Maybe you setup/configure those allocators somewhere else, in some other way that is flexible.

IntegratedQuantum · January 27, 2024, 2:38pm

I don’t have a save system yet, but I would want it to be resistant against all kinds of crashes (that would allow shipping some versions of the game in ReleaseSafe for finding bugs/UB). At the very least I want to guarantee that the save is not corrupted. A few seconds of progress might still be lost though, but I’ll try to keep progress loss at a minimum without having to constantly write stuff to disk.

That is a good point. Especially logging is something that shouldn’t crash the game.
But at the same time, if some developer tools hits out of memory of the global allocator, then I’d argue that some other code of the game would probably hit it next, so it wouldn’t really make a difference in the end.

AndrewCodeDev · January 27, 2024, 4:45pm

To clarify - this is specifically related to try. In the case where the global allocator via a free function is handling it’s own issues, we avoid the issue of crossing streams and creating error ambiguity.

cancername · January 28, 2024, 12:26am

FWIW, allocation failure is not always unrecoverable. In many cases, the user is simply running too much stuff at the time, and retrying later or prompting the user can alleviate the issue.

IntegratedQuantum · January 29, 2024, 5:39pm

It’s been a few days and based on what @AndrewCodeDev suggested I started doing the following:
I made a ErrorHandlingAllocator which works basically like most other Allocators, it has a backing Allocator and if an allocation in the backing Allocator failed, it calls some error handling function, which in my case just crashes, but it could easily be extended with some sort of garbage collection+retry mechanism:

    fn alloc(ctx: *anyopaque, len: usize, ptr_align: u8, ret_addr: usize) ?[*]u8 {
		const self: *ErrorHandlingAllocator = @ptrCast(@alignCast(ctx));
		return self.backingAllocator.rawAlloc(len, ptr_align, ret_addr) orelse handleError();
	}

Secondly I made a wrapper over the Allocator interface which I called NeverFailingAllocator. It basically just calls all the Allocator functions with catch unreachable:

	pub fn create(self: NeverFailingAllocator, comptime T: type) *T {
		return self.allocator.create(T) catch unreachable;
	}

I think this is working quite well. I can still easily interface with library functions by unwrapping the NeverFailingAllocator, but most of my code can assume that allocations won’t fail, since failure is handled in one central location.

Overall I went from 1400 trys down to just 230 and I also cut down around 40 unnecessary catches that catched OutOfMemory to satisfy some arbitrary interface decisions.
I think this will make it a lot easier to get a picture of where I get real errors and hopefully I’ll have an easy time adding errdefers and error handling code around these.

However, I sadly had to add around 170 catch unreachable to stdlib functions like ArrayList.append(). I think I’ll probably implement my own list and maybe a wrapper for the HashMap to reduce that number.