What is the best way to handle unrecoverable errors like OutOfMemory in an application?

One shot programs

I definitely see the value of using catch @panic("Out of Memory") in short running programs that have a simple defined task, do that task and then exit. For example build scripts, code/asset generation, utility tools.

I think one of the questions you need to ask yourself is whether your application truly is such a short running “one shot” program, if it is it is probably a good strategy.

If it processes multiple data sets, allocating memory, freeing a lot and then allocating other stuff, then I don’t think crashing is a good choice.

Basically if your memory is monotonically increasing and then you hit finish, then crashing (exiting) is the best choice, because it is faster then freeing everything in reverse allocation order piece by piece, just so that you can exit. But technically in that situation you don’t need to crash, you can just call std.os.exit.

Multi phase programs / bundles of different tasks

I think if you have a bunch of different phases where your memory usage grows, because you work on something; and then it shrinks, because you are distilling the results. I think you have 2 options:

  1. Use arenas and think in terms of batches, failing in batches and recovering in batches (related: Enter The Arena - Ryan Fleury)

    This allows you to just say: “for these things, if anything fails we just abort the whole section, resetting the entire arena and pretending we never started”.

    However it is important to make sure you don’t keep some database connection, or something like it, half opened, where you loose the handle and thus don’t know how to close it. Basically don’t lose/leak external resources in the swept away trash pile memory.

  2. Do everything “one piece at a time”:

    • create one object
    • errdefer destroy it
    • add it to a list
    • errdefer pop it from the list
    • add its index in some other data structure
    • errdefer remove the index from the other data structure
    • do something else that can fail
    • everything worked return the object

    Using this you get methods that are able to rollback to the state before the function was called, because if an error happens the errdefers undo the partial successes of the function. Thus these functions hide partial success, by undoing it, giving you complete success or complete failure.

    These functions are nice because they allow your program to reverse back from failure, convert the fine granularity error to a bigger granularity error, leaving you with less complexity at the call site.

    You have “success or fail”, not: “success or it failed and I need to check if the half created object is still in some list” at the call site.


With strategy 1. I still don’t have a lot of experience, I used it a tiny bit with some old gui code, but not enough to really test it to its limits.

With strategy 2. here is an example:

pub fn createArchetype(self: *Self, atype: AType, new_id: u32) !*Archetype {
    var archetype = try self.allocator.create(Archetype);
    errdefer self.allocator.destroy(archetype);

    archetype.* = try Archetype.init(self.allocator, &self.component_infos, new_id, atype);
    errdefer archetype.deinit();
    try archetype.calculateSizes();
    return archetype;
}

This is from a work in progress ECS, which brings me to another point, by using such an ECS (which uses Archetypes which are basically batches of memory containing similar objects) I am kind of using both strategies:

  • strategy 2. to manage batches
  • strategy 1. by using these batches in a way that is somewhat similar to strategy 1.
    by allowing me to manage memory as batches of higher granularity

Because I have the batches, there are code paths where it is already clear that the memory is already allocated, thus I don’t have to deal with it on the instance level anymore.

Instead I get an error when I try to add a new instance and the batch tries to grow and doesn’t have enough memory. (If I deal with an individual object I may have to deal with the error individually)

batching / assume capacity

But it is often times possible to, for example check if the school bus is big enough for the whole class, instead of requesting a seat for each pupil individually, which allows you to move (allocate) the whole bus, instead of one pupil at a time.
Which is the assumeCapacity case @AndrewCodeDev mentioned.

The logging Andrew mentioned is a interesting case, might be worth to explore the idea of a logging library that uses something like assumeCapacity to either log the entire message or none of it, by pre-computing and reserving a big enough chunk of memory, but I am not sure how that would turn out. I think it would have to be explored as an actual experiment, to see whether that could have some benefit to it.

asserts

I also agree that asserts are great, but they are more for cases where some invariant wasn’t upheld, so I think they are for situations where some programmer tries to use something in an unintended way, to prevent that from being possible.


Garbage collection

I also think there is a 3. strategy which is, you build something that in some way starts looking like / is a garbage collector.

You have some heuristic that triggers “memory is getting too filled we need to cleanup”, then you walk a data structure and figure out what is garbage and remove it. Possibly reorganizing the remaining things and then you continue running the program.

Typically garbage collection gets triggered before you hit out of memory, but you also could use an out of memory error as the trigger.

  • It could back away from an error with strategy 1. or 2., garbage collect, retry.
  • If the error happens again, maybe even try to move things that could be done later to the disk and retry.
  • If it happens again finally crash.

It is just that most zig programs, probably can avoid the need to invent / use their own garbage collector.

I guess if the os has swapping configured, that could delay the point where you hit out of memory, but because the disk/storage is so much slower it might be better for the program to hit out of memory, instead of being slowed, because of os based swapping. Because the program might have more information, to just decide to kill some less important sub task.

You also could do some kind of distributed programming swapping, move some part of your working memory / problem to another machine, but that is even slower and just isn’t practical if you want to stay on one machine.

4 Likes

I found a rather straightforward example of this while researching how different applications handle this. This article is old but good: Handling out-of-memory conditions in C - Eli Bendersky's website

Here’s an example from that article of a semi-gc method that attempts a recovery from Git’s xmalloc wrapper.

void *xmalloc(size_t size)
{
      void *ret = malloc(size);
      if (!ret && !size)
              ret = malloc(1);
      if (!ret) {
              release_pack_memory(size, -1);
              ret = malloc(size);
              if (!ret && !size)
                      ret = malloc(1);
              if (!ret)
                      die("Out of memory, malloc failed");
      }
#ifdef XMALLOC_POISON
      memset(ret, 0xA5, size);
#endif
      return ret;
}

We can see that the code tries to make an allocation and probes if it doesn’t get the value it was looking for. If the requested size failed, it releases its “pack memory” and tries again. If that fails, it dies.

2 Likes

I probably should have specified that: It’s a multiplayer game.

I generally agree, but I think that needs a save system that works reliably in crash condition. For example on Linux the out of memory error isn’t even triggered reliably.

The problem is that my code doesn’t really look like this. I use a global allocator quite a lot. For example here is a random bit of code from my game:

pub fn schedule(mesh: *ChunkMesh) !void {
	const task = try main.globalAllocator.create(@This());
	task.* = .{
		.mesh = mesh,
	};
	try main.threadPool.addTask(task, &vtable); // Uses another allocator internally
}

Often there is not really a clean cut where it makes logical sense to place the @panic. I would need to place it on most alloc/append calls.

Yeah, I agree. Asserts are definitely the solution when it comes to user-defined unrecoverable errors, and I’m already using them in most places. But I cannot really apply that to unrecoverable library errors, like out of memory or errors while trying to open a window.

Yeah, a garbage collector is definitely something to look into. Though I would probably do it in the allocator instead of catching it at every site. For me the strategy could be as simple as just shrinking a bunch of caches I have.
But that doesn’t help with my problem. Eventually the garbage collector won’t have anything to clean, and then we are back at the point where I don’t know what to do with the (now truly unrecoverable) error.OutOfMemory that it returns.

Is it a network game? If so, on which side (client or server) have you troubled with handling above mentioned situations most?

I see - so let’s take a step back here.

We have two try statements here - one for creating a task, and one for adding a task. Since the try on the create statement uses the globalAllocator, that makes me wonder what can fail about the addTask call. Does it use the same allocator under the hood? Is there a different allocator? Or are there additional reasons why adding a task can fail?

The reason I’m asking is let’s pretend I’m the caller - if I call that function and I have multiple error sets coming at me, then I need to handle those independently. I could be getting an OOM or something else. Some may be reasonable, some may not be reasonable from my vantage point.

If I try to call that function and the globalAllocator runs dry… what can I reasonably do at that point? Going back to the xmalloc example, they free up some junk and then try again or die lol. But that happens every time and it’s handled in one place with the same reasoning.

If there’s really only one option, that should get moved into it’s own free function that uses the globalAllocator and makes a decision about what it means to globally run out of memory. If I as the caller have multiple options and I can be crafty about it then sure, give me the error if I have more levers to pull.

Furthermore, different allocators returning OOM doesn’t mean the same thing universally. Maybe one can reasonably run out but the other absolutely shouldn’t - how would I tell on the outside if all I have is OOM?

IMO, I think you should make some decisions about what it means for the globalAllocator to run out and then codify those in a series of functions. Then, if you have additional errors, we can keep them in their own lanes.

2 Likes

Here’s a few toy examples…

// In one scenario I want something from the allocator...
const x = globalAllocator.create(T) catch {
     // free some stuff or call the police
};

// later...

const y = globalAllocator.create(T) catch {
     // freeing more stuff and calling the cops again
};

// and again...
const z = globalAllocator.create(T) catch {
     // this is getting old.
};

When we’re creating x,y,z… are we at tthe correct altitude to be making these decisions? Do we really have a different option each time? If not, I’d advocate for…

pub fn globalCreate(comptime T: type) *T {
    return globalAllocator.create(T) catch {
        // be as reasonable as possible...
    }
}

const x = globalCreate(T);
const y = globalCreate(U);
const z = globalCreate(V);

pub fn schedule(mesh: *ChunkMesh) !void {
	const task = globalCreate(@This());
	task.* = .{
		.mesh = mesh,
	};
    // now we are only getting errors from addTask
	try main.threadPool.addTask(task, &vtable); // Uses another allocator internally
}

This way, if I have unique errors from addTask, maybe I can handle them differently. This has a kind of “signal purity” that I like - especially if addTask has it’s own allocation strategy so we can purely deal with its problems alone.

By returning the error, I believe we are implicitly saying “there’s legitimately different ways to handle this error”. For adding tasks, that may be true and it may be a different situation that running out of global memory. If there isn’t legitimately different ways to handle this problem, it should handle that problem in one location itself.

Again, if you can imagine that there are legitimately unique things to do when global allocation fails, then you should handle those situations uniquely in each case - if not, centralize that concern and keep the error streams pure.

2 Likes

… rusty nail into my kicks!!!
I’ve just remembered the only one
multi-player game with two players using a single keyboard, and with amazing music via PC speaker using PCM (Pulse Code Modulation)

It is a network game using UDP holepunching to be precise.

That’s a valuable insight right there! And yeah, using globalCreate/Alloc/Destroy/Free functions feels right for this situation.
Sadly there seems to be no support for central error handling allocation in the standard library.
Data structures like ArrayLists only support std.mem.Allocator, which returns errors.
I guess I could write my own version of ArrayList, which is not too bad(I was forced to do that in java as well, due to their poor generics), but I don’t want to end up rewriting half of the standard library.

After reconsidering in more detail, I don’t like this answer, in the context of a game client, I would argue heavily to get away from the idea of relying on one single global allocator.

With this you are leaking or crashing globally:

globalCreate can’t really do anything reasonable, except fail with the error, instead of absorbing it. You maybe could argue for using one set of methods while you are trying to init the game if you can’t even get to the main menu you might as well crash / shutdown.

But I think schedule or things that are used after an initial startup phase should either properly manage errors or have their memory pre-allocated in the startup phase and have no actual way to fail.

I think it is better to instead have smaller allocators/arenas that can fail and be reset individually. Better if a few branches brake off then the whole tree collapsing.

With this your code can go back to before you tried to schedule:

pub fn schedule(allocator: std.mem.Allocator, mesh: *ChunkMesh) !void {
    const task = try allocator.create(@This());
    errdefer allocator.destroy(task);
    task.* = .{
        .mesh = mesh,
    };
    try main.threadPool.addTask(task, &vtable); // Uses another allocator internally
}

Then you have the option at the call site to let it bubble up to a reasonable granularity and make a decision, for example you could switch to use higher lod levels (reduce geometry detail) and try again with lower resolution.
I also wonder a bit about who owns the task and cleans it up and who owns the results it produces, I don’t really know the details of how the thread pool is used.

Here is my reasoning:

  • when you have only one global allocator and you hit out of memory, there isn’t much sensible stuff you can do, except basically try to walk your program in reverse until you reach a point where your memory looks sensible again (which isn’t really a valid / easy thing to do in most languages)

  • instead (ideally) you want to split it so that you can have stacks of simple lifetimes:

    1. game boots up creates the root allocator and a bunch of things that will never be freed until program exit
    2. now you create an arena use it to load the splash screen, menu backgrounds, menu stuff, etc.
    3. now you basically push the arena onto the stack and create a nested new arena
      this new arena can grow further but resetting it gets you back to the beginning of step 3. you now fill it with stuff needed to communicate with the server and handle the local game state
  • lets say the server sends very different amounts of variable length data because of different terrain, enemies etc.

    • so you basically need 2 buffers that grow and shrink a lot, the network incoming stuff can be one arena while the client working memory is another
    • you could use slab allocated arenas, or alternatively you could use memory mapping to virtually pre-allocate address space for each arena separately (just pre-reserving virtual memory without the physical mapping)
    • for some things you may be able to just pre allocate a pre calculated amount of memory and switch those in and out, replacing chunks with other chunks
    • for incoming messages (streaming/temporary data really) it may make sense to have something like a circular buffer, especially if you can pre calculate an upper bound
      • some people have used page mapping to create circular buffers that can be read and written as if they were contiguous memory by mapping 2x virtual address space to 1x physical space, that could be interesting because then you can use a circular buffer without the user having to know / notice
  • with good reuse of buffers you should hit a steady state where the program doesn’t grow beyond

  • now if you hit out of memory anyway you need to wonder why, are you holding on to things, are you leaking memory, did you try to load more then the machine can handle?

  • are you fragmenting memory within arenas?
    this means a lot of memory is no longer used but still not recovered, you could either try to reuse elements better for example via MemoryPool
    Or you could periodically scan through the arena compacting it.
    Or maybe even something that is like double buffering, where you swap between A and B buffer moving things from current to next buffer, if they continue to live, dropping/forgetting them if they aren’t relevant anymore.

  • lets assume some other process steals all the memory from us while we where only using a little bit, theoretically we should be able to continue not using more memory then we currently have by not showing more chunks, etc.

  • if that isn’t desireable because a half rendered world isn’t appealing, we could do some last attempts to send/save the players position and then pop the stack going back to 2. basically going back to the main menu and disconnecting

  • from there on the player can close their browser hogging memory, or configure a swap file so that the os can page out the browsers memory stealing ways

  • and the user can connect to the server again, without the game crashing

I’m going to go one step further and suggest something different here because while I get what you’re saying, this still has the problem from before:

Imagine we have an error like so…

           alloc1
              |
         {   OOM, X, Y Z  }
              |
           alloc2

You now have two allocators using the same error. So which one failed? If you have a long enough call stack, this can become impractical to solve.

So I’m going to make a suggestion here…

// somewhere up above...

const task = try allocator.create(T);
errdefer allocator.destroy(task);
task.* = .{
    .mesh = mesh,
};

// then later...
pub fn schedule(task: *Task) !void {
    try main.threadPool.addTask(task, &vtable); // Uses another allocator internally
}

I’ll meet you halfway - I think mixing allocators for the same function (if they both can fail) is not optimal, but I get your point about not using a global allocator for the current use case.

1 Like

I agree that mixing different allocators in one code path is bad and your example is a possible mitigation.

I think another thing to think about is the whole who owns what, who creates the task vs who reads the task details and writes the result.

One of the mantras from go-lang that seem reasonable, is to try to avoid to communicate via sharing memory and instead share memory by communicating, that basically boils down to using queues/channels if possible, so that one end can write while the other reads.

Might be a way to get some distance between the sheduler and the worker thread. But kind of difficult to say without knowing how the results are used, if those need to be synced up back again and get recombined.

Also may depend on how much latency is acceptable and the needed throughput.

1 Like

It is those “mobile applications” (a whole bunch of a modern internet browsers) that
are “dying breed”.

Everyone is installing some stupid “mobile applications” on their’s kinda “smart” phones.

Every programmer a-la python for so-called back-end and a-la js for so-called front-end
are the most valued programmers ever.

Why on earth it is happening this way?

:slight_smile:

2 Likes

Crashing globally is exactly my intention. Otherwise I’d need to handle the error, reverting all allocations and other state changes and coding a fallback behavior, like reducing render distance or whatever. This is not worth the extra effort.

I don’t exclusively use a single global allocator. I do have a couple of functions that use an ArenaAllocator and I also have a global stack-like allocator for each thread that gets used for temporary runtime-known fixed-sized allocations. And I do plan to use MemoryPools more often, which would probably include tasks.

The threadpool owns the task, but the task is responsible for cleaning up its own memory. At the end of the task.run() method it deallocates its memory. It also has a separate task.cleanup() method which gets called by the thread pool if task.isStillNeeded() returns false.

The task.run() method also gives its result to a list in mesh_storage to be sent to the gpu and stored.

Basically the task itself is responsible for managing its memory and the result of its execution.

So the code bit you sent schedule(allocator: ...) isn’t really applicable. This would create the illusion that the memory is somehow owned by the caller, or discarded after the end of the function. Instead the memory is supposed to be shared with other threads and the calling will probably never see it again.

On preallocating and reusing memory

Overall you seem to be advocating for allocating as much memory as needed upfront and then using that memory for memory pools, circular buffers and arenas.

While I agree that this is generally a good measure to reduce the number of potentially failing allocations, it doesn’t work for everything. E.g. there is no good upper estimate for mesh memory. It depends not only on render distance (configurable at runtime), but also on where you are in the world. For example if you are in a large cave system, mesh memory can easily quadruple.

So my question remains: What should I do in these cases? I don’t want to implement fallback behavior for all cases. That would make my code more complicated, for the gain of having a game that gets uglier(missing/LODed meshes or wrong lighting data) instead of crashing.

On mixing allocators

You both seem to agree that using different allocators inside the same function is bad.
But I think often having multiple allocators is the best solution. Let’s say you have a function that takes some data, processes it (processing needs extra memory), and returns the final result using an allocator that was passed in. Using the passed in allocator is not a good idea in my opinion.

fn process(allocator, data) !@TypeOf(data) {
    const internalData = try allocator.alloc(...);
    defer allocator.free(internalData);
    // processing...
    const returnData = try allocator.alloc(...);
    // more processing...
    return returnData;
}

By using the same allocator for both allocations we have introduced a potential performance bottleneck(if the allocator is slow, such as a GPA) or a potential memory fragmentation(if the allocator is an Arena).
Instead it would be much better to allocate with a stackfallback allocator or another stack based allocator. For example in my game I’d do something like this:

...
    const internalData = try main.stackAllocator.alloc(...);
    defer main.stackAllocator.free(internalData);
...

Where the main.stackAllocator is a threadlocal stack emulating allocator (which also falls back to the globalAllocator when the allocation is too big). This makes it much faster than the GPA, while not having the fragmentation problem of the Arena. And additionally it basically cannot fail unless the system runs out of memory.

2 Likes

Thank you for the additional details, it helps with getting more of a mental picture of what is going on.

Yes I like the strategy and yes it can only be used in some situations.

I only can come up with these 2 options, do a lot of work to potentially recover and release some less important memory, or focus on the happy path and exit if you don’t have enough memory. For the latter maybe you can at least exit in a nice way, so the player doesn’t lose anything, don’t know maybe the server or save system already ensures that…

I think using different allocators is good, I just think letting different allocators errors bubble up into the same code path via try is bad, because the call site would have trouble to recognize, whether the global allocator is out of memory or some smaller more specialized one.

So if you want to handle some of the out of memory errors, it is useful to know whether that error came from some allocator, where you want to use some kind of fallback. If you catch and crash the ones that aren’t supposed to be handled, that is one way to avoid mixing that error, with another one, that might make sense to handle.

You can decide that every OOM crashes, but I feel like there are cases where you use allocators for certain things, where it may make sense to sometimes do something different. For example if your game has developer tools, maybe those should try not to crash the game, so that you still can gain insights into a program that is about to crash. I also think developer tools are a great option for something where pre allocating a buffer may make sense and a build option to compile them in or not.

You also could pass in multiple allocators, I think this is mostly a difference in personal preferences, in how to structure code (I don’t care if the allocator is passed in or managed in some data structure), mixed with me not knowing how your code actually looks like (apart from snippets).

I agree that mixing your temporary working memory data into your results data in the same allocator isn’t good.

Maybe you setup/configure those allocators somewhere else, in some other way that is flexible.

1 Like

I don’t have a save system yet, but I would want it to be resistant against all kinds of crashes (that would allow shipping some versions of the game in ReleaseSafe for finding bugs/UB). At the very least I want to guarantee that the save is not corrupted. A few seconds of progress might still be lost though, but I’ll try to keep progress loss at a minimum without having to constantly write stuff to disk.

That is a good point. Especially logging is something that shouldn’t crash the game.
But at the same time, if some developer tools hits out of memory of the global allocator, then I’d argue that some other code of the game would probably hit it next, so it wouldn’t really make a difference in the end.

2 Likes

To clarify - this is specifically related to try. In the case where the global allocator via a free function is handling it’s own issues, we avoid the issue of crossing streams and creating error ambiguity.

2 Likes

FWIW, allocation failure is not always unrecoverable. In many cases, the user is simply running too much stuff at the time, and retrying later or prompting the user can alleviate the issue.

1 Like

It’s been a few days and based on what @AndrewCodeDev suggested I started doing the following:
I made a ErrorHandlingAllocator which works basically like most other Allocators, it has a backing Allocator and if an allocation in the backing Allocator failed, it calls some error handling function, which in my case just crashes, but it could easily be extended with some sort of garbage collection+retry mechanism:

    fn alloc(ctx: *anyopaque, len: usize, ptr_align: u8, ret_addr: usize) ?[*]u8 {
		const self: *ErrorHandlingAllocator = @ptrCast(@alignCast(ctx));
		return self.backingAllocator.rawAlloc(len, ptr_align, ret_addr) orelse handleError();
	}

Secondly I made a wrapper over the Allocator interface which I called NeverFailingAllocator. It basically just calls all the Allocator functions with catch unreachable:

	pub fn create(self: NeverFailingAllocator, comptime T: type) *T {
		return self.allocator.create(T) catch unreachable;
	}

I think this is working quite well. I can still easily interface with library functions by unwrapping the NeverFailingAllocator, but most of my code can assume that allocations won’t fail, since failure is handled in one central location.

Overall I went from 1400 trys down to just 230 and I also cut down around 40 unnecessary catches that catched OutOfMemory to satisfy some arbitrary interface decisions.
I think this will make it a lot easier to get a picture of where I get real errors and hopefully I’ll have an easy time adding errdefers and error handling code around these.

However, I sadly had to add around 170 catch unreachable to stdlib functions like ArrayList.append(). I think I’ll probably implement my own list and maybe a wrapper for the HashMap to reduce that number.

1 Like