What is the best way to handle unrecoverable errors like OutOfMemory in an application?

From https://ziggit.dev/t/zig-code-smells/ where @AndrewCodeDev wrote

The program I’m working on is filled to the brim with trys, bubbling up most errors into the main.
I agree that this is a code smell and this has lead to some bad habits. For example I’m rarely handling errors(even though I probably should) and I’m not using errdefer at all since the program terminates anyways in most cases, leading to lots of potential leaks in the few cases where I actually catch an error.

Now I think the biggest problem for this are OutOfMemory errors. They happen absolutely everywhere(with a quick grep I found over 500 of these), and, since I normally use fallback allocators, they are not really recoverable.

So I wonder if I should get into the habit of using catch @panic("Out of Memory") or maybe some helper function(for autocompletion) catch main.panicOutOfMemory()?
Then it would be easier for me track the real errors, which should be recovered at some stage. And I could hopefully rewire my brain to something like “error return type? Be extra careful and add those errdefers and handle the error properly at the callsite” instead of the current “error return type? It’ll probably crash, no need to put extra work in”
But on the other hand it does seem quite tedious, especially considering that I’d need to change over 500 lines upfront.

What would you do? Is there maybe another simpler alternative?

(For libraries, or any function that takes an allocator, the answer is obviously to use try and bubble up the error. However, I’m asking this from the perspective of an application which mostly uses a global allocator)

2 Likes

I guess it depends on the kind of the program.

If you’re developing a desktop application (they seem to be a dying breed these days), I would say that preserving the user document / data / work would be of paramount importance.

If you’re writing a service / request handler of some kind… maybe crashing is an option, even though back when I was writing programs like that (in Python), I tried to keep them running at all costs - catching all exceptions to prevent them from crashing the process. Logging is very important, of course.

2 Likes

I was just looking at it today and I think I agree with a couple things that @Sze brought up.

Also, I agree that it’ tends to be around allocators where this stacks up.

So let’s say I have a function foo, and foo takes an allocator…

foo(allocator: std.mem.Allocator, ...) ![]const u8 {

   var x = try std.ArrayList([]const u8).initCapacity(allocator, 42);

    defer {
        x.deinit(); // and other stuff... maybe loop and free stuff...
    }

   for (...) |...| {
       var str = try allocator.alloc(u8, 42);

       // do some more stuff to str
       try x.append(str);
   }
   // and one more for good measure
   return try std.mem.join(allocator, "\n", x.items);
}

So judging by my atrocious pseudo code, I can see that there are 4 places where we’re trying something. In this case, if I use foo… I could just say:

const z = foo(allocator ,...) catch @panic("You must construct additional pylons.");

Because really… it’s the same error four times times. If you really don’t intend on doing anything with it, then @panic makes sense. Due to that, a big part of keeping the tries down is grouping functionality together that can fail in similar same ways and doing something about it in one place.

I also think that utilities like assumeCapacity are really well suited for cutting down on the noise if you can allocate an appropriate amount of space up front.

Speaking of logging, if we take a look at one of the logging functions here: https://github.com/ziglang/zig/blob/master/lib/std/log.zig

/// The default implementation for the log function, custom log functions may
/// forward log messages to this function.
pub fn defaultLog(
    comptime message_level: Level,
    comptime scope: @Type(.EnumLiteral),
    comptime format: []const u8,
    args: anytype,
) void {
    const level_txt = comptime message_level.asText();
    const prefix2 = if (scope == .default) ": " else "(" ++ @tagName(scope) ++ "): ";
    const stderr = std.io.getStdErr().writer();
    std.debug.getStderrMutex().lock();
    defer std.debug.getStderrMutex().unlock();
    nosuspend stderr.print(level_txt ++ prefix2 ++ format ++ "\n", args) catch return;
}

On the last line, we see an something interesting happening…

nonsuspend ... catch return;

The writer type it’s using is a WriteError which is looks to me like it ultimately sources from the File.zig source:

pub const WriteError = posix.WriteError;

So when they’re logging there, it looks like they just toss the error aside and move on. It might be because there’s only so much that you can care about writing to stderr (I’m guessing at the author’s intent here). It’s really about value judgements at that level.

One more addendum here - consider the case of a matrix multiplication. For M1 * M2, we need the columns of M1 to line up with the rows of M2. In this case, I could make an error out of it, but imo, why not just make that an assert? Asserts are under appreciated (even though they’re a dignified panic). Either way, keeping a check like that in debug is probably where it belongs.

2 Likes

Better not to crash, but rather exit(some-non-zero-val) and then rely on some restarting system facility similar to systemd’s Restart=on-failure in a service file.

1 Like

Well, yes, you want to auto-restart your service - that’s your last line of defense, unless you enjoy getting calls at 2 AM on weekends.

But if it’s not OOM (for example), but something you can recover from - say you got a bad JSON or a string that could not be converted to int, surely it’s best to handle it and not let it terminate your service.

1 Like

Yes, definitely. I meant “out of memory” situation only.
But there is another thing: memory overcommit.
Or your service may crash under some unusual circumstances, after it’s been running without any problems for weeks, months…

To account for this, I usually catch SIGSEGV, SIGBUS, SIGILL, print callstack in the handler and then exit(1). And, of course, I periodically check logs in order not to miss something strange.

3 Likes

One shot programs

I definitely see the value of using catch @panic("Out of Memory") in short running programs that have a simple defined task, do that task and then exit. For example build scripts, code/asset generation, utility tools.

I think one of the questions you need to ask yourself is whether your application truly is such a short running “one shot” program, if it is it is probably a good strategy.

If it processes multiple data sets, allocating memory, freeing a lot and then allocating other stuff, then I don’t think crashing is a good choice.

Basically if your memory is monotonically increasing and then you hit finish, then crashing (exiting) is the best choice, because it is faster then freeing everything in reverse allocation order piece by piece, just so that you can exit. But technically in that situation you don’t need to crash, you can just call std.os.exit.

Multi phase programs / bundles of different tasks

I think if you have a bunch of different phases where your memory usage grows, because you work on something; and then it shrinks, because you are distilling the results. I think you have 2 options:

  1. Use arenas and think in terms of batches, failing in batches and recovering in batches (related: Enter The Arena - Ryan Fleury)

    This allows you to just say: “for these things, if anything fails we just abort the whole section, resetting the entire arena and pretending we never started”.

    However it is important to make sure you don’t keep some database connection, or something like it, half opened, where you loose the handle and thus don’t know how to close it. Basically don’t lose/leak external resources in the swept away trash pile memory.

  2. Do everything “one piece at a time”:

    • create one object
    • errdefer destroy it
    • add it to a list
    • errdefer pop it from the list
    • add its index in some other data structure
    • errdefer remove the index from the other data structure
    • do something else that can fail
    • everything worked return the object

    Using this you get methods that are able to rollback to the state before the function was called, because if an error happens the errdefers undo the partial successes of the function. Thus these functions hide partial success, by undoing it, giving you complete success or complete failure.

    These functions are nice because they allow your program to reverse back from failure, convert the fine granularity error to a bigger granularity error, leaving you with less complexity at the call site.

    You have “success or fail”, not: “success or it failed and I need to check if the half created object is still in some list” at the call site.


With strategy 1. I still don’t have a lot of experience, I used it a tiny bit with some old gui code, but not enough to really test it to its limits.

With strategy 2. here is an example:

pub fn createArchetype(self: *Self, atype: AType, new_id: u32) !*Archetype {
    var archetype = try self.allocator.create(Archetype);
    errdefer self.allocator.destroy(archetype);

    archetype.* = try Archetype.init(self.allocator, &self.component_infos, new_id, atype);
    errdefer archetype.deinit();
    try archetype.calculateSizes();
    return archetype;
}

This is from a work in progress ECS, which brings me to another point, by using such an ECS (which uses Archetypes which are basically batches of memory containing similar objects) I am kind of using both strategies:

  • strategy 2. to manage batches
  • strategy 1. by using these batches in a way that is somewhat similar to strategy 1.
    by allowing me to manage memory as batches of higher granularity

Because I have the batches, there are code paths where it is already clear that the memory is already allocated, thus I don’t have to deal with it on the instance level anymore.

Instead I get an error when I try to add a new instance and the batch tries to grow and doesn’t have enough memory. (If I deal with an individual object I may have to deal with the error individually)

batching / assume capacity

But it is often times possible to, for example check if the school bus is big enough for the whole class, instead of requesting a seat for each pupil individually, which allows you to move (allocate) the whole bus, instead of one pupil at a time.
Which is the assumeCapacity case @AndrewCodeDev mentioned.

The logging Andrew mentioned is a interesting case, might be worth to explore the idea of a logging library that uses something like assumeCapacity to either log the entire message or none of it, by pre-computing and reserving a big enough chunk of memory, but I am not sure how that would turn out. I think it would have to be explored as an actual experiment, to see whether that could have some benefit to it.

asserts

I also agree that asserts are great, but they are more for cases where some invariant wasn’t upheld, so I think they are for situations where some programmer tries to use something in an unintended way, to prevent that from being possible.


Garbage collection

I also think there is a 3. strategy which is, you build something that in some way starts looking like / is a garbage collector.

You have some heuristic that triggers “memory is getting too filled we need to cleanup”, then you walk a data structure and figure out what is garbage and remove it. Possibly reorganizing the remaining things and then you continue running the program.

Typically garbage collection gets triggered before you hit out of memory, but you also could use an out of memory error as the trigger.

  • It could back away from an error with strategy 1. or 2., garbage collect, retry.
  • If the error happens again, maybe even try to move things that could be done later to the disk and retry.
  • If it happens again finally crash.

It is just that most zig programs, probably can avoid the need to invent / use their own garbage collector.

I guess if the os has swapping configured, that could delay the point where you hit out of memory, but because the disk/storage is so much slower it might be better for the program to hit out of memory, instead of being slowed, because of os based swapping. Because the program might have more information, to just decide to kill some less important sub task.

You also could do some kind of distributed programming swapping, move some part of your working memory / problem to another machine, but that is even slower and just isn’t practical if you want to stay on one machine.

4 Likes

I found a rather straightforward example of this while researching how different applications handle this. This article is old but good: Handling out-of-memory conditions in C - Eli Bendersky's website

Here’s an example from that article of a semi-gc method that attempts a recovery from Git’s xmalloc wrapper.

void *xmalloc(size_t size)
{
      void *ret = malloc(size);
      if (!ret && !size)
              ret = malloc(1);
      if (!ret) {
              release_pack_memory(size, -1);
              ret = malloc(size);
              if (!ret && !size)
                      ret = malloc(1);
              if (!ret)
                      die("Out of memory, malloc failed");
      }
#ifdef XMALLOC_POISON
      memset(ret, 0xA5, size);
#endif
      return ret;
}

We can see that the code tries to make an allocation and probes if it doesn’t get the value it was looking for. If the requested size failed, it releases its “pack memory” and tries again. If that fails, it dies.

3 Likes

I probably should have specified that: It’s a multiplayer game.

I generally agree, but I think that needs a save system that works reliably in crash condition. For example on Linux the out of memory error isn’t even triggered reliably.

The problem is that my code doesn’t really look like this. I use a global allocator quite a lot. For example here is a random bit of code from my game:

pub fn schedule(mesh: *ChunkMesh) !void {
	const task = try main.globalAllocator.create(@This());
	task.* = .{
		.mesh = mesh,
	};
	try main.threadPool.addTask(task, &vtable); // Uses another allocator internally
}

Often there is not really a clean cut where it makes logical sense to place the @panic. I would need to place it on most alloc/append calls.

Yeah, I agree. Asserts are definitely the solution when it comes to user-defined unrecoverable errors, and I’m already using them in most places. But I cannot really apply that to unrecoverable library errors, like out of memory or errors while trying to open a window.

Yeah, a garbage collector is definitely something to look into. Though I would probably do it in the allocator instead of catching it at every site. For me the strategy could be as simple as just shrinking a bunch of caches I have.
But that doesn’t help with my problem. Eventually the garbage collector won’t have anything to clean, and then we are back at the point where I don’t know what to do with the (now truly unrecoverable) error.OutOfMemory that it returns.

Is it a network game? If so, on which side (client or server) have you troubled with handling above mentioned situations most?

I see - so let’s take a step back here.

We have two try statements here - one for creating a task, and one for adding a task. Since the try on the create statement uses the globalAllocator, that makes me wonder what can fail about the addTask call. Does it use the same allocator under the hood? Is there a different allocator? Or are there additional reasons why adding a task can fail?

The reason I’m asking is let’s pretend I’m the caller - if I call that function and I have multiple error sets coming at me, then I need to handle those independently. I could be getting an OOM or something else. Some may be reasonable, some may not be reasonable from my vantage point.

If I try to call that function and the globalAllocator runs dry… what can I reasonably do at that point? Going back to the xmalloc example, they free up some junk and then try again or die lol. But that happens every time and it’s handled in one place with the same reasoning.

If there’s really only one option, that should get moved into it’s own free function that uses the globalAllocator and makes a decision about what it means to globally run out of memory. If I as the caller have multiple options and I can be crafty about it then sure, give me the error if I have more levers to pull.

Furthermore, different allocators returning OOM doesn’t mean the same thing universally. Maybe one can reasonably run out but the other absolutely shouldn’t - how would I tell on the outside if all I have is OOM?

IMO, I think you should make some decisions about what it means for the globalAllocator to run out and then codify those in a series of functions. Then, if you have additional errors, we can keep them in their own lanes.

2 Likes

Here’s a few toy examples…

// In one scenario I want something from the allocator...
const x = globalAllocator.create(T) catch {
     // free some stuff or call the police
};

// later...

const y = globalAllocator.create(T) catch {
     // freeing more stuff and calling the cops again
};

// and again...
const z = globalAllocator.create(T) catch {
     // this is getting old.
};

When we’re creating x,y,z… are we at tthe correct altitude to be making these decisions? Do we really have a different option each time? If not, I’d advocate for…

pub fn globalCreate(comptime T: type) *T {
    return globalAllocator.create(T) catch {
        // be as reasonable as possible...
    }
}

const x = globalCreate(T);
const y = globalCreate(U);
const z = globalCreate(V);

pub fn schedule(mesh: *ChunkMesh) !void {
	const task = globalCreate(@This());
	task.* = .{
		.mesh = mesh,
	};
    // now we are only getting errors from addTask
	try main.threadPool.addTask(task, &vtable); // Uses another allocator internally
}

This way, if I have unique errors from addTask, maybe I can handle them differently. This has a kind of “signal purity” that I like - especially if addTask has it’s own allocation strategy so we can purely deal with its problems alone.

By returning the error, I believe we are implicitly saying “there’s legitimately different ways to handle this error”. For adding tasks, that may be true and it may be a different situation that running out of global memory. If there isn’t legitimately different ways to handle this problem, it should handle that problem in one location itself.

Again, if you can imagine that there are legitimately unique things to do when global allocation fails, then you should handle those situations uniquely in each case - if not, centralize that concern and keep the error streams pure.

2 Likes

… rusty nail into my kicks!!!
I’ve just remembered the only one
multi-player game with two players using a single keyboard, and with amazing music via PC speaker using PCM (Pulse Code Modulation)

It is a network game using UDP holepunching to be precise.

That’s a valuable insight right there! And yeah, using globalCreate/Alloc/Destroy/Free functions feels right for this situation.
Sadly there seems to be no support for central error handling allocation in the standard library.
Data structures like ArrayLists only support std.mem.Allocator, which returns errors.
I guess I could write my own version of ArrayList, which is not too bad(I was forced to do that in java as well, due to their poor generics), but I don’t want to end up rewriting half of the standard library.

After reconsidering in more detail, I don’t like this answer, in the context of a game client, I would argue heavily to get away from the idea of relying on one single global allocator.

With this you are leaking or crashing globally:

globalCreate can’t really do anything reasonable, except fail with the error, instead of absorbing it. You maybe could argue for using one set of methods while you are trying to init the game if you can’t even get to the main menu you might as well crash / shutdown.

But I think schedule or things that are used after an initial startup phase should either properly manage errors or have their memory pre-allocated in the startup phase and have no actual way to fail.

I think it is better to instead have smaller allocators/arenas that can fail and be reset individually. Better if a few branches brake off then the whole tree collapsing.

With this your code can go back to before you tried to schedule:

pub fn schedule(allocator: std.mem.Allocator, mesh: *ChunkMesh) !void {
    const task = try allocator.create(@This());
    errdefer allocator.destroy(task);
    task.* = .{
        .mesh = mesh,
    };
    try main.threadPool.addTask(task, &vtable); // Uses another allocator internally
}

Then you have the option at the call site to let it bubble up to a reasonable granularity and make a decision, for example you could switch to use higher lod levels (reduce geometry detail) and try again with lower resolution.
I also wonder a bit about who owns the task and cleans it up and who owns the results it produces, I don’t really know the details of how the thread pool is used.

Here is my reasoning:

  • when you have only one global allocator and you hit out of memory, there isn’t much sensible stuff you can do, except basically try to walk your program in reverse until you reach a point where your memory looks sensible again (which isn’t really a valid / easy thing to do in most languages)

  • instead (ideally) you want to split it so that you can have stacks of simple lifetimes:

    1. game boots up creates the root allocator and a bunch of things that will never be freed until program exit
    2. now you create an arena use it to load the splash screen, menu backgrounds, menu stuff, etc.
    3. now you basically push the arena onto the stack and create a nested new arena
      this new arena can grow further but resetting it gets you back to the beginning of step 3. you now fill it with stuff needed to communicate with the server and handle the local game state
  • lets say the server sends very different amounts of variable length data because of different terrain, enemies etc.

    • so you basically need 2 buffers that grow and shrink a lot, the network incoming stuff can be one arena while the client working memory is another
    • you could use slab allocated arenas, or alternatively you could use memory mapping to virtually pre-allocate address space for each arena separately (just pre-reserving virtual memory without the physical mapping)
    • for some things you may be able to just pre allocate a pre calculated amount of memory and switch those in and out, replacing chunks with other chunks
    • for incoming messages (streaming/temporary data really) it may make sense to have something like a circular buffer, especially if you can pre calculate an upper bound
      • some people have used page mapping to create circular buffers that can be read and written as if they were contiguous memory by mapping 2x virtual address space to 1x physical space, that could be interesting because then you can use a circular buffer without the user having to know / notice
  • with good reuse of buffers you should hit a steady state where the program doesn’t grow beyond

  • now if you hit out of memory anyway you need to wonder why, are you holding on to things, are you leaking memory, did you try to load more then the machine can handle?

  • are you fragmenting memory within arenas?
    this means a lot of memory is no longer used but still not recovered, you could either try to reuse elements better for example via MemoryPool
    Or you could periodically scan through the arena compacting it.
    Or maybe even something that is like double buffering, where you swap between A and B buffer moving things from current to next buffer, if they continue to live, dropping/forgetting them if they aren’t relevant anymore.

  • lets assume some other process steals all the memory from us while we where only using a little bit, theoretically we should be able to continue not using more memory then we currently have by not showing more chunks, etc.

  • if that isn’t desireable because a half rendered world isn’t appealing, we could do some last attempts to send/save the players position and then pop the stack going back to 2. basically going back to the main menu and disconnecting

  • from there on the player can close their browser hogging memory, or configure a swap file so that the os can page out the browsers memory stealing ways

  • and the user can connect to the server again, without the game crashing

I’m going to go one step further and suggest something different here because while I get what you’re saying, this still has the problem from before:

Imagine we have an error like so…

           alloc1
              |
         {   OOM, X, Y Z  }
              |
           alloc2

You now have two allocators using the same error. So which one failed? If you have a long enough call stack, this can become impractical to solve.

So I’m going to make a suggestion here…

// somewhere up above...

const task = try allocator.create(T);
errdefer allocator.destroy(task);
task.* = .{
    .mesh = mesh,
};

// then later...
pub fn schedule(task: *Task) !void {
    try main.threadPool.addTask(task, &vtable); // Uses another allocator internally
}

I’ll meet you halfway - I think mixing allocators for the same function (if they both can fail) is not optimal, but I get your point about not using a global allocator for the current use case.

1 Like

I agree that mixing different allocators in one code path is bad and your example is a possible mitigation.

I think another thing to think about is the whole who owns what, who creates the task vs who reads the task details and writes the result.

One of the mantras from go-lang that seem reasonable, is to try to avoid to communicate via sharing memory and instead share memory by communicating, that basically boils down to using queues/channels if possible, so that one end can write while the other reads.

Might be a way to get some distance between the sheduler and the worker thread. But kind of difficult to say without knowing how the results are used, if those need to be synced up back again and get recombined.

Also may depend on how much latency is acceptable and the needed throughput.

1 Like

It is those “mobile applications” (a whole bunch of a modern internet browsers) that
are “dying breed”.

Everyone is installing some stupid “mobile applications” on their’s kinda “smart” phones.

Every programmer a-la python for so-called back-end and a-la js for so-called front-end
are the most valued programmers ever.

Why on earth it is happening this way?

:slight_smile:

2 Likes