Handling Out Of Memory errors

I’ll admit that I’ve never really thought about what the code I write should do if it runs out of memory. I figured it would crash hard and that was fine.

Working with zig though, I am disliking just trying allocator methods to let the potential error bubble up.

I am working on a toy game, so it’s not a critical item, but I’d like to understand some good practices for handling running out of memory.

Presumably it’s good practice to try and do a save - but probably in a different location and check for that when loading up again to see if it can be recovered.

Is it also an idea to allocate a small amount of memory at the start which can be used for safer teardown. How do we set it aside for teardown use?

Are there other pointers (haha) or resources you could point me to?

Thanks in advance for your help :slight_smile:

2 Likes

This is perhaps a somewhat heretical opinion, but I do think that calling std.process.abort on most allocation errors is the appropriate strategy for most applications.

I don’t think you should try to do an auto-save on allocation failure. Rather, you should auto-save frequently, such that there’s a save even in the event of power-loss or some other unforeseen circumstances.

10 Likes

Didn’t know about that function, I’ve used @panic("OOM") until now. I’m guessing std.process.abort quits more cleanly?

2 Likes

Didn’t know about that function, I’ve used @panic("OOM") until now. I’m guessing std.process.abort quits more cleanly?

Yeah, but @panic ends up calling std.process.exit. I was describing th thing in terms of what should happen physically, not in terms of what the most clean way to express that. For the latter, I’d advice defining

pub fn oom() noreturn {
    // ...
}

so that you can adjust details as you go. For TigerBeetle, we have the following utility:

/// To ease investigation of accidents, assign a separate exit status for each fatal condition.
/// This is a process-global set.
const FatalReason = enum(u8) {
    cli = 1,
    no_space_left = 2,
    manifest_node_pool_exhausted = 3,
    storage_size_exceeds_limit = 4,
    storage_size_would_exceed_limit = 5,
    forest_tables_count_would_exceed_limit = 6,
    unknown_command = 7,

    fn exit_status(reason: FatalReason) u8 {
        return @intFromEnum(reason);
    }
};

/// Terminates the process with non-zero exit code.
///
/// Use fatal when encountering an environmental error where stopping is the intended end response.
/// For example, when running out of disk space, use `fatal` instead of threading error.NoSpaceLeft
/// up the stack. Propagating fatal errors up the stack needlessly increases dimensionality (unusual
/// defers might run), but doesn't improve experience --- the leaf of the call stack has the most
/// context for printing error message.
///
/// Don't use fatal for situations which are necessarily bugs in some replica process (not
/// necessary this process), use assert or panic instead.
pub fn fatal(reason: FatalReason, comptime fmt: []const u8, args: anytype) noreturn {
    log.err(fmt, args);
    const status = reason.exit_status();
    assert(status != 0);
    std.process.exit(status);
}
10 Likes

Tbh, about the only useful thing you can do if you actually run into an OOM is to “let it crash”.

Also some operating systems (most notably Linux) let you overcommit allocations, so your malloc call will succeed even if there’s no physical memory to back it up, and your process will be killed later by the OS once you try doing memory accesses to allocated address ranges that can’t be mapped to physical memory. And AFAIK the infamous “Linux OOM Killer” may also kill your application when another application caused the out-of-memory.

AFAIK Android will also start killing random apps that are suspended in the background when it needs to make room for something else.

Personally I started to write code with the assumption that it may be terminated at any time without being able to do a cleanup or final saving operation (because most of my stuff also needs to run in browsers via WASM, and when the user closes a tab there’s no reliable way to run cleanup code).

TL;DR: your application may be killed by the OS for all sorts of reasons, even if you did nothing wrong, and there’s nothing you can do about that anyway.

11 Likes

Tbh, about the only useful thing you can do if you actually run into an OOM is to “let it crash”.

I disagree with this. It really depends on the context.

3 Likes

Well it’s not perfect - but if some of your target platforms tell you an allocation succeeded, but then kill your process when actually trying to access that memory there’s not a lot of useful things your code can do, and even if you get a SIGKILL you can’t do much there if the system is out of physical memory.

It’s the same with stack size (at least on Windows and WASM), there’s no way to gracefully react to a stack overflow. You either get memory corruption or the process is killed.

IMHO running out of (physical) memory is about the only error that’s not worth attempting to handle. It does make sense though to define an artificial upper memory limit at least in debug mode to easily catch and identify runaway memory leaks.

6 Likes

I used to be one of the those who always bubbled error.OutOfMemory up to the caller, which I (generally) now consider indicative of my own inexperience.

I can’t recall who it was or what the post was exactly, but one of the notable members here on Ziggit provided some great advice for creating better APIs and reducing try being splattered all over the place. There will always be exceptions, but to summarize: typically there is no saving an application that is out of memory, so simply aborting at the call site is the best solution.

const buffer = allocator.alloc(u8, num_bytes) catch @panic("out of memory");

This practice alone has dramatically reduced the amount of try polluting my codebases, and accomplishes the same result as it would if I let it bubble its way all the way back up to main, in which case I would still not have a solution for except to crash anyways.

As for your concern about saving, this woud be very risky, I personally wouldn’t even presume that it would be possible to save when the application is in such a state, or worse yet, saving a corrupted/invalid state. Better to just implement an autosave feature that runs periodically or on certain events, and let that be a sufficient enough solution for such an exceptional event.

3 Likes

Tbf, in a library I would never put a catch @panic on a failed allocation (or generally when you’re allocating through an allocator you get from somewhere else), instead always pass the error up to the library user (which then might decide to panic).

Because a user-provided allocator running full is very different from your system running out of physical memory (it could be a fixed-capacity arena allocator). And in this situation it is actually important to pass the error up to the caller.

But if you run into an “actual” out-of-memory (e.g. no more physical memory available), it’s quite unlikely that your allocator even returns with an error (depending on the system you’re running on).

8 Likes

I would agree, there are different desired behaviors when writing for a library using an allocator passed into it compared to an executable, etc.

For OP’s purpose of a game, I would abort the application at the failure. If writing a library to be consumed by others, I would still definitely return the error instead.

2 Likes

I must still be inexperienced, then. :upside_down_face:

4 Likes

I think doing an autosave on error is a good idea. Obviously, it shouldn’t replace a save from the user, nor a previous autosave that might be in a better position (because when the error happens, the game state might have been corrupted already), but I, personally hate losing even a minute of gameplay. It doesn’t replace a good autosave feature, like others have mentioned. Granted, the system may crash unexpectedly, but sometimes it will warn you, so, in these situation, you might as well make the best of it. As far as I know, Windows does return you an out of memory error.
In a game, you tipically control everything. You know your functions, your assets, and the possible ways the user can interact with the game. In a lot of games, you can calculate an upper bound on the amount of memory you’ll need. If this is your case, just make a global variable that will work as a buffer, and allocate from that. The OS will make space for it during loading and, if it can’t, it will handle the error for you (by warning the user and aborting), so your code can just assume that it has everything it needs.
Casey Muratori does something similar in Handmade Hero, you can watch his videos. But in his case, he did a single large allocation right at beginning of the program.

1 Like

It is dependent, and I worded that in a way which failed to convey as such. If you are developing a library to be consumed, likely with a user-supplied allocator, the error should definitely be returned.

OP stated this was for a game, which I would assume means it is the executable, and it determines the allocator being used. I simply can’t think of any helpful reason to return an OOM error all the way back to main, just so the application can be crashed there. I ask myself how many functions between the call-site and main must implement error-handling solely for a failed allocation, which I am not going to recover from anyways. I am assuming recovering from such a state is well beyond the scope of most applications, including hobby games.

8 Likes

One tangential thing that the friction of having to constantly deal with error.OutOfMemory guides you towards is avoiding heap allocation altogether whenever possible, which is generally a pretty good idea. So, one potential unintended consequence of sweeping OOM into the corner is that you might get too comfortable heap allocating and you might not realize that putting some effort into avoiding the possibility of OOM (using stack-allocated memory, passing around buffers instead of allocators, using smarter pre-allocation, using a different allocation strategy, etc) could lead to better overall code.

7 Likes

It’s already so cumbersome to set up heap allocation that I don’t see how the language does not disincentivize heap allocations.

1 Like

I recently learned about a funny pool allocator type (from casey muratori) that never returns an error. You keep a dummy instance (at index 0 perhaps) and when you run of out memory, instead of returning an error, you zero out the dummy instance and return it instead. That means that it can happen that you give out this dummy instance multiple times and that you zero out data that the callers wrote into it.

The callers need to know about this of course and I think that it is rare that this is actually a useful pattern but I find it funny.

It can perhaps be used in codebases that do the zero-is-initialized style when you need to allocate but don’t really care if you fail. (maybe particles, enemies, grass, leaves?)

The benefit is that you never have to handle the error case at the cost of having to keep in mind that you might have a dummy instance in hand and writing code that works anyway.

8 Likes

I don’t think this friction is really guiding you in the right way. From my experience it’s easy to get into the habit of just bubbling all allocation errors up to the main and by doing so it gets significantly harder to handle actual errors, like missing files, that you don’t want to crash on, or at least want to print some nice error messages. Instead you just get error.FileNotFound on failure which to be honest is not really more helpful than Segmentation fault(core dumped).

I don’t think there is an inherent advantage of avoiding heap allocation. It’s a fine tool and in my opinion heap allocation is always more readable and more secure than lazily written buffer code with randomly chosen buffer sizes. (Trust me, I’ve been there)
I think the most important part is just choosing the right allocator, and not giving up on allocation entirely. Nowadays I use a threadlocal, stack-like allocator (which falls back to a global allocator when the size is reached) for almost all my local allocations, and it’s fast, nicer to use and less error-prone than trying to guess maximum sizes for user-provided inputs (e.g. config files, file paths, user-entered text).

6 Likes

This is quite similar to junk pages in (computer) emulators. E.g. instead of expensive code to catch and ignore write access to emulator read-only memory, you just map a common host memory area as ‘junk page’ for write accesses to emulator ROM areas (and only map read accesses to the actual ROM dumps). That way writes to ROM areas don’t need to be handled differently than writes to regular RAM, which simplifies the code and is also faster.

1 Like

Thank you all for your inputs - it has been edifying, and I really appreciate the thoughts.

My key takeaways are:

  • Autosave regularly rather than save on crash (there are many reasons for a crash, not just running out of memory - good point)
  • For applications - the most sensible course of action is to simply crash. Good points about the application possibly just getting killed on linux, and that the actual OOM could happen on access rather than allocation (I did not know this)
  • I like the idea of having a centralised function you call that can be fleshed out later, perhaps to include some logging, but can start with a panic. Would logging even work after OOM?
  • It does make sense for libraries to bubble up the exception.
  • I could probably calculate the maximum amount of memory required and allocate it at the start. Could be interesting, but it feels a bit greedy to hold on to memory that the game may use later.

Do you mean services, or do you mean operating systems themselves. I’d also be curious as to how they handle the cases that doesn’t report on memory allocation, but on access. I don’t think it applies to my current use case, but I’d love to learn more.

From what I remember, services (like http servers) tend to favour fixed memory usage and limits to avoid OOMs in the first place.

I like this idea as well - I wonder if there is a way to do that for performance optimisations as well - i.e. particles are dummy instances if the last X frames were too slow (or something) - kinda like a cpu time allocator.

3 Likes

As I said it depends on context. Handling OOM locally is very much possible. For example, if you have cache that can be reset, you could free the cache and try allocating again.

If you are trying to allocate for something that is not necessarily required, you can report the failure and continue with execution.

If you are some sort of editor / tool, and there’s not enough to load some asset, you can instead show a placeholder that indicates there’s not enough memory available.

Panicing on OOM is not a bad strategy, but there are scenarios where it isn’t the best route.

It’s pretty common tactic to have per request arenas in http servers.

4 Likes