Handling Out Of Memory errors

Thank you for going into more detail. That makes a lot of sense :+1:

I agree with this, and my unsubstantiated hunch is that it’s partially a symptom of explicit error sets being underutilized. It’s very easy to lose track of what errors are possible when using inferred error sets, and it’s hard to do explicit error sets well since the language/standard library doesn’t give much to help out when working with explicit error sets (I’m still a bit bummed that this PR got rejected, and unfortunately this proposal got un-accepted).

For resinator a goal of mine is to be able to put an explicit error set on main with only a few errors included like OutOfMemory, but I haven’t put effort into that yet, and I know I’m currently bubbling up some errors that I shouldn’t be.

3 Likes

Crashing on OOM for server-side applications would be totally nuts.

One of the nice things about Zig allocator is that you can impose local limits on memory usage. People shouldn’t write code assuming an OOM error indicates unrecoverable memory exhaustion.

5 Likes

Not really since you usually have a watchdog running which restarts the service on crash. Crashes shouldn’t be frequent, but when they happen they shouldn’t be disruptive or even noticeable for the majority of users (e.g. when I worked on the Drakensang MMO, even when a map server process crashed, the process would be restarted automatically, users on that map instance would get a short reconnect screen (maybe 2…5 seconds) and then they’re back in action (although on a ‘fresh’ map). And since each map instance was running on its own process, only few users were usually affected (unless it was a hub map instance which usually holds around 50 users).

2 Likes

If a remote user can crash your server-side app once, he can crash it constantly. This isn’t an acceptable approach at all in this day and age.

This is a good discussion.

I’d offer that the only rule of thumb which is generally applicable is: libraries should propagate OutOfMemory, applications should favor handling the condition where it arises. That, and anything which does propagate OutOfMemory (or in fact, any error) should have a test demonstrating that it doesn’t leak resources, in the event that an error occurs.

I’d even suggest that applications consider treating internal libraries as libraries for this heuristic, letting errors bubble up a level or two when called for. If code has no reason to know what an Allocator actually is, it’s good if that code doesn’t assume allocation errors are unrecoverable. That leaves room for refactoring the allocation strategy in the program, or spinning that internal library into something reusable.

I say “consider” because this won’t always be the right thing to do. Huge amount of “it depends” going into this question.

Last observation: “what is responsible for OutOfMemory” is input into decisions about what is responsible for allocations, and that’s a central question for program correctness in a memory-managed language. When you have a clear memory policy, it should let you answer the OOM question as well.

7 Likes

Server crashes will happen, you fix the bug and move on. When you let a crashing bug linger so that clients have enough time to figure out how to intentionally kill your server, that’s an organizational problem, not a technical one. Also even if that happens, a client can only crash the map instance he’s currently on, that only affects somewhere between 1 and 50 users.

For professional ransom groups a traditional DDOS attack is much more effective.

If you write a program that panics on a OOM error, that’s not a bug. That’s failure by design.

1 Like

It’s, design. A crash-only architecture is perfectly valid, and in fact has numerous advantages.

But it isn’t the only way to skin the cat, either.

You’re both projecting a number of assumptions onto the category “server” here. Which is very broad, and since those assumptions clearly differ, that’s putting you at cross purposes.

I don’t think anything useful can be said about the proper memory policy for “a server”, without refining what kind of server we’re talking about.

5 Likes

Interesting thread. For at least 20 years I never ran out of memory in any program, mainly because I am a bit aware of what is going on.

It is because of me using Zig that I actually need to think about the flow.

My current program is full of
somelist.append(allocator, newthing) catch die();
to not pollute all code with try and !void.

Of course it depends on what exe or lib or whatever we are creating.

Main question for me: Propagating the error al the way up to main() will not solve anything, because we will not know where the error took place. Or do we??

I would maybe rather claim the max memory I need upfront and always be sure I stay in there.

3 Likes

A server-side app is always a multi-user app. Letting the action of one user negatively impact the experience of other users is clearly wrong.

Consider typical request / response based server, would you rather panic and let watchdog restart the server (potentially dropping all other active clients), or serve Internal Server error to the client that found a way to trigger OOM in the request?

When the server really runs into an out of memory, it’s unlikely that a respone can be sent back to the client, since that would also require memory allocation (also on Linux you won’t get the opportunity to handle an OOM except in a SIGKILL handler, but you can’t do a lot of useful things in a signal handler).

Other crash-bugs are generally unpredictable and thus also cannot have error handling until they actually happen and are fixed (in our case, “crash bugs” were usually “can’t happen” asserts, which then actually happened because of wrong assumptions).

Also I’m in the same boat as @ericlang: I’ve been programming for decades now but never ran into an actual out of memory (plenty of slowly growing memory leaks though), so while actual OOMs are in the category of ‘predictable’ bugs, they’re about as rare as UFO sightings.

1 Like

Please note that as @floooh points out above, this will not work on linux (and other platforms). Maybe you’re not working with these platforms and it does make sense to allocate up front.

2 Likes

??? You preallocate enough resources in your server so you can always send a fallback / error response.

And lets say somebody somehow found way to send a request that always triggers OOM by allocating lots of memory (assuming no overcommit). Is your strategy to DDoS attacks to just to take the arrow to the knee?

We programmers have dumb habit of creating patterns and following them like religion when the correct solution always depends on the context.

Also on Linux you won’t get the opportunity to handle an OOM except in a SIGKILL handler, but you can’t do a lot of useful things in a signal handler

Unless you rely on badly written program that depends on overcommit, I’d disable it. It’s unfortunate linux is still stuck with this default behavior due to fork.

I’ve been programming for decades now but never ran into an actual out of memory (plenty of slowly growing memory leaks though), so while actual OOMs are in the category of ‘predictable’ bugs, they’re about as rare as UFO sightings.

I think this just means you mostly program for beefy hardware.

Please note that as @floooh points out above, this will not work on linux (and other platforms). Maybe you’re not working with these platforms and it does make sense to allocate up front.

On linux (with overcommit) it’s possible to use mlock/mlockall to claim the memory and make sure it won’t be swapped out, the syscalls report error if the memory can’t be claimed (OOM).

This happens once, then the bug is fixed, life goes on. Such things should already be caught in the validation for incoming messages though and never make it into the guts of the server code.

In general, IMHO it is more important to quickly react to and fix server crashes instead of trying to think upfront of every single situation that can go wrong (at least after covering the basics like proper validation of all data that enters the server). The problems will always be in the areas where you least expect them, and no amount of error handling can protect from error situations you don’t expect to even exist.

TL;DR: good debuggability is more important than correctness (since 100% correctness is not achievable anyway)

In typical response/request based server, it’s very trivial to use arenas to both limit and reuse memory. This is also a form of “validation”, that your request / responses don’t go over the memory budget.

3 Likes

That’s not a trivial operation though.

In doing so you’re inadverdently telling the kernel that your software is super duper important and must be kept in RAM at all times, and that you’d rather swap out every other process to disk than kill this one.

It also forces you to run with root privileges or operate under the configured ulimit, and either of those can be big design challenges, depending on what you’re doing and what you’re running on.

It works in a sense: On startup, you either get an ENOMEM or you don’t, after that, the memory is “yours”.

The OOM killer can still send a SIGKILL, but in fact the OS can do that for any reason at any time. So yes, a robust system has to have a strategy for dying at literally any cycle in which it’s executing, on any POSIX system, that’s just the nature of the game. Memory over-provision doesn’t affect that, except insofar as page faulting is one of the things which can draw the baleful eye of the OOM killer.

But the program logic now has one place to deal with the possibility of asking for memory and being told “nope”. That does have advantages.

Session management is not an essential property of servers, and server-side apps do not have to be implemented as single-process applications, and commonly are not. Plenty of servers are stateless, or nearly so, or what state they have they get from a memcache anonymously, so that individual workers don’t get pinned to specific client state.

It’s also perfectly tenable to do session management on a per-process basis. It’s not even terribly uncommon, that’s how CGI works, for example. If a CGI script encounters any invalid condition, it can just abort with a non-zero code and the webserver will send some sort of 5xx in response. No problem.

This is the difference between saying that it’s always bad for a server process to crash on any error, and saying that for some kinds of server, it’s bad to just crash on any error. The former is objectively false and the latter is objectively true, that’s quite a distance!

3 Likes

Your comment makes me thing you might enjoy reading this paper:
https://www.usenix.org/conference/hotos-ix/crash-only-software

1 Like