Cloudflare and `.?`

I’m not sure if language comparison discussions are welcome here or not, but I do want to make this about Zig. If it’s not on-topic, I understand.

It looks like the recent Cloudflare outage was down to somebody assuming a rust Optional would always have a usable value in it, and so all that was needed was a quick .unwrap(). Either that, or they didn’t want to handle the error, or they’d only really understood that to get the value out of an Optional you have to .unwrap() it and didn’t realise crashing was an option. There now seems to be a growing consensus in rust-land that this method is badly named because it looks safe when it isn’t. Suggestions like .unwrap_or_panic() are making the rounds, with the intention of making the behaviour much more obvious.

In Zig, the equivalent is .? which seems to be suggesting that this is such a common operation that it needs to be really short. The language reference says that it’s has an alternative spelling orelse unreachable which personally I find to be much more explicit that something bad can happen. It’s also easier to turn into a proper error handler.

Is having .? a good thing?

8 Likes

having .? is a good thing.

1 Like

Can you maybe “explain” to me why it’s preferable over orelse?

2 Likes

afaik, fail fast is better than continuing with an invalid value. using “orelse” can hide bugs. We use .? to assert that the value is not going to be null, but if it is, it crashes. Because it goes against our assertion that it is guaranteed going to be a value. A crash is a proof that our assertion was wrong. But using an orelse will hide it.

It’s tricky. Maybe .? should exit the function with an error instead of injecting undefined behaviour (which is also worse than a proper panic), but IMHO at the very least .? shouldn’t be a shortcut for ‘orelse unreachable’ but ‘orelse panic’.

OTH failed range checks also terminate execution (or afaik are UB in non-safe modes) and nobody really questions this behaviour - but it’s the exact same problem.

A strict compiler would prevent an unwrap or .? if the code hasn’t checked before that the value is actually valid (eg similar to type narrowing in Typescript), or prevent indexing an array unless the user has done an explicit range check before. The problem is: do we really want such a “grammar nazi” compiler, and is that even feasible/possible in all situations? And not everything is a mission critical backend service which brings half the internet down when it crashes.

IMHO the Cloudflare problem is that they only implemented one half of the “let it crash” philosophy (they didn’t test that failure case and didn’t have a recovery strategy). Letting processes go into a controlled crash is totally fine as long as the process will be restarted and without going into an infinite crash loop.

5 Likes

One problem of .? s that ‘unreachable’ only crashes in debug and safe mode, otherwise it’s undefined behaviour.

yes, in rust .unwrap() crashes even in release mode.

I can understand that’s desirable behaviour for unreachable. It’s explicitly saying to the compiler “this isn’t going to happen so don’t worry about it”. It’s use in std.debug.assert is doing just that.

.? just feels so handy that it having UB seems dangerous.

I’d probably say it’s a bit different as optionals are meant to be a way of mitigating something failing. Having unsafe behaviour in your mitigation strategy seems like a footgun.

Bugs exist. It’s a fact of life. Cloudflare’s problem is that they don’t utilize a deployment in stage process in a live environment. Because all the development testing won’t prepare you for the live data.

Typically you roll out the changes to a few servers to take live traffic. Run experiments on them, comparing them against a control group of servers with the old version. When the statistics look good, you expand the deployment to 1~5%. Then to 20%. Then fully. With automation the deployment and testing can be run in minutes, in days, or weeks if needed. In a fail fast scenario like this case, it would have been caught in the first batch of the rollout.

2 Likes

I think .? is fine, if people use it in a context where they shouldn’t then your process for ensuring software quality and getting developers to write quality software has failed, if there is a context where it absolutely shouldn’t be used than you should have some sort of test that ensures that it isn’t used.

But having said that, I think this isn’t an entirely technical problem, a lot of times when people try to solve problems through technical means they create more and more complex technical solutions (like longer more explicit names), which will still fail to actually solve the problem (for example by people copy&pasting code without understanding it).

If the problem is that irresponsible programmers are writing your software, you can’t fix that through technical means, but instead you need to actually spend the time to create a process and culture that leads to people acting responsibly.
And technical solutions can help in that, but they aren’t enough.

Also different projects have differing tolerance for accidental or even malicious foot-shootings, so it is on the project to define the constraints for what is acceptable and what isn’t.

So I think defining your way of doing things, similar to how tigerbeetle has done it with their tigerstyle, seems way better than creating generic disconnected safety rules which may completely miss the actual issues some project is facing.

Don’t over engineer some kind of imagined feel-good safety that just results in pointless hoop jumping and checkbox ticking, but actually develop your own real safety protocols and methods that make sense for your project, ideally even allowing you to reason about it and measure its effectiveness.


Having said all that, I think we can make it easier to use techniques like fuzzing, validation, etc. by creating better/easier to use libraries for those.

11 Likes

I think in moments like this it’s important to step back and consider: Is this really such a big issue, or are we just overreacting to it because it had such a big impact? Mistakes happen, and we cannot prevent all kinds of mistakes. There is no guarantee that unwrap_or_panic would have changed anything.

In my opinion .? is actually making things safer, since it gives us a convenient way to insert assertions about the program’s state. Furthermore they make optionals as a whole more accessible, and I’m sure we can all agree that the alternatives, a boolean flag or a magic pointer value, are much less secure, but that’s what people (and I’ve seen this first-hand in C++) will use if optionals become too cumbersome.

4 Likes

The reason I brought this up was because I’ve just lost several hours to a piece of code doing a .unwrap() on a null value. It was a piece of code running in the windowing system, and it crashing was just taking out random applications.

Once I found it I grepped the code base for other uses of unwrap and it was everywhere. This author just seemed to be using it as if that was their normal pattern. At that point we’re just back to a NullPointerException situation. That’s one of the exact things Optionals were meant to solve.

…in debug mode.

Please don’t think I’m questioning the value of Optionals, because I’m not. Just this one operator.

1 Like

Assertions are for invariants you can ensure through testing / fuzzing in debug/safe modes to be impossible. I strongly disagree with people misusing the term for general checks.

Sure you can create release time runtime checks for stuff, but I don’t call those assertions, assertions should be impossible to trigger (once you have found all your errors in thinking), if they aren’t then they aren’t assertions. And the safe-mode testing needs to be rigorous enough to make sure that all assertions truly are assertions, instead of things that actually can happen.

3 Likes

The call to unwrap was to extract the success value from a Result, which is the Rust equivalent of a Zig error union, not of an optional.

The crash was because the Result in question was actually in the Err case, causing the operation to panic.

Zig doesn’t have a direct equivalent of unwrap for error unions, we have to concatenate two constructs, either catch unreachable or catch @panic() depending on how you want to look at it.

I do think that a shortcut assert that something that can return an error does not infact return an error, is kind of a iffy language feature (much more so than .? for optionals), but regardless, this outage is the result of sloppy programming not of anything particularly specific to Rust, in my opinion.

15 Likes

Did you know that you can disable specific sources of panics like .?, turning uses of them into compile errors, simply by declaring a panic handler in your root source file?

const std = @import("std");

pub const panic = struct {
    const default = std.debug.FullPanic(std.debug.defaultPanic);

    pub const call = default.call;
    pub const sentinelMismatch = default.sentinelMismatch;
    pub const unwrapError = default.unwrapError;
    pub const outOfBounds = default.outOfBounds;
    pub const startGreaterThanEnd = default.startGreaterThanEnd;
    pub const inactiveUnionField = default.inactiveUnionField;
    pub const sliceCastLenRemainder = default.sliceCastLenRemainder;
    pub const reachedUnreachable = default.reachedUnreachable;
    //pub const unwrapNull = default.unwrapNull;
    pub const castToNull = default.castToNull;
    pub const incorrectAlignment = default.incorrectAlignment;
    pub const invalidErrorCode = default.invalidErrorCode;
    pub const integerOutOfBounds = default.integerOutOfBounds;
    //pub const integerOverflow = default.integerOverflow;
    pub const shlOverflow = default.shlOverflow;
    pub const shrOverflow = default.shrOverflow;
    pub const divideByZero = default.divideByZero;
    pub const exactDivisionRemainder = default.exactDivisionRemainder;
    pub const integerPartOutOfBounds = default.integerPartOutOfBounds;
    pub const corruptSwitch = default.corruptSwitch;
    pub const shiftRhsTooBig = default.shiftRhsTooBig;
    pub const invalidEnumValue = default.invalidEnumValue;
    pub const forLenMismatch = default.forLenMismatch;
    pub const copyLenMismatch = default.copyLenMismatch;
    pub const memcpyAlias = default.memcpyAlias;
    pub const noreturnReturned = default.noreturnReturned;

    pub fn unwrapNull() noreturn {
        @compileError("unwrapping optionals is forbidden");
    }

    pub fn integerOverflow() noreturn {
        @compileError("operations that can result in integer overflow are forbidden");
    }
};

pub fn main() void {
    var x: ?u32 = 123;
    var y = x.?;

    var a: u32 = 1;
    var b: u32 = 2;
    var c: u32 = a + b;

    _ = .{ &x, &y, &a, &b, &c };
}
repro.zig:34:9: error: unwrapping optionals is forbidden
        @compileError("unwrapping optionals is forbidden");
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
repro.zig:38:9: error: operations that can result in integer overflow are forbidden
        @compileError("operations that can result in integer overflow are forbidden");
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Unfortunately this isn’t currently very useful in practice in most cases, because panic handler functions can also be called from std and the startup code that runs before and calls your main() function uses several std APIs. But for safety-critical software it could theoretically be a useful tool.

21 Likes

orelse does not “hide” it, but unreachable may (depending on the build mode). Consider:

fn orPanic(wrapped: anytype) std.meta.Child(@TypeOf(wrapped)) {
    return wrapped orelse std.debug.panic("Attempted to unwrap: {?}", .{wrapped});
}

This will panic even in optimized release modes.

1 Like

As someone who primarily works with “old-school” languages, Im surprised to hear that some rustaceans wouldn’t understand the mechanics of unwrap.

I guess when you’re working in a language where exceptions can come from anywhere and anything could be null, you’re more on the lookout for thse kinds of bugs.

Now I’m not a rust professional, I did enough rust to learn that Zig is more my speed, but as you mentioned, when you make a call to unwrap, one of two things has happened:

  1. you’ve taken a potentially dangerous shortcut to save yourself brain cycles / stay in the flow of working on the happy path.

  2. You are intentionally making an assertion that the value cannot be an error / null and the program cannot reasonably continue if that assertion holds false.

I swear that rust also has a .? that returns or unwraps the error like Zig’s try, but maybe I’m misremembering.

I’m any case, unwrapping an Option/Maybe monad without first validating the contents is as dangerous as dereferencing a pointer before checking that it is not null or submitting form data before it is validated. You can rename the function anything you like, but it’s up to the programmer to have the knowledge necessary to know which parts of a language are scary and which ones are safe.


Now, IIUC, in cloudflare’s case, unwrap wasn’t the real issue, the real issue was either in communication or remebering program requirements. What I heard was that they were pre-allocating room for 60 rules, and attempting to allocate 200. So in that case the two options were either crash, or run with only 60 of the rules, the latter of which would’ve been (arguably) less catastrophic, but likely wouldve been more insidious to debug. If instead of unwrapping they had logged the error, then perhaps the crash wouldn’t have happened, but some nasty requests might have gotten through.

But hey, we’ve all programmed at 5 o’clock on a Friday, and perhaps that’s all this really boils down to. :man_shrugging:

2 Likes

The problem is that they wrote code that didn’t properly handle error conditions.

So, at least with Rust, they got a useful error message and a crash, instead of the program just continuing onward with bad data or undefined behavior. So, at the very least Rust is crashing instead of doing undefined behavior. That’s good.

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

Unfortunately, I don’t believe this is really something that a language, by itself can fix. This requires good architecture, good code review, and proper resources to fully understand a problem and how it can go wrong. A language can absolutely make it harder to do it right, but I don’t think any language can make it right on its own.

I think that the Rust community has created a set of expectations that they cannot meet and perhaps this has lead to people being a lot less careful about unwrap calls and not being careful about their logic.

Overall, I don’t think that .? is the problem.

2 Likes