I agree with everything you said, but I want to add, that article seems LLM-written. It feels awkward to read and has logic that doesn’t follow if you think about it:
Wrong. As the Cloudflare incident demonstrated, you can have a catastrophic security incident (a global CDN outage that could have been exploited by attackers) without a single memory safety violation . The unwrap() that crashed Cloudflare was memory-safe . It was also a critical security failure that took down major internet services.
I didn’t read the news about the Cloudflare incident, but it seems to have been an outage, not a security incident. An outage is not a “critical security failure”. Security failure is when an attacker gains access to something they don’t have permission for.
I think the meat of the article was written prior to Cloudflare’s public RCA and they then rewrote it to try and keep it relevant after the RCA was published. I had read it prior to the RCA and the takeaway from me was about the problem of marketing your language as “safe” in the context of “thread safety” (which the article discusses as maybe not even being accurate) and memory safety and how that can create perception problems, where developers become less careful because they think the language is protecting them (mistakenly believing that because Rust is safe, there is no problem calling unwrap in my mission critical service), hence linking it. I would have to see if there’s a way to fetch the original version, since I only pulled it out of my browser history since this discussion made me think of it.
It may have been generated or updated via LLM. That would not be surprising, but would also be disappointing.
In my opinion completely blowing apart and knowing exactly where you blew apart was WAY better than having a partial or corrupt set of rules handling your security dealing with hostile bots. Or, even worse, having to run down corrupted memory because extra rules overwrote a bunch of stuff.
In addition, everybody has an opinion about what should have been done, but the only useful one seems to be “roll back to previous known good”. And, to do that you either need access to the original file (not guaranteed to be present) or to cache all the previous rules or file (more resource consumption). The fact that this was statically allocated indicates that resources were constrained, so rollback required buy-in and process to guarantee the old file would still be in place on upgrade and plumb the access through and …
Or you can write .unwrap() and let it die.
Best case–nobody ever hits it. Okay case–you get a logged crash and you now get to have a real discussion about what the correct solution is. Worst case–everything blows apart but you finally get the resources to resolve this properly now that you have the rapt attention of management.
I’ll write that .unwrap() every time with only the barest second thought.
I’ll slag on Rust all day long, but the Cloudflare outage is not a black eye for Rust, in my opinion. The continuing idiocies with coreutils are a far stronger case against Rust … and even there the case is more against Rewrite It In Rust for something as battle-tested as coreutils than using Rust as a greenfield language.
Okay, this is really really cool and I think it should be a lot more prominent. The only other language that I know of that lets you do this kind of “language configuration” is Ada, and although some might be against it I honestly think that having the ability to configure the language (to the extent of “I don’t need these features”) is a good thing, and more languages should adopt it.
expect() (and unwrap()) are fine, albeit I wish they had better names. But the expect() here wasn’t someone ignorantly adding because they’re bad at error handling, the program was in a state that they didn’t expect (too many columns being returned) and therefore their program had no path to continue.
This is my sentiment towards .? in Zig. orelse undefined is much more explicit, and ? in many languages (including JS, Rust, Kotlin, and pretty much any language with a similar operator) is a safe operator for handling nulls. But in Zig, it’s Illegal Behavior (and therefore UB in safety-less builds!)
I’d personally much rather .? be named .! if possible. The ! symbol is what many languages use to describe “just assume it’s not null” (i.e. TypeScript and Kotlin)
I feel like in practice .? is very rare operation and should raise eyebrows whenever you see it. It’s not equal to unwrap because it’s similar to orelse unreachable and will be UB in unsafe release modes.
There’s nothing wrong with panicing and crashing program down on wrong invariants and I feel like .unwrap() is being made scapegoat when the real issue is how cloudflare did their deployments and testing… they had similar issue in past where they blamed ragel but instead their own code was wrong.
if (optional) |value| if (std.mem.eql(u8, value, "some string") {
//...
};
That said I wouldn’t necessarily call this better. If you don’t use the unwrapped value of optional anywhere else, then local use of .? right after the check is fine IMO.
See also here orelse is handy for early exits or do something alternative.
Having unwrap() crash in rust’s release mode is not good. That’s why I like the dangerous .?.
What is bugging me is that we cannot use and in combination with nullable syntax. Thats why I use .? in longer functions and need the filled nullable variable multiple times.
if ((optional) |value| and std.mem.eql(u8, value, "some string")) {}
This incident really made me start reflecting on the idea of ‘fail fast.’ In the past, I always believed that failing fast was not a bad approach, but this CF incident made me reconsider. Many people think that using unwrap in this case wasn’t a bad choice; I guess it might be the inertia of the ‘fail fast’ mindset, but in this incident, I completely disagree.
In fact, the problem here was caused by just one bot module of CF, a non-critical module, which had no reason to affect the entire project because of its own fast failure. For a non-critical module in a program, when it exhibits unexpected behavior, the expected behavior is clear: log the issue, report its own error and failure upwards. The higher-level components can completely choose to ignore this error; although it caused the submodule to fail, the failure of the submodule doesn’t have to lead to the failure of the entire program.
When such problems occur in the core module, the upper layers can respond and choose to crash, but the module itself should not overstep its bounds and act on behalf of the program core without permission.
The subtle differences here may occur between debug and releaseSafe. While releaseFast actively embraces undefined behavior and debug actively embraces fast failure, the requirements for releaseSafe may be more nuanced; it is neither about fast failure nor undefined behavior, but about throwing errors upward. However, the difference in these behavior requirements can lead to discrepancies in the API between different modes, which is really tricky.
Are you saying you prefer UB to a panic in unsafe release mode? If so, why? To diagnose the problem, and prevent unpredictable side effects from the UB, a panic is better.
It depends how the lower and higher level modules are isolated from each other, and how well the lower module can clean up and undo the partial operation before returning. Because this is very difficult to get right, and very rarely tested, it is usually best to panic.
Note that when comparing to Rust it really isn’t a one-to-one comparison. The reason Rust’s unwrap always panics is because Rust doesn’t have a non-safe Release mode (since safety is the point of Rust). It only has one Release mode, which is roughly like Zig’s ReleaseSafe.
It’s surprising to me that the incident was ongoing for 3 hours when the failing service was logging:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
It’s also somewhat surprising that the method that failed returns a Result so they could have simply return an error here no matter how unexpected it was. It’s interesting that the non-Rust version of the code that was running in parallel simply returned incorrect results instead of failing, which is potentially much worse.
In Zig I think you would not get a message like above in Release builds so it’s even more critical to avoid using Optional unwarp unsafely.