From @matklad
@kristoff proposed this in the lobsters thread:
const result = try for (0..retry_count) |n| {
break action() catch |err| {
if (!transient(err)) break err;
sleep(exp(n));
};
} else action();
Though, in testament to the trickiness of this, your solution is missing a continue after the sleep. (and the sleep is missing jitter)
Here’s my attempt:
var retries_left = retry_count;
while (true) {
return action() catch |e| {
if (retries_left > 0 and is_transient_error(e)) {
retries_left -= 1;
sleep();
continue;
}
return e;
}
}
(Second possible solution removed from this spot right before I posted, because it was subtly failing the “no sleep after last retry” requirement, this really is tricky.)
However, assuming this is for a service that’s designed to be well tested, trusted, and perform optimally, there is a different issue here. This is all putting too much logic along side pipeline/glue code. That is, if you’re using automated testing to gain confidence in code correctness, there’s no good testing methodology:
- Unit testing doesn’t work, because
actionis not a unit. It’s (probably) a real service, possibly external.- Unit testing with a stub for action doesn’t work, because a stub won’t fail like a real service.
- Unit testing with mocks is a joke 100% of the time.
- Integration testing doesn’t naturally get failures, so coverage of the logic is poor. Or there are failures and your integration test is also really flakey, possibly hiding logic bugs behind “oh that test just flakes a lot”.
- Stress testing, possibly with fault injection, might uncover any issues. However it is more difficult to gain confidence it behaves like real world failures, issues might fall through the cracks if corrected for by other systems, and identified failures are more costly to track down to their source.
So, if this were a more serious project, I’d split it into two: A logic piece that can be unit tested, and pipeline/glue code pieces which is likely to either always work or never work.
var Retry = struct {
gas: usize,
pub fn check(r: *Retry, v: anytype) ?@TypeOf(v) {
return v catch |e| {
if (is_transient_error(e) and r.gas > 0) {
r.gas -= 1;
sleep();
return null;
}
return e;
};
}
};
// ...
while (true) {
return retry.check(action()) orelse continue;
}
This solution has the additional benefit you’d likely want to add to your system anyways: A Retry could be created at the root of handling a request, and propagated to all other calls. Then the total number of retries across the whole system can be limited. A tree of branching retries can lead to an exponential number of calls as the retry count is multiplied at each layer. DOSing your system that’s already failing under load is not good!
There’s lots of trickiness to correctly doing retrying. My example is incomplete in other areas. Though it solves the specific question posed in the blog post, it’s missing things like cancellation and timeouts. Instead of regaling it all here, I’ll just post to Google’s SRE book (Regardless of many poor opinions I have about Google, I do think there’s many good ideas in that book): Google SRE - Cascading Failures: Reducing System Outage
I’ve specifically used client side load shedding to prevent a type of failure which regularly set off my team’s pagers: Google SRE - Load Balancing to Handle System Overload