Retry Loop Retry

From @matklad

@kristoff proposed this in the lobsters thread:

    const result = try for (0..retry_count) |n| {
        break action() catch |err| {
            if (!transient(err)) break err;
            sleep(exp(n));
        };
    } else action();

Though, in testament to the trickiness of this, your solution is missing a continue after the sleep. (and the sleep is missing jitter)

Here’s my attempt:

var retries_left = retry_count;
while (true) {
	return action() catch |e| {
		if (retries_left > 0 and is_transient_error(e)) {
			retries_left -= 1;
			sleep();
			continue;
		}
		return e;
	}
}

(Second possible solution removed from this spot right before I posted, because it was subtly failing the “no sleep after last retry” requirement, this really is tricky.)

However, assuming this is for a service that’s designed to be well tested, trusted, and perform optimally, there is a different issue here. This is all putting too much logic along side pipeline/glue code. That is, if you’re using automated testing to gain confidence in code correctness, there’s no good testing methodology:

  • Unit testing doesn’t work, because action is not a unit. It’s (probably) a real service, possibly external.
    • Unit testing with a stub for action doesn’t work, because a stub won’t fail like a real service.
    • Unit testing with mocks is a joke 100% of the time.
  • Integration testing doesn’t naturally get failures, so coverage of the logic is poor. Or there are failures and your integration test is also really flakey, possibly hiding logic bugs behind “oh that test just flakes a lot”.
  • Stress testing, possibly with fault injection, might uncover any issues. However it is more difficult to gain confidence it behaves like real world failures, issues might fall through the cracks if corrected for by other systems, and identified failures are more costly to track down to their source.

So, if this were a more serious project, I’d split it into two: A logic piece that can be unit tested, and pipeline/glue code pieces which is likely to either always work or never work.

var Retry = struct {
	gas: usize,

	pub fn check(r: *Retry, v: anytype)	?@TypeOf(v) {
		return v catch |e| {
			if (is_transient_error(e) and r.gas > 0) {
				r.gas -= 1;
				sleep();
				return null;
			}
			return e;
		};
	}
};

// ...

while (true) {
	return retry.check(action()) orelse continue;
}

This solution has the additional benefit you’d likely want to add to your system anyways: A Retry could be created at the root of handling a request, and propagated to all other calls. Then the total number of retries across the whole system can be limited. A tree of branching retries can lead to an exponential number of calls as the retry count is multiplied at each layer. DOSing your system that’s already failing under load is not good!

There’s lots of trickiness to correctly doing retrying. My example is incomplete in other areas. Though it solves the specific question posed in the blog post, it’s missing things like cancellation and timeouts. Instead of regaling it all here, I’ll just post to Google’s SRE book (Regardless of many poor opinions I have about Google, I do think there’s many good ideas in that book): Google SRE - Cascading Failures: Reducing System Outage

I’ve specifically used client side load shedding to prevent a type of failure which regularly set off my team’s pagers: Google SRE - Load Balancing to Handle System Overload

5 Likes

related topic with different criteria, but might be adaptable towards that:

yeah I forgot the continue :^)

Retry is a nice abstraction I’m going to steal it

1 Like

Obviously the solution is to abstract it away into something like this.

fn action_with_retries(
    action: anytype,
    action_args: anytype,
    is_transient: fn (
        @typeInfo(
            @typeInfo(
                @TypeOf(action),
            ).@"fn".return_type.?,
        ).error_union.error_set,
    ) bool,
    sleep: fn () void,
    retry_count: u32,
) @typeInfo(@TypeOf(action)).@"fn".return_type.? {
    var i = retry_count;
    while (true) : (i -%= 1) {
        const act = @call(.auto, action, action_args);

        if (act) |ok|
            return ok
        else |err| if (!is_transient(err) or i == 0)
            return err;

        sleep();
    } else unreachable;
}

You can improve type-safety a little bit by making action_args into std.meta.ArgsTuple(action) (and ergonomics—this allows for RLS/type coercion to work correctly)

1 Like

We might need clarification on the meaning of ‘syntactically obvious that the amount of retries is bounded’. But I think that it rules out loops that use a var that could be mutated in the body…