Resilience in Distributed Systems: Retry Patterns That Actually Work

Your service calls fail. Accept it. Plan for it. Retry wisely.

In a distributed system, failure is not a bug—it’s a feature. Your payment gateway times out. Your truck tracking API returns 503. A message queue is briefly unavailable. These aren’t anomalies; they’re daily occurrences.

Our transportation platform handles 10,000+ shipments daily. Every single one involves multiple HTTP calls, queue messages, and database operations. We’ve learned that blindly retrying is as bad as not retrying at all.

The Naive Approach (and Why It Fails)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// DON'T DO THIS
public async Task<Shipment> GetShipmentWithRetryAsync(string id)
{
    for (int i = 0; i < 3; i++)
    {
        try
        {
            return await _httpClient.GetAsync<Shipment>($"/api/shipments/{id}");
        }
        catch
        {
            if (i == 2) throw;
            await Task.Delay(1000); // Fixed 1 second delay
        }
    }
}

Problems:

Fixed delays: If the service is down for 2 seconds, your 1-second retries won’t help.
Thundering herd: All your instances retry at the same time, overwhelming the recovering service.
No differentiation: You retry a 401 Unauthorized the same way as a 503 Service Unavailable. (You shouldn’t.)
No circuit breaking: You keep hammering a dead service.

Pattern 1: Exponential Backoff

Space out your retries with exponential delays:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
public async Task<T> CallWithExponentialBackoffAsync<T>(
    Func<Task<T>> operation,
    int maxRetries = 3)
{
    for (int attempt = 0; attempt <= maxRetries; attempt++)
    {
        try
        {
            return await operation();
        }
        catch (HttpRequestException ex) when (attempt < maxRetries)
        {
            // Calculate delay: 2^attempt * 100ms (± random jitter)
            int delayMs = (int)Math.Pow(2, attempt) * 100;
            int jitter = new Random().Next(0, delayMs / 2);
            int totalDelay = delayMs + jitter;
            
            await Task.Delay(totalDelay);
        }
    }
}

// Usage
var shipment = await CallWithExponentialBackoffAsync(
    () => _shipmentService.GetByIdAsync("SHP-001"));

Attempt delays: 100ms → 200-300ms → 400-600ms

This spreads the load and gives the service time to recover.

Pattern 2: Using Polly (The Professional’s Choice)

Polly is a .NET resilience library. It’s battle-tested and covers most scenarios.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// In Program.cs or your DI container
services.AddHttpClient<IShipmentService>()
    .AddTransientHttpErrorPolicy(p =>
        p.OrResult(r => r.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
            .WaitAndRetryAsync(
                retryCount: 3,
                sleepDurationProvider: retryAttempt =>
                    TimeSpan.FromMilliseconds(
                        Math.Pow(2, retryAttempt) * 100 +
                        new Random().Next(0, 100 * retryAttempt)
                    ),
                onRetry: (outcome, timespan, retryCount, context) =>
                {
                    _logger.LogWarning(
                        $"Retry {retryCount} after {timespan.TotalMilliseconds}ms");
                }
            ))
    .AddCircuitBreakerAsync(
        handledEventsAllowedBeforeBreaking: 3,
        durationOfBreak: TimeSpan.FromSeconds(30),
        onBreak: (outcome, duration) =>
        {
            _logger.LogError(
                $"Circuit breaker opened for {duration.TotalSeconds}s");
        });

Pattern 3: Circuit Breaker (The Sentinel)

A circuit breaker stops making calls to a failing service, giving it time to recover.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
public class CircuitBreakerExample
{
    private readonly IAsyncPolicy<HttpResponseMessage> _policy;
    
    public CircuitBreakerExample()
    {
        // Trip if we get 3 failures
        // Stay open for 30 seconds
        // Then try again (half-open state)
        _policy = Policy
            .HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
            .OrTransientHttpError()
            .CircuitBreakerAsync<HttpResponseMessage>(
                handledEventsAllowedBeforeBreaking: 3,
                durationOfBreak: TimeSpan.FromSeconds(30),
                onBreak: (outcome, duration) =>
                {
                    Console.WriteLine(
                        $"Circuit broken. Service down. Waiting {duration.TotalSeconds}s");
                },
                onReset: () =>
                {
                    Console.WriteLine("Circuit reset. Service recovered!");
                }
            );
    }
    
    public async Task<HttpResponseMessage> CallWithCircuitBreakerAsync(
        HttpRequestMessage request)
    {
        try
        {
            return await _policy.ExecuteAsync(async () =>
                await _httpClient.SendAsync(request));
        }
        catch (BrokenCircuitException ex)
        {
            return new HttpResponseMessage(System.Net.HttpStatusCode.ServiceUnavailable)
            {
                Content = new StringContent("Service temporarily unavailable")
            };
        }
    }
}

Circuit States:

Closed: Normal operation, requests pass through
Open: Service is failing, requests fail fast without calling the backend
Half-Open: Service might be recovering, allow one test request

Pattern 4: Differentiate Between Retryable and Non-Retryable Failures

Not all errors deserve retries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
public static bool IsRetryable(Exception ex)
{
    if (ex is HttpRequestException httpEx)
    {
        // Retryable HTTP errors
        var statusCodes = new[] { 408, 429, 500, 502, 503, 504 };
        if (httpEx.StatusCode.HasValue &&
            statusCodes.Contains((int)httpEx.StatusCode))
        {
            return true;
        }
    }
    
    // Non-retryable
    if (ex is ArgumentException) return false;      // Caller's fault
    if (ex is UnauthorizedAccessException) return false; // Auth issue
    if (ex is ValidationException) return false;    // Bad data
    
    // Timeout might be transient
    if (ex is TimeoutException) return true;
    
    // Default to no retry to avoid infinite loops
    return false;
}

public async Task<Shipment> GetShipmentAsync(string id)
{
    var policy = Policy
        .Handle<Exception>(IsRetryable)
        .WaitAndRetryAsync(
            retryCount: 3,
            sleepDurationProvider: attempt =>
                TimeSpan.FromMilliseconds(Math.Pow(2, attempt) * 100));
    
    return await policy.ExecuteAsync(async () =>
        await _shipmentService.GetByIdAsync(id));
}

Pattern 5: Bulkhead Isolation

Prevent one failing service from dragging down everything else:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Limit concurrent calls to payment service
var bulkheadPolicy = Policy.BulkheadAsync<PaymentResult>(
    maxParallelization: 10,
    maxQueuingActions: 50,
    onBulkheadRejectedAsync: context =>
    {
        _logger.LogWarning("Payment service bulkhead exceeded");
        return Task.CompletedTask;
    });

var combinedPolicy = Policy.WrapAsync<PaymentResult>(
    bulkheadPolicy,
    retryPolicy,
    circuitBreakerPolicy);

Real-World: Delivery Status Sync from Carrier APIs

Here’s how we retry shipment status updates from our carrier API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
private IAsyncPolicy<CarrierStatusResponse> BuildCarrierSyncPolicy()
{
    var retryPolicy = Policy
        .HandleResult<CarrierStatusResponse>(r => r.StatusCode >= 500)
        .Or<TimeoutException>()
        .WaitAndRetryAsync(
            retryCount: 4,
            sleepDurationProvider: attempt =>
            {
                // 100ms, 200ms, 400ms, 800ms
                return TimeSpan.FromMilliseconds(
                    Math.Pow(2, attempt - 1) * 100);
            },
            onRetry: (outcome, delay, retryCount, context) =>
            {
                _logger.LogWarning(
                    $"Carrier API retry {retryCount} after {delay.TotalMilliseconds}ms");
            });
    
    var circuitBreaker = Policy
        .HandleResult<CarrierStatusResponse>(r => r.StatusCode >= 500)
        .CircuitBreakerAsync<CarrierStatusResponse>(
            handledEventsAllowedBeforeBreaking: 5,
            durationOfBreak: TimeSpan.FromSeconds(60),
            onBreak: (outcome, duration) =>
            {
                _logger.LogError(
                    $"Carrier API circuit broken for {duration.TotalSeconds}s");
                // Send alert
                _alertService.NotifyAsync("Carrier API down");
            });
    
    return Policy.WrapAsync(retryPolicy, circuitBreaker);
}

public async Task<CarrierStatusResponse> GetCarrierStatusAsync(string trackingId)
{
    var policy = BuildCarrierSyncPolicy();
    
    return await policy.ExecuteAsync(async () =>
    {
        var response = await _httpClient.GetAsync(
            $"https://carrier-api.com/track/{trackingId}",
            TimeSpan.FromSeconds(10)); // Timeout
        
        return response;
    });
}

Key Lessons

Retry != Resilience: Retries are one tool. You also need circuit breakers, bulkheads, and timeouts.
Exponential backoff + jitter: Linear retries cause thundering herds. Exponential spacing with randomness is better.
Know what to retry: 401s and 404s don’t deserve retries. 503s do.
Monitor your retries: Log every retry. Chart them. If you’re retrying more than expected, something is degrading.
Set appropriate timeouts: A 30-second timeout defeats the purpose of a 1-second retry. Your total time budget matters.
Use libraries: Don’t roll your own. Polly is mature, well-tested, and handles edge cases you haven’t thought of.

The Rule of Thumb

For our platform:

3 retries for most operations
Exponential backoff with jitter (100ms → 200ms → 400ms)
Circuit breaker after 5 consecutive failures
Differentiate errors: Only retry transient failures
Monitor ruthlessly: Every retry is a signal that something is struggling

Remember: The best retry is the one you never need to make. Design your systems for resilience, not just recovery.