Your service calls fail. Accept it. Plan for it. Retry wisely.
In a distributed system, failure is not a bug—it’s a feature. Your payment gateway times out. Your truck tracking API returns 503. A message queue is briefly unavailable. These aren’t anomalies; they’re daily occurrences.
Our transportation platform handles 10,000+ shipments daily. Every single one involves multiple HTTP calls, queue messages, and database operations. We’ve learned that blindly retrying is as bad as not retrying at all.
The Naive Approach (and Why It Fails)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| // DON'T DO THIS
public async Task<Shipment> GetShipmentWithRetryAsync(string id)
{
for (int i = 0; i < 3; i++)
{
try
{
return await _httpClient.GetAsync<Shipment>($"/api/shipments/{id}");
}
catch
{
if (i == 2) throw;
await Task.Delay(1000); // Fixed 1 second delay
}
}
}
|
Problems:
- Fixed delays: If the service is down for 2 seconds, your 1-second retries won’t help.
- Thundering herd: All your instances retry at the same time, overwhelming the recovering service.
- No differentiation: You retry a 401 Unauthorized the same way as a 503 Service Unavailable. (You shouldn’t.)
- No circuit breaking: You keep hammering a dead service.
Pattern 1: Exponential Backoff
Space out your retries with exponential delays:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| public async Task<T> CallWithExponentialBackoffAsync<T>(
Func<Task<T>> operation,
int maxRetries = 3)
{
for (int attempt = 0; attempt <= maxRetries; attempt++)
{
try
{
return await operation();
}
catch (HttpRequestException ex) when (attempt < maxRetries)
{
// Calculate delay: 2^attempt * 100ms (± random jitter)
int delayMs = (int)Math.Pow(2, attempt) * 100;
int jitter = new Random().Next(0, delayMs / 2);
int totalDelay = delayMs + jitter;
await Task.Delay(totalDelay);
}
}
}
// Usage
var shipment = await CallWithExponentialBackoffAsync(
() => _shipmentService.GetByIdAsync("SHP-001"));
|
Attempt delays: 100ms → 200-300ms → 400-600ms
This spreads the load and gives the service time to recover.
Pattern 2: Using Polly (The Professional’s Choice)
Polly is a .NET resilience library. It’s battle-tested and covers most scenarios.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| // In Program.cs or your DI container
services.AddHttpClient<IShipmentService>()
.AddTransientHttpErrorPolicy(p =>
p.OrResult(r => r.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: retryAttempt =>
TimeSpan.FromMilliseconds(
Math.Pow(2, retryAttempt) * 100 +
new Random().Next(0, 100 * retryAttempt)
),
onRetry: (outcome, timespan, retryCount, context) =>
{
_logger.LogWarning(
$"Retry {retryCount} after {timespan.TotalMilliseconds}ms");
}
))
.AddCircuitBreakerAsync(
handledEventsAllowedBeforeBreaking: 3,
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (outcome, duration) =>
{
_logger.LogError(
$"Circuit breaker opened for {duration.TotalSeconds}s");
});
|
Pattern 3: Circuit Breaker (The Sentinel)
A circuit breaker stops making calls to a failing service, giving it time to recover.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| public class CircuitBreakerExample
{
private readonly IAsyncPolicy<HttpResponseMessage> _policy;
public CircuitBreakerExample()
{
// Trip if we get 3 failures
// Stay open for 30 seconds
// Then try again (half-open state)
_policy = Policy
.HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.OrTransientHttpError()
.CircuitBreakerAsync<HttpResponseMessage>(
handledEventsAllowedBeforeBreaking: 3,
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (outcome, duration) =>
{
Console.WriteLine(
$"Circuit broken. Service down. Waiting {duration.TotalSeconds}s");
},
onReset: () =>
{
Console.WriteLine("Circuit reset. Service recovered!");
}
);
}
public async Task<HttpResponseMessage> CallWithCircuitBreakerAsync(
HttpRequestMessage request)
{
try
{
return await _policy.ExecuteAsync(async () =>
await _httpClient.SendAsync(request));
}
catch (BrokenCircuitException ex)
{
return new HttpResponseMessage(System.Net.HttpStatusCode.ServiceUnavailable)
{
Content = new StringContent("Service temporarily unavailable")
};
}
}
}
|
Circuit States:
- Closed: Normal operation, requests pass through
- Open: Service is failing, requests fail fast without calling the backend
- Half-Open: Service might be recovering, allow one test request
Pattern 4: Differentiate Between Retryable and Non-Retryable Failures
Not all errors deserve retries:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| public static bool IsRetryable(Exception ex)
{
if (ex is HttpRequestException httpEx)
{
// Retryable HTTP errors
var statusCodes = new[] { 408, 429, 500, 502, 503, 504 };
if (httpEx.StatusCode.HasValue &&
statusCodes.Contains((int)httpEx.StatusCode))
{
return true;
}
}
// Non-retryable
if (ex is ArgumentException) return false; // Caller's fault
if (ex is UnauthorizedAccessException) return false; // Auth issue
if (ex is ValidationException) return false; // Bad data
// Timeout might be transient
if (ex is TimeoutException) return true;
// Default to no retry to avoid infinite loops
return false;
}
public async Task<Shipment> GetShipmentAsync(string id)
{
var policy = Policy
.Handle<Exception>(IsRetryable)
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: attempt =>
TimeSpan.FromMilliseconds(Math.Pow(2, attempt) * 100));
return await policy.ExecuteAsync(async () =>
await _shipmentService.GetByIdAsync(id));
}
|
Pattern 5: Bulkhead Isolation
Prevent one failing service from dragging down everything else:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| // Limit concurrent calls to payment service
var bulkheadPolicy = Policy.BulkheadAsync<PaymentResult>(
maxParallelization: 10,
maxQueuingActions: 50,
onBulkheadRejectedAsync: context =>
{
_logger.LogWarning("Payment service bulkhead exceeded");
return Task.CompletedTask;
});
var combinedPolicy = Policy.WrapAsync<PaymentResult>(
bulkheadPolicy,
retryPolicy,
circuitBreakerPolicy);
|
Real-World: Delivery Status Sync from Carrier APIs
Here’s how we retry shipment status updates from our carrier API:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
| private IAsyncPolicy<CarrierStatusResponse> BuildCarrierSyncPolicy()
{
var retryPolicy = Policy
.HandleResult<CarrierStatusResponse>(r => r.StatusCode >= 500)
.Or<TimeoutException>()
.WaitAndRetryAsync(
retryCount: 4,
sleepDurationProvider: attempt =>
{
// 100ms, 200ms, 400ms, 800ms
return TimeSpan.FromMilliseconds(
Math.Pow(2, attempt - 1) * 100);
},
onRetry: (outcome, delay, retryCount, context) =>
{
_logger.LogWarning(
$"Carrier API retry {retryCount} after {delay.TotalMilliseconds}ms");
});
var circuitBreaker = Policy
.HandleResult<CarrierStatusResponse>(r => r.StatusCode >= 500)
.CircuitBreakerAsync<CarrierStatusResponse>(
handledEventsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(60),
onBreak: (outcome, duration) =>
{
_logger.LogError(
$"Carrier API circuit broken for {duration.TotalSeconds}s");
// Send alert
_alertService.NotifyAsync("Carrier API down");
});
return Policy.WrapAsync(retryPolicy, circuitBreaker);
}
public async Task<CarrierStatusResponse> GetCarrierStatusAsync(string trackingId)
{
var policy = BuildCarrierSyncPolicy();
return await policy.ExecuteAsync(async () =>
{
var response = await _httpClient.GetAsync(
$"https://carrier-api.com/track/{trackingId}",
TimeSpan.FromSeconds(10)); // Timeout
return response;
});
}
|
Key Lessons
Retry != Resilience: Retries are one tool. You also need circuit breakers, bulkheads, and timeouts.
Exponential backoff + jitter: Linear retries cause thundering herds. Exponential spacing with randomness is better.
Know what to retry: 401s and 404s don’t deserve retries. 503s do.
Monitor your retries: Log every retry. Chart them. If you’re retrying more than expected, something is degrading.
Set appropriate timeouts: A 30-second timeout defeats the purpose of a 1-second retry. Your total time budget matters.
Use libraries: Don’t roll your own. Polly is mature, well-tested, and handles edge cases you haven’t thought of.
The Rule of Thumb
For our platform:
- 3 retries for most operations
- Exponential backoff with jitter (100ms → 200ms → 400ms)
- Circuit breaker after 5 consecutive failures
- Differentiate errors: Only retry transient failures
- Monitor ruthlessly: Every retry is a signal that something is struggling
Remember: The best retry is the one you never need to make. Design your systems for resilience, not just recovery.