Audience
This guide is for developers who need to build reliable integrations with APIs:
- Backend developers implementing API clients that need to handle failures gracefully
- Frontend developers building applications that remain responsive when APIs fail
- System architects designing resilient distributed systems
- DevOps engineers understanding failure patterns to improve monitoring and alerting
- Anyone who has been frustrated by APIs that timeout, return errors, or become unavailable
You should be familiar with HTTP basics. If not, start with HTTP for REST APIs.
Goal
After reading this guide, you’ll understand:
- Why APIs fail and the different categories of failures
- What timeouts are and why they’re critical for system health
- How retry strategies work, including exponential backoff and jitter
- Why idempotency matters and how idempotency keys prevent duplicates
- How the circuit breaker pattern protects your system from cascading failures
- What graceful degradation means and when to apply it
You won’t be configuring production-ready resilience patterns yet, but you’ll have a solid mental model of how resilient systems handle failure.
1. Why APIs Fail
APIs fail. Not occasionally—constantly. Understanding failure modes is the first step to building resilient systems.
Categories of Failures
API failures fall into distinct categories, each requiring different handling:
graph TD
F[API Failure] --> N[Network Failures]
F --> T[Timeout Failures]
F --> O[Overload Failures]
F --> B[Bug Failures]
F --> D[Dependency Failures]
N --> N1[Connection refused]
N --> N2[DNS resolution failed]
N --> N3[TLS handshake failed]
T --> T1[Read timeout]
T --> T2[Connect timeout]
T --> T3[Idle timeout]
O --> O1[Rate limiting - 429]
O --> O2[Server overloaded - 503]
O --> O3[Queue full]
B --> B1[Server error - 500]
B --> B2[Unexpected response]
B --> B3[Data corruption]
D --> D1[Database down]
D --> D2[External API failed]
D --> D3[Message queue unavailable]
style F fill:#ffccbc
style N fill:#fff9c4
style T fill:#fff9c4
style O fill:#fff9c4
style B fill:#fff9c4
style D fill:#fff9c4Network Failures
The network between your client and the API can fail in many ways:
- Connection refused: The server isn’t accepting connections
- DNS failure: Can’t resolve the hostname to an IP address
- TLS errors: Certificate expired, hostname mismatch, protocol incompatibility
- Connection reset: The connection was forcibly closed
- Packet loss: Data never arrives or is corrupted in transit
Network failures are usually transient—they resolve themselves when the network recovers.
Timeout Failures
Timeouts occur when something takes too long:
- Connection timeout: Establishing the TCP connection takes too long
- Read timeout: Waiting for response data takes too long
- Idle timeout: The connection sat unused for too long
Timeouts are ambiguous: Did the request succeed? You sent the request, but you don’t know if the server received it, processed it, or sent a response that got lost.
Overload Failures
Servers can become overwhelmed:
- Rate limiting (429): You’re sending requests too fast
- Server overload (503): The server can’t handle current load
- Queue full: Request queues are saturated
Overload failures signal: “I’m too busy, try again later.”
Bug Failures
Sometimes the server itself is broken:
- 500 Internal Server Error: Unhandled exception, null pointer, etc.
- Unexpected response format: API changed without notice
- Logic errors: Server processed the request incorrectly
Bug failures often require human intervention to fix.
Dependency Failures
Modern APIs depend on other services:
- Database unavailable: Can’t store or retrieve data
- External API failed: Third-party service is down
- Cache failure: Redis/Memcached unavailable
Dependency failures cascade—one failing service can bring down many others.
The Reality of Failure
In a distributed system with multiple services, failures are inevitable. If each service has 99.9% uptime, a request touching 10 services has only 99%^10 = 99% success rate—1% of requests fail even when everything is “reliable.”
Building resilient systems means accepting that failures happen and designing for them.
2. Timeouts: Your First Line of Defense
A timeout is a limit on how long you’ll wait for something to complete. Without timeouts, a slow or unresponsive API can hang your entire application.
Why Timeouts Matter
Imagine a payment API that becomes slow. Without timeouts:
sequenceDiagram
participant C as Your App
participant P as Payment API
C->>P: Process payment
Note over P: API is slow...
Note over C: Thread blocked
waiting forever
Note over P: Still processing...
Note over C: More requests arrive
All threads blocked
Note over P: More time passes...
Note over C: Thread pool exhausted
Application hangs
style C fill:#ffccbcOne slow dependency can consume all your resources, making your entire application unresponsive. This is called resource exhaustion.
Types of Timeouts
Connection timeout: How long to wait when establishing a connection.
- Too short: Fails before connection completes on slow networks
- Too long: Resources tied up waiting for unreachable servers
- Typical values: 1-10 seconds
Read timeout (response timeout): How long to wait for response data after connecting.
- Too short: Fails on legitimate slow operations
- Too long: Resources tied up waiting for stuck servers
- Typical values: 5-60 seconds (depends on operation)
Total timeout: Maximum time for the entire request-response cycle.
- Ensures bounded wait time regardless of retries
- Typical values: 30-120 seconds
The Timeout Dilemma
Setting timeouts involves tradeoffs:
| Timeout Too Short | Timeout Too Long |
|---|---|
| Fails legitimate requests | Slow failures |
| High error rate | Resource exhaustion |
| Poor user experience | Cascading slowdowns |
The key insight: A fast failure is often better than a slow success. Your users would rather see “try again” than wait forever.
Timeout as a Contract
Think of timeouts as a contract with your users: “I promise to respond within X seconds, even if the answer is ‘I don’t know.’”
# Request with timeout context
POST /payments HTTP/1.1
X-Request-Timeout: 30
# Server should abort if it can't respond in time
Without this contract, your application’s response time is determined by your slowest dependency—which might be infinitely slow.
3. Retry Strategies: When and How to Try Again
Retries can turn transient failures into successful requests. But naive retries can make things worse.
When to Retry
Not all failures should be retried:
flowchart TD
E[Error Occurred] --> S{Status Code?}
S -->|4xx| C4[Client Error]
S -->|5xx| C5[Server Error]
S -->|Network Error| CN[Network Error]
S -->|Timeout| CT[Timeout]
C4 --> D4{Which 4xx?}
D4 -->|400, 401, 403, 404| N1[Don't Retry
Fix the request]
D4 -->|429| R1[Retry
After delay]
D4 -->|408| R2[Retry
Immediately possible]
C5 --> D5{Which 5xx?}
D5 -->|500| M1[Maybe Retry
Server bug?]
D5 -->|502, 503, 504| R3[Retry
Transient failure]
CN --> R4[Retry
Network recovered?]
CT --> R5[Retry
But carefully!]
style N1 fill:#ffccbc
style R1 fill:#c8e6c9
style R2 fill:#c8e6c9
style R3 fill:#c8e6c9
style R4 fill:#c8e6c9
style R5 fill:#fff9c4
style M1 fill:#fff9c4Safe to retry:
- 429 Too Many Requests (after waiting)
- 502 Bad Gateway
- 503 Service Unavailable
- 504 Gateway Timeout
- Network errors (connection refused, DNS failure)
- Timeouts (with caution)
Don’t retry:
- 400 Bad Request (your request is malformed)
- 401 Unauthorized (your credentials are wrong)
- 403 Forbidden (you don’t have permission)
- 404 Not Found (resource doesn’t exist)
- 409 Conflict (state conflict, needs resolution)
Maybe retry:
- 500 Internal Server Error (might be a transient bug)
- Timeouts on non-idempotent operations (see idempotency section)
The Problem with Naive Retries
Simple immediate retries can cause retry storms:
sequenceDiagram
participant C1 as Client 1
participant C2 as Client 2
participant C3 as Client 3
participant S as Overloaded Server
Note over S: Server at 100% capacity
C1->>S: Request 1
C2->>S: Request 2
C3->>S: Request 3
S--xC1: 503 Service Unavailable
S--xC2: 503 Service Unavailable
S--xC3: 503 Service Unavailable
Note over C1,C3: All clients retry immediately
C1->>S: Retry 1
C2->>S: Retry 2
C3->>S: Retry 3
Note over S: Now handling 6 requests
instead of 3!
style S fill:#ffccbcWhen all clients retry immediately, they amplify the load on an already struggling server.
Exponential Backoff
The solution is exponential backoff: Wait longer between each retry.
Attempt 1: Wait 1 second
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds
Attempt 5: Wait 16 seconds
...
The delay grows exponentially: delay = base * 2^attempt
graph LR
A1[Attempt 1] -->|1s wait| A2[Attempt 2]
A2 -->|2s wait| A3[Attempt 3]
A3 -->|4s wait| A4[Attempt 4]
A4 -->|8s wait| A5[Attempt 5]
style A1 fill:#e3f2fd
style A2 fill:#bbdefb
style A3 fill:#90caf9
style A4 fill:#64b5f6
style A5 fill:#42a5f5This gives the server time to recover while preventing retry storms.
Jitter: Randomizing Delays
Even with exponential backoff, if 1000 clients start retrying at the same time, they’ll hit the server in synchronized waves.
Jitter adds randomness to the delay:
# Full jitter (recommended)
delay = random(0, base * 2^attempt)
# Decorrelated jitter
delay = random(base, previous_delay * 3)
Instead of all clients waiting exactly 4 seconds, they wait randomly between 0-4 seconds, spreading the load:
graph LR
subgraph Without Jitter
W1[Client A: 4s]
W2[Client B: 4s]
W3[Client C: 4s]
end
subgraph With Jitter
J1[Client A: 1.2s]
J2[Client B: 3.7s]
J3[Client C: 2.4s]
end
style W1 fill:#ffccbc
style W2 fill:#ffccbc
style W3 fill:#ffccbc
style J1 fill:#c8e6c9
style J2 fill:#c8e6c9
style J3 fill:#c8e6c9When NOT to Retry
Never retry when:
- The request succeeded (obviously, but check idempotency)
- The error is permanent: 400, 401, 403, 404
- You’ve exceeded max retries: Know when to give up
- The timeout has expired: Don’t retry into infinity
- The operation isn’t safe to repeat: Non-idempotent operations without idempotency keys
Be careful when:
- The operation has side effects: Might cause duplicates
- The timeout was ambiguous: Did it succeed or not?
- You’re propagating user latency: They’re waiting
Retry Budget
Set limits on retries to prevent endless retry loops:
- Max attempts: Limit total attempts (e.g., 3-5)
- Max total time: Stop retrying after N seconds
- Retry budget: Only retry X% of requests in a time window
4. Idempotency: Making Retries Safe
A timeout happens. Did your payment go through? You don’t know. If you retry and it already succeeded, you might charge the customer twice.
Idempotency makes operations safe to retry.
What Is Idempotency?
An operation is idempotent if calling it multiple times produces the same result as calling it once.
graph LR
subgraph Idempotent
I1[x = 5] -->|Call once| IR1[x = 5]
I2[x = 5] -->|Call twice| IR2[x = 5]
I3[x = 5] -->|Call N times| IR3[x = 5]
end
subgraph Not Idempotent
N1[x = 0] -->|Call once| NR1[x = 1]
N2[x = 0] -->|Call twice| NR2[x = 2]
N3[x = 0] -->|Call N times| NR3[x = N]
end
style IR1 fill:#c8e6c9
style IR2 fill:#c8e6c9
style IR3 fill:#c8e6c9
style NR1 fill:#fff9c4
style NR2 fill:#fff9c4
style NR3 fill:#ffccbcNaturally idempotent operations:
GET /users/123— Reading doesn’t change anythingPUT /users/123 {name: "Alice"}— Sets to same state each timeDELETE /users/123— Resource is gone after first call
NOT naturally idempotent:
POST /payments— Each call might create a new paymentPOST /emails/send— Each call might send another emailx = x + 1— Each call increments
Idempotency Keys
For non-idempotent operations, use idempotency keys—unique identifiers that let the server detect duplicate requests.
POST /payments HTTP/1.1
Content-Type: application/json
Idempotency-Key: abc123-unique-request-id
{
"amount": 100,
"currency": "USD",
"recipient": "user_456"
}
How it works:
sequenceDiagram
participant C as Client
participant S as Server
participant DB as Database
C->>S: POST /payments
Idempotency-Key: abc123
S->>DB: Check: seen abc123?
DB-->>S: No
S->>DB: Process payment
S->>DB: Store: abc123 -> result
S-->>C: 200 OK, payment_id: 789
Note over C: Network fails, client retries
C->>S: POST /payments
Idempotency-Key: abc123
S->>DB: Check: seen abc123?
DB-->>S: Yes, result exists
S-->>C: 200 OK, payment_id: 789
Note over C: Same result, no duplicate payment!Key properties of idempotency keys:
- Client-generated: The client creates a unique ID before the first attempt
- Stored by server: Server remembers key → result mapping
- Returns cached result: On retry, server returns the original response
- Expires eventually: Keys don’t need to live forever (hours to days)
The Idempotency Contract
When you use an idempotency key:
- Same key + same request = same result
- Same key + different request = error (some APIs allow this, some don’t)
- Server must complete storage before responding (or use transactions)
Common mistake: Generating a new idempotency key for each retry. That defeats the purpose—each retry looks like a new request!
// WRONG: New key for each attempt
async function processPaymentWrong(amount) {
for (let attempt = 0; attempt < 3; attempt++) {
try {
const idempotencyKey = generateUUID(); // New key each time!
return await api.createPayment({ amount, idempotencyKey });
} catch (error) {
if (attempt === 2) throw error;
}
}
}
// RIGHT: Same key for all attempts
async function processPaymentRight(amount) {
const idempotencyKey = generateUUID(); // Generate once
for (let attempt = 0; attempt < 3; attempt++) {
try {
return await api.createPayment({ amount, idempotencyKey });
} catch (error) {
if (attempt === 2) throw error;
}
}
}
5. Circuit Breaker: Protecting Your System
When a dependency is failing, continuing to call it wastes resources and can cause cascading failures. The circuit breaker pattern stops this.
The Circuit Breaker Analogy
Think of an electrical circuit breaker in your house:
- Normal operation: Electricity flows freely
- Overload detected: Breaker trips, cuts power
- Manual reset: After fixing the problem, you reset the breaker
An API circuit breaker works the same way:
- Closed state: Requests flow through normally
- Open state: Requests fail immediately without calling the failing service
- Half-open state: Allow a few test requests to check if service recovered
Circuit Breaker States
stateDiagram-v2
[*] --> Closed
Closed --> Open: Failure threshold exceeded
Open --> HalfOpen: Timeout expires
HalfOpen --> Closed: Test request succeeds
HalfOpen --> Open: Test request fails
note right of Closed
Normal operation
Counting failures
end note
note right of Open
Fail fast
No requests sent
end note
note right of HalfOpen
Testing recovery
Limited requests
end noteHow It Works
Closed State (normal):
sequenceDiagram
participant C as Client
participant CB as Circuit Breaker
participant S as Service
C->>CB: Request 1
CB->>S: Forward request
S-->>CB: Success
CB-->>C: Success
C->>CB: Request 2
CB->>S: Forward request
S--xCB: Failure (1)
CB-->>C: Failure
C->>CB: Request 3
CB->>S: Forward request
S--xCB: Failure (2)
CB-->>C: Failure
Note over CB: Failure count: 2/5Requests pass through. Failures are counted. If failures exceed threshold (e.g., 5 failures in 30 seconds), circuit opens.
Open State (failing fast):
sequenceDiagram
participant C as Client
participant CB as Circuit Breaker
participant S as Service
Note over CB: Circuit OPEN
Service assumed down
C->>CB: Request 1
CB--xC: Fail immediately
Note over CB: No request sent!
C->>CB: Request 2
CB--xC: Fail immediately
C->>CB: Request 3
CB--xC: Fail immediately
Note over CB: Wait for timeout...No requests reach the failing service. Clients get fast failures instead of slow timeouts. The system saves resources and avoids amplifying the problem.
Half-Open State (testing recovery):
sequenceDiagram
participant C as Client
participant CB as Circuit Breaker
participant S as Service
Note over CB: Timeout expired
Try half-open
C->>CB: Request 1
CB->>S: Test request
S-->>CB: Success!
CB-->>C: Success
Note over CB: Service recovered!
Circuit CLOSEDAfter a timeout (e.g., 30 seconds), the circuit breaker allows a test request. If it succeeds, the circuit closes. If it fails, the circuit opens again.
Why Circuit Breakers Matter
Without a circuit breaker:
graph LR
subgraph Without Circuit Breaker
A[Your Service] -->|100 req/s| B[Failing Service]
B -->|100 timeouts/s| A
A -->|Threads exhausted| C[Your Service Down]
end
style B fill:#ffccbc
style C fill:#ffccbcWith a circuit breaker:
graph LR
subgraph With Circuit Breaker
A[Your Service] -->|Circuit Open| CB[Circuit Breaker]
CB -->|Fast Fail| A
A -->|Degrades gracefully| D[Your Service Stays Up]
end
style D fill:#c8e6c9Benefits:
- Fail fast: Don’t waste time on doomed requests
- Protect resources: Don’t exhaust thread pools/connections
- Allow recovery: Give the failing service breathing room
- Prevent cascading failures: Your failure doesn’t become everyone’s failure
Circuit Breaker Configuration
Key parameters:
| Parameter | Description | Typical Value |
|---|---|---|
| Failure threshold | Failures to open circuit | 5-10 failures |
| Time window | Period for counting failures | 30-60 seconds |
| Open timeout | How long circuit stays open | 30-60 seconds |
| Half-open requests | Test requests allowed | 1-3 |
6. Graceful Degradation: Failing Well
When a dependency fails, you have choices beyond “error” or “success.” Graceful degradation means providing reduced functionality instead of complete failure.
Degradation Strategies
Return cached data:
When the product API is down, return stale cached data with a warning.
{
"products": [...],
"metadata": {
"cached": true,
"cachedAt": "2026-01-12T10:00:00Z",
"warning": "Data may be outdated"
}
}
Return default values:
When personalization service is down, show generic recommendations.
{
"recommendations": ["popular-item-1", "popular-item-2"],
"personalized": false
}
Reduce functionality:
When payment processing is slow, disable checkout but keep browsing working.
Queue for later:
When email service is down, queue emails and send when service recovers.
Degradation Decision Tree
flowchart TD
F[Dependency Failed] --> Q1{Critical for
this operation?}
Q1 -->|No| S1[Skip it
Continue without]
Q1 -->|Yes| Q2{Have cached data?}
Q2 -->|Yes| S2[Return cached
with warning]
Q2 -->|No| Q3{Have default value?}
Q3 -->|Yes| S3[Return default
with warning]
Q3 -->|No| Q4{Can queue for later?}
Q4 -->|Yes| S4[Queue and
acknowledge]
Q4 -->|No| S5[Return error
gracefully]
style S1 fill:#c8e6c9
style S2 fill:#c8e6c9
style S3 fill:#fff9c4
style S4 fill:#fff9c4
style S5 fill:#ffccbcExamples of Graceful Degradation
E-commerce site:
| Feature | Degraded State | User Experience |
|---|---|---|
| Product search | Show cached results | “Results may be outdated” |
| Recommendations | Show popular items | Less personalized but functional |
| Reviews | Hide reviews section | Product page still works |
| Payment | Show “try again later” | Can’t buy, but can browse |
Social media:
| Feature | Degraded State | User Experience |
|---|---|---|
| Timeline | Show cached posts | “Showing recent posts” |
| Like count | Hide counts | Can still like |
| Comments | Disable new comments | Can read existing |
| Notifications | Batch and delay | Slight delay acceptable |
The Key Principle
Always ask: “What’s the best experience I can provide when this fails?”
The answer is rarely “show an error page.” Usually there’s a partial experience that’s better than nothing.
7. Putting It All Together
Here’s how these concepts work together in a resilient system:
flowchart TD
R[Request] --> T{Timeout
configured?}
T -->|No| WARN[Add timeout!]
T -->|Yes| CB{Circuit
breaker state?}
CB -->|Open| FF[Fast fail]
CB -->|Closed/Half-Open| S[Send request]
S --> RESULT{Result?}
RESULT -->|Success| REC[Record success]
RESULT -->|Failure| CHK{Retryable?}
CHK -->|No| FINAL[Return error]
CHK -->|Yes| IK{Idempotent/
has key?}
IK -->|No| DANGER[Dangerous to retry!]
IK -->|Yes| LIMIT{Under
retry limit?}
LIMIT -->|No| GIVEUP[Give up]
LIMIT -->|Yes| WAIT[Backoff + jitter]
WAIT --> CB
REC --> CLOSE[Circuit stays/becomes closed]
FINAL --> COUNT[Count failure]
COUNT --> THRESH{Threshold
exceeded?}
THRESH -->|Yes| OPEN[Open circuit]
THRESH -->|No| DONE[Done]
FF --> DEGRADE{Can degrade
gracefully?}
DEGRADE -->|Yes| CACHED[Return cached/default]
DEGRADE -->|No| ERROR[Return error]
style WARN fill:#ffccbc
style DANGER fill:#ffccbc
style FF fill:#fff9c4
style CACHED fill:#c8e6c9Summary of Resilience Patterns
| Problem | Solution | Key Concept |
|---|---|---|
| Slow responses | Timeouts | Bounded wait time |
| Transient failures | Retries | Try again |
| Retry storms | Exponential backoff | Wait longer each time |
| Synchronized retries | Jitter | Randomize delays |
| Duplicate operations | Idempotency keys | Same input = same output |
| Cascading failures | Circuit breaker | Fail fast when dependency is down |
| Total failure | Graceful degradation | Best possible experience |
What’s Next
This guide covered the concepts of resilience—understanding why things fail and the patterns that help.
For implementation details including:
- Configuring timeouts for different scenarios
- Tuning retry parameters for your use case
- Implementing idempotency key storage
- Setting circuit breaker thresholds
- Metrics, alerting, and observability for failures
See the upcoming course: Resiliencia y tolerancia a fallos en APIs REST.
Related Vocabulary Terms
Deepen your understanding with these related concepts:
- Circuit Breaker - Pattern for handling failing dependencies
- Idempotency - Making operations safe to retry
- Retry - Strategies for trying again
- Timeout - Bounding wait time for operations
- Error Handling - Patterns for handling API errors
- Backoff - Delaying retries appropriately
- Rate Limiting - Understanding 429 responses
- 5xx Status Codes - Server error responses
- 429 Too Many Requests - Rate limit exceeded