Errors, Retries, and Resilience

Audience

This guide is for developers who need to build reliable integrations with APIs:

Backend developers implementing API clients that need to handle failures gracefully
Frontend developers building applications that remain responsive when APIs fail
System architects designing resilient distributed systems
DevOps engineers understanding failure patterns to improve monitoring and alerting
Anyone who has been frustrated by APIs that timeout, return errors, or become unavailable

You should be familiar with HTTP basics. If not, start with HTTP for REST APIs.

Goal

After reading this guide, you’ll understand:

Why APIs fail and the different categories of failures
What timeouts are and why they’re critical for system health
How retry strategies work, including exponential backoff and jitter
Why idempotency matters and how idempotency keys prevent duplicates
How the circuit breaker pattern protects your system from cascading failures
What graceful degradation means and when to apply it

You won’t be configuring production-ready resilience patterns yet, but you’ll have a solid mental model of how resilient systems handle failure.

1. Why APIs Fail

APIs fail. Not occasionally—constantly. Understanding failure modes is the first step to building resilient systems.

Categories of Failures

API failures fall into distinct categories, each requiring different handling:

graph TD
    F[API Failure] --> N[Network Failures]
    F --> T[Timeout Failures]
    F --> O[Overload Failures]
    F --> B[Bug Failures]
    F --> D[Dependency Failures]

    N --> N1[Connection refused]
    N --> N2[DNS resolution failed]
    N --> N3[TLS handshake failed]

    T --> T1[Read timeout]
    T --> T2[Connect timeout]
    T --> T3[Idle timeout]

    O --> O1[Rate limiting - 429]
    O --> O2[Server overloaded - 503]
    O --> O3[Queue full]

    B --> B1[Server error - 500]
    B --> B2[Unexpected response]
    B --> B3[Data corruption]

    D --> D1[Database down]
    D --> D2[External API failed]
    D --> D3[Message queue unavailable]

    style F fill:#ffccbc
    style N fill:#fff9c4
    style T fill:#fff9c4
    style O fill:#fff9c4
    style B fill:#fff9c4
    style D fill:#fff9c4

Network Failures

The network between your client and the API can fail in many ways:

Connection refused: The server isn’t accepting connections
DNS failure: Can’t resolve the hostname to an IP address
TLS errors: Certificate expired, hostname mismatch, protocol incompatibility
Connection reset: The connection was forcibly closed
Packet loss: Data never arrives or is corrupted in transit

Network failures are usually transient—they resolve themselves when the network recovers.

Timeout Failures

Timeouts occur when something takes too long:

Connection timeout: Establishing the TCP connection takes too long
Read timeout: Waiting for response data takes too long
Idle timeout: The connection sat unused for too long

Timeouts are ambiguous: Did the request succeed? You sent the request, but you don’t know if the server received it, processed it, or sent a response that got lost.

Overload Failures

Servers can become overwhelmed:

Rate limiting (429): You’re sending requests too fast
Server overload (503): The server can’t handle current load
Queue full: Request queues are saturated

Overload failures signal: “I’m too busy, try again later.”

Bug Failures

Sometimes the server itself is broken:

500 Internal Server Error: Unhandled exception, null pointer, etc.
Unexpected response format: API changed without notice
Logic errors: Server processed the request incorrectly

Bug failures often require human intervention to fix.

Dependency Failures

Modern APIs depend on other services:

Database unavailable: Can’t store or retrieve data
External API failed: Third-party service is down
Cache failure: Redis/Memcached unavailable

Dependency failures cascade—one failing service can bring down many others.

The Reality of Failure

In a distributed system with multiple services, failures are inevitable. If each service has 99.9% uptime, a request touching 10 services has only 99%^10 = 99% success rate—1% of requests fail even when everything is “reliable.”

Building resilient systems means accepting that failures happen and designing for them.

2. Timeouts: Your First Line of Defense

A timeout is a limit on how long you’ll wait for something to complete. Without timeouts, a slow or unresponsive API can hang your entire application.

Why Timeouts Matter

Imagine a payment API that becomes slow. Without timeouts:

sequenceDiagram
    participant C as Your App
    participant P as Payment API

    C->>P: Process payment
    Note over P: API is slow...
    Note over C: Thread blocked
waiting forever
    Note over P: Still processing...
    Note over C: More requests arrive
All threads blocked
    Note over P: More time passes...
    Note over C: Thread pool exhausted
Application hangs

    style C fill:#ffccbc

One slow dependency can consume all your resources, making your entire application unresponsive. This is called resource exhaustion.

Types of Timeouts

Connection timeout: How long to wait when establishing a connection.

Too short: Fails before connection completes on slow networks
Too long: Resources tied up waiting for unreachable servers
Typical values: 1-10 seconds

Read timeout (response timeout): How long to wait for response data after connecting.

Too short: Fails on legitimate slow operations
Too long: Resources tied up waiting for stuck servers
Typical values: 5-60 seconds (depends on operation)

Total timeout: Maximum time for the entire request-response cycle.

Ensures bounded wait time regardless of retries
Typical values: 30-120 seconds

The Timeout Dilemma

Setting timeouts involves tradeoffs:

Timeout Too Short	Timeout Too Long
Fails legitimate requests	Slow failures
High error rate	Resource exhaustion
Poor user experience	Cascading slowdowns

The key insight: A fast failure is often better than a slow success. Your users would rather see “try again” than wait forever.

Timeout as a Contract

Think of timeouts as a contract with your users: “I promise to respond within X seconds, even if the answer is ‘I don’t know.’”

# Request with timeout context
POST /payments HTTP/1.1
X-Request-Timeout: 30

# Server should abort if it can't respond in time

Without this contract, your application’s response time is determined by your slowest dependency—which might be infinitely slow.

3. Retry Strategies: When and How to Try Again

Retries can turn transient failures into successful requests. But naive retries can make things worse.

When to Retry

Not all failures should be retried:

flowchart TD
    E[Error Occurred] --> S{Status Code?}

    S -->|4xx| C4[Client Error]
    S -->|5xx| C5[Server Error]
    S -->|Network Error| CN[Network Error]
    S -->|Timeout| CT[Timeout]

    C4 --> D4{Which 4xx?}
    D4 -->|400, 401, 403, 404| N1[Don't Retry
Fix the request]
    D4 -->|429| R1[Retry
After delay]
    D4 -->|408| R2[Retry
Immediately possible]

    C5 --> D5{Which 5xx?}
    D5 -->|500| M1[Maybe Retry
Server bug?]
    D5 -->|502, 503, 504| R3[Retry
Transient failure]

    CN --> R4[Retry
Network recovered?]
    CT --> R5[Retry
But carefully!]

    style N1 fill:#ffccbc
    style R1 fill:#c8e6c9
    style R2 fill:#c8e6c9
    style R3 fill:#c8e6c9
    style R4 fill:#c8e6c9
    style R5 fill:#fff9c4
    style M1 fill:#fff9c4

Safe to retry:

429 Too Many Requests (after waiting)
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
Network errors (connection refused, DNS failure)
Timeouts (with caution)

Don’t retry:

400 Bad Request (your request is malformed)
401 Unauthorized (your credentials are wrong)
403 Forbidden (you don’t have permission)
404 Not Found (resource doesn’t exist)
409 Conflict (state conflict, needs resolution)

Maybe retry:

500 Internal Server Error (might be a transient bug)
Timeouts on non-idempotent operations (see idempotency section)

The Problem with Naive Retries

Simple immediate retries can cause retry storms:

sequenceDiagram
    participant C1 as Client 1
    participant C2 as Client 2
    participant C3 as Client 3
    participant S as Overloaded Server

    Note over S: Server at 100% capacity

    C1->>S: Request 1
    C2->>S: Request 2
    C3->>S: Request 3

    S--xC1: 503 Service Unavailable
    S--xC2: 503 Service Unavailable
    S--xC3: 503 Service Unavailable

    Note over C1,C3: All clients retry immediately

    C1->>S: Retry 1
    C2->>S: Retry 2
    C3->>S: Retry 3

    Note over S: Now handling 6 requests
instead of 3!

    style S fill:#ffccbc

When all clients retry immediately, they amplify the load on an already struggling server.

Exponential Backoff

The solution is exponential backoff: Wait longer between each retry.

Attempt 1: Wait 1 second
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds
Attempt 5: Wait 16 seconds
...

The delay grows exponentially: delay = base * 2^attempt

graph LR
    A1[Attempt 1] -->|1s wait| A2[Attempt 2]
    A2 -->|2s wait| A3[Attempt 3]
    A3 -->|4s wait| A4[Attempt 4]
    A4 -->|8s wait| A5[Attempt 5]

    style A1 fill:#e3f2fd
    style A2 fill:#bbdefb
    style A3 fill:#90caf9
    style A4 fill:#64b5f6
    style A5 fill:#42a5f5

This gives the server time to recover while preventing retry storms.

Jitter: Randomizing Delays

Even with exponential backoff, if 1000 clients start retrying at the same time, they’ll hit the server in synchronized waves.

Jitter adds randomness to the delay:

# Full jitter (recommended)
delay = random(0, base * 2^attempt)

# Decorrelated jitter
delay = random(base, previous_delay * 3)

Instead of all clients waiting exactly 4 seconds, they wait randomly between 0-4 seconds, spreading the load:

graph LR
    subgraph Without Jitter
        W1[Client A: 4s]
        W2[Client B: 4s]
        W3[Client C: 4s]
    end

    subgraph With Jitter
        J1[Client A: 1.2s]
        J2[Client B: 3.7s]
        J3[Client C: 2.4s]
    end

    style W1 fill:#ffccbc
    style W2 fill:#ffccbc
    style W3 fill:#ffccbc
    style J1 fill:#c8e6c9
    style J2 fill:#c8e6c9
    style J3 fill:#c8e6c9

When NOT to Retry

Never retry when:

The request succeeded (obviously, but check idempotency)
The error is permanent: 400, 401, 403, 404
You’ve exceeded max retries: Know when to give up
The timeout has expired: Don’t retry into infinity
The operation isn’t safe to repeat: Non-idempotent operations without idempotency keys

Be careful when:

The operation has side effects: Might cause duplicates
The timeout was ambiguous: Did it succeed or not?
You’re propagating user latency: They’re waiting

Retry Budget

Set limits on retries to prevent endless retry loops:

Max attempts: Limit total attempts (e.g., 3-5)
Max total time: Stop retrying after N seconds
Retry budget: Only retry X% of requests in a time window

4. Idempotency: Making Retries Safe

A timeout happens. Did your payment go through? You don’t know. If you retry and it already succeeded, you might charge the customer twice.

Idempotency makes operations safe to retry.

What Is Idempotency?

An operation is idempotent if calling it multiple times produces the same result as calling it once.

graph LR
    subgraph Idempotent
        I1[x = 5] -->|Call once| IR1[x = 5]
        I2[x = 5] -->|Call twice| IR2[x = 5]
        I3[x = 5] -->|Call N times| IR3[x = 5]
    end

    subgraph Not Idempotent
        N1[x = 0] -->|Call once| NR1[x = 1]
        N2[x = 0] -->|Call twice| NR2[x = 2]
        N3[x = 0] -->|Call N times| NR3[x = N]
    end

    style IR1 fill:#c8e6c9
    style IR2 fill:#c8e6c9
    style IR3 fill:#c8e6c9
    style NR1 fill:#fff9c4
    style NR2 fill:#fff9c4
    style NR3 fill:#ffccbc

Naturally idempotent operations:

GET /users/123 — Reading doesn’t change anything
PUT /users/123 {name: "Alice"} — Sets to same state each time
DELETE /users/123 — Resource is gone after first call

NOT naturally idempotent:

POST /payments — Each call might create a new payment
POST /emails/send — Each call might send another email
x = x + 1 — Each call increments

Idempotency Keys

For non-idempotent operations, use idempotency keys—unique identifiers that let the server detect duplicate requests.

POST /payments HTTP/1.1
Content-Type: application/json
Idempotency-Key: abc123-unique-request-id

{
  "amount": 100,
  "currency": "USD",
  "recipient": "user_456"
}

How it works:

sequenceDiagram
    participant C as Client
    participant S as Server
    participant DB as Database

    C->>S: POST /payments
Idempotency-Key: abc123
    S->>DB: Check: seen abc123?
    DB-->>S: No
    S->>DB: Process payment
    S->>DB: Store: abc123 -> result
    S-->>C: 200 OK, payment_id: 789

    Note over C: Network fails, client retries

    C->>S: POST /payments
Idempotency-Key: abc123
    S->>DB: Check: seen abc123?
    DB-->>S: Yes, result exists
    S-->>C: 200 OK, payment_id: 789

    Note over C: Same result, no duplicate payment!

Key properties of idempotency keys:

Client-generated: The client creates a unique ID before the first attempt
Stored by server: Server remembers key → result mapping
Returns cached result: On retry, server returns the original response
Expires eventually: Keys don’t need to live forever (hours to days)

The Idempotency Contract

When you use an idempotency key:

Same key + same request = same result
Same key + different request = error (some APIs allow this, some don’t)
Server must complete storage before responding (or use transactions)

Common mistake: Generating a new idempotency key for each retry. That defeats the purpose—each retry looks like a new request!

// WRONG: New key for each attempt
async function processPaymentWrong(amount) {
  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      const idempotencyKey = generateUUID(); // New key each time!
      return await api.createPayment({ amount, idempotencyKey });
    } catch (error) {
      if (attempt === 2) throw error;
    }
  }
}

// RIGHT: Same key for all attempts
async function processPaymentRight(amount) {
  const idempotencyKey = generateUUID(); // Generate once
  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await api.createPayment({ amount, idempotencyKey });
    } catch (error) {
      if (attempt === 2) throw error;
    }
  }
}

5. Circuit Breaker: Protecting Your System

When a dependency is failing, continuing to call it wastes resources and can cause cascading failures. The circuit breaker pattern stops this.

The Circuit Breaker Analogy

Think of an electrical circuit breaker in your house:

Normal operation: Electricity flows freely
Overload detected: Breaker trips, cuts power
Manual reset: After fixing the problem, you reset the breaker

An API circuit breaker works the same way:

Closed state: Requests flow through normally
Open state: Requests fail immediately without calling the failing service
Half-open state: Allow a few test requests to check if service recovered

Circuit Breaker States

stateDiagram-v2
    [*] --> Closed

    Closed --> Open: Failure threshold exceeded
    Open --> HalfOpen: Timeout expires
    HalfOpen --> Closed: Test request succeeds
    HalfOpen --> Open: Test request fails

    note right of Closed
        Normal operation
        Counting failures
    end note

    note right of Open
        Fail fast
        No requests sent
    end note

    note right of HalfOpen
        Testing recovery
        Limited requests
    end note

How It Works

Closed State (normal):

sequenceDiagram
    participant C as Client
    participant CB as Circuit Breaker
    participant S as Service

    C->>CB: Request 1
    CB->>S: Forward request
    S-->>CB: Success
    CB-->>C: Success

    C->>CB: Request 2
    CB->>S: Forward request
    S--xCB: Failure (1)
    CB-->>C: Failure

    C->>CB: Request 3
    CB->>S: Forward request
    S--xCB: Failure (2)
    CB-->>C: Failure

    Note over CB: Failure count: 2/5

Requests pass through. Failures are counted. If failures exceed threshold (e.g., 5 failures in 30 seconds), circuit opens.

Open State (failing fast):

sequenceDiagram
    participant C as Client
    participant CB as Circuit Breaker
    participant S as Service

    Note over CB: Circuit OPEN
Service assumed down

    C->>CB: Request 1
    CB--xC: Fail immediately
    Note over CB: No request sent!

    C->>CB: Request 2
    CB--xC: Fail immediately

    C->>CB: Request 3
    CB--xC: Fail immediately

    Note over CB: Wait for timeout...

No requests reach the failing service. Clients get fast failures instead of slow timeouts. The system saves resources and avoids amplifying the problem.

Half-Open State (testing recovery):

sequenceDiagram
    participant C as Client
    participant CB as Circuit Breaker
    participant S as Service

    Note over CB: Timeout expired
Try half-open

    C->>CB: Request 1
    CB->>S: Test request
    S-->>CB: Success!
    CB-->>C: Success

    Note over CB: Service recovered!
Circuit CLOSED

After a timeout (e.g., 30 seconds), the circuit breaker allows a test request. If it succeeds, the circuit closes. If it fails, the circuit opens again.

Why Circuit Breakers Matter

Without a circuit breaker:

graph LR
    subgraph Without Circuit Breaker
        A[Your Service] -->|100 req/s| B[Failing Service]
        B -->|100 timeouts/s| A
        A -->|Threads exhausted| C[Your Service Down]
    end

    style B fill:#ffccbc
    style C fill:#ffccbc

With a circuit breaker:

graph LR
    subgraph With Circuit Breaker
        A[Your Service] -->|Circuit Open| CB[Circuit Breaker]
        CB -->|Fast Fail| A
        A -->|Degrades gracefully| D[Your Service Stays Up]
    end

    style D fill:#c8e6c9

Benefits:

Fail fast: Don’t waste time on doomed requests
Protect resources: Don’t exhaust thread pools/connections
Allow recovery: Give the failing service breathing room
Prevent cascading failures: Your failure doesn’t become everyone’s failure

Circuit Breaker Configuration

Key parameters:

Parameter	Description	Typical Value
Failure threshold	Failures to open circuit	5-10 failures
Time window	Period for counting failures	30-60 seconds
Open timeout	How long circuit stays open	30-60 seconds
Half-open requests	Test requests allowed	1-3

6. Graceful Degradation: Failing Well

When a dependency fails, you have choices beyond “error” or “success.” Graceful degradation means providing reduced functionality instead of complete failure.

Degradation Strategies

Return cached data:

When the product API is down, return stale cached data with a warning.

{
  "products": [...],
  "metadata": {
    "cached": true,
    "cachedAt": "2026-01-12T10:00:00Z",
    "warning": "Data may be outdated"
  }
}

Return default values:

When personalization service is down, show generic recommendations.

{
  "recommendations": ["popular-item-1", "popular-item-2"],
  "personalized": false
}

Reduce functionality:

When payment processing is slow, disable checkout but keep browsing working.

Queue for later:

When email service is down, queue emails and send when service recovers.

Degradation Decision Tree

flowchart TD
    F[Dependency Failed] --> Q1{Critical for
this operation?}

    Q1 -->|No| S1[Skip it
Continue without]
    Q1 -->|Yes| Q2{Have cached data?}

    Q2 -->|Yes| S2[Return cached
with warning]
    Q2 -->|No| Q3{Have default value?}

    Q3 -->|Yes| S3[Return default
with warning]
    Q3 -->|No| Q4{Can queue for later?}

    Q4 -->|Yes| S4[Queue and
acknowledge]
    Q4 -->|No| S5[Return error
gracefully]

    style S1 fill:#c8e6c9
    style S2 fill:#c8e6c9
    style S3 fill:#fff9c4
    style S4 fill:#fff9c4
    style S5 fill:#ffccbc

Examples of Graceful Degradation

E-commerce site:

Feature	Degraded State	User Experience
Product search	Show cached results	“Results may be outdated”
Recommendations	Show popular items	Less personalized but functional
Reviews	Hide reviews section	Product page still works
Payment	Show “try again later”	Can’t buy, but can browse

Social media:

Feature	Degraded State	User Experience
Timeline	Show cached posts	“Showing recent posts”
Like count	Hide counts	Can still like
Comments	Disable new comments	Can read existing
Notifications	Batch and delay	Slight delay acceptable

The Key Principle

Always ask: “What’s the best experience I can provide when this fails?”

The answer is rarely “show an error page.” Usually there’s a partial experience that’s better than nothing.

7. Putting It All Together

Here’s how these concepts work together in a resilient system:

flowchart TD
    R[Request] --> T{Timeout
configured?}
    T -->|No| WARN[Add timeout!]
    T -->|Yes| CB{Circuit
breaker state?}

    CB -->|Open| FF[Fast fail]
    CB -->|Closed/Half-Open| S[Send request]

    S --> RESULT{Result?}

    RESULT -->|Success| REC[Record success]
    RESULT -->|Failure| CHK{Retryable?}

    CHK -->|No| FINAL[Return error]
    CHK -->|Yes| IK{Idempotent/
has key?}

    IK -->|No| DANGER[Dangerous to retry!]
    IK -->|Yes| LIMIT{Under
retry limit?}

    LIMIT -->|No| GIVEUP[Give up]
    LIMIT -->|Yes| WAIT[Backoff + jitter]

    WAIT --> CB

    REC --> CLOSE[Circuit stays/becomes closed]
    FINAL --> COUNT[Count failure]
    COUNT --> THRESH{Threshold
exceeded?}
    THRESH -->|Yes| OPEN[Open circuit]
    THRESH -->|No| DONE[Done]

    FF --> DEGRADE{Can degrade
gracefully?}
    DEGRADE -->|Yes| CACHED[Return cached/default]
    DEGRADE -->|No| ERROR[Return error]

    style WARN fill:#ffccbc
    style DANGER fill:#ffccbc
    style FF fill:#fff9c4
    style CACHED fill:#c8e6c9

Summary of Resilience Patterns

Problem	Solution	Key Concept
Slow responses	Timeouts	Bounded wait time
Transient failures	Retries	Try again
Retry storms	Exponential backoff	Wait longer each time
Synchronized retries	Jitter	Randomize delays
Duplicate operations	Idempotency keys	Same input = same output
Cascading failures	Circuit breaker	Fail fast when dependency is down
Total failure	Graceful degradation	Best possible experience

What’s Next

This guide covered the concepts of resilience—understanding why things fail and the patterns that help.

For implementation details including:

Configuring timeouts for different scenarios
Tuning retry parameters for your use case
Implementing idempotency key storage
Setting circuit breaker thresholds
Metrics, alerting, and observability for failures

See the upcoming course: Resiliencia y tolerancia a fallos en APIs REST.

Deepen your understanding with these related concepts:

Circuit Breaker - Pattern for handling failing dependencies
Idempotency - Making operations safe to retry
Retry - Strategies for trying again
Timeout - Bounding wait time for operations
Error Handling - Patterns for handling API errors
Backoff - Delaying retries appropriately
Rate Limiting - Understanding 429 responses
5xx Status Codes - Server error responses
429 Too Many Requests - Rate limit exceeded