Errors, Retries, and Resilience

Patterns Intermediate 20 min Jan 12, 2026

Audience

This guide is for developers who need to build reliable integrations with APIs:

  • Backend developers implementing API clients that need to handle failures gracefully
  • Frontend developers building applications that remain responsive when APIs fail
  • System architects designing resilient distributed systems
  • DevOps engineers understanding failure patterns to improve monitoring and alerting
  • Anyone who has been frustrated by APIs that timeout, return errors, or become unavailable

You should be familiar with HTTP basics. If not, start with HTTP for REST APIs.

Goal

After reading this guide, you’ll understand:

  • Why APIs fail and the different categories of failures
  • What timeouts are and why they’re critical for system health
  • How retry strategies work, including exponential backoff and jitter
  • Why idempotency matters and how idempotency keys prevent duplicates
  • How the circuit breaker pattern protects your system from cascading failures
  • What graceful degradation means and when to apply it

You won’t be configuring production-ready resilience patterns yet, but you’ll have a solid mental model of how resilient systems handle failure.

1. Why APIs Fail

APIs fail. Not occasionally—constantly. Understanding failure modes is the first step to building resilient systems.

Categories of Failures

API failures fall into distinct categories, each requiring different handling:

graph TD
    F[API Failure] --> N[Network Failures]
    F --> T[Timeout Failures]
    F --> O[Overload Failures]
    F --> B[Bug Failures]
    F --> D[Dependency Failures]

    N --> N1[Connection refused]
    N --> N2[DNS resolution failed]
    N --> N3[TLS handshake failed]

    T --> T1[Read timeout]
    T --> T2[Connect timeout]
    T --> T3[Idle timeout]

    O --> O1[Rate limiting - 429]
    O --> O2[Server overloaded - 503]
    O --> O3[Queue full]

    B --> B1[Server error - 500]
    B --> B2[Unexpected response]
    B --> B3[Data corruption]

    D --> D1[Database down]
    D --> D2[External API failed]
    D --> D3[Message queue unavailable]

    style F fill:#ffccbc
    style N fill:#fff9c4
    style T fill:#fff9c4
    style O fill:#fff9c4
    style B fill:#fff9c4
    style D fill:#fff9c4

Network Failures

The network between your client and the API can fail in many ways:

  • Connection refused: The server isn’t accepting connections
  • DNS failure: Can’t resolve the hostname to an IP address
  • TLS errors: Certificate expired, hostname mismatch, protocol incompatibility
  • Connection reset: The connection was forcibly closed
  • Packet loss: Data never arrives or is corrupted in transit

Network failures are usually transient—they resolve themselves when the network recovers.

Timeout Failures

Timeouts occur when something takes too long:

  • Connection timeout: Establishing the TCP connection takes too long
  • Read timeout: Waiting for response data takes too long
  • Idle timeout: The connection sat unused for too long

Timeouts are ambiguous: Did the request succeed? You sent the request, but you don’t know if the server received it, processed it, or sent a response that got lost.

Overload Failures

Servers can become overwhelmed:

  • Rate limiting (429): You’re sending requests too fast
  • Server overload (503): The server can’t handle current load
  • Queue full: Request queues are saturated

Overload failures signal: “I’m too busy, try again later.”

Bug Failures

Sometimes the server itself is broken:

  • 500 Internal Server Error: Unhandled exception, null pointer, etc.
  • Unexpected response format: API changed without notice
  • Logic errors: Server processed the request incorrectly

Bug failures often require human intervention to fix.

Dependency Failures

Modern APIs depend on other services:

  • Database unavailable: Can’t store or retrieve data
  • External API failed: Third-party service is down
  • Cache failure: Redis/Memcached unavailable

Dependency failures cascade—one failing service can bring down many others.

The Reality of Failure

In a distributed system with multiple services, failures are inevitable. If each service has 99.9% uptime, a request touching 10 services has only 99%^10 = 99% success rate—1% of requests fail even when everything is “reliable.”

Building resilient systems means accepting that failures happen and designing for them.

2. Timeouts: Your First Line of Defense

A timeout is a limit on how long you’ll wait for something to complete. Without timeouts, a slow or unresponsive API can hang your entire application.

Why Timeouts Matter

Imagine a payment API that becomes slow. Without timeouts:

sequenceDiagram
    participant C as Your App
    participant P as Payment API

    C->>P: Process payment
    Note over P: API is slow...
    Note over C: Thread blocked
waiting forever Note over P: Still processing... Note over C: More requests arrive
All threads blocked Note over P: More time passes... Note over C: Thread pool exhausted
Application hangs style C fill:#ffccbc

One slow dependency can consume all your resources, making your entire application unresponsive. This is called resource exhaustion.

Types of Timeouts

Connection timeout: How long to wait when establishing a connection.

  • Too short: Fails before connection completes on slow networks
  • Too long: Resources tied up waiting for unreachable servers
  • Typical values: 1-10 seconds

Read timeout (response timeout): How long to wait for response data after connecting.

  • Too short: Fails on legitimate slow operations
  • Too long: Resources tied up waiting for stuck servers
  • Typical values: 5-60 seconds (depends on operation)

Total timeout: Maximum time for the entire request-response cycle.

  • Ensures bounded wait time regardless of retries
  • Typical values: 30-120 seconds

The Timeout Dilemma

Setting timeouts involves tradeoffs:

Timeout Too ShortTimeout Too Long
Fails legitimate requestsSlow failures
High error rateResource exhaustion
Poor user experienceCascading slowdowns

The key insight: A fast failure is often better than a slow success. Your users would rather see “try again” than wait forever.

Timeout as a Contract

Think of timeouts as a contract with your users: “I promise to respond within X seconds, even if the answer is ‘I don’t know.’”

# Request with timeout context
POST /payments HTTP/1.1
X-Request-Timeout: 30

# Server should abort if it can't respond in time

Without this contract, your application’s response time is determined by your slowest dependency—which might be infinitely slow.

3. Retry Strategies: When and How to Try Again

Retries can turn transient failures into successful requests. But naive retries can make things worse.

When to Retry

Not all failures should be retried:

flowchart TD
    E[Error Occurred] --> S{Status Code?}

    S -->|4xx| C4[Client Error]
    S -->|5xx| C5[Server Error]
    S -->|Network Error| CN[Network Error]
    S -->|Timeout| CT[Timeout]

    C4 --> D4{Which 4xx?}
    D4 -->|400, 401, 403, 404| N1[Don't Retry
Fix the request] D4 -->|429| R1[Retry
After delay] D4 -->|408| R2[Retry
Immediately possible] C5 --> D5{Which 5xx?} D5 -->|500| M1[Maybe Retry
Server bug?] D5 -->|502, 503, 504| R3[Retry
Transient failure] CN --> R4[Retry
Network recovered?] CT --> R5[Retry
But carefully!] style N1 fill:#ffccbc style R1 fill:#c8e6c9 style R2 fill:#c8e6c9 style R3 fill:#c8e6c9 style R4 fill:#c8e6c9 style R5 fill:#fff9c4 style M1 fill:#fff9c4

Safe to retry:

  • 429 Too Many Requests (after waiting)
  • 502 Bad Gateway
  • 503 Service Unavailable
  • 504 Gateway Timeout
  • Network errors (connection refused, DNS failure)
  • Timeouts (with caution)

Don’t retry:

  • 400 Bad Request (your request is malformed)
  • 401 Unauthorized (your credentials are wrong)
  • 403 Forbidden (you don’t have permission)
  • 404 Not Found (resource doesn’t exist)
  • 409 Conflict (state conflict, needs resolution)

Maybe retry:

  • 500 Internal Server Error (might be a transient bug)
  • Timeouts on non-idempotent operations (see idempotency section)

The Problem with Naive Retries

Simple immediate retries can cause retry storms:

sequenceDiagram
    participant C1 as Client 1
    participant C2 as Client 2
    participant C3 as Client 3
    participant S as Overloaded Server

    Note over S: Server at 100% capacity

    C1->>S: Request 1
    C2->>S: Request 2
    C3->>S: Request 3

    S--xC1: 503 Service Unavailable
    S--xC2: 503 Service Unavailable
    S--xC3: 503 Service Unavailable

    Note over C1,C3: All clients retry immediately

    C1->>S: Retry 1
    C2->>S: Retry 2
    C3->>S: Retry 3

    Note over S: Now handling 6 requests
instead of 3! style S fill:#ffccbc

When all clients retry immediately, they amplify the load on an already struggling server.

Exponential Backoff

The solution is exponential backoff: Wait longer between each retry.

Attempt 1: Wait 1 second
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds
Attempt 5: Wait 16 seconds
...

The delay grows exponentially: delay = base * 2^attempt

graph LR
    A1[Attempt 1] -->|1s wait| A2[Attempt 2]
    A2 -->|2s wait| A3[Attempt 3]
    A3 -->|4s wait| A4[Attempt 4]
    A4 -->|8s wait| A5[Attempt 5]

    style A1 fill:#e3f2fd
    style A2 fill:#bbdefb
    style A3 fill:#90caf9
    style A4 fill:#64b5f6
    style A5 fill:#42a5f5

This gives the server time to recover while preventing retry storms.

Jitter: Randomizing Delays

Even with exponential backoff, if 1000 clients start retrying at the same time, they’ll hit the server in synchronized waves.

Jitter adds randomness to the delay:

# Full jitter (recommended)
delay = random(0, base * 2^attempt)

# Decorrelated jitter
delay = random(base, previous_delay * 3)

Instead of all clients waiting exactly 4 seconds, they wait randomly between 0-4 seconds, spreading the load:

graph LR
    subgraph Without Jitter
        W1[Client A: 4s]
        W2[Client B: 4s]
        W3[Client C: 4s]
    end

    subgraph With Jitter
        J1[Client A: 1.2s]
        J2[Client B: 3.7s]
        J3[Client C: 2.4s]
    end

    style W1 fill:#ffccbc
    style W2 fill:#ffccbc
    style W3 fill:#ffccbc
    style J1 fill:#c8e6c9
    style J2 fill:#c8e6c9
    style J3 fill:#c8e6c9

When NOT to Retry

Never retry when:

  1. The request succeeded (obviously, but check idempotency)
  2. The error is permanent: 400, 401, 403, 404
  3. You’ve exceeded max retries: Know when to give up
  4. The timeout has expired: Don’t retry into infinity
  5. The operation isn’t safe to repeat: Non-idempotent operations without idempotency keys

Be careful when:

  1. The operation has side effects: Might cause duplicates
  2. The timeout was ambiguous: Did it succeed or not?
  3. You’re propagating user latency: They’re waiting

Retry Budget

Set limits on retries to prevent endless retry loops:

  • Max attempts: Limit total attempts (e.g., 3-5)
  • Max total time: Stop retrying after N seconds
  • Retry budget: Only retry X% of requests in a time window

4. Idempotency: Making Retries Safe

A timeout happens. Did your payment go through? You don’t know. If you retry and it already succeeded, you might charge the customer twice.

Idempotency makes operations safe to retry.

What Is Idempotency?

An operation is idempotent if calling it multiple times produces the same result as calling it once.

graph LR
    subgraph Idempotent
        I1[x = 5] -->|Call once| IR1[x = 5]
        I2[x = 5] -->|Call twice| IR2[x = 5]
        I3[x = 5] -->|Call N times| IR3[x = 5]
    end

    subgraph Not Idempotent
        N1[x = 0] -->|Call once| NR1[x = 1]
        N2[x = 0] -->|Call twice| NR2[x = 2]
        N3[x = 0] -->|Call N times| NR3[x = N]
    end

    style IR1 fill:#c8e6c9
    style IR2 fill:#c8e6c9
    style IR3 fill:#c8e6c9
    style NR1 fill:#fff9c4
    style NR2 fill:#fff9c4
    style NR3 fill:#ffccbc

Naturally idempotent operations:

  • GET /users/123 — Reading doesn’t change anything
  • PUT /users/123 {name: "Alice"} — Sets to same state each time
  • DELETE /users/123 — Resource is gone after first call

NOT naturally idempotent:

  • POST /payments — Each call might create a new payment
  • POST /emails/send — Each call might send another email
  • x = x + 1 — Each call increments

Idempotency Keys

For non-idempotent operations, use idempotency keys—unique identifiers that let the server detect duplicate requests.

POST /payments HTTP/1.1
Content-Type: application/json
Idempotency-Key: abc123-unique-request-id

{
  "amount": 100,
  "currency": "USD",
  "recipient": "user_456"
}

How it works:

sequenceDiagram
    participant C as Client
    participant S as Server
    participant DB as Database

    C->>S: POST /payments
Idempotency-Key: abc123 S->>DB: Check: seen abc123? DB-->>S: No S->>DB: Process payment S->>DB: Store: abc123 -> result S-->>C: 200 OK, payment_id: 789 Note over C: Network fails, client retries C->>S: POST /payments
Idempotency-Key: abc123 S->>DB: Check: seen abc123? DB-->>S: Yes, result exists S-->>C: 200 OK, payment_id: 789 Note over C: Same result, no duplicate payment!

Key properties of idempotency keys:

  1. Client-generated: The client creates a unique ID before the first attempt
  2. Stored by server: Server remembers key → result mapping
  3. Returns cached result: On retry, server returns the original response
  4. Expires eventually: Keys don’t need to live forever (hours to days)

The Idempotency Contract

When you use an idempotency key:

  • Same key + same request = same result
  • Same key + different request = error (some APIs allow this, some don’t)
  • Server must complete storage before responding (or use transactions)

Common mistake: Generating a new idempotency key for each retry. That defeats the purpose—each retry looks like a new request!

// WRONG: New key for each attempt
async function processPaymentWrong(amount) {
  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      const idempotencyKey = generateUUID(); // New key each time!
      return await api.createPayment({ amount, idempotencyKey });
    } catch (error) {
      if (attempt === 2) throw error;
    }
  }
}

// RIGHT: Same key for all attempts
async function processPaymentRight(amount) {
  const idempotencyKey = generateUUID(); // Generate once
  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await api.createPayment({ amount, idempotencyKey });
    } catch (error) {
      if (attempt === 2) throw error;
    }
  }
}

5. Circuit Breaker: Protecting Your System

When a dependency is failing, continuing to call it wastes resources and can cause cascading failures. The circuit breaker pattern stops this.

The Circuit Breaker Analogy

Think of an electrical circuit breaker in your house:

  • Normal operation: Electricity flows freely
  • Overload detected: Breaker trips, cuts power
  • Manual reset: After fixing the problem, you reset the breaker

An API circuit breaker works the same way:

  • Closed state: Requests flow through normally
  • Open state: Requests fail immediately without calling the failing service
  • Half-open state: Allow a few test requests to check if service recovered

Circuit Breaker States

stateDiagram-v2
    [*] --> Closed

    Closed --> Open: Failure threshold exceeded
    Open --> HalfOpen: Timeout expires
    HalfOpen --> Closed: Test request succeeds
    HalfOpen --> Open: Test request fails

    note right of Closed
        Normal operation
        Counting failures
    end note

    note right of Open
        Fail fast
        No requests sent
    end note

    note right of HalfOpen
        Testing recovery
        Limited requests
    end note

How It Works

Closed State (normal):

sequenceDiagram
    participant C as Client
    participant CB as Circuit Breaker
    participant S as Service

    C->>CB: Request 1
    CB->>S: Forward request
    S-->>CB: Success
    CB-->>C: Success

    C->>CB: Request 2
    CB->>S: Forward request
    S--xCB: Failure (1)
    CB-->>C: Failure

    C->>CB: Request 3
    CB->>S: Forward request
    S--xCB: Failure (2)
    CB-->>C: Failure

    Note over CB: Failure count: 2/5

Requests pass through. Failures are counted. If failures exceed threshold (e.g., 5 failures in 30 seconds), circuit opens.

Open State (failing fast):

sequenceDiagram
    participant C as Client
    participant CB as Circuit Breaker
    participant S as Service

    Note over CB: Circuit OPEN
Service assumed down C->>CB: Request 1 CB--xC: Fail immediately Note over CB: No request sent! C->>CB: Request 2 CB--xC: Fail immediately C->>CB: Request 3 CB--xC: Fail immediately Note over CB: Wait for timeout...

No requests reach the failing service. Clients get fast failures instead of slow timeouts. The system saves resources and avoids amplifying the problem.

Half-Open State (testing recovery):

sequenceDiagram
    participant C as Client
    participant CB as Circuit Breaker
    participant S as Service

    Note over CB: Timeout expired
Try half-open C->>CB: Request 1 CB->>S: Test request S-->>CB: Success! CB-->>C: Success Note over CB: Service recovered!
Circuit CLOSED

After a timeout (e.g., 30 seconds), the circuit breaker allows a test request. If it succeeds, the circuit closes. If it fails, the circuit opens again.

Why Circuit Breakers Matter

Without a circuit breaker:

graph LR
    subgraph Without Circuit Breaker
        A[Your Service] -->|100 req/s| B[Failing Service]
        B -->|100 timeouts/s| A
        A -->|Threads exhausted| C[Your Service Down]
    end

    style B fill:#ffccbc
    style C fill:#ffccbc

With a circuit breaker:

graph LR
    subgraph With Circuit Breaker
        A[Your Service] -->|Circuit Open| CB[Circuit Breaker]
        CB -->|Fast Fail| A
        A -->|Degrades gracefully| D[Your Service Stays Up]
    end

    style D fill:#c8e6c9

Benefits:

  • Fail fast: Don’t waste time on doomed requests
  • Protect resources: Don’t exhaust thread pools/connections
  • Allow recovery: Give the failing service breathing room
  • Prevent cascading failures: Your failure doesn’t become everyone’s failure

Circuit Breaker Configuration

Key parameters:

ParameterDescriptionTypical Value
Failure thresholdFailures to open circuit5-10 failures
Time windowPeriod for counting failures30-60 seconds
Open timeoutHow long circuit stays open30-60 seconds
Half-open requestsTest requests allowed1-3

6. Graceful Degradation: Failing Well

When a dependency fails, you have choices beyond “error” or “success.” Graceful degradation means providing reduced functionality instead of complete failure.

Degradation Strategies

Return cached data:

When the product API is down, return stale cached data with a warning.

{
  "products": [...],
  "metadata": {
    "cached": true,
    "cachedAt": "2026-01-12T10:00:00Z",
    "warning": "Data may be outdated"
  }
}

Return default values:

When personalization service is down, show generic recommendations.

{
  "recommendations": ["popular-item-1", "popular-item-2"],
  "personalized": false
}

Reduce functionality:

When payment processing is slow, disable checkout but keep browsing working.

Queue for later:

When email service is down, queue emails and send when service recovers.

Degradation Decision Tree

flowchart TD
    F[Dependency Failed] --> Q1{Critical for
this operation?} Q1 -->|No| S1[Skip it
Continue without] Q1 -->|Yes| Q2{Have cached data?} Q2 -->|Yes| S2[Return cached
with warning] Q2 -->|No| Q3{Have default value?} Q3 -->|Yes| S3[Return default
with warning] Q3 -->|No| Q4{Can queue for later?} Q4 -->|Yes| S4[Queue and
acknowledge] Q4 -->|No| S5[Return error
gracefully] style S1 fill:#c8e6c9 style S2 fill:#c8e6c9 style S3 fill:#fff9c4 style S4 fill:#fff9c4 style S5 fill:#ffccbc

Examples of Graceful Degradation

E-commerce site:

FeatureDegraded StateUser Experience
Product searchShow cached results“Results may be outdated”
RecommendationsShow popular itemsLess personalized but functional
ReviewsHide reviews sectionProduct page still works
PaymentShow “try again later”Can’t buy, but can browse

Social media:

FeatureDegraded StateUser Experience
TimelineShow cached posts“Showing recent posts”
Like countHide countsCan still like
CommentsDisable new commentsCan read existing
NotificationsBatch and delaySlight delay acceptable

The Key Principle

Always ask: “What’s the best experience I can provide when this fails?”

The answer is rarely “show an error page.” Usually there’s a partial experience that’s better than nothing.

7. Putting It All Together

Here’s how these concepts work together in a resilient system:

flowchart TD
    R[Request] --> T{Timeout
configured?} T -->|No| WARN[Add timeout!] T -->|Yes| CB{Circuit
breaker state?} CB -->|Open| FF[Fast fail] CB -->|Closed/Half-Open| S[Send request] S --> RESULT{Result?} RESULT -->|Success| REC[Record success] RESULT -->|Failure| CHK{Retryable?} CHK -->|No| FINAL[Return error] CHK -->|Yes| IK{Idempotent/
has key?} IK -->|No| DANGER[Dangerous to retry!] IK -->|Yes| LIMIT{Under
retry limit?} LIMIT -->|No| GIVEUP[Give up] LIMIT -->|Yes| WAIT[Backoff + jitter] WAIT --> CB REC --> CLOSE[Circuit stays/becomes closed] FINAL --> COUNT[Count failure] COUNT --> THRESH{Threshold
exceeded?} THRESH -->|Yes| OPEN[Open circuit] THRESH -->|No| DONE[Done] FF --> DEGRADE{Can degrade
gracefully?} DEGRADE -->|Yes| CACHED[Return cached/default] DEGRADE -->|No| ERROR[Return error] style WARN fill:#ffccbc style DANGER fill:#ffccbc style FF fill:#fff9c4 style CACHED fill:#c8e6c9

Summary of Resilience Patterns

ProblemSolutionKey Concept
Slow responsesTimeoutsBounded wait time
Transient failuresRetriesTry again
Retry stormsExponential backoffWait longer each time
Synchronized retriesJitterRandomize delays
Duplicate operationsIdempotency keysSame input = same output
Cascading failuresCircuit breakerFail fast when dependency is down
Total failureGraceful degradationBest possible experience

What’s Next

This guide covered the concepts of resilience—understanding why things fail and the patterns that help.

For implementation details including:

  • Configuring timeouts for different scenarios
  • Tuning retry parameters for your use case
  • Implementing idempotency key storage
  • Setting circuit breaker thresholds
  • Metrics, alerting, and observability for failures

See the upcoming course: Resiliencia y tolerancia a fallos en APIs REST.


Deepen your understanding with these related concepts: