Backoff

Infrastructure & Governance Security Notes Jan 8, 2026 TYPESCRIPT
resilience retry distributed-systems reliability algorithms

Definition

When a service you’re calling fails, your first instinct might be to retry immediately - and then retry again, and again. But imagine thousands of clients all doing this simultaneously: a server that was just starting to recover suddenly gets slammed with a massive spike of retry traffic, pushing it right back into failure. This is why backoff exists: it’s a strategy of waiting progressively longer between retry attempts, giving systems time to recover.

Backoff algorithms calculate the delay between retry attempts, typically increasing the delay with each failure. The most common approach is exponential backoff: wait 1 second, then 2 seconds, then 4, then 8, and so on. But pure exponential backoff has a problem - if many clients start at the same time and use the same algorithm, they’ll all retry at the same time, creating coordinated waves of traffic called “thundering herd.” The solution is jitter: adding randomness to the delay so clients naturally spread out their retries.

The combination of exponential backoff with jitter is the gold standard for retry logic in distributed systems. It balances persistence (keep trying) with patience (wait longer each time) and fairness (don’t all retry at once). AWS, Google Cloud, and virtually every major API provider recommends this approach in their SDKs and documentation.

Example

AWS SDK Default Behavior: When you call AWS services and get throttled, the SDK automatically applies exponential backoff with jitter. First retry after ~100ms, then ~200ms, then ~400ms, with random variation. This is why AWS SDKs “just work” even when services are under load - the backoff is built in.

Ethernet CSMA/CD: The original exponential backoff algorithm comes from Ethernet networking. When two devices transmit simultaneously and collide, each picks a random delay (from an exponentially growing window) before retrying. This foundational algorithm from the 1970s is still used in software today.

API Rate Limiting Recovery: When Twitter’s API returns a 429 (rate limited), clients should back off exponentially. Immediately retrying just burns through more rate limit capacity. Backing off lets your quota recover and lets other clients get their fair share.

Email Delivery Retry Queues: SMTP servers implement backoff when delivery fails. First retry after 1 minute, then 5 minutes, then 15, then 30, then hourly, then every few hours. This gives receiving servers time to come back online without overwhelming them with retry attempts.

Kubernetes Pod Restart Backoff: When a container crashes repeatedly, Kubernetes applies exponential backoff to restarts: 10s, 20s, 40s, up to 5 minutes. This prevents a crash loop from consuming all cluster resources while still attempting recovery.

Analogy

The Polite Knock: You knock on someone’s door and they don’t answer. Do you knock again immediately? No - you wait a few seconds. Still no answer? You wait a bit longer. Each time, you give them more time to reach the door. Eventually, if they’re clearly not home, you stop and try later. That’s exponential backoff.

The Crowded Restaurant: A popular restaurant is full. You could stand at the host stand asking “is there a table yet?” every 30 seconds, annoying everyone. Or you could check back in 5 minutes, then 10 minutes, then 15 minutes. The second approach is backoff - you’re persistent but not annoying.

The Traffic Light System: When there’s a traffic jam, you don’t floor the gas every time traffic stops. You wait a moment, inch forward, wait longer, inch again. Everyone doing this naturally creates flow. Everyone flooring it simultaneously creates gridlock. Jitter works the same way - randomizing when you “inch forward” prevents everyone from moving at once.

The Busy Signal Callback: In the pre-voicemail era, when you got a busy signal, you wouldn’t immediately redial. You’d wait a minute, try again, wait 5 minutes, try again. Each failure meant a longer wait, giving the other person time to finish their call.

Code Example

// Various backoff algorithms with implementations

// Simple exponential backoff
function exponentialBackoff(attempt: number, baseMs: number = 100): number {
  return baseMs * Math.pow(2, attempt);
  // Attempt 0: 100ms, Attempt 1: 200ms, Attempt 2: 400ms, etc.
}

// Exponential backoff with jitter (recommended)
function exponentialBackoffWithJitter(
  attempt: number,
  baseMs: number = 100,
  maxMs: number = 30000
): number {
  const exponential = Math.min(baseMs * Math.pow(2, attempt), maxMs);
  // Full jitter: random value between 0 and calculated delay
  return Math.random() * exponential;
}

// AWS-style "equal jitter" - decorrelated jitter
function equalJitterBackoff(
  attempt: number,
  baseMs: number = 100,
  maxMs: number = 30000
): number {
  const exponential = Math.min(baseMs * Math.pow(2, attempt), maxMs);
  // Half exponential + half random
  return exponential / 2 + Math.random() * (exponential / 2);
}

// AWS decorrelated jitter (recommended by AWS)
function decorrelatedJitter(
  previousDelayMs: number,
  baseMs: number = 100,
  maxMs: number = 30000
): number {
  return Math.min(maxMs, Math.random() * (previousDelayMs * 3 - baseMs) + baseMs);
}

// Complete retry with backoff implementation
interface BackoffConfig {
  baseDelayMs: number;
  maxDelayMs: number;
  maxRetries: number;
  jitterType: 'none' | 'full' | 'equal' | 'decorrelated';
}

async function retryWithBackoff<T>(
  operation: () => Promise<T>,
  config: BackoffConfig = {
    baseDelayMs: 100,
    maxDelayMs: 30000,
    maxRetries: 5,
    jitterType: 'full'
  }
): Promise<T> {
  let lastError: Error;
  let previousDelay = config.baseDelayMs;

  for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      lastError = error as Error;

      if (attempt === config.maxRetries) break;

      // Calculate delay based on jitter type
      let delay: number;
      switch (config.jitterType) {
        case 'none':
          delay = exponentialBackoff(attempt, config.baseDelayMs);
          break;
        case 'full':
          delay = exponentialBackoffWithJitter(
            attempt,
            config.baseDelayMs,
            config.maxDelayMs
          );
          break;
        case 'equal':
          delay = equalJitterBackoff(
            attempt,
            config.baseDelayMs,
            config.maxDelayMs
          );
          break;
        case 'decorrelated':
          delay = decorrelatedJitter(
            previousDelay,
            config.baseDelayMs,
            config.maxDelayMs
          );
          previousDelay = delay;
          break;
      }

      console.log(`Attempt ${attempt + 1} failed, backing off for ${delay}ms`);
      await sleep(delay);
    }
  }

  throw lastError!;
}

// Linear backoff (not recommended but sometimes needed)
function linearBackoff(attempt: number, baseMs: number = 1000): number {
  return baseMs * (attempt + 1);
  // Attempt 0: 1000ms, Attempt 1: 2000ms, Attempt 2: 3000ms, etc.
}

// Fibonacci backoff (interesting alternative)
function fibonacciBackoff(attempt: number, baseMs: number = 100): number {
  const fib = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55];
  return baseMs * (fib[Math.min(attempt, fib.length - 1)]);
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

Diagram

flowchart TD
    subgraph NoBackoff["Without Backoff"]
        A1[Failure] --> A2[Immediate Retry]
        A2 --> A3[Failure]
        A3 --> A4[Immediate Retry]
        A4 --> A5[Server Overload!]
    end

    subgraph WithBackoff["With Exponential Backoff + Jitter"]
        B1[Failure] --> B2[Wait 100ms + jitter]
        B2 --> B3[Retry - Failure]
        B3 --> B4[Wait 200ms + jitter]
        B4 --> B5[Retry - Failure]
        B5 --> B6[Wait 400ms + jitter]
        B6 --> B7[Retry - Success!]
    end

    subgraph ThunderingHerd["Thundering Herd Prevention"]
        C1[1000 Clients Fail]
        C2[With Jitter: Spread over 10s]
        C3[Without Jitter: All retry at once]
        C1 --> C2
        C1 --> C3
        C2 --> C4[Gradual Recovery]
        C3 --> C5[Immediate Overload]
    end

    style A5 fill:#f87171
    style B7 fill:#86efac
    style C4 fill:#86efac
    style C5 fill:#f87171

Security Notes

SECURITY NOTES

WARNING - Predictable backoff patterns can be exploited. If an attacker knows your exact retry timing, they can time their attacks to coincide with your retry waves, maximizing damage to recovering systems.

Always use jitter to add unpredictability to your backoff. This prevents attackers from predicting when retries will occur and makes it harder to coordinate attacks with your retry patterns.

Be careful with backoff in authentication scenarios. Exponential backoff on login failures can be used for denial of service - an attacker can lock out legitimate users by intentionally failing their login attempts, triggering long backoff periods.

Consider implementing per-client backoff rather than global backoff. This prevents one abusive client’s behavior from affecting backoff timing for all other clients.

Monitor for clients that aren’t respecting backoff. If a client ignores 429 responses and Retry-After headers, they may be malicious or misconfigured and should be rate-limited or blocked.

Best Practices

  1. Always use jitter - Full or decorrelated jitter prevents thundering herd; never use pure exponential backoff in production
  2. Set a maximum delay cap - Delays shouldn’t grow forever; 30-60 seconds is usually plenty
  3. Respect Retry-After headers - When a server tells you when to retry, use that instead of your own calculation
  4. Log backoff events - Track when backoff happens to identify problematic dependencies
  5. Make backoff configurable - Different operations may need different backoff parameters
  6. Start with small base delays - 100-200ms is usually a good starting point
  7. Use backoff with circuit breakers - Circuit breakers prevent retries entirely when a service is down
  8. Consider request priority - High-priority requests might use less aggressive backoff
  9. Test backoff behavior - Simulate failures to verify backoff works as expected
  10. Document your backoff strategy - Future maintainers need to understand the logic

Common Mistakes

No jitter: Pure exponential backoff causes synchronized retry waves, potentially worse than no backoff at all.

Base delay too short: 1ms base delay with exponential backoff still produces bursts of retries before meaningful wait times accumulate.

No maximum cap: Exponential growth without a cap leads to absurd delays (2^20 base = millions of ms = hours).

Ignoring Retry-After: When a server provides explicit retry timing, your backoff algorithm is overriding expert knowledge.

Backoff per operation instead of per endpoint: If one endpoint is failing, backing off on all endpoints penalizes healthy services.

Resetting backoff too aggressively: One success shouldn’t reset to base delay if the service is still struggling. Consider decay instead.

Not backing off on 429: Rate limit responses always mean “slow down” - immediate retry makes the problem worse.

Standards & RFCs

Standards & RFCs