API Observability

Infrastructure Intermediate 18 min Jan 12, 2026

Audience

This guide is for developers and operators who want to understand what’s happening inside their APIs:

  • Backend developers who build APIs and need to know when things break
  • DevOps engineers responsible for keeping APIs healthy
  • SREs who need to respond to incidents and prevent future ones
  • Team leads who want to understand why observability matters
  • Anyone who has been paged at 3am and couldn’t figure out what went wrong

You should understand how REST APIs work. If not, start with How a REST API Works.

Goal

After reading this guide, you’ll understand:

  • Why observability is different from monitoring
  • The three pillars: logs, metrics, and traces
  • What to measure in your APIs
  • How to detect silent failures (200 OK that actually failed)
  • Key metrics that matter: p50, p95, p99, error budgets
  • The SRE perspective on API health

This guide builds criteria—the ability to know what to observe and why.

1. Why Observe APIs

You can’t fix what you can’t see.

The Difference Between Monitoring and Observability

Monitoring answers: “Is my system working?”

Observability answers: “Why isn’t my system working?”

graph TD
    A[System State] --> B{Monitoring}
    B -->|Green| C[Everything OK]
    B -->|Red| D[Something is wrong]
    D --> E{Observability}
    E --> F[Logs: What happened?]
    E --> G[Metrics: How bad is it?]
    E --> H[Traces: Where did it fail?]
    style C fill:#c8e6c9
    style D fill:#ffccbc

Monitoring tells you there’s a fire. Observability helps you find where it started and how to put it out.

What You Lose Without Observability

Without proper observability, you’re flying blind:

  • Slow debugging: “Check the logs” becomes a multi-hour search
  • Missed issues: Problems that don’t trigger alerts go unnoticed
  • Finger-pointing: Teams blame each other because no one has the full picture
  • Slow recovery: Mean Time to Recovery (MTTR) increases dramatically
  • User trust erosion: Users discover problems before you do

The Cost of Outages

API downtime is expensive:

ImpactConsequence
Lost revenueEvery minute of downtime = lost transactions
Customer churnUsers switch to competitors
SLA violationsFinancial penalties
Reputation damageHard to recover trust
Engineering timeTeams stop feature work to fight fires

Observability is not a luxury—it’s the cost of running production systems.

2. The Three Pillars of Observability

Observability rests on three complementary pillars: logs, metrics, and traces.

graph LR
    subgraph "Three Pillars"
        L[Logs
What happened?] M[Metrics
How much?] T[Traces
Where?] end L --> O[Observability] M --> O T --> O O --> I[Understanding] style L fill:#e3f2fd style M fill:#fff3e0 style T fill:#e8f5e9 style O fill:#f3e5f5

Each pillar answers different questions:

PillarQuestionExample
LogsWhat happened?“User 123 failed authentication at 14:32:05”
MetricsHow much/many?“Error rate is 5.2%, p99 latency is 340ms”
TracesWhere did it go?“Request spent 80% of time in database query”

You need all three. Logs without metrics lack context. Metrics without traces lack specificity. Traces without logs lack detail.

How They Work Together

Imagine a user reports: “The API is slow.”

  1. Metrics show: latency spiked at 14:30
  2. Traces reveal: the slow endpoint is /users/search
  3. Logs explain: database connection pool was exhausted

Without all three, you’d still be guessing.

3. Logs: The Story of What Happened

Logs are timestamped records of discrete events.

What to Log

Always log:

  • Request received (method, path, client IP)
  • Authentication events (success, failure, token type)
  • Business actions (user created, order placed, payment processed)
  • Errors and exceptions (with stack traces in development)
  • Request completed (status code, duration)

Never log:

  • Passwords or credentials
  • Full credit card numbers
  • Personal data (GDPR/CCPA compliance)
  • API secrets or tokens

Structured Logging

Don’t log plain strings. Use structured formats:

{
  "timestamp": "2026-01-12T14:32:05.123Z",
  "level": "ERROR",
  "service": "user-api",
  "requestId": "abc-123-def",
  "userId": "user_789",
  "action": "authentication",
  "status": "failed",
  "reason": "invalid_token",
  "duration_ms": 45
}

Why structured?

  • Searchable: Query level:ERROR AND service:user-api
  • Parseable: Tools can aggregate and visualize
  • Consistent: Same format across all services

Log Levels

Use levels consistently:

LevelWhen to Use
ERRORSomething failed that shouldn’t have
WARNSomething concerning but not critical
INFONormal operations worth recording
DEBUGDetailed info for troubleshooting

Don’t overuse ERROR. If everything is an error, nothing is.

The Request ID Pattern

Every request should have a unique identifier:

GET /users/123 HTTP/1.1
X-Request-Id: abc-123-def

This ID should appear in every log for that request:

{"requestId": "abc-123-def", "message": "Request received"}
{"requestId": "abc-123-def", "message": "Fetching user from database"}
{"requestId": "abc-123-def", "message": "Response sent", "status": 200}

Now you can search for abc-123-def and see the entire request lifecycle.

4. Metrics: Numbers That Tell the Truth

Metrics are numeric measurements collected over time.

What to Measure

The Four Golden Signals (from Google SRE):

graph TD
    subgraph "Four Golden Signals"
        L[Latency
How fast?] T[Traffic
How much?] E[Errors
How broken?] S[Saturation
How full?] end L --> H[API Health] T --> H E --> H S --> H style L fill:#e3f2fd style T fill:#fff3e0 style E fill:#ffccbc style S fill:#e8f5e9

1. Latency — How long requests take.

  • Measure successful and failed requests separately
  • Failed requests that fail fast can skew averages

2. Traffic — Demand on your system.

  • Requests per second
  • Concurrent connections
  • Data transferred

3. Errors — Failed requests.

  • HTTP 5xx rate (server errors)
  • HTTP 4xx rate (client errors, but track spikes)
  • Custom error types (business logic failures)

4. Saturation — How “full” your service is.

  • CPU utilization
  • Memory usage
  • Database connection pool usage
  • Queue depth

Understanding Percentiles

Averages lie. Use percentiles.

MetricWhat It Means
p50 (median)Half of requests are faster than this
p9090% of requests are faster than this
p9595% of requests are faster than this
p9999% of requests are faster than this

Example: If p50 = 100ms and p99 = 2000ms, most users have a good experience, but 1 in 100 waits 20x longer.

Why p99 matters:

  • A service handling 1M requests/day
  • p99 = 2000ms means 10,000 users/day have terrible experience
  • That’s enough to lose customers

Rate, Errors, Duration (RED)

An alternative to Golden Signals, simpler for APIs:

MetricDescription
RateRequests per second
ErrorsFailed requests per second
DurationTime to process requests (histogram)

Both frameworks work. Pick one and be consistent.

5. Traces: Following the Request Path

Traces show the journey of a request through your system.

What Is Distributed Tracing?

When a request touches multiple services, logs and metrics show each service in isolation. Traces connect them.

graph LR
    subgraph "Single Request Journey"
        A[API Gateway
5ms] --> B[Auth Service
15ms] B --> C[User Service
120ms] C --> D[Database
95ms] end style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#ffccbc style D fill:#e8f5e9

A trace shows:

  • Total request time: 235ms
  • Time in each service
  • Where the bottleneck is (User Service + Database = 215ms)

Trace Structure

  • Trace: The entire journey (one request)
  • Span: A single operation within the trace
  • Trace ID: Unique identifier connecting all spans
Trace ID: trace-xyz-789
├── Span: API Gateway (5ms)
├── Span: Auth Service (15ms)
├── Span: User Service (120ms)
│   └── Span: Database Query (95ms)
└── Total: 235ms

When Traces Save the Day

Without traces, debugging “slow request” means:

  1. Check API Gateway logs
  2. Check Auth Service logs
  3. Check User Service logs
  4. Try to correlate by timestamp
  5. Guess what happened

With traces:

  1. Look up trace ID
  2. See the entire flow
  3. Identify the bottleneck immediately

The Correlation ID Connection

The correlation ID pattern is foundational to tracing:

# Initial request
GET /api/users/123 HTTP/1.1
X-Request-Id: abc-123

# Propagated to downstream services
GET /auth/validate HTTP/1.1
X-Request-Id: abc-123

# Appears in database queries
-- request_id: abc-123
SELECT * FROM users WHERE id = 123

Every service, every log, every database query includes the same ID.

6. Silent Failures: The 200 OK That Lied

The most dangerous failures don’t look like failures.

What Is a Silent Failure?

HTTP/1.1 200 OK
Content-Type: application/json

{
  "success": false,
  "error": "Payment processor unavailable"
}

The HTTP layer says “success.” The business layer says “failure.”

Your monitoring sees: 200 OK, all good! Your users see: “My payment didn’t work.”

Why Silent Failures Happen

1. Overly optimistic error handling:

# BAD: Catching everything and returning success
try:
    process_payment()
    return {"status": "success"}
except Exception:
    return {"status": "success", "note": "will retry later"}  # WRONG!

2. Partial failures in batch operations:

{
  "status": 200,
  "processed": 95,
  "failed": 5,
  "errors": ["item 3 not found", "item 7 invalid"]
}

Is this success? The HTTP status says yes. Five items say no.

3. Degraded responses:

{
  "user": {
    "id": 123,
    "name": "Alice",
    "preferences": null,
    "recommendations": []
  }
}

User exists, but preferences service was down. Is this an error?

Detecting Silent Failures

1. Business metrics, not just HTTP metrics:

# HTTP metrics show
http_requests_total{status="200"} = 10000

# Business metrics reveal the truth
payment_success_total = 8500
payment_failed_total = 1500

2. Error response body parsing:

Monitor for "success": false or "error" in 200 responses.

3. Downstream dependency monitoring:

If the payment service is down, 200s from your API are probably lies.

The Solution: Be Honest

# DON'T: Silent failure
HTTP/1.1 200 OK
{"success": false, "error": "Payment failed"}

# DO: Honest failure
HTTP/1.1 502 Bad Gateway
{"error": "Payment processor unavailable"}

If something failed, the status code should reflect it.

7. Key Metrics for API Health

What should you actually track?

SLIs, SLOs, and SLAs

graph TD
    SLI[SLI: Service Level Indicator
What you measure] --> SLO[SLO: Service Level Objective
Target you aim for] SLO --> SLA[SLA: Service Level Agreement
Promise to customers] style SLI fill:#e3f2fd style SLO fill:#fff3e0 style SLA fill:#ffccbc
TermDefinitionExample
SLIA metric that indicates service healthp99 latency, error rate
SLOInternal target for an SLIp99 < 500ms, errors < 1%
SLAExternal commitment with consequences99.9% uptime or refund

Essential API Metrics

1. Availability

Availability = (Successful requests) / (Total requests)

Target: 99.9% = 8.7 hours downtime/year
Target: 99.99% = 52 minutes downtime/year

2. Error Rate

Error Rate = (5xx responses) / (Total responses)

Healthy: < 0.1%
Concerning: > 1%
Critical: > 5%

3. Latency Distribution

PercentileHealthyDegradedCritical
p50< 100ms< 500ms> 1s
p95< 300ms< 1s> 2s
p99< 500ms< 2s> 5s

4. Throughput

Requests per second (RPS)

Know your baseline: 1000 RPS normal
Alert when: < 500 RPS (unusual drop) or > 2000 RPS (spike)

Error Budgets

An error budget is the inverse of availability:

If SLO = 99.9% availability
Error budget = 0.1% = 43.2 minutes/month of allowed downtime

How to use it:

  • Budget remaining: Deploy new features
  • Budget exhausted: Focus on reliability
  • Budget overspent: All hands on stability
graph LR
    A[Error Budget
100%] -->|Deployment| B[Budget: 80%] B -->|Incident| C[Budget: 30%] C -->|Another incident| D[Budget: 0%] D --> E[FREEZE:
No new deploys
Fix reliability] style A fill:#c8e6c9 style B fill:#fff9c4 style C fill:#ffccbc style D fill:#ef9a9a style E fill:#ef5350

Alerting Strategy

Not every metric needs an alert. Alert on:

Alert TypeExamplePriority
SymptomError rate > 5%High
CauseDatabase connection pool exhaustedHigh
PredictiveDisk 90% fullMedium

Don’t alert on:

  • Metrics within normal variance
  • Things that auto-recover
  • Duplicates of the same issue

The goal: Every alert should require human action.

What’s Next

This guide covered the fundamentals of API observability—the why and what to observe.

For deeper topics like:

  • Instrumenting your code for observability
  • Correlating events across services
  • Building effective dashboards
  • Runbooks for incident response
  • Advanced diagnosis techniques

See the upcoming course: Observabilidad efectiva de APIs REST


Deepen your understanding: