API Observability

Audience

This guide is for developers and operators who want to understand what’s happening inside their APIs:

Backend developers who build APIs and need to know when things break
DevOps engineers responsible for keeping APIs healthy
SREs who need to respond to incidents and prevent future ones
Team leads who want to understand why observability matters
Anyone who has been paged at 3am and couldn’t figure out what went wrong

You should understand how REST APIs work. If not, start with How a REST API Works.

Goal

After reading this guide, you’ll understand:

Why observability is different from monitoring
The three pillars: logs, metrics, and traces
What to measure in your APIs
How to detect silent failures (200 OK that actually failed)
Key metrics that matter: p50, p95, p99, error budgets
The SRE perspective on API health

This guide builds criteria—the ability to know what to observe and why.

1. Why Observe APIs

You can’t fix what you can’t see.

The Difference Between Monitoring and Observability

Monitoring answers: “Is my system working?”

Observability answers: “Why isn’t my system working?”

graph TD
    A[System State] --> B{Monitoring}
    B -->|Green| C[Everything OK]
    B -->|Red| D[Something is wrong]
    D --> E{Observability}
    E --> F[Logs: What happened?]
    E --> G[Metrics: How bad is it?]
    E --> H[Traces: Where did it fail?]
    style C fill:#c8e6c9
    style D fill:#ffccbc

Monitoring tells you there’s a fire. Observability helps you find where it started and how to put it out.

What You Lose Without Observability

Without proper observability, you’re flying blind:

Slow debugging: “Check the logs” becomes a multi-hour search
Missed issues: Problems that don’t trigger alerts go unnoticed
Finger-pointing: Teams blame each other because no one has the full picture
Slow recovery: Mean Time to Recovery (MTTR) increases dramatically
User trust erosion: Users discover problems before you do

The Cost of Outages

API downtime is expensive:

Impact	Consequence
Lost revenue	Every minute of downtime = lost transactions
Customer churn	Users switch to competitors
SLA violations	Financial penalties
Reputation damage	Hard to recover trust
Engineering time	Teams stop feature work to fight fires

Observability is not a luxury—it’s the cost of running production systems.

2. The Three Pillars of Observability

Observability rests on three complementary pillars: logs, metrics, and traces.

graph LR
    subgraph "Three Pillars"
        L[Logs
What happened?]
        M[Metrics
How much?]
        T[Traces
Where?]
    end

    L --> O[Observability]
    M --> O
    T --> O

    O --> I[Understanding]

    style L fill:#e3f2fd
    style M fill:#fff3e0
    style T fill:#e8f5e9
    style O fill:#f3e5f5

Each pillar answers different questions:

Pillar	Question	Example
Logs	What happened?	“User 123 failed authentication at 14:32:05”
Metrics	How much/many?	“Error rate is 5.2%, p99 latency is 340ms”
Traces	Where did it go?	“Request spent 80% of time in database query”

You need all three. Logs without metrics lack context. Metrics without traces lack specificity. Traces without logs lack detail.

How They Work Together

Imagine a user reports: “The API is slow.”

Metrics show: latency spiked at 14:30
Traces reveal: the slow endpoint is /users/search
Logs explain: database connection pool was exhausted

Without all three, you’d still be guessing.

3. Logs: The Story of What Happened

Logs are timestamped records of discrete events.

What to Log

Always log:

Request received (method, path, client IP)
Authentication events (success, failure, token type)
Business actions (user created, order placed, payment processed)
Errors and exceptions (with stack traces in development)
Request completed (status code, duration)

Never log:

Passwords or credentials
Full credit card numbers
Personal data (GDPR/CCPA compliance)
API secrets or tokens

Structured Logging

Don’t log plain strings. Use structured formats:

{
  "timestamp": "2026-01-12T14:32:05.123Z",
  "level": "ERROR",
  "service": "user-api",
  "requestId": "abc-123-def",
  "userId": "user_789",
  "action": "authentication",
  "status": "failed",
  "reason": "invalid_token",
  "duration_ms": 45
}

Why structured?

Searchable: Query level:ERROR AND service:user-api
Parseable: Tools can aggregate and visualize
Consistent: Same format across all services

Log Levels

Use levels consistently:

Level	When to Use
ERROR	Something failed that shouldn’t have
WARN	Something concerning but not critical
INFO	Normal operations worth recording
DEBUG	Detailed info for troubleshooting

Don’t overuse ERROR. If everything is an error, nothing is.

The Request ID Pattern

Every request should have a unique identifier:

GET /users/123 HTTP/1.1
X-Request-Id: abc-123-def

This ID should appear in every log for that request:

{"requestId": "abc-123-def", "message": "Request received"}
{"requestId": "abc-123-def", "message": "Fetching user from database"}
{"requestId": "abc-123-def", "message": "Response sent", "status": 200}

Now you can search for abc-123-def and see the entire request lifecycle.

4. Metrics: Numbers That Tell the Truth

Metrics are numeric measurements collected over time.

What to Measure

The Four Golden Signals (from Google SRE):

graph TD
    subgraph "Four Golden Signals"
        L[Latency
How fast?]
        T[Traffic
How much?]
        E[Errors
How broken?]
        S[Saturation
How full?]
    end

    L --> H[API Health]
    T --> H
    E --> H
    S --> H

    style L fill:#e3f2fd
    style T fill:#fff3e0
    style E fill:#ffccbc
    style S fill:#e8f5e9

1. Latency — How long requests take.

Measure successful and failed requests separately
Failed requests that fail fast can skew averages

2. Traffic — Demand on your system.

Requests per second
Concurrent connections
Data transferred

3. Errors — Failed requests.

HTTP 5xx rate (server errors)
HTTP 4xx rate (client errors, but track spikes)
Custom error types (business logic failures)

4. Saturation — How “full” your service is.

CPU utilization
Memory usage
Database connection pool usage
Queue depth

Understanding Percentiles

Averages lie. Use percentiles.

Metric	What It Means
p50 (median)	Half of requests are faster than this
p90	90% of requests are faster than this
p95	95% of requests are faster than this
p99	99% of requests are faster than this

Example: If p50 = 100ms and p99 = 2000ms, most users have a good experience, but 1 in 100 waits 20x longer.

Why p99 matters:

A service handling 1M requests/day
p99 = 2000ms means 10,000 users/day have terrible experience
That’s enough to lose customers

Rate, Errors, Duration (RED)

An alternative to Golden Signals, simpler for APIs:

Metric	Description
Rate	Requests per second
Errors	Failed requests per second
Duration	Time to process requests (histogram)

Both frameworks work. Pick one and be consistent.

5. Traces: Following the Request Path

Traces show the journey of a request through your system.

What Is Distributed Tracing?

When a request touches multiple services, logs and metrics show each service in isolation. Traces connect them.

graph LR
    subgraph "Single Request Journey"
        A[API Gateway
5ms] --> B[Auth Service
15ms]
        B --> C[User Service
120ms]
        C --> D[Database
95ms]
    end

    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#ffccbc
    style D fill:#e8f5e9

A trace shows:

Total request time: 235ms
Time in each service
Where the bottleneck is (User Service + Database = 215ms)

Trace Structure

Trace: The entire journey (one request)
Span: A single operation within the trace
Trace ID: Unique identifier connecting all spans

Trace ID: trace-xyz-789
├── Span: API Gateway (5ms)
├── Span: Auth Service (15ms)
├── Span: User Service (120ms)
│   └── Span: Database Query (95ms)
└── Total: 235ms

When Traces Save the Day

Without traces, debugging “slow request” means:

Check API Gateway logs
Check Auth Service logs
Check User Service logs
Try to correlate by timestamp
Guess what happened

With traces:

Look up trace ID
See the entire flow
Identify the bottleneck immediately

The Correlation ID Connection

The correlation ID pattern is foundational to tracing:

# Initial request
GET /api/users/123 HTTP/1.1
X-Request-Id: abc-123

# Propagated to downstream services
GET /auth/validate HTTP/1.1
X-Request-Id: abc-123

# Appears in database queries
-- request_id: abc-123
SELECT * FROM users WHERE id = 123

Every service, every log, every database query includes the same ID.

6. Silent Failures: The 200 OK That Lied

The most dangerous failures don’t look like failures.

What Is a Silent Failure?

HTTP/1.1 200 OK
Content-Type: application/json

{
  "success": false,
  "error": "Payment processor unavailable"
}

The HTTP layer says “success.” The business layer says “failure.”

Your monitoring sees: 200 OK, all good! Your users see: “My payment didn’t work.”

Why Silent Failures Happen

1. Overly optimistic error handling:

# BAD: Catching everything and returning success
try:
    process_payment()
    return {"status": "success"}
except Exception:
    return {"status": "success", "note": "will retry later"}  # WRONG!

2. Partial failures in batch operations:

{
  "status": 200,
  "processed": 95,
  "failed": 5,
  "errors": ["item 3 not found", "item 7 invalid"]
}

Is this success? The HTTP status says yes. Five items say no.

3. Degraded responses:

{
  "user": {
    "id": 123,
    "name": "Alice",
    "preferences": null,
    "recommendations": []
  }
}

User exists, but preferences service was down. Is this an error?

Detecting Silent Failures

1. Business metrics, not just HTTP metrics:

# HTTP metrics show
http_requests_total{status="200"} = 10000

# Business metrics reveal the truth
payment_success_total = 8500
payment_failed_total = 1500

2. Error response body parsing:

Monitor for "success": false or "error" in 200 responses.

3. Downstream dependency monitoring:

If the payment service is down, 200s from your API are probably lies.

The Solution: Be Honest

# DON'T: Silent failure
HTTP/1.1 200 OK
{"success": false, "error": "Payment failed"}

# DO: Honest failure
HTTP/1.1 502 Bad Gateway
{"error": "Payment processor unavailable"}

If something failed, the status code should reflect it.

7. Key Metrics for API Health

What should you actually track?

SLIs, SLOs, and SLAs

graph TD
    SLI[SLI: Service Level Indicator
What you measure] --> SLO[SLO: Service Level Objective
Target you aim for]
    SLO --> SLA[SLA: Service Level Agreement
Promise to customers]

    style SLI fill:#e3f2fd
    style SLO fill:#fff3e0
    style SLA fill:#ffccbc

Term	Definition	Example
SLI	A metric that indicates service health	p99 latency, error rate
SLO	Internal target for an SLI	p99 < 500ms, errors < 1%
SLA	External commitment with consequences	99.9% uptime or refund

Essential API Metrics

1. Availability

Availability = (Successful requests) / (Total requests)

Target: 99.9% = 8.7 hours downtime/year
Target: 99.99% = 52 minutes downtime/year

2. Error Rate

Error Rate = (5xx responses) / (Total responses)

Healthy: < 0.1%
Concerning: > 1%
Critical: > 5%

3. Latency Distribution

Percentile	Healthy	Degraded	Critical
p50	< 100ms	< 500ms	> 1s
p95	< 300ms	< 1s	> 2s
p99	< 500ms	< 2s	> 5s

4. Throughput

Requests per second (RPS)

Know your baseline: 1000 RPS normal
Alert when: < 500 RPS (unusual drop) or > 2000 RPS (spike)

Error Budgets

An error budget is the inverse of availability:

If SLO = 99.9% availability
Error budget = 0.1% = 43.2 minutes/month of allowed downtime

How to use it:

Budget remaining: Deploy new features
Budget exhausted: Focus on reliability
Budget overspent: All hands on stability

graph LR
    A[Error Budget
100%] -->|Deployment| B[Budget: 80%]
    B -->|Incident| C[Budget: 30%]
    C -->|Another incident| D[Budget: 0%]
    D --> E[FREEZE:
No new deploys
Fix reliability]

    style A fill:#c8e6c9
    style B fill:#fff9c4
    style C fill:#ffccbc
    style D fill:#ef9a9a
    style E fill:#ef5350

Alerting Strategy

Not every metric needs an alert. Alert on:

Alert Type	Example	Priority
Symptom	Error rate > 5%	High
Cause	Database connection pool exhausted	High
Predictive	Disk 90% full	Medium

Don’t alert on:

Metrics within normal variance
Things that auto-recover
Duplicates of the same issue

The goal: Every alert should require human action.

What’s Next

This guide covered the fundamentals of API observability—the why and what to observe.

For deeper topics like:

Instrumenting your code for observability
Correlating events across services
Building effective dashboards
Runbooks for incident response
Advanced diagnosis techniques

See the upcoming course: Observabilidad efectiva de APIs REST

Deepen your understanding:

Observability - The ability to understand system state
Logging - Recording events for debugging
Metrics - Numeric measurements over time
Tracing - Following requests through systems
Correlation ID - Linking related events across services