Audience
This guide is for developers and operators who want to understand what’s happening inside their APIs:
- Backend developers who build APIs and need to know when things break
- DevOps engineers responsible for keeping APIs healthy
- SREs who need to respond to incidents and prevent future ones
- Team leads who want to understand why observability matters
- Anyone who has been paged at 3am and couldn’t figure out what went wrong
You should understand how REST APIs work. If not, start with How a REST API Works.
Goal
After reading this guide, you’ll understand:
- Why observability is different from monitoring
- The three pillars: logs, metrics, and traces
- What to measure in your APIs
- How to detect silent failures (200 OK that actually failed)
- Key metrics that matter: p50, p95, p99, error budgets
- The SRE perspective on API health
This guide builds criteria—the ability to know what to observe and why.
1. Why Observe APIs
You can’t fix what you can’t see.
The Difference Between Monitoring and Observability
Monitoring answers: “Is my system working?”
Observability answers: “Why isn’t my system working?”
graph TD
A[System State] --> B{Monitoring}
B -->|Green| C[Everything OK]
B -->|Red| D[Something is wrong]
D --> E{Observability}
E --> F[Logs: What happened?]
E --> G[Metrics: How bad is it?]
E --> H[Traces: Where did it fail?]
style C fill:#c8e6c9
style D fill:#ffccbcMonitoring tells you there’s a fire. Observability helps you find where it started and how to put it out.
What You Lose Without Observability
Without proper observability, you’re flying blind:
- Slow debugging: “Check the logs” becomes a multi-hour search
- Missed issues: Problems that don’t trigger alerts go unnoticed
- Finger-pointing: Teams blame each other because no one has the full picture
- Slow recovery: Mean Time to Recovery (MTTR) increases dramatically
- User trust erosion: Users discover problems before you do
The Cost of Outages
API downtime is expensive:
| Impact | Consequence |
|---|---|
| Lost revenue | Every minute of downtime = lost transactions |
| Customer churn | Users switch to competitors |
| SLA violations | Financial penalties |
| Reputation damage | Hard to recover trust |
| Engineering time | Teams stop feature work to fight fires |
Observability is not a luxury—it’s the cost of running production systems.
2. The Three Pillars of Observability
Observability rests on three complementary pillars: logs, metrics, and traces.
graph LR
subgraph "Three Pillars"
L[Logs
What happened?]
M[Metrics
How much?]
T[Traces
Where?]
end
L --> O[Observability]
M --> O
T --> O
O --> I[Understanding]
style L fill:#e3f2fd
style M fill:#fff3e0
style T fill:#e8f5e9
style O fill:#f3e5f5Each pillar answers different questions:
| Pillar | Question | Example |
|---|---|---|
| Logs | What happened? | “User 123 failed authentication at 14:32:05” |
| Metrics | How much/many? | “Error rate is 5.2%, p99 latency is 340ms” |
| Traces | Where did it go? | “Request spent 80% of time in database query” |
You need all three. Logs without metrics lack context. Metrics without traces lack specificity. Traces without logs lack detail.
How They Work Together
Imagine a user reports: “The API is slow.”
- Metrics show: latency spiked at 14:30
- Traces reveal: the slow endpoint is
/users/search - Logs explain: database connection pool was exhausted
Without all three, you’d still be guessing.
3. Logs: The Story of What Happened
Logs are timestamped records of discrete events.
What to Log
Always log:
- Request received (method, path, client IP)
- Authentication events (success, failure, token type)
- Business actions (user created, order placed, payment processed)
- Errors and exceptions (with stack traces in development)
- Request completed (status code, duration)
Never log:
- Passwords or credentials
- Full credit card numbers
- Personal data (GDPR/CCPA compliance)
- API secrets or tokens
Structured Logging
Don’t log plain strings. Use structured formats:
{
"timestamp": "2026-01-12T14:32:05.123Z",
"level": "ERROR",
"service": "user-api",
"requestId": "abc-123-def",
"userId": "user_789",
"action": "authentication",
"status": "failed",
"reason": "invalid_token",
"duration_ms": 45
}
Why structured?
- Searchable: Query
level:ERROR AND service:user-api - Parseable: Tools can aggregate and visualize
- Consistent: Same format across all services
Log Levels
Use levels consistently:
| Level | When to Use |
|---|---|
| ERROR | Something failed that shouldn’t have |
| WARN | Something concerning but not critical |
| INFO | Normal operations worth recording |
| DEBUG | Detailed info for troubleshooting |
Don’t overuse ERROR. If everything is an error, nothing is.
The Request ID Pattern
Every request should have a unique identifier:
GET /users/123 HTTP/1.1
X-Request-Id: abc-123-def
This ID should appear in every log for that request:
{"requestId": "abc-123-def", "message": "Request received"}
{"requestId": "abc-123-def", "message": "Fetching user from database"}
{"requestId": "abc-123-def", "message": "Response sent", "status": 200}
Now you can search for abc-123-def and see the entire request lifecycle.
4. Metrics: Numbers That Tell the Truth
Metrics are numeric measurements collected over time.
What to Measure
The Four Golden Signals (from Google SRE):
graph TD
subgraph "Four Golden Signals"
L[Latency
How fast?]
T[Traffic
How much?]
E[Errors
How broken?]
S[Saturation
How full?]
end
L --> H[API Health]
T --> H
E --> H
S --> H
style L fill:#e3f2fd
style T fill:#fff3e0
style E fill:#ffccbc
style S fill:#e8f5e91. Latency — How long requests take.
- Measure successful and failed requests separately
- Failed requests that fail fast can skew averages
2. Traffic — Demand on your system.
- Requests per second
- Concurrent connections
- Data transferred
3. Errors — Failed requests.
- HTTP 5xx rate (server errors)
- HTTP 4xx rate (client errors, but track spikes)
- Custom error types (business logic failures)
4. Saturation — How “full” your service is.
- CPU utilization
- Memory usage
- Database connection pool usage
- Queue depth
Understanding Percentiles
Averages lie. Use percentiles.
| Metric | What It Means |
|---|---|
| p50 (median) | Half of requests are faster than this |
| p90 | 90% of requests are faster than this |
| p95 | 95% of requests are faster than this |
| p99 | 99% of requests are faster than this |
Example: If p50 = 100ms and p99 = 2000ms, most users have a good experience, but 1 in 100 waits 20x longer.
Why p99 matters:
- A service handling 1M requests/day
- p99 = 2000ms means 10,000 users/day have terrible experience
- That’s enough to lose customers
Rate, Errors, Duration (RED)
An alternative to Golden Signals, simpler for APIs:
| Metric | Description |
|---|---|
| Rate | Requests per second |
| Errors | Failed requests per second |
| Duration | Time to process requests (histogram) |
Both frameworks work. Pick one and be consistent.
5. Traces: Following the Request Path
Traces show the journey of a request through your system.
What Is Distributed Tracing?
When a request touches multiple services, logs and metrics show each service in isolation. Traces connect them.
graph LR
subgraph "Single Request Journey"
A[API Gateway
5ms] --> B[Auth Service
15ms]
B --> C[User Service
120ms]
C --> D[Database
95ms]
end
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#ffccbc
style D fill:#e8f5e9A trace shows:
- Total request time: 235ms
- Time in each service
- Where the bottleneck is (User Service + Database = 215ms)
Trace Structure
- Trace: The entire journey (one request)
- Span: A single operation within the trace
- Trace ID: Unique identifier connecting all spans
Trace ID: trace-xyz-789
├── Span: API Gateway (5ms)
├── Span: Auth Service (15ms)
├── Span: User Service (120ms)
│ └── Span: Database Query (95ms)
└── Total: 235ms
When Traces Save the Day
Without traces, debugging “slow request” means:
- Check API Gateway logs
- Check Auth Service logs
- Check User Service logs
- Try to correlate by timestamp
- Guess what happened
With traces:
- Look up trace ID
- See the entire flow
- Identify the bottleneck immediately
The Correlation ID Connection
The correlation ID pattern is foundational to tracing:
# Initial request
GET /api/users/123 HTTP/1.1
X-Request-Id: abc-123
# Propagated to downstream services
GET /auth/validate HTTP/1.1
X-Request-Id: abc-123
# Appears in database queries
-- request_id: abc-123
SELECT * FROM users WHERE id = 123
Every service, every log, every database query includes the same ID.
6. Silent Failures: The 200 OK That Lied
The most dangerous failures don’t look like failures.
What Is a Silent Failure?
HTTP/1.1 200 OK
Content-Type: application/json
{
"success": false,
"error": "Payment processor unavailable"
}
The HTTP layer says “success.” The business layer says “failure.”
Your monitoring sees: 200 OK, all good! Your users see: “My payment didn’t work.”
Why Silent Failures Happen
1. Overly optimistic error handling:
# BAD: Catching everything and returning success
try:
process_payment()
return {"status": "success"}
except Exception:
return {"status": "success", "note": "will retry later"} # WRONG!
2. Partial failures in batch operations:
{
"status": 200,
"processed": 95,
"failed": 5,
"errors": ["item 3 not found", "item 7 invalid"]
}
Is this success? The HTTP status says yes. Five items say no.
3. Degraded responses:
{
"user": {
"id": 123,
"name": "Alice",
"preferences": null,
"recommendations": []
}
}
User exists, but preferences service was down. Is this an error?
Detecting Silent Failures
1. Business metrics, not just HTTP metrics:
# HTTP metrics show
http_requests_total{status="200"} = 10000
# Business metrics reveal the truth
payment_success_total = 8500
payment_failed_total = 1500
2. Error response body parsing:
Monitor for "success": false or "error" in 200 responses.
3. Downstream dependency monitoring:
If the payment service is down, 200s from your API are probably lies.
The Solution: Be Honest
# DON'T: Silent failure
HTTP/1.1 200 OK
{"success": false, "error": "Payment failed"}
# DO: Honest failure
HTTP/1.1 502 Bad Gateway
{"error": "Payment processor unavailable"}
If something failed, the status code should reflect it.
7. Key Metrics for API Health
What should you actually track?
SLIs, SLOs, and SLAs
graph TD
SLI[SLI: Service Level Indicator
What you measure] --> SLO[SLO: Service Level Objective
Target you aim for]
SLO --> SLA[SLA: Service Level Agreement
Promise to customers]
style SLI fill:#e3f2fd
style SLO fill:#fff3e0
style SLA fill:#ffccbc| Term | Definition | Example |
|---|---|---|
| SLI | A metric that indicates service health | p99 latency, error rate |
| SLO | Internal target for an SLI | p99 < 500ms, errors < 1% |
| SLA | External commitment with consequences | 99.9% uptime or refund |
Essential API Metrics
1. Availability
Availability = (Successful requests) / (Total requests)
Target: 99.9% = 8.7 hours downtime/year
Target: 99.99% = 52 minutes downtime/year
2. Error Rate
Error Rate = (5xx responses) / (Total responses)
Healthy: < 0.1%
Concerning: > 1%
Critical: > 5%
3. Latency Distribution
| Percentile | Healthy | Degraded | Critical |
|---|---|---|---|
| p50 | < 100ms | < 500ms | > 1s |
| p95 | < 300ms | < 1s | > 2s |
| p99 | < 500ms | < 2s | > 5s |
4. Throughput
Requests per second (RPS)
Know your baseline: 1000 RPS normal
Alert when: < 500 RPS (unusual drop) or > 2000 RPS (spike)
Error Budgets
An error budget is the inverse of availability:
If SLO = 99.9% availability
Error budget = 0.1% = 43.2 minutes/month of allowed downtime
How to use it:
- Budget remaining: Deploy new features
- Budget exhausted: Focus on reliability
- Budget overspent: All hands on stability
graph LR
A[Error Budget
100%] -->|Deployment| B[Budget: 80%]
B -->|Incident| C[Budget: 30%]
C -->|Another incident| D[Budget: 0%]
D --> E[FREEZE:
No new deploys
Fix reliability]
style A fill:#c8e6c9
style B fill:#fff9c4
style C fill:#ffccbc
style D fill:#ef9a9a
style E fill:#ef5350Alerting Strategy
Not every metric needs an alert. Alert on:
| Alert Type | Example | Priority |
|---|---|---|
| Symptom | Error rate > 5% | High |
| Cause | Database connection pool exhausted | High |
| Predictive | Disk 90% full | Medium |
Don’t alert on:
- Metrics within normal variance
- Things that auto-recover
- Duplicates of the same issue
The goal: Every alert should require human action.
What’s Next
This guide covered the fundamentals of API observability—the why and what to observe.
For deeper topics like:
- Instrumenting your code for observability
- Correlating events across services
- Building effective dashboards
- Runbooks for incident response
- Advanced diagnosis techniques
See the upcoming course: Observabilidad efectiva de APIs REST
Related Vocabulary Terms
Deepen your understanding:
- Observability - The ability to understand system state
- Logging - Recording events for debugging
- Metrics - Numeric measurements over time
- Tracing - Following requests through systems
- Correlation ID - Linking related events across services