Observability

Infrastructure & Governance Jan 6, 2025 TYPESCRIPT

Definition

Picture this: it’s 2 AM and your e-commerce site is running slow. Customers are complaining, orders are failing, and you have no idea why. Is it the database? The payment service? A network issue? Without observability, you’re essentially debugging in the dark, making random guesses and hoping something works.

Observability is the ability to understand what’s happening inside your system by looking at what it outputs. It’s not just about knowing that something is wrong - it’s about having the tools and data to figure out exactly what went wrong, where, and why. A truly observable system lets you ask questions you’ve never asked before and get answers, even for problems you’ve never encountered.

The concept rests on three pillars: Metrics tell you what is happening (CPU usage is at 90%, response time jumped to 3 seconds), Logs tell you about specific events (user X tried to login at 2:03 AM and failed), and Traces show you the journey of a request through your system (this API call took 2 seconds because it waited 1.8 seconds for the database). Together, these three pillars give you a complete picture of your system’s health and behavior, letting you diagnose problems quickly instead of guessing.

Example

Debugging a Slow Checkout: A user reports that checkout is slow. With observability: Metrics show latency spikes at 2 PM. Traces reveal that requests are waiting 5 seconds at the inventory service. Logs from that service show “connection pool exhausted” errors. In 5 minutes, you’ve identified the root cause: the inventory database needs more connections.

Finding a Memory Leak: Your app crashes every few days. Metrics show memory steadily increasing over time. Traces pinpoint which specific endpoint is consuming the most memory. Logs reveal that a particular query result isn’t being garbage collected properly. Without observability, you might spend weeks trying to reproduce the issue.

Understanding User Behavior: You notice error rates climbing. Traces show that 40% of failed requests come from a specific mobile app version. Logs reveal these users are sending malformed dates. Metrics confirm the problem started when version 2.3 was released. You can now roll back or push a fix to that specific version.

Capacity Planning: Before Black Friday, you analyze metrics from last year’s traffic patterns. Traces show which services become bottlenecks under load. Logs reveal warning signs that appeared before previous outages. You can now scale the right services proactively rather than reacting to crashes.

Analogy

The Doctor’s Examination: When you feel sick, a doctor doesn’t just guess what’s wrong. They check your vital signs (metrics: temperature, blood pressure, heart rate), ask about your symptoms and history (logs: discrete events and their context), and might order tests that trace how things flow through your body (traces: following blood flow, nerve signals, or food through your digestive system). Together, these let the doctor diagnose problems even if they’ve never seen your exact condition before.

The Car Dashboard and Service Records: Modern cars are highly observable. The dashboard shows real-time metrics (speed, fuel, engine temperature). The check engine light with error codes is like logs (discrete events that tell you something specific happened). And when the mechanic plugs in a diagnostic tool that shows how signals travel through different systems? That’s tracing. A car without any of these would be terrifying to drive and nearly impossible to fix.

The Air Traffic Control Tower: Air traffic controllers have complete observability over their airspace. Radar shows real-time positions and speeds (metrics). Radio communications log specific events (“Flight 123 requesting runway 4L”). And flight tracking shows the complete journey of each aircraft from origin to destination (traces). Without this level of observability, they couldn’t safely manage hundreds of flights simultaneously.

The Glass-Walled Kitchen: Some restaurants have open kitchens where you can see everything happening. You can watch the chef’s movements (traces), see the timers and thermometers (metrics), and hear the orders being called out (logs). If your food is late, you can see exactly where the bottleneck is - maybe the grill station is backed up, or the expeditor is overwhelmed. Compare this to a closed kitchen where you just wait and wonder.

Code Example


// OpenTelemetry observability
import { trace, metrics, logs } from '@opentelemetry/api';

const tracer = trace.getTracer('api-service');
const meter = metrics.getMeter('api-service');

app.get('/api/users/:id', async (req, res) => {
  const span = tracer.startSpan('get_user');

  try {
    const user = await db.users.findById(req.params.id);
    span.setStatus({ code: SpanStatusCode.OK });
    res.json(user);
  } catch (error) {
    span.setStatus({ code: SpanStatusCode.ERROR });
    span.recordException(error);
    res.status(500).json({ error: 'Internal error' });
  } finally {
    span.end();
  }
});

Standards & RFCs

Standards & RFCs