Tracing

Definition

When you click “Buy Now” on an e-commerce site, that single click might trigger calls to dozens of services: shopping cart, inventory, pricing, fraud detection, payment processing, shipping calculation, email notification. If the checkout takes 10 seconds instead of 2, which service is the bottleneck? If it fails, which service caused the failure? Without distributed tracing, answering these questions requires heroic detective work across multiple log systems.

Distributed tracing solves this by following a request’s complete journey through your system. It creates a trace - a tree of spans where each span represents one operation (an API call, a database query, a message queue operation). Each span records when it started, when it ended, what service handled it, and any relevant metadata. The magic is that all spans in a request share a common trace ID, so you can reconstruct the entire journey from start to finish.

Think of it like a package tracking system that shows not just “your package arrived” but every warehouse it passed through, how long it spent at each one, and which truck carried it between locations. Tracing answers “why is this slow?” and “what called what?” in ways that logs and metrics alone cannot. When combined with correlation IDs propagated through headers, traces become an essential debugging tool for microservices.

Example

Debugging Slow Checkout at Shopify: A customer reports checkout taking 8 seconds. Tracing shows: API Gateway (50ms) → Cart Service (100ms) → Inventory Service (6000ms!) → Payment Service (500ms) → Email Service (200ms). The trace immediately identifies inventory as the bottleneck - it’s waiting on a slow database query visible as a child span.

Netflix Request Flow: When you press play, Netflix’s trace might show: Edge Service → API Gateway → User Service → Entitlement Service → Steering Service → CDN Selection → Playback Service. Each span shows timing, and the visualization reveals that CDN selection took 2 seconds in a specific region due to a failing health check.

Uber Ride Request: Tracing a ride request shows: App → API Gateway → Driver Matching → ETA Calculation → Pricing → Payment Authorization → Driver Notification. If matching is slow, the trace shows exactly which matching algorithm stage is the problem and how many driver candidates were evaluated.

AWS Lambda Cold Start Detection: Traces reveal that a Lambda function’s first invocation takes 3 seconds (cold start) while subsequent calls take 100ms. The trace shows initialization time separately from execution time, helping engineers optimize cold start performance.

Database N+1 Query Detection: A trace for an API endpoint shows 100 child spans, each a database query taking 10ms. The parent span totals 1 second. The trace visualization immediately shows the N+1 problem - 100 sequential queries that should be one batch query.

Analogy

The Package Tracking System: When you order from Amazon, you can track your package through every facility: “Picked up at Los Angeles warehouse at 10 AM, arrived at Phoenix hub at 2 PM, out for delivery at 8 AM.” Distributed tracing does this for your request - you can see every “stop” it made and how long it spent there.

The Airport Journey: Your flight involves many steps: check-in counter (5 min), security screening (15 min), walking to gate (10 min), boarding (20 min), flight (2 hours), deplaning (10 min), baggage claim (15 min). If someone asks “why did travel take 4 hours?”, you can point to exactly which step was slow. That’s what traces show for requests.

The Relay Race: In a relay race, each runner’s split time is recorded. You can see that runner 2 was 3 seconds slower than expected, immediately identifying where time was lost. Each runner is a span in the trace.

The Crime Scene Investigation: Detectives trace a suspect’s movements through the city: “Left home at 8 AM (camera footage), arrived at bank at 8:30 (transaction record), gas station at 9 AM (receipt).” They’re reconstructing a journey from disparate evidence. Distributed tracing automatically collects this evidence for every request.

Code Example

// OpenTelemetry distributed tracing setup
import { trace, SpanKind, SpanStatusCode, context } from '@opentelemetry/api';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { W3CTraceContextPropagator } from '@opentelemetry/core';

// Initialize tracer
const provider = new NodeTracerProvider();
provider.addSpanProcessor(
  new BatchSpanProcessor(new JaegerExporter({ endpoint: 'http://jaeger:14268/api/traces' }))
);
provider.register({ propagator: new W3CTraceContextPropagator() });

const tracer = trace.getTracer('order-service', '1.0.0');

// Trace an HTTP request handler
app.post('/api/orders', async (req, res) => {
  // Start a new span (or continue from incoming trace context)
  const span = tracer.startSpan('create_order', {
    kind: SpanKind.SERVER,
    attributes: {
      'http.method': 'POST',
      'http.url': req.url,
      'order.customer_id': req.body.customerId
    }
  });

  try {
    // Wrap all operations in the span's context
    await context.with(trace.setSpan(context.active(), span), async () => {
      // Each service call creates a child span
      const items = await validateInventory(req.body.items);
      const payment = await processPayment(req.body.payment);
      const order = await createOrder(req.body.customerId, items, payment);

      span.setAttribute('order.id', order.id);
      span.setAttribute('order.total', order.total);

      res.json(order);
    });

    span.setStatus({ code: SpanStatusCode.OK });
  } catch (error) {
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
    span.recordException(error);
    res.status(500).json({ error: 'Order creation failed' });
  } finally {
    span.end();
  }
});

// Child span for downstream service call
async function validateInventory(items: OrderItem[]): Promise<ValidatedItem[]> {
  return tracer.startActiveSpan('validate_inventory', {
    kind: SpanKind.CLIENT,
    attributes: { 'inventory.item_count': items.length }
  }, async (span) => {
    try {
      // Propagate trace context to downstream service
      const headers = {};
      propagator.inject(context.active(), headers);

      const response = await fetch('http://inventory-service/validate', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          ...headers  // Contains traceparent header
        },
        body: JSON.stringify(items)
      });

      span.setAttribute('inventory.validated', true);
      return response.json();
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

// Database operation with span
async function createOrder(customerId: string, items: ValidatedItem[], payment: PaymentResult) {
  return tracer.startActiveSpan('db.create_order', {
    kind: SpanKind.INTERNAL,
    attributes: {
      'db.system': 'postgresql',
      'db.operation': 'INSERT',
      'db.table': 'orders'
    }
  }, async (span) => {
    const order = await db.orders.create({
      customerId,
      items,
      paymentId: payment.id,
      status: 'confirmed'
    });

    span.setAttribute('db.rows_affected', 1);
    span.end();
    return order;
  });
}

// Example trace output (conceptual):
// Trace ID: abc123
// └── create_order (server, 500ms)
//     ├── validate_inventory (client, 150ms)
//     │   └── [inventory-service] check_stock (server, 100ms)
//     │       └── db.query (internal, 50ms)
//     ├── process_payment (client, 300ms)
//     │   └── [payment-service] charge_card (server, 250ms)
//     │       ├── fraud_check (internal, 50ms)
//     │       └── gateway_call (client, 150ms)
//     └── db.create_order (internal, 50ms)

Diagram

flowchart TB
    subgraph TraceStructure["Trace Structure (Trace ID: abc-123)"]
        A[Root Span
API Gateway
0-500ms]
        B[Child Span
Order Service
10-400ms]
        C[Child Span
Inventory Check
50-150ms]
        D[Child Span
Payment Service
150-350ms]
        E[Child Span
DB Write
350-380ms]
        F[Grandchild
Fraud Check
160-200ms]
        G[Grandchild
Card Charge
200-340ms]
    end

    A --> B
    B --> C
    B --> D
    B --> E
    D --> F
    D --> G

    subgraph Propagation["Context Propagation"]
        H[traceparent header]
        I[W3C Trace Context]
        J[Service A → Service B]
    end

    H --> I --> J

    style A fill:#93c5fd
    style D fill:#fcd34d
    style G fill:#f87171

Security Notes

SECURITY NOTES

CRITICAL: Distributed tracing tracks requests across services. Critical for debugging and monitoring.

Tracing Components:

Trace ID: Unique ID for entire request flow
Span ID: Unique ID for operation within trace
Parent ID: Link to parent span
Timestamps: When span started/ended
Tags: Key-value metadata

Tracing Protocols:

OpenTelemetry: Standard tracing format
Jaeger: Distributed tracing platform
Zipkin: Open-source distributed tracing
AWS X-Ray: AWS tracing service
Google Cloud Trace: GCP tracing

Security Considerations:

PII leakage: Don’t log PII in traces
Sensitive data: Don’t include passwords/keys in traces
Trace access: Restrict access to trace data
Retention: Define trace data retention policy
Sampling: Sample to reduce data volume

Implementation:

Propagate trace ID: Pass through all services
Instrumentation: Add tracing to all code
Correlation: Correlate related operations
Performance: Minimal overhead from tracing
Storage: Centralized trace storage

Benefits:

Debugging: Trace request through system
Performance: Identify bottlenecks
Errors: Correlate errors across services
Latency: Measure latency per service
Dependencies: Visualize service dependencies

Best Practices:

Unique IDs: Generate proper trace IDs
Propagation: Propagate IDs through all layers
Sampling: Sample high-volume traces
Privacy: Sanitize sensitive data
Monitoring: Alert on anomalies

Best Practices

Instrument at service boundaries - Every incoming request and outgoing call should create a span
Propagate context through headers - Use W3C Trace Context (traceparent) for interoperability
Add meaningful span names - “GET /api/users/:id” is better than “http_request”
Include relevant attributes - Add context like user_id, order_id, but avoid high-cardinality PII
Use sampling wisely - 100% sampling is expensive; sample strategically based on error, latency, or percentage
Set span status correctly - Mark spans as error when they fail, include exception details
Create child spans for significant operations - Database queries, cache lookups, and external calls each deserve spans
Connect traces to logs - Include trace_id in log entries to correlate detailed logs with trace context
Use span events for milestones - Events mark significant moments within a span without creating child spans
Monitor trace completeness - Missing spans indicate instrumentation gaps or dropped context

Common Mistakes

Not propagating trace context: If one service doesn’t forward the traceparent header, the trace breaks and you lose visibility.

Too many spans (over-instrumentation): Creating spans for every function call creates noise and performance overhead. Focus on service boundaries and significant operations.

Too few spans (under-instrumentation): Only instrumenting the entry point leaves you blind to internal bottlenecks.

Ignoring async operations: Message queues, background jobs, and async processing need trace context propagation too - the trace shouldn’t end when you publish to a queue.

High-cardinality attributes: Using dynamic values like timestamps or UUIDs as span names or attributes explodes storage costs.

Not sampling in production: Tracing 100% of requests in high-traffic systems is expensive. Use intelligent sampling (error-based, latency-based, or probabilistic).

Treating traces as logs: Traces show request flow and timing, not detailed event logs. Use spans for structure, logs for detail.

Standards & RFCs

1)W3C- Trace Context - Standard for distributed trace context propagation

2)W3C- Baggage - Standard for propagating key-value pairs alongside trace context

3)OTel- Tracing Specification - CNCF observability standard

4)- Zipkin B3 Propagation - Widely-used propagation format