Tracing

Infrastructure & Governance Security Notes Jan 8, 2026 TYPESCRIPT
observability distributed-systems debugging performance microservices

Definition

When you click “Buy Now” on an e-commerce site, that single click might trigger calls to dozens of services: shopping cart, inventory, pricing, fraud detection, payment processing, shipping calculation, email notification. If the checkout takes 10 seconds instead of 2, which service is the bottleneck? If it fails, which service caused the failure? Without distributed tracing, answering these questions requires heroic detective work across multiple log systems.

Distributed tracing solves this by following a request’s complete journey through your system. It creates a trace - a tree of spans where each span represents one operation (an API call, a database query, a message queue operation). Each span records when it started, when it ended, what service handled it, and any relevant metadata. The magic is that all spans in a request share a common trace ID, so you can reconstruct the entire journey from start to finish.

Think of it like a package tracking system that shows not just “your package arrived” but every warehouse it passed through, how long it spent at each one, and which truck carried it between locations. Tracing answers “why is this slow?” and “what called what?” in ways that logs and metrics alone cannot. When combined with correlation IDs propagated through headers, traces become an essential debugging tool for microservices.

Example

Debugging Slow Checkout at Shopify: A customer reports checkout taking 8 seconds. Tracing shows: API Gateway (50ms) β†’ Cart Service (100ms) β†’ Inventory Service (6000ms!) β†’ Payment Service (500ms) β†’ Email Service (200ms). The trace immediately identifies inventory as the bottleneck - it’s waiting on a slow database query visible as a child span.

Netflix Request Flow: When you press play, Netflix’s trace might show: Edge Service β†’ API Gateway β†’ User Service β†’ Entitlement Service β†’ Steering Service β†’ CDN Selection β†’ Playback Service. Each span shows timing, and the visualization reveals that CDN selection took 2 seconds in a specific region due to a failing health check.

Uber Ride Request: Tracing a ride request shows: App β†’ API Gateway β†’ Driver Matching β†’ ETA Calculation β†’ Pricing β†’ Payment Authorization β†’ Driver Notification. If matching is slow, the trace shows exactly which matching algorithm stage is the problem and how many driver candidates were evaluated.

AWS Lambda Cold Start Detection: Traces reveal that a Lambda function’s first invocation takes 3 seconds (cold start) while subsequent calls take 100ms. The trace shows initialization time separately from execution time, helping engineers optimize cold start performance.

Database N+1 Query Detection: A trace for an API endpoint shows 100 child spans, each a database query taking 10ms. The parent span totals 1 second. The trace visualization immediately shows the N+1 problem - 100 sequential queries that should be one batch query.

Analogy

The Package Tracking System: When you order from Amazon, you can track your package through every facility: “Picked up at Los Angeles warehouse at 10 AM, arrived at Phoenix hub at 2 PM, out for delivery at 8 AM.” Distributed tracing does this for your request - you can see every “stop” it made and how long it spent there.

The Airport Journey: Your flight involves many steps: check-in counter (5 min), security screening (15 min), walking to gate (10 min), boarding (20 min), flight (2 hours), deplaning (10 min), baggage claim (15 min). If someone asks “why did travel take 4 hours?”, you can point to exactly which step was slow. That’s what traces show for requests.

The Relay Race: In a relay race, each runner’s split time is recorded. You can see that runner 2 was 3 seconds slower than expected, immediately identifying where time was lost. Each runner is a span in the trace.

The Crime Scene Investigation: Detectives trace a suspect’s movements through the city: “Left home at 8 AM (camera footage), arrived at bank at 8:30 (transaction record), gas station at 9 AM (receipt).” They’re reconstructing a journey from disparate evidence. Distributed tracing automatically collects this evidence for every request.

Code Example

// OpenTelemetry distributed tracing setup
import { trace, SpanKind, SpanStatusCode, context } from '@opentelemetry/api';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { W3CTraceContextPropagator } from '@opentelemetry/core';

// Initialize tracer
const provider = new NodeTracerProvider();
provider.addSpanProcessor(
  new BatchSpanProcessor(new JaegerExporter({ endpoint: 'http://jaeger:14268/api/traces' }))
);
provider.register({ propagator: new W3CTraceContextPropagator() });

const tracer = trace.getTracer('order-service', '1.0.0');

// Trace an HTTP request handler
app.post('/api/orders', async (req, res) => {
  // Start a new span (or continue from incoming trace context)
  const span = tracer.startSpan('create_order', {
    kind: SpanKind.SERVER,
    attributes: {
      'http.method': 'POST',
      'http.url': req.url,
      'order.customer_id': req.body.customerId
    }
  });

  try {
    // Wrap all operations in the span's context
    await context.with(trace.setSpan(context.active(), span), async () => {
      // Each service call creates a child span
      const items = await validateInventory(req.body.items);
      const payment = await processPayment(req.body.payment);
      const order = await createOrder(req.body.customerId, items, payment);

      span.setAttribute('order.id', order.id);
      span.setAttribute('order.total', order.total);

      res.json(order);
    });

    span.setStatus({ code: SpanStatusCode.OK });
  } catch (error) {
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
    span.recordException(error);
    res.status(500).json({ error: 'Order creation failed' });
  } finally {
    span.end();
  }
});

// Child span for downstream service call
async function validateInventory(items: OrderItem[]): Promise<ValidatedItem[]> {
  return tracer.startActiveSpan('validate_inventory', {
    kind: SpanKind.CLIENT,
    attributes: { 'inventory.item_count': items.length }
  }, async (span) => {
    try {
      // Propagate trace context to downstream service
      const headers = {};
      propagator.inject(context.active(), headers);

      const response = await fetch('http://inventory-service/validate', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          ...headers  // Contains traceparent header
        },
        body: JSON.stringify(items)
      });

      span.setAttribute('inventory.validated', true);
      return response.json();
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

// Database operation with span
async function createOrder(customerId: string, items: ValidatedItem[], payment: PaymentResult) {
  return tracer.startActiveSpan('db.create_order', {
    kind: SpanKind.INTERNAL,
    attributes: {
      'db.system': 'postgresql',
      'db.operation': 'INSERT',
      'db.table': 'orders'
    }
  }, async (span) => {
    const order = await db.orders.create({
      customerId,
      items,
      paymentId: payment.id,
      status: 'confirmed'
    });

    span.setAttribute('db.rows_affected', 1);
    span.end();
    return order;
  });
}

// Example trace output (conceptual):
// Trace ID: abc123
// └── create_order (server, 500ms)
//     β”œβ”€β”€ validate_inventory (client, 150ms)
//     β”‚   └── [inventory-service] check_stock (server, 100ms)
//     β”‚       └── db.query (internal, 50ms)
//     β”œβ”€β”€ process_payment (client, 300ms)
//     β”‚   └── [payment-service] charge_card (server, 250ms)
//     β”‚       β”œβ”€β”€ fraud_check (internal, 50ms)
//     β”‚       └── gateway_call (client, 150ms)
//     └── db.create_order (internal, 50ms)

Diagram

flowchart TB
    subgraph TraceStructure["Trace Structure (Trace ID: abc-123)"]
        A[Root Span
API Gateway
0-500ms] B[Child Span
Order Service
10-400ms] C[Child Span
Inventory Check
50-150ms] D[Child Span
Payment Service
150-350ms] E[Child Span
DB Write
350-380ms] F[Grandchild
Fraud Check
160-200ms] G[Grandchild
Card Charge
200-340ms] end A --> B B --> C B --> D B --> E D --> F D --> G subgraph Propagation["Context Propagation"] H[traceparent header] I[W3C Trace Context] J[Service A β†’ Service B] end H --> I --> J style A fill:#93c5fd style D fill:#fcd34d style G fill:#f87171

Security Notes

SECURITY NOTES

CRITICAL: Distributed tracing tracks requests across services. Critical for debugging and monitoring.

Tracing Components:

  • Trace ID: Unique ID for entire request flow
  • Span ID: Unique ID for operation within trace
  • Parent ID: Link to parent span
  • Timestamps: When span started/ended
  • Tags: Key-value metadata

Tracing Protocols:

  • OpenTelemetry: Standard tracing format
  • Jaeger: Distributed tracing platform
  • Zipkin: Open-source distributed tracing
  • AWS X-Ray: AWS tracing service
  • Google Cloud Trace: GCP tracing

Security Considerations:

  • PII leakage: Don’t log PII in traces
  • Sensitive data: Don’t include passwords/keys in traces
  • Trace access: Restrict access to trace data
  • Retention: Define trace data retention policy
  • Sampling: Sample to reduce data volume

Implementation:

  • Propagate trace ID: Pass through all services
  • Instrumentation: Add tracing to all code
  • Correlation: Correlate related operations
  • Performance: Minimal overhead from tracing
  • Storage: Centralized trace storage

Benefits:

  • Debugging: Trace request through system
  • Performance: Identify bottlenecks
  • Errors: Correlate errors across services
  • Latency: Measure latency per service
  • Dependencies: Visualize service dependencies

Best Practices:

  • Unique IDs: Generate proper trace IDs
  • Propagation: Propagate IDs through all layers
  • Sampling: Sample high-volume traces
  • Privacy: Sanitize sensitive data
  • Monitoring: Alert on anomalies

Best Practices

  1. Instrument at service boundaries - Every incoming request and outgoing call should create a span
  2. Propagate context through headers - Use W3C Trace Context (traceparent) for interoperability
  3. Add meaningful span names - “GET /api/users/:id” is better than “http_request”
  4. Include relevant attributes - Add context like user_id, order_id, but avoid high-cardinality PII
  5. Use sampling wisely - 100% sampling is expensive; sample strategically based on error, latency, or percentage
  6. Set span status correctly - Mark spans as error when they fail, include exception details
  7. Create child spans for significant operations - Database queries, cache lookups, and external calls each deserve spans
  8. Connect traces to logs - Include trace_id in log entries to correlate detailed logs with trace context
  9. Use span events for milestones - Events mark significant moments within a span without creating child spans
  10. Monitor trace completeness - Missing spans indicate instrumentation gaps or dropped context

Common Mistakes

Not propagating trace context: If one service doesn’t forward the traceparent header, the trace breaks and you lose visibility.

Too many spans (over-instrumentation): Creating spans for every function call creates noise and performance overhead. Focus on service boundaries and significant operations.

Too few spans (under-instrumentation): Only instrumenting the entry point leaves you blind to internal bottlenecks.

Ignoring async operations: Message queues, background jobs, and async processing need trace context propagation too - the trace shouldn’t end when you publish to a queue.

High-cardinality attributes: Using dynamic values like timestamps or UUIDs as span names or attributes explodes storage costs.

Not sampling in production: Tracing 100% of requests in high-traffic systems is expensive. Use intelligent sampling (error-based, latency-based, or probabilistic).

Treating traces as logs: Traces show request flow and timing, not detailed event logs. Use spans for structure, logs for detail.

Standards & RFCs