Metrics

Definition

Imagine trying to understand your car’s health without a dashboard - no speedometer, no fuel gauge, no engine temperature warning light. You’d be driving blind, only discovering problems when something catastrophic happened. Metrics are your application’s dashboard: numerical measurements that tell you how fast you’re going, how much fuel you have left, and whether the engine is overheating.

Metrics are time-series data that capture quantitative measurements about your system’s behavior. Unlike logs (which capture individual events) or traces (which follow request paths), metrics aggregate information into numbers: request rate (requests per second), latency (P50, P95, P99), error rate (percentage of failed requests), and resource utilization (CPU, memory, connections). These numbers tell you the “what” of system health at a glance.

The power of metrics lies in their efficiency and comparability. While storing every request’s full log might be prohibitive, storing “we processed 10,000 requests/second with 99th percentile latency of 200ms” is cheap and immediately useful. Metrics enable dashboards, alerting, capacity planning, and trend analysis. They’re the foundation of SLIs (Service Level Indicators) that define whether you’re meeting your promises to users.

Example

The Four Golden Signals at Google: Google’s SRE team monitors four key metrics for every service: Latency (how long requests take), Traffic (how many requests you’re getting), Errors (what percentage fail), and Saturation (how “full” your resources are). These four metrics capture the essence of service health.

E-commerce Black Friday Dashboard: Amazon monitors thousands of metrics during peak shopping. Metrics show requests per second climbing from 10K to 100K, checkout latency staying under 500ms, error rate holding at 0.1%, and database connections approaching limits. One glance at the dashboard tells operators if they’re surviving the load.

API Rate Limiting Enforcement: Cloudflare tracks request rates per API key as metrics. When a key exceeds 1000 requests/minute (visible as a spike on the metrics graph), rate limiting kicks in. Historical metrics show which customers consistently hit limits and might need higher quotas.

Netflix Streaming Quality: Netflix monitors metrics like buffering events per viewer-hour, video start time, and bitrate stability. A metric showing “buffering events increased 50% in the last hour in EU region” immediately triggers investigation, often catching issues before users complain.

Kubernetes Cluster Autoscaling: Cloud platforms use metrics (CPU utilization, memory pressure, request queue depth) to automatically scale resources. When average CPU across pods exceeds 80% for 5 minutes, metrics trigger horizontal pod autoscaling to add more instances.

Analogy

The Car Dashboard: Your car’s dashboard shows speed, RPM, fuel level, and engine temperature - all metrics. You don’t need to understand every engine component; the gauges tell you if everything’s normal or if you should pull over immediately. Application metrics serve the same purpose.

The Hospital Vital Signs Monitor: In a hospital, patients are connected to monitors showing heart rate, blood pressure, oxygen saturation, and temperature. Doctors glance at these numbers to assess patient health without running full diagnostic tests every minute. Metrics are your application’s vital signs.

The Stock Market Ticker: Stock prices are metrics - numbers that change over time, immediately visible, comparable across time and across stocks. Just as investors use stock metrics for decisions, operators use application metrics for capacity and reliability decisions.

The Fitness Tracker: Your fitness tracker records steps, heart rate, sleep quality, and calories as numbers over time. Weekly trends show if you’re getting healthier or need to change habits. Application metrics similarly show system health trends.

Code Example

// Prometheus-style metrics implementation
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

// Counter: only goes up (requests, errors, bytes processed)
const requestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [registry]
});

// Histogram: measures distributions (latency, request size)
const requestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [registry]
});

// Gauge: can go up or down (connections, queue size, temperature)
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
  registers: [registry]
});

const queueSize = new Gauge({
  name: 'job_queue_size',
  help: 'Number of jobs waiting in queue',
  labelNames: ['queue_name'],
  registers: [registry]
});

// Middleware to collect metrics
function metricsMiddleware(req: Request, res: Response, next: NextFunction) {
  const startTime = Date.now();

  activeConnections.inc();

  res.on('finish', () => {
    const duration = (Date.now() - startTime) / 1000;
    const labels = {
      method: req.method,
      path: req.route?.path || req.path,
      status: res.statusCode.toString()
    };

    requestsTotal.inc(labels);
    requestDuration.observe(
      { method: req.method, path: req.route?.path || req.path },
      duration
    );
    activeConnections.dec();
  });

  next();
}

// Business metrics
const paymentsProcessed = new Counter({
  name: 'payments_processed_total',
  help: 'Total payments processed',
  labelNames: ['currency', 'status'],
  registers: [registry]
});

const paymentAmount = new Histogram({
  name: 'payment_amount_dollars',
  help: 'Payment amounts in dollars',
  buckets: [10, 50, 100, 500, 1000, 5000, 10000],
  registers: [registry]
});

// Usage in application code
async function processPayment(payment: Payment) {
  const timer = requestDuration.startTimer({ method: 'POST', path: '/payments' });

  try {
    const result = await paymentService.process(payment);

    paymentsProcessed.inc({
      currency: payment.currency,
      status: 'success'
    });
    paymentAmount.observe(payment.amount);

    return result;
  } catch (error) {
    paymentsProcessed.inc({
      currency: payment.currency,
      status: 'failed'
    });
    throw error;
  } finally {
    timer();
  }
}

// Expose metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
});

// Example metrics output:
// # HELP http_requests_total Total number of HTTP requests
// # TYPE http_requests_total counter
// http_requests_total{method="GET",path="/api/users",status="200"} 15420
// http_requests_total{method="POST",path="/api/payments",status="201"} 3241
// http_requests_total{method="POST",path="/api/payments",status="400"} 127
//
// # HELP http_request_duration_seconds HTTP request duration in seconds
// # TYPE http_request_duration_seconds histogram
// http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.1"} 14200
// http_request_duration_seconds_bucket{method="GET",path="/api/users",le="0.5"} 15100
// http_request_duration_seconds_sum{method="GET",path="/api/users"} 892.5
// http_request_duration_seconds_count{method="GET",path="/api/users"} 15420

Diagram

flowchart TB
    subgraph MetricTypes["Metric Types"]
        C[Counter
Only increases]
        G[Gauge
Up and down]
        H[Histogram
Distribution buckets]
        S[Summary
Quantiles]
    end

    subgraph Examples["Examples"]
        C1[Requests total
Errors count
Bytes processed]
        G1[Active connections
Queue size
Temperature]
        H1[Request latency
Response size
Payment amounts]
        S1[P50, P95, P99
percentiles]
    end

    subgraph Pipeline["Metrics Pipeline"]
        A[Application] --> P[Prometheus
Scraper]
        P --> T[(Time Series DB)]
        T --> D[Grafana
Dashboard]
        T --> L[Alert Manager]
    end

    C --> C1
    G --> G1
    H --> H1
    S --> S1

    style C fill:#93c5fd
    style G fill:#86efac
    style H fill:#fcd34d
    style S fill:#f9a8d4

Best Practices

Use the RED method for services - Rate (requests/sec), Errors (error rate), Duration (latency) cover most monitoring needs
Use the USE method for resources - Utilization, Saturation, Errors for CPU, memory, disk, network
Choose appropriate metric types - Counters for things that only increase, Gauges for things that go up/down, Histograms for distributions
Use meaningful labels - But not too many; high cardinality labels (like user_id) can explode storage costs
Define SLIs based on metrics - “99% of requests complete in under 200ms” is measurable from latency histograms
Set up alerts on metrics - Error rate > 1%, latency P99 > 500ms, CPU > 80% for 5 minutes
Create dashboards for different audiences - Executive summary, on-call engineers, and detailed debugging each need different views
Measure business metrics - Orders processed, revenue, user signups - not just technical metrics
Keep metric names consistent - Use naming conventions like service_operation_unit (e.g., http_requests_total)
Monitor your monitoring - Ensure your metrics pipeline itself is healthy and not dropping data

Common Mistakes

Too many labels (high cardinality): Using user_id as a label creates millions of time series, exploding storage costs and query times.

Measuring everything: Collecting every possible metric is expensive and noisy. Focus on metrics that drive decisions.

Ignoring percentiles: Averages hide problems. P99 latency of 10 seconds means 1% of users have terrible experience, even if average is 100ms.

No baselines: Without knowing what “normal” looks like, you can’t detect anomalies. Establish baselines before alerting.

Alerting on wrong things: Alerting on CPU utilization is less useful than alerting on error rate or latency - focus on symptoms users experience.

Metric naming chaos: Inconsistent naming (request_count vs requestsTotal vs http_reqs) makes dashboards and queries painful.

Not measuring business outcomes: Technical metrics matter, but ultimately you care about orders placed, users retained, revenue generated.

Gaps in metric collection: Missing metrics from some services creates blind spots. Ensure consistent coverage.

Standards & RFCs

1)- OpenMetrics Specification - CNCF standard for metrics exposition

2)- Prometheus Exposition Format - De facto standard for metrics

3)OTel- Metrics - Part of the observability framework

4)- StatsD Protocol - UDP-based metrics aggregation