Metrics and Monitoring Fundamentals

Every SRE interview will test your monitoring knowledge. Let's master the concepts and tools.

The Four Golden Signals

Google's SRE book defines four critical metrics:

Signal	What It Measures	Example
Latency	Time to serve requests	p50, p95, p99 response time
Traffic	Demand on system	Requests per second (RPS)
Errors	Rate of failed requests	5xx errors, failed jobs
Saturation	Resource utilization	CPU, memory, disk usage

Interview tip: When asked "How would you monitor X?", start with these four signals.

USE and RED Methods

USE Method (Infrastructure)

For every resource, check:

Utilization: How much is used (0-100%)
Saturation: How much work is queued
Errors: Error counts

CPU:     Utilization → load average, %CPU
         Saturation  → run queue length
         Errors      → machine check exceptions

Memory:  Utilization → used vs total
         Saturation  → swap usage, OOM events
         Errors      → allocation failures

Disk:    Utilization → %used, I/O bandwidth
         Saturation  → I/O wait time
         Errors      → read/write errors

RED Method (Services)

For every service:

Rate: Requests per second
Errors: Failed requests per second
Duration: Time per request (latency)

Prometheus Fundamentals

Metric Types

Type	Description	Example
Counter	Only increases	`http_requests_total`
Gauge	Can go up/down	`temperature_celsius`
Histogram	Samples in buckets	`request_duration_seconds`
Summary	Quantiles over time	`request_latency_seconds`

PromQL Basics

# Request rate over 5 minutes
rate(http_requests_total[5m])

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU usage per pod
sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

Grafana Dashboard Design

Best Practices

┌─────────────────────────────────────────────────┐
│  Service Health Overview                        │
├─────────────────────────────────────────────────┤
│  [RPS] [Error Rate] [Latency p99] [Saturation]  │  ← Golden signals
├─────────────────────────────────────────────────┤
│  Request Rate         │  Error Rate             │
│  ████████████████    │  ███░░░░░░░░░           │  ← Time series
├──────────────────────┼──────────────────────────┤
│  Latency Distribution │  Resource Usage         │
│  [p50] [p95] [p99]   │  CPU | Memory | Disk    │
└─────────────────────────────────────────────────┘

Dashboard Variables

# Define variables for filtering
$environment = production, staging, development
$service = api, web, worker
$instance = all instances of selected service

# Use in queries
rate(http_requests_total{env="$environment", service="$service"}[5m])

Alerting Best Practices

Alert Structure

# Prometheus alerting rule
groups:
- name: service-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"
      runbook: "https://wiki/runbooks/high-error-rate"

Alert Fatigue Prevention

Practice	Why
Alert on symptoms, not causes	Users care about errors, not CPU
Include runbooks	Reduce mean time to mitigate
Set appropriate thresholds	Avoid false positives
Use `for` duration	Prevent flapping alerts
Route to right team	Don't wake wrong people

Interview Questions

Q: "How do you distinguish between latency and availability issues?"

Metric	High Latency	Low Availability
Requests	Completing slowly	Failing/timing out
Error rate	Low (requests succeed)	High (requests fail)
User experience	Slow but works	Broken
Root cause	Backend slow, resource contention	Service down, network issue

Q: "Your error rate alert fired. Walk me through investigation."

# 1. Quantify the impact
# Check current error rate and trend
# What percentage of users affected?

# 2. Identify the scope
# Which endpoints are failing?
rate(http_requests_total{status=~"5.."}[5m]) by (endpoint)

# 3. Check recent changes
# Deployments in last hour?
# Config changes?

# 4. Look at dependencies
# Database healthy?
# External APIs responding?

# 5. Check resources
# CPU/memory/disk saturated?
# Connection pools exhausted?

Next, we'll cover the three pillars of observability: metrics, logs, and traces. :::