Monitoring, Observability & Incident Response

Metrics and Monitoring Fundamentals

4 min read

Every SRE interview will test your monitoring knowledge. Let's master the concepts and tools.

The Four Golden Signals

Google's SRE book defines four critical metrics:

SignalWhat It MeasuresExample
LatencyTime to serve requestsp50, p95, p99 response time
TrafficDemand on systemRequests per second (RPS)
ErrorsRate of failed requests5xx errors, failed jobs
SaturationResource utilizationCPU, memory, disk usage

Interview tip: When asked "How would you monitor X?", start with these four signals.

USE and RED Methods

USE Method (Infrastructure)

For every resource, check:

  • Utilization: How much is used (0-100%)
  • Saturation: How much work is queued
  • Errors: Error counts
CPU:     Utilization → load average, %CPU
         Saturation  → run queue length
         Errors      → machine check exceptions

Memory:  Utilization → used vs total
         Saturation  → swap usage, OOM events
         Errors      → allocation failures

Disk:    Utilization → %used, I/O bandwidth
         Saturation  → I/O wait time
         Errors      → read/write errors

RED Method (Services)

For every service:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Time per request (latency)

Prometheus Fundamentals

Metric Types

TypeDescriptionExample
CounterOnly increaseshttp_requests_total
GaugeCan go up/downtemperature_celsius
HistogramSamples in bucketsrequest_duration_seconds
SummaryQuantiles over timerequest_latency_seconds

PromQL Basics

# Request rate over 5 minutes
rate(http_requests_total[5m])

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU usage per pod
sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

Grafana Dashboard Design

Best Practices

┌─────────────────────────────────────────────────┐
│  Service Health Overview                        │
├─────────────────────────────────────────────────┤
│  [RPS] [Error Rate] [Latency p99] [Saturation]  │  ← Golden signals
├─────────────────────────────────────────────────┤
│  Request Rate         │  Error Rate             │
│  ████████████████    │  ███░░░░░░░░░           │  ← Time series
├──────────────────────┼──────────────────────────┤
│  Latency Distribution │  Resource Usage         │
│  [p50] [p95] [p99]   │  CPU | Memory | Disk    │
└─────────────────────────────────────────────────┘

Dashboard Variables

# Define variables for filtering
$environment = production, staging, development
$service = api, web, worker
$instance = all instances of selected service

# Use in queries
rate(http_requests_total{env="$environment", service="$service"}[5m])

Alerting Best Practices

Alert Structure

# Prometheus alerting rule
groups:
- name: service-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"
      runbook: "https://wiki/runbooks/high-error-rate"

Alert Fatigue Prevention

PracticeWhy
Alert on symptoms, not causesUsers care about errors, not CPU
Include runbooksReduce mean time to mitigate
Set appropriate thresholdsAvoid false positives
Use for durationPrevent flapping alerts
Route to right teamDon't wake wrong people

Interview Questions

Q: "How do you distinguish between latency and availability issues?"

MetricHigh LatencyLow Availability
RequestsCompleting slowlyFailing/timing out
Error rateLow (requests succeed)High (requests fail)
User experienceSlow but worksBroken
Root causeBackend slow, resource contentionService down, network issue

Q: "Your error rate alert fired. Walk me through investigation."

# 1. Quantify the impact
# Check current error rate and trend
# What percentage of users affected?

# 2. Identify the scope
# Which endpoints are failing?
rate(http_requests_total{status=~"5.."}[5m]) by (endpoint)

# 3. Check recent changes
# Deployments in last hour?
# Config changes?

# 4. Look at dependencies
# Database healthy?
# External APIs responding?

# 5. Check resources
# CPU/memory/disk saturated?
# Connection pools exhausted?

Next, we'll cover the three pillars of observability: metrics, logs, and traces. :::

Quick check: how does this lesson land for you?

Quiz

Module 5: Monitoring, Observability & Incident Response

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.