Monitoring, Observability & Incident Response

Metrics and Monitoring Fundamentals

4 min read

Every SRE interview will test your monitoring knowledge. Let's master the concepts and tools.

The Four Golden Signals

Google's SRE book defines four critical metrics:

Signal What It Measures Example
Latency Time to serve requests p50, p95, p99 response time
Traffic Demand on system Requests per second (RPS)
Errors Rate of failed requests 5xx errors, failed jobs
Saturation Resource utilization CPU, memory, disk usage

Interview tip: When asked "How would you monitor X?", start with these four signals.

USE and RED Methods

USE Method (Infrastructure)

For every resource, check:

  • Utilization: How much is used (0-100%)
  • Saturation: How much work is queued
  • Errors: Error counts
CPU:     Utilization → load average, %CPU
         Saturation  → run queue length
         Errors      → machine check exceptions

Memory:  Utilization → used vs total
         Saturation  → swap usage, OOM events
         Errors      → allocation failures

Disk:    Utilization → %used, I/O bandwidth
         Saturation  → I/O wait time
         Errors      → read/write errors

RED Method (Services)

For every service:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Time per request (latency)

Prometheus Fundamentals

Metric Types

Type Description Example
Counter Only increases http_requests_total
Gauge Can go up/down temperature_celsius
Histogram Samples in buckets request_duration_seconds
Summary Quantiles over time request_latency_seconds

PromQL Basics

# Request rate over 5 minutes
rate(http_requests_total[5m])

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU usage per pod
sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

Grafana Dashboard Design

Best Practices

┌─────────────────────────────────────────────────┐
│  Service Health Overview                        │
├─────────────────────────────────────────────────┤
│  [RPS] [Error Rate] [Latency p99] [Saturation]  │  ← Golden signals
├─────────────────────────────────────────────────┤
│  Request Rate         │  Error Rate             │
│  ████████████████    │  ███░░░░░░░░░           │  ← Time series
├──────────────────────┼──────────────────────────┤
│  Latency Distribution │  Resource Usage         │
│  [p50] [p95] [p99]   │  CPU | Memory | Disk    │
└─────────────────────────────────────────────────┘

Dashboard Variables

# Define variables for filtering
$environment = production, staging, development
$service = api, web, worker
$instance = all instances of selected service

# Use in queries
rate(http_requests_total{env="$environment", service="$service"}[5m])

Alerting Best Practices

Alert Structure

# Prometheus alerting rule
groups:
- name: service-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"
      runbook: "https://wiki/runbooks/high-error-rate"

Alert Fatigue Prevention

Practice Why
Alert on symptoms, not causes Users care about errors, not CPU
Include runbooks Reduce mean time to mitigate
Set appropriate thresholds Avoid false positives
Use for duration Prevent flapping alerts
Route to right team Don't wake wrong people

Interview Questions

Q: "How do you distinguish between latency and availability issues?"

Metric High Latency Low Availability
Requests Completing slowly Failing/timing out
Error rate Low (requests succeed) High (requests fail)
User experience Slow but works Broken
Root cause Backend slow, resource contention Service down, network issue

Q: "Your error rate alert fired. Walk me through investigation."

# 1. Quantify the impact
# Check current error rate and trend
# What percentage of users affected?

# 2. Identify the scope
# Which endpoints are failing?
rate(http_requests_total{status=~"5.."}[5m]) by (endpoint)

# 3. Check recent changes
# Deployments in last hour?
# Config changes?

# 4. Look at dependencies
# Database healthy?
# External APIs responding?

# 5. Check resources
# CPU/memory/disk saturated?
# Connection pools exhausted?

Next, we'll cover the three pillars of observability: metrics, logs, and traces. :::

Quiz

Module 5: Monitoring, Observability & Incident Response

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.