Monitoring, Observability & Incident Response
Metrics and Monitoring Fundamentals
4 min read
Every SRE interview will test your monitoring knowledge. Let's master the concepts and tools.
The Four Golden Signals
Google's SRE book defines four critical metrics:
| Signal | What It Measures | Example |
|---|---|---|
| Latency | Time to serve requests | p50, p95, p99 response time |
| Traffic | Demand on system | Requests per second (RPS) |
| Errors | Rate of failed requests | 5xx errors, failed jobs |
| Saturation | Resource utilization | CPU, memory, disk usage |
Interview tip: When asked "How would you monitor X?", start with these four signals.
USE and RED Methods
USE Method (Infrastructure)
For every resource, check:
- Utilization: How much is used (0-100%)
- Saturation: How much work is queued
- Errors: Error counts
CPU: Utilization → load average, %CPU
Saturation → run queue length
Errors → machine check exceptions
Memory: Utilization → used vs total
Saturation → swap usage, OOM events
Errors → allocation failures
Disk: Utilization → %used, I/O bandwidth
Saturation → I/O wait time
Errors → read/write errors
RED Method (Services)
For every service:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Time per request (latency)
Prometheus Fundamentals
Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Only increases | http_requests_total |
| Gauge | Can go up/down | temperature_celsius |
| Histogram | Samples in buckets | request_duration_seconds |
| Summary | Quantiles over time | request_latency_seconds |
PromQL Basics
# Request rate over 5 minutes
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU usage per pod
sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
Grafana Dashboard Design
Best Practices
┌─────────────────────────────────────────────────┐
│ Service Health Overview │
├─────────────────────────────────────────────────┤
│ [RPS] [Error Rate] [Latency p99] [Saturation] │ ← Golden signals
├─────────────────────────────────────────────────┤
│ Request Rate │ Error Rate │
│ ████████████████ │ ███░░░░░░░░░ │ ← Time series
├──────────────────────┼──────────────────────────┤
│ Latency Distribution │ Resource Usage │
│ [p50] [p95] [p99] │ CPU | Memory | Disk │
└─────────────────────────────────────────────────┘
Dashboard Variables
# Define variables for filtering
$environment = production, staging, development
$service = api, web, worker
$instance = all instances of selected service
# Use in queries
rate(http_requests_total{env="$environment", service="$service"}[5m])
Alerting Best Practices
Alert Structure
# Prometheus alerting rule
groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"
runbook: "https://wiki/runbooks/high-error-rate"
Alert Fatigue Prevention
| Practice | Why |
|---|---|
| Alert on symptoms, not causes | Users care about errors, not CPU |
| Include runbooks | Reduce mean time to mitigate |
| Set appropriate thresholds | Avoid false positives |
Use for duration |
Prevent flapping alerts |
| Route to right team | Don't wake wrong people |
Interview Questions
Q: "How do you distinguish between latency and availability issues?"
| Metric | High Latency | Low Availability |
|---|---|---|
| Requests | Completing slowly | Failing/timing out |
| Error rate | Low (requests succeed) | High (requests fail) |
| User experience | Slow but works | Broken |
| Root cause | Backend slow, resource contention | Service down, network issue |
Q: "Your error rate alert fired. Walk me through investigation."
# 1. Quantify the impact
# Check current error rate and trend
# What percentage of users affected?
# 2. Identify the scope
# Which endpoints are failing?
rate(http_requests_total{status=~"5.."}[5m]) by (endpoint)
# 3. Check recent changes
# Deployments in last hour?
# Config changes?
# 4. Look at dependencies
# Database healthy?
# External APIs responding?
# 5. Check resources
# CPU/memory/disk saturated?
# Connection pools exhausted?
Next, we'll cover the three pillars of observability: metrics, logs, and traces. :::