Monitoring, Observability & Incident Response
The Three Pillars of Observability
4 min read
Modern observability rests on three pillars: metrics, logs, and traces. Understanding how they complement each other is essential.
Observability vs Monitoring
| Monitoring | Observability |
|---|---|
| Tells you WHEN something is wrong | Helps you understand WHY |
| Predefined questions and dashboards | Explore unknown unknowns |
| Alert when threshold crossed | Investigate novel issues |
| "Is the system healthy?" | "Why is this request slow?" |
The Three Pillars
┌─────────────────────────────────────────────────────────┐
│ OBSERVABILITY │
├─────────────────┬─────────────────┬─────────────────────┤
│ METRICS │ LOGS │ TRACES │
├─────────────────┼─────────────────┼─────────────────────┤
│ What's happening│ What happened │ How it happened │
│ (aggregated) │ (detailed) │ (distributed flow) │
├─────────────────┼─────────────────┼─────────────────────┤
│ Prometheus │ ELK/Loki │ Jaeger/Zipkin │
│ Datadog │ Splunk │ OpenTelemetry │
│ Grafana │ CloudWatch │ AWS X-Ray │
└─────────────────┴─────────────────┴─────────────────────┘
Metrics: The "What"
Aggregated numerical data over time:
# High cardinality metrics (be careful)
http_requests_total{user_id="123"} # BAD - too many values
# Better approach
http_requests_total{endpoint="/api/users", method="GET"} # GOOD
When to Use Metrics
| Use Case | Example |
|---|---|
| Alerting | Error rate > 5% |
| Dashboards | Traffic trends |
| Capacity planning | CPU trending up |
| SLO tracking | Availability percentage |
Logs: The "What Happened"
Structured events with context:
{
"timestamp": "2025-01-04T10:30:00Z",
"level": "ERROR",
"service": "payment-api",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "user-789",
"message": "Payment processing failed",
"error": "Card declined: insufficient funds",
"duration_ms": 234,
"metadata": {
"card_type": "visa",
"amount": 99.99
}
}
Logging Best Practices
| Practice | Why |
|---|---|
| Structured logging (JSON) | Machine parseable |
| Include trace/span IDs | Correlate with traces |
| Log levels (DEBUG → ERROR) | Filter by severity |
| Avoid PII in logs | Compliance, security |
| Log context, not secrets | Include request ID, user ID |
Log Queries (Loki LogQL)
# Find errors in payment service
{service="payment-api"} |= "ERROR"
# Parse JSON and filter
{service="payment-api"} | json | level="ERROR" | duration_ms > 1000
# Count errors by endpoint
sum by (endpoint) (count_over_time({service="api"} |= "ERROR" [5m]))
Traces: The "How"
Distributed request flow across services:
[User Request]
│
▼
┌───────────┐ 250ms total
│ API Gateway│───────────────────────────────┐
└─────┬─────┘ │
│ │
▼ │
┌───────────┐ 150ms │
│ Auth │─────────────┐ │
└─────┬─────┘ │ │
│ │ │
▼ ▼ │
┌───────────┐ ┌───────────┐ │
│ Service A│ │ Redis │ │
│ 80ms │ │ 20ms │ │
└─────┬─────┘ └───────────┘ │
│ │
▼ │
┌───────────┐ │
│ Database │ │
│ 50ms │ │
└───────────┘ │
Trace Structure
Trace (end-to-end request)
├── Span: API Gateway (root span)
│ ├── Span: Auth Service
│ │ └── Span: Redis lookup
│ └── Span: Service A
│ └── Span: Database query
OpenTelemetry (Standard)
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
# Initialize tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Create spans
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.amount", 99.99)
span.set_attribute("user.id", "user-123")
# Nested span
with tracer.start_as_current_span("validate_card"):
# validation logic
pass
Correlating the Three Pillars
The key is connecting metrics, logs, and traces:
1. ALERT fires (metrics)
"Error rate > 5% for payment-api"
2. INVESTIGATE with logs
Search: {service="payment-api"} |= "ERROR" | last 5m
3. TRACE specific request
Find trace_id from log, view in Jaeger
See exactly which service/query failed
Exemplars: Connecting Metrics to Traces
# Prometheus histogram with exemplar
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) # Click data point → see trace_id → view trace
Interview Questions
Q: "A user complains about slow checkout. How do you investigate?"
- Metrics: Check p99 latency for checkout endpoint
- Logs: Search for that user's requests in timeframe
- Traces: Find the slow trace, see which span took longest
- Root cause: Database query? External API? Resource contention?
Q: "How would you implement observability for a new microservice?"
# Minimum viable observability:
1. Metrics:
- Request rate, error rate, latency (RED)
- Resource utilization
- Business metrics (orders/min)
2. Logs:
- Structured JSON format
- Include trace_id and span_id
- Log levels: INFO for normal, ERROR for failures
3. Traces:
- OpenTelemetry SDK integration
- Auto-instrument HTTP clients
- Custom spans for business logic
4. Dashboards:
- Service overview (golden signals)
- Dependency map
- Error breakdown
5. Alerts:
- Error rate threshold
- Latency degradation
- Dependency failures
Next, we'll dive into SLOs and Error Budgets—the foundation of SRE reliability management. :::