Monitoring, Observability & Incident Response

The Three Pillars of Observability

4 min read

Modern observability rests on three pillars: metrics, logs, and traces. Understanding how they complement each other is essential.

Observability vs Monitoring

Monitoring Observability
Tells you WHEN something is wrong Helps you understand WHY
Predefined questions and dashboards Explore unknown unknowns
Alert when threshold crossed Investigate novel issues
"Is the system healthy?" "Why is this request slow?"

The Three Pillars

┌─────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                         │
├─────────────────┬─────────────────┬─────────────────────┤
│     METRICS     │      LOGS       │       TRACES        │
├─────────────────┼─────────────────┼─────────────────────┤
│ What's happening│ What happened   │ How it happened     │
│ (aggregated)    │ (detailed)      │ (distributed flow)  │
├─────────────────┼─────────────────┼─────────────────────┤
│ Prometheus      │ ELK/Loki        │ Jaeger/Zipkin       │
│ Datadog         │ Splunk          │ OpenTelemetry       │
│ Grafana         │ CloudWatch      │ AWS X-Ray           │
└─────────────────┴─────────────────┴─────────────────────┘

Metrics: The "What"

Aggregated numerical data over time:

# High cardinality metrics (be careful)
http_requests_total{user_id="123"}  # BAD - too many values

# Better approach
http_requests_total{endpoint="/api/users", method="GET"}  # GOOD

When to Use Metrics

Use Case Example
Alerting Error rate > 5%
Dashboards Traffic trends
Capacity planning CPU trending up
SLO tracking Availability percentage

Logs: The "What Happened"

Structured events with context:

{
  "timestamp": "2025-01-04T10:30:00Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user-789",
  "message": "Payment processing failed",
  "error": "Card declined: insufficient funds",
  "duration_ms": 234,
  "metadata": {
    "card_type": "visa",
    "amount": 99.99
  }
}

Logging Best Practices

Practice Why
Structured logging (JSON) Machine parseable
Include trace/span IDs Correlate with traces
Log levels (DEBUG → ERROR) Filter by severity
Avoid PII in logs Compliance, security
Log context, not secrets Include request ID, user ID

Log Queries (Loki LogQL)

# Find errors in payment service
{service="payment-api"} |= "ERROR"

# Parse JSON and filter
{service="payment-api"} | json | level="ERROR" | duration_ms > 1000

# Count errors by endpoint
sum by (endpoint) (count_over_time({service="api"} |= "ERROR" [5m]))

Traces: The "How"

Distributed request flow across services:

[User Request]
┌───────────┐    250ms total
│ API Gateway│───────────────────────────────┐
└─────┬─────┘                                │
      │                                      │
      ▼                                      │
┌───────────┐    150ms                       │
│   Auth    │─────────────┐                  │
└─────┬─────┘             │                  │
      │                   │                  │
      ▼                   ▼                  │
┌───────────┐    ┌───────────┐              │
│  Service A│    │   Redis   │              │
│   80ms    │    │   20ms    │              │
└─────┬─────┘    └───────────┘              │
      │                                      │
      ▼                                      │
┌───────────┐                               │
│ Database  │                               │
│   50ms    │                               │
└───────────┘                               │

Trace Structure

Trace (end-to-end request)
├── Span: API Gateway (root span)
│   ├── Span: Auth Service
│   │   └── Span: Redis lookup
│   └── Span: Service A
│       └── Span: Database query

OpenTelemetry (Standard)

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

# Initialize tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Create spans
with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("payment.amount", 99.99)
    span.set_attribute("user.id", "user-123")

    # Nested span
    with tracer.start_as_current_span("validate_card"):
        # validation logic
        pass

Correlating the Three Pillars

The key is connecting metrics, logs, and traces:

1. ALERT fires (metrics)
   "Error rate > 5% for payment-api"

2. INVESTIGATE with logs
   Search: {service="payment-api"} |= "ERROR" | last 5m

3. TRACE specific request
   Find trace_id from log, view in Jaeger
   See exactly which service/query failed

Exemplars: Connecting Metrics to Traces

# Prometheus histogram with exemplar
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
) # Click data point → see trace_id → view trace

Interview Questions

Q: "A user complains about slow checkout. How do you investigate?"

  1. Metrics: Check p99 latency for checkout endpoint
  2. Logs: Search for that user's requests in timeframe
  3. Traces: Find the slow trace, see which span took longest
  4. Root cause: Database query? External API? Resource contention?

Q: "How would you implement observability for a new microservice?"

# Minimum viable observability:
1. Metrics:
   - Request rate, error rate, latency (RED)
   - Resource utilization
   - Business metrics (orders/min)

2. Logs:
   - Structured JSON format
   - Include trace_id and span_id
   - Log levels: INFO for normal, ERROR for failures

3. Traces:
   - OpenTelemetry SDK integration
   - Auto-instrument HTTP clients
   - Custom spans for business logic

4. Dashboards:
   - Service overview (golden signals)
   - Dependency map
   - Error breakdown

5. Alerts:
   - Error rate threshold
   - Latency degradation
   - Dependency failures

Next, we'll dive into SLOs and Error Budgets—the foundation of SRE reliability management. :::

Quiz

Module 5: Monitoring, Observability & Incident Response

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.