Monitoring, Observability & Incident Response

The Three Pillars of Observability

4 min read

Modern observability rests on three pillars: metrics, logs, and traces. Understanding how they complement each other is essential.

Observability vs Monitoring

MonitoringObservability
Tells you WHEN something is wrongHelps you understand WHY
Predefined questions and dashboardsExplore unknown unknowns
Alert when threshold crossedInvestigate novel issues
"Is the system healthy?""Why is this request slow?"

The Three Pillars

┌─────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                         │
├─────────────────┬─────────────────┬─────────────────────┤
│     METRICS     │      LOGS       │       TRACES        │
├─────────────────┼─────────────────┼─────────────────────┤
│ What's happening│ What happened   │ How it happened     │
│ (aggregated)    │ (detailed)      │ (distributed flow)  │
├─────────────────┼─────────────────┼─────────────────────┤
│ Prometheus      │ ELK/Loki        │ Jaeger/Zipkin       │
│ Datadog         │ Splunk          │ OpenTelemetry       │
│ Grafana         │ CloudWatch      │ AWS X-Ray           │
└─────────────────┴─────────────────┴─────────────────────┘

Metrics: The "What"

Aggregated numerical data over time:

# High cardinality metrics (be careful)
http_requests_total{user_id="123"}  # BAD - too many values

# Better approach
http_requests_total{endpoint="/api/users", method="GET"}  # GOOD

When to Use Metrics

Use CaseExample
AlertingError rate > 5%
DashboardsTraffic trends
Capacity planningCPU trending up
SLO trackingAvailability percentage

Logs: The "What Happened"

Structured events with context:

{
  "timestamp": "2025-01-04T10:30:00Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user-789",
  "message": "Payment processing failed",
  "error": "Card declined: insufficient funds",
  "duration_ms": 234,
  "metadata": {
    "card_type": "visa",
    "amount": 99.99
  }
}

Logging Best Practices

PracticeWhy
Structured logging (JSON)Machine parseable
Include trace/span IDsCorrelate with traces
Log levels (DEBUG → ERROR)Filter by severity
Avoid PII in logsCompliance, security
Log context, not secretsInclude request ID, user ID

Log Queries (Loki LogQL)

# Find errors in payment service
{service="payment-api"} |= "ERROR"

# Parse JSON and filter
{service="payment-api"} | json | level="ERROR" | duration_ms > 1000

# Count errors by endpoint
sum by (endpoint) (count_over_time({service="api"} |= "ERROR" [5m]))

Traces: The "How"

Distributed request flow across services:

[User Request]
┌───────────┐    250ms total
│ API Gateway│───────────────────────────────┐
└─────┬─────┘                                │
      │                                      │
      ▼                                      │
┌───────────┐    150ms                       │
│   Auth    │─────────────┐                  │
└─────┬─────┘             │                  │
      │                   │                  │
      ▼                   ▼                  │
┌───────────┐    ┌───────────┐              │
│  Service A│    │   Redis   │              │
│   80ms    │    │   20ms    │              │
└─────┬─────┘    └───────────┘              │
      │                                      │
      ▼                                      │
┌───────────┐                               │
│ Database  │                               │
│   50ms    │                               │
└───────────┘                               │

Trace Structure

Trace (end-to-end request)
├── Span: API Gateway (root span)
│   ├── Span: Auth Service
│   │   └── Span: Redis lookup
│   └── Span: Service A
│       └── Span: Database query

OpenTelemetry (Standard)

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

# Initialize tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Create spans
with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("payment.amount", 99.99)
    span.set_attribute("user.id", "user-123")

    # Nested span
    with tracer.start_as_current_span("validate_card"):
        # validation logic
        pass

Correlating the Three Pillars

The key is connecting metrics, logs, and traces:

1. ALERT fires (metrics)
   "Error rate > 5% for payment-api"

2. INVESTIGATE with logs
   Search: {service="payment-api"} |= "ERROR" | last 5m

3. TRACE specific request
   Find trace_id from log, view in Jaeger
   See exactly which service/query failed

Exemplars: Connecting Metrics to Traces

# Prometheus histogram with exemplar
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
) # Click data point → see trace_id → view trace

Interview Questions

Q: "A user complains about slow checkout. How do you investigate?"

  1. Metrics: Check p99 latency for checkout endpoint
  2. Logs: Search for that user's requests in timeframe
  3. Traces: Find the slow trace, see which span took longest
  4. Root cause: Database query? External API? Resource contention?

Q: "How would you implement observability for a new microservice?"

# Minimum viable observability:
1. Metrics:
   - Request rate, error rate, latency (RED)
   - Resource utilization
   - Business metrics (orders/min)

2. Logs:
   - Structured JSON format
   - Include trace_id and span_id
   - Log levels: INFO for normal, ERROR for failures

3. Traces:
   - OpenTelemetry SDK integration
   - Auto-instrument HTTP clients
   - Custom spans for business logic

4. Dashboards:
   - Service overview (golden signals)
   - Dependency map
   - Error breakdown

5. Alerts:
   - Error rate threshold
   - Latency degradation
   - Dependency failures

Next, we'll dive into SLOs and Error Budgets—the foundation of SRE reliability management. :::

Quick check: how does this lesson land for you?

Quiz

Module 5: Monitoring, Observability & Incident Response

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.