The Three Pillars of Observability

Modern observability rests on three pillars: metrics, logs, and traces. Understanding how they complement each other is essential.

Observability vs Monitoring

Monitoring	Observability
Tells you WHEN something is wrong	Helps you understand WHY
Predefined questions and dashboards	Explore unknown unknowns
Alert when threshold crossed	Investigate novel issues
"Is the system healthy?"	"Why is this request slow?"

The Three Pillars

┌─────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                         │
├─────────────────┬─────────────────┬─────────────────────┤
│     METRICS     │      LOGS       │       TRACES        │
├─────────────────┼─────────────────┼─────────────────────┤
│ What's happening│ What happened   │ How it happened     │
│ (aggregated)    │ (detailed)      │ (distributed flow)  │
├─────────────────┼─────────────────┼─────────────────────┤
│ Prometheus      │ ELK/Loki        │ Jaeger/Zipkin       │
│ Datadog         │ Splunk          │ OpenTelemetry       │
│ Grafana         │ CloudWatch      │ AWS X-Ray           │
└─────────────────┴─────────────────┴─────────────────────┘

Metrics: The "What"

Aggregated numerical data over time:

# High cardinality metrics (be careful)
http_requests_total{user_id="123"}  # BAD - too many values

# Better approach
http_requests_total{endpoint="/api/users", method="GET"}  # GOOD

When to Use Metrics

Use Case	Example
Alerting	Error rate > 5%
Dashboards	Traffic trends
Capacity planning	CPU trending up
SLO tracking	Availability percentage

Logs: The "What Happened"

Structured events with context:

{
  "timestamp": "2025-01-04T10:30:00Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user-789",
  "message": "Payment processing failed",
  "error": "Card declined: insufficient funds",
  "duration_ms": 234,
  "metadata": {
    "card_type": "visa",
    "amount": 99.99
  }
}

Logging Best Practices

Practice	Why
Structured logging (JSON)	Machine parseable
Include trace/span IDs	Correlate with traces
Log levels (DEBUG → ERROR)	Filter by severity
Avoid PII in logs	Compliance, security
Log context, not secrets	Include request ID, user ID

Log Queries (Loki LogQL)

# Find errors in payment service
{service="payment-api"} |= "ERROR"

# Parse JSON and filter
{service="payment-api"} | json | level="ERROR" | duration_ms > 1000

# Count errors by endpoint
sum by (endpoint) (count_over_time({service="api"} |= "ERROR" [5m]))

Traces: The "How"

Distributed request flow across services:

[User Request]
      │
      ▼
┌───────────┐    250ms total
│ API Gateway│───────────────────────────────┐
└─────┬─────┘                                │
      │                                      │
      ▼                                      │
┌───────────┐    150ms                       │
│   Auth    │─────────────┐                  │
└─────┬─────┘             │                  │
      │                   │                  │
      ▼                   ▼                  │
┌───────────┐    ┌───────────┐              │
│  Service A│    │   Redis   │              │
│   80ms    │    │   20ms    │              │
└─────┬─────┘    └───────────┘              │
      │                                      │
      ▼                                      │
┌───────────┐                               │
│ Database  │                               │
│   50ms    │                               │
└───────────┘                               │

Trace Structure

Trace (end-to-end request)
├── Span: API Gateway (root span)
│   ├── Span: Auth Service
│   │   └── Span: Redis lookup
│   └── Span: Service A
│       └── Span: Database query

OpenTelemetry (Standard)

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

# Initialize tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Create spans
with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("payment.amount", 99.99)
    span.set_attribute("user.id", "user-123")

    # Nested span
    with tracer.start_as_current_span("validate_card"):
        # validation logic
        pass

Correlating the Three Pillars

The key is connecting metrics, logs, and traces:

1. ALERT fires (metrics)
   "Error rate > 5% for payment-api"

2. INVESTIGATE with logs
   Search: {service="payment-api"} |= "ERROR" | last 5m

3. TRACE specific request
   Find trace_id from log, view in Jaeger
   See exactly which service/query failed

Exemplars: Connecting Metrics to Traces

# Prometheus histogram with exemplar
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
) # Click data point → see trace_id → view trace

Interview Questions

Q: "A user complains about slow checkout. How do you investigate?"

Metrics: Check p99 latency for checkout endpoint
Logs: Search for that user's requests in timeframe
Traces: Find the slow trace, see which span took longest
Root cause: Database query? External API? Resource contention?

Q: "How would you implement observability for a new microservice?"

# Minimum viable observability:
1. Metrics:
   - Request rate, error rate, latency (RED)
   - Resource utilization
   - Business metrics (orders/min)

2. Logs:
   - Structured JSON format
   - Include trace_id and span_id
   - Log levels: INFO for normal, ERROR for failures

3. Traces:
   - OpenTelemetry SDK integration
   - Auto-instrument HTTP clients
   - Custom spans for business logic

4. Dashboards:
   - Service overview (golden signals)
   - Dependency map
   - Error breakdown

5. Alerts:
   - Error rate threshold
   - Latency degradation
   - Dependency failures

Next, we'll dive into SLOs and Error Budgets—the foundation of SRE reliability management. :::

Observability vs Monitoring