Platform Observability & Reliability

Platform Metrics & SLOs

3 min read

Platform teams need to measure their success beyond traditional infrastructure metrics. This lesson covers DORA metrics, platform-specific KPIs, and how to define SLOs that reflect platform health and developer productivity.

Why Platform Metrics Matter

Traditional Ops Metrics:
┌─────────────────────────────────────────────────────────┐
│  CPU: 45%  │  Memory: 60%  │  Uptime: 99.9%           │
│                                                         │
│  "Everything looks healthy... but developers are        │
│   still complaining about slow deployments"            │
└─────────────────────────────────────────────────────────┘

Platform Metrics:
┌─────────────────────────────────────────────────────────┐
│  Deployment Frequency: 50/day                          │
│  Lead Time for Changes: 2 hours                        │
│  Change Failure Rate: 5%                               │
│  Time to Recovery: 30 minutes                          │
│                                                         │
│  "We can now see the actual developer experience"      │
└─────────────────────────────────────────────────────────┘

DORA Metrics

The four key metrics that predict software delivery performance:

# DORA Metrics Definition
dora_metrics:

  deployment_frequency:
    definition: "How often code is deployed to production"
    elite: "Multiple times per day"
    high: "Weekly to monthly"
    medium: "Monthly to every 6 months"
    low: "Less than once per 6 months"
    measurement: |
      count(deployments) / time_period

  lead_time_for_changes:
    definition: "Time from commit to production"
    elite: "Less than 1 hour"
    high: "1 day to 1 week"
    medium: "1 week to 1 month"
    low: "1 month to 6 months"
    measurement: |
      deployment_time - commit_time

  change_failure_rate:
    definition: "Percentage of deployments causing failures"
    elite: "0-15%"
    high: "16-30%"
    medium: "31-45%"
    low: "46-60%"
    measurement: |
      (failed_deployments / total_deployments) * 100

  time_to_restore:
    definition: "Time to recover from production failure"
    elite: "Less than 1 hour"
    high: "Less than 1 day"
    medium: "1 day to 1 week"
    low: "More than 1 week"
    measurement: |
      recovery_time - incident_start_time

Platform-Specific Metrics

Beyond DORA, track platform-specific indicators:

# Platform Health Metrics
platform_metrics:

  self_service:
    - name: "Self-service adoption rate"
      query: |
        (resources_created_via_backstage / total_resources_created) * 100
      target: ">80%"

    - name: "Template usage rate"
      query: |
        count(projects_from_templates) / count(total_new_projects)
      target: ">90%"

  infrastructure:
    - name: "Infrastructure provisioning time"
      query: |
        avg(time_to_ready) where resource_type="database"
      target: "<10 minutes"

    - name: "Crossplane sync success rate"
      query: |
        count(synced_resources) / count(total_resources)
      target: ">99%"

  gitops:
    - name: "GitOps sync status"
      query: |
        count(apps where sync_status="Synced") / count(total_apps)
      target: ">99%"

    - name: "Deployment rollback rate"
      query: |
        count(rollbacks) / count(deployments)
      target: "<5%"

  developer_experience:
    - name: "Time to first deployment"
      description: "Time for new developer to deploy"
      target: "<1 day"

    - name: "Support ticket volume"
      query: |
        count(platform_support_tickets) / count(developers)
      target: "<0.5 tickets/developer/month"

Prometheus Queries for DORA

Implement DORA metrics in Prometheus:

# prometheus-rules-dora.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: dora-metrics
  namespace: monitoring
spec:
  groups:
    - name: dora.metrics
      interval: 1m
      rules:
        # Deployment Frequency (per day)
        - record: dora:deployment_frequency:daily
          expr: |
            sum(increase(argocd_app_sync_total{
              phase="Succeeded"
            }[24h])) by (dest_namespace)

        # Lead Time for Changes (average in hours)
        - record: dora:lead_time_hours:avg
          expr: |
            avg(
              (argocd_app_sync_timestamp - on(app) argocd_app_source_revision_timestamp)
            ) / 3600

        # Change Failure Rate (percentage)
        - record: dora:change_failure_rate:percent
          expr: |
            (
              sum(increase(argocd_app_sync_total{phase="Failed"}[7d]))
              /
              sum(increase(argocd_app_sync_total[7d]))
            ) * 100

        # Mean Time to Recovery (in minutes)
        - record: dora:mttr_minutes:avg
          expr: |
            avg(
              alertmanager_alerts_resolved_total
              - on(alertname) alertmanager_alerts_firing_total
            ) / 60

Defining SLOs

Service Level Objectives for your platform:

# slo-definitions.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: platform-slos
  namespace: monitoring
spec:
  service: "internal-developer-platform"
  labels:
    team: platform

  slos:
    # Platform API Availability
    - name: "platform-api-availability"
      objective: 99.9
      description: "Backstage and platform APIs available"
      sli:
        events:
          errorQuery: |
            sum(rate(http_requests_total{
              service=~"backstage|crossplane|argocd",
              status=~"5.."
            }[{{.window}}]))
          totalQuery: |
            sum(rate(http_requests_total{
              service=~"backstage|crossplane|argocd"
            }[{{.window}}]))
      alerting:
        name: "PlatformAPIAvailability"
        labels:
          severity: critical
        pageAlert:
          labels:
            severity: page
        ticketAlert:
          labels:
            severity: ticket

    # Infrastructure Provisioning Success
    - name: "infra-provisioning-success"
      objective: 99.5
      description: "Crossplane resource provisioning succeeds"
      sli:
        events:
          errorQuery: |
            sum(rate(crossplane_managed_resource_sync_total{
              status="failure"
            }[{{.window}}]))
          totalQuery: |
            sum(rate(crossplane_managed_resource_sync_total[{{.window}}]))

    # GitOps Sync Success
    - name: "gitops-sync-success"
      objective: 99.9
      description: "ArgoCD applications sync successfully"
      sli:
        events:
          errorQuery: |
            sum(rate(argocd_app_sync_total{
              phase="Failed"
            }[{{.window}}]))
          totalQuery: |
            sum(rate(argocd_app_sync_total[{{.window}}]))

Error Budgets

Calculate and track error budgets:

# Error Budget Calculation
error_budget:

  formula: |
    error_budget = (1 - SLO) * time_period

    Example:
    SLO = 99.9%
    Time period = 30 days
    Error budget = 0.1% * 30 days = 43.2 minutes

  tracking: |
    # Prometheus query for remaining budget
    1 - (
      sum(increase(http_errors_total[30d]))
      /
      sum(increase(http_requests_total[30d]))
    ) / (1 - 0.999)

  policies:
    budget_remaining_high: "Continue normal development"
    budget_remaining_low: "Prioritize reliability work"
    budget_exhausted: "Feature freeze, fix issues"

Grafana Dashboard

{
  "dashboard": {
    "title": "Platform Engineering Dashboard",
    "panels": [
      {
        "title": "DORA: Deployment Frequency",
        "type": "stat",
        "targets": [{
          "expr": "sum(increase(argocd_app_sync_total{phase=\"Succeeded\"}[24h]))"
        }]
      },
      {
        "title": "DORA: Lead Time (hours)",
        "type": "gauge",
        "targets": [{
          "expr": "dora:lead_time_hours:avg"
        }]
      },
      {
        "title": "DORA: Change Failure Rate",
        "type": "gauge",
        "targets": [{
          "expr": "dora:change_failure_rate:percent"
        }]
      },
      {
        "title": "Platform SLO: Availability",
        "type": "gauge",
        "targets": [{
          "expr": "1 - (sum(rate(http_requests_total{status=~\"5..\"}[30d])) / sum(rate(http_requests_total[30d])))"
        }]
      },
      {
        "title": "Error Budget Remaining",
        "type": "stat",
        "targets": [{
          "expr": "platform_error_budget_remaining_percent"
        }]
      }
    ]
  }
}

In the next lesson, we'll explore developer experience metrics and how to measure platform adoption. :::

Quiz

Module 5: Platform Observability & Reliability

Take Quiz