Platform Observability & Reliability
Platform Metrics & SLOs
3 min read
Platform teams need to measure their success beyond traditional infrastructure metrics. This lesson covers DORA metrics, platform-specific KPIs, and how to define SLOs that reflect platform health and developer productivity.
Why Platform Metrics Matter
Traditional Ops Metrics:
┌─────────────────────────────────────────────────────────┐
│ CPU: 45% │ Memory: 60% │ Uptime: 99.9% │
│ │
│ "Everything looks healthy... but developers are │
│ still complaining about slow deployments" │
└─────────────────────────────────────────────────────────┘
Platform Metrics:
┌─────────────────────────────────────────────────────────┐
│ Deployment Frequency: 50/day │
│ Lead Time for Changes: 2 hours │
│ Change Failure Rate: 5% │
│ Time to Recovery: 30 minutes │
│ │
│ "We can now see the actual developer experience" │
└─────────────────────────────────────────────────────────┘
DORA Metrics
The four key metrics that predict software delivery performance:
# DORA Metrics Definition
dora_metrics:
deployment_frequency:
definition: "How often code is deployed to production"
elite: "Multiple times per day"
high: "Weekly to monthly"
medium: "Monthly to every 6 months"
low: "Less than once per 6 months"
measurement: |
count(deployments) / time_period
lead_time_for_changes:
definition: "Time from commit to production"
elite: "Less than 1 hour"
high: "1 day to 1 week"
medium: "1 week to 1 month"
low: "1 month to 6 months"
measurement: |
deployment_time - commit_time
change_failure_rate:
definition: "Percentage of deployments causing failures"
elite: "0-15%"
high: "16-30%"
medium: "31-45%"
low: "46-60%"
measurement: |
(failed_deployments / total_deployments) * 100
time_to_restore:
definition: "Time to recover from production failure"
elite: "Less than 1 hour"
high: "Less than 1 day"
medium: "1 day to 1 week"
low: "More than 1 week"
measurement: |
recovery_time - incident_start_time
Platform-Specific Metrics
Beyond DORA, track platform-specific indicators:
# Platform Health Metrics
platform_metrics:
self_service:
- name: "Self-service adoption rate"
query: |
(resources_created_via_backstage / total_resources_created) * 100
target: ">80%"
- name: "Template usage rate"
query: |
count(projects_from_templates) / count(total_new_projects)
target: ">90%"
infrastructure:
- name: "Infrastructure provisioning time"
query: |
avg(time_to_ready) where resource_type="database"
target: "<10 minutes"
- name: "Crossplane sync success rate"
query: |
count(synced_resources) / count(total_resources)
target: ">99%"
gitops:
- name: "GitOps sync status"
query: |
count(apps where sync_status="Synced") / count(total_apps)
target: ">99%"
- name: "Deployment rollback rate"
query: |
count(rollbacks) / count(deployments)
target: "<5%"
developer_experience:
- name: "Time to first deployment"
description: "Time for new developer to deploy"
target: "<1 day"
- name: "Support ticket volume"
query: |
count(platform_support_tickets) / count(developers)
target: "<0.5 tickets/developer/month"
Prometheus Queries for DORA
Implement DORA metrics in Prometheus:
# prometheus-rules-dora.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: dora-metrics
namespace: monitoring
spec:
groups:
- name: dora.metrics
interval: 1m
rules:
# Deployment Frequency (per day)
- record: dora:deployment_frequency:daily
expr: |
sum(increase(argocd_app_sync_total{
phase="Succeeded"
}[24h])) by (dest_namespace)
# Lead Time for Changes (average in hours)
- record: dora:lead_time_hours:avg
expr: |
avg(
(argocd_app_sync_timestamp - on(app) argocd_app_source_revision_timestamp)
) / 3600
# Change Failure Rate (percentage)
- record: dora:change_failure_rate:percent
expr: |
(
sum(increase(argocd_app_sync_total{phase="Failed"}[7d]))
/
sum(increase(argocd_app_sync_total[7d]))
) * 100
# Mean Time to Recovery (in minutes)
- record: dora:mttr_minutes:avg
expr: |
avg(
alertmanager_alerts_resolved_total
- on(alertname) alertmanager_alerts_firing_total
) / 60
Defining SLOs
Service Level Objectives for your platform:
# slo-definitions.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: platform-slos
namespace: monitoring
spec:
service: "internal-developer-platform"
labels:
team: platform
slos:
# Platform API Availability
- name: "platform-api-availability"
objective: 99.9
description: "Backstage and platform APIs available"
sli:
events:
errorQuery: |
sum(rate(http_requests_total{
service=~"backstage|crossplane|argocd",
status=~"5.."
}[{{.window}}]))
totalQuery: |
sum(rate(http_requests_total{
service=~"backstage|crossplane|argocd"
}[{{.window}}]))
alerting:
name: "PlatformAPIAvailability"
labels:
severity: critical
pageAlert:
labels:
severity: page
ticketAlert:
labels:
severity: ticket
# Infrastructure Provisioning Success
- name: "infra-provisioning-success"
objective: 99.5
description: "Crossplane resource provisioning succeeds"
sli:
events:
errorQuery: |
sum(rate(crossplane_managed_resource_sync_total{
status="failure"
}[{{.window}}]))
totalQuery: |
sum(rate(crossplane_managed_resource_sync_total[{{.window}}]))
# GitOps Sync Success
- name: "gitops-sync-success"
objective: 99.9
description: "ArgoCD applications sync successfully"
sli:
events:
errorQuery: |
sum(rate(argocd_app_sync_total{
phase="Failed"
}[{{.window}}]))
totalQuery: |
sum(rate(argocd_app_sync_total[{{.window}}]))
Error Budgets
Calculate and track error budgets:
# Error Budget Calculation
error_budget:
formula: |
error_budget = (1 - SLO) * time_period
Example:
SLO = 99.9%
Time period = 30 days
Error budget = 0.1% * 30 days = 43.2 minutes
tracking: |
# Prometheus query for remaining budget
1 - (
sum(increase(http_errors_total[30d]))
/
sum(increase(http_requests_total[30d]))
) / (1 - 0.999)
policies:
budget_remaining_high: "Continue normal development"
budget_remaining_low: "Prioritize reliability work"
budget_exhausted: "Feature freeze, fix issues"
Grafana Dashboard
{
"dashboard": {
"title": "Platform Engineering Dashboard",
"panels": [
{
"title": "DORA: Deployment Frequency",
"type": "stat",
"targets": [{
"expr": "sum(increase(argocd_app_sync_total{phase=\"Succeeded\"}[24h]))"
}]
},
{
"title": "DORA: Lead Time (hours)",
"type": "gauge",
"targets": [{
"expr": "dora:lead_time_hours:avg"
}]
},
{
"title": "DORA: Change Failure Rate",
"type": "gauge",
"targets": [{
"expr": "dora:change_failure_rate:percent"
}]
},
{
"title": "Platform SLO: Availability",
"type": "gauge",
"targets": [{
"expr": "1 - (sum(rate(http_requests_total{status=~\"5..\"}[30d])) / sum(rate(http_requests_total[30d])))"
}]
},
{
"title": "Error Budget Remaining",
"type": "stat",
"targets": [{
"expr": "platform_error_budget_remaining_percent"
}]
}
]
}
}
In the next lesson, we'll explore developer experience metrics and how to measure platform adoption. :::