Platform Observability & Reliability
Cost Management & FinOps
3 min read
Platform teams are responsible for infrastructure costs. FinOps (Financial Operations) practices help optimize spending while maintaining platform capabilities. This lesson covers cost visibility, allocation, and optimization strategies.
Why FinOps for Platform Teams
Without FinOps:
┌─────────────────────────────────────────────────────────┐
│ Monthly Cloud Bill: $500,000 │
│ │
│ "Where is this money going?" │
│ "Which teams are spending the most?" │
│ "Are we over-provisioned?" │
│ "How do we budget for next year?" │
└─────────────────────────────────────────────────────────┘
With FinOps:
┌─────────────────────────────────────────────────────────┐
│ Monthly Cloud Bill: $500,000 │
│ │
│ Team Orders: $150,000 (30%) ↓ 5% vs last month │
│ Team Users: $120,000 (24%) ↑ 8% - new feature │
│ Team Platform: $ 80,000 (16%) = stable │
│ Idle Resources: $ 50,000 (10%) ← Action needed │
│ Shared Services: $100,000 (20%) │
└─────────────────────────────────────────────────────────┘
Installing Kubecost
Kubecost provides Kubernetes cost visibility:
# Install Kubecost with Helm
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm repo update
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace \
--set kubecostToken="YOUR_TOKEN" \
--set prometheus.server.retention="15d"
# Access Kubecost UI
kubectl port-forward -n kubecost deployment/kubecost-cost-analyzer 9090
# Open http://localhost:9090
Cost Allocation
Tag resources for accurate cost attribution:
# Kubernetes labels for cost allocation
cost_labels:
required:
- team: "orders|users|platform|data"
- environment: "production|staging|development"
- product: "checkout|search|recommendations"
optional:
- cost-center: "CC-12345"
- project: "project-name"
# Example deployment with cost labels
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
labels:
team: orders
environment: production
product: checkout
cost-center: CC-12345
spec:
template:
metadata:
labels:
team: orders
environment: production
product: checkout
Crossplane Cost Tagging
Automatically tag infrastructure costs:
# composition-with-cost-tags.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: database-aws-with-tags
spec:
compositeTypeRef:
apiVersion: platform.acme.com/v1alpha1
kind: XDatabase
resources:
- name: rds-instance
base:
apiVersion: rds.aws.upbound.io/v1beta1
kind: Instance
spec:
forProvider:
tags:
ManagedBy: crossplane
Platform: internal-developer-platform
patches:
# Add team tag from claim
- type: FromCompositeFieldPath
fromFieldPath: metadata.labels[team]
toFieldPath: spec.forProvider.tags.Team
# Add environment tag
- type: FromCompositeFieldPath
fromFieldPath: metadata.labels[environment]
toFieldPath: spec.forProvider.tags.Environment
# Add cost center
- type: FromCompositeFieldPath
fromFieldPath: spec.costCenter
toFieldPath: spec.forProvider.tags.CostCenter
Cost Monitoring Dashboard
# Prometheus rules for cost metrics
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cost-metrics
namespace: monitoring
spec:
groups:
- name: cost.metrics
rules:
# Cost per team (from Kubecost)
- record: cost:team:monthly
expr: |
sum by (team) (
kubecost_cluster_cost_total{aggregation="team"}
) * 720 # hours in month
# Cost per namespace
- record: cost:namespace:monthly
expr: |
sum by (namespace) (
kubecost_namespace_cost_total
) * 720
# Idle resource cost
- record: cost:idle:daily
expr: |
sum(kubecost_idle_cost_total) * 24
# Cost efficiency ratio
- record: cost:efficiency:ratio
expr: |
1 - (sum(kubecost_idle_cost_total) /
sum(kubecost_total_cost))
Resource Optimization
Identify and fix wasteful resources:
# Resource optimization strategies
optimization:
right_sizing:
description: "Match resource requests to actual usage"
tools:
- Kubecost recommendations
- VPA (Vertical Pod Autoscaler)
prometheus_query: |
# CPU over-provisioned (request > 2x actual)
(
sum by (namespace, pod) (kube_pod_container_resource_requests{resource="cpu"})
/
sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[1h]))
) > 2
idle_resources:
description: "Resources running but not used"
checks:
- "Deployments with 0 traffic"
- "PVCs not mounted"
- "Load balancers with no connections"
spot_instances:
description: "Use spot/preemptible for non-critical workloads"
candidates:
- Development environments
- CI/CD runners
- Batch processing
reserved_capacity:
description: "Commit to savings for predictable workloads"
targets:
- Production databases
- Core platform services
FinOps Policies
Implement cost guardrails:
# Kyverno policy for cost controls
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
spec:
validationFailureAction: enforce
rules:
- name: require-limits
match:
resources:
kinds:
- Pod
validate:
message: "CPU and memory limits are required"
pattern:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-cost-labels
spec:
validationFailureAction: enforce
rules:
- name: require-team-label
match:
resources:
kinds:
- Deployment
- StatefulSet
validate:
message: "Team label is required for cost allocation"
pattern:
metadata:
labels:
team: "?*"
Cost Reports
Generate team cost reports:
# Cost reporting configuration
cost_reports:
weekly:
recipients: "team-leads@acme.com"
content:
- "Cost by team (week-over-week change)"
- "Top 10 expensive resources"
- "Optimization recommendations"
monthly:
recipients: "finance@acme.com, engineering-vp@acme.com"
content:
- "Cost by team and product"
- "Trend analysis"
- "Budget vs actual"
- "Forecast for next month"
alerts:
- name: "Budget exceeded"
condition: "team_cost > budget * 1.1"
action: "notify team lead"
- name: "Anomaly detected"
condition: "daily_cost > avg_daily_cost * 2"
action: "alert platform team"
Showback vs Chargeback
# Cost allocation models
allocation_models:
showback:
description: "Show teams their costs without charging"
use_case: "Building cost awareness culture"
implementation:
- Monthly cost reports per team
- Dashboard visibility
- No budget enforcement
chargeback:
description: "Charge teams for actual usage"
use_case: "Mature organizations with budget ownership"
implementation:
- Internal billing system
- Budget alerts and enforcement
- Approval workflows for large requests
recommendation: |
Start with showback to build awareness,
then move to chargeback when teams are ready
In the next lesson, we'll explore platform reliability patterns including multi-tenancy and isolation. :::