Platform Observability & Reliability

Cost Management & FinOps

3 min read

Platform teams are responsible for infrastructure costs. FinOps (Financial Operations) practices help optimize spending while maintaining platform capabilities. This lesson covers cost visibility, allocation, and optimization strategies.

Why FinOps for Platform Teams

Without FinOps:
┌─────────────────────────────────────────────────────────┐
│  Monthly Cloud Bill: $500,000                           │
│                                                          │
│  "Where is this money going?"                           │
│  "Which teams are spending the most?"                   │
│  "Are we over-provisioned?"                             │
│  "How do we budget for next year?"                      │
└─────────────────────────────────────────────────────────┘

With FinOps:
┌─────────────────────────────────────────────────────────┐
│  Monthly Cloud Bill: $500,000                           │
│                                                          │
│  Team Orders:     $150,000 (30%)  ↓ 5% vs last month   │
│  Team Users:      $120,000 (24%)  ↑ 8% - new feature   │
│  Team Platform:   $ 80,000 (16%)  = stable             │
│  Idle Resources:  $ 50,000 (10%)  ← Action needed      │
│  Shared Services: $100,000 (20%)                       │
└─────────────────────────────────────────────────────────┘

Installing Kubecost

Kubecost provides Kubernetes cost visibility:

# Install Kubecost with Helm
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm repo update

helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost \
  --create-namespace \
  --set kubecostToken="YOUR_TOKEN" \
  --set prometheus.server.retention="15d"

# Access Kubecost UI
kubectl port-forward -n kubecost deployment/kubecost-cost-analyzer 9090
# Open http://localhost:9090

Cost Allocation

Tag resources for accurate cost attribution:

# Kubernetes labels for cost allocation
cost_labels:

  required:
    - team: "orders|users|platform|data"
    - environment: "production|staging|development"
    - product: "checkout|search|recommendations"

  optional:
    - cost-center: "CC-12345"
    - project: "project-name"

# Example deployment with cost labels
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  labels:
    team: orders
    environment: production
    product: checkout
    cost-center: CC-12345
spec:
  template:
    metadata:
      labels:
        team: orders
        environment: production
        product: checkout

Crossplane Cost Tagging

Automatically tag infrastructure costs:

# composition-with-cost-tags.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: database-aws-with-tags
spec:
  compositeTypeRef:
    apiVersion: platform.acme.com/v1alpha1
    kind: XDatabase

  resources:
    - name: rds-instance
      base:
        apiVersion: rds.aws.upbound.io/v1beta1
        kind: Instance
        spec:
          forProvider:
            tags:
              ManagedBy: crossplane
              Platform: internal-developer-platform
      patches:
        # Add team tag from claim
        - type: FromCompositeFieldPath
          fromFieldPath: metadata.labels[team]
          toFieldPath: spec.forProvider.tags.Team

        # Add environment tag
        - type: FromCompositeFieldPath
          fromFieldPath: metadata.labels[environment]
          toFieldPath: spec.forProvider.tags.Environment

        # Add cost center
        - type: FromCompositeFieldPath
          fromFieldPath: spec.costCenter
          toFieldPath: spec.forProvider.tags.CostCenter

Cost Monitoring Dashboard

# Prometheus rules for cost metrics
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cost-metrics
  namespace: monitoring
spec:
  groups:
    - name: cost.metrics
      rules:
        # Cost per team (from Kubecost)
        - record: cost:team:monthly
          expr: |
            sum by (team) (
              kubecost_cluster_cost_total{aggregation="team"}
            ) * 720  # hours in month

        # Cost per namespace
        - record: cost:namespace:monthly
          expr: |
            sum by (namespace) (
              kubecost_namespace_cost_total
            ) * 720

        # Idle resource cost
        - record: cost:idle:daily
          expr: |
            sum(kubecost_idle_cost_total) * 24

        # Cost efficiency ratio
        - record: cost:efficiency:ratio
          expr: |
            1 - (sum(kubecost_idle_cost_total) /
                 sum(kubecost_total_cost))

Resource Optimization

Identify and fix wasteful resources:

# Resource optimization strategies
optimization:

  right_sizing:
    description: "Match resource requests to actual usage"
    tools:
      - Kubecost recommendations
      - VPA (Vertical Pod Autoscaler)
    prometheus_query: |
      # CPU over-provisioned (request > 2x actual)
      (
        sum by (namespace, pod) (kube_pod_container_resource_requests{resource="cpu"})
        /
        sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[1h]))
      ) > 2

  idle_resources:
    description: "Resources running but not used"
    checks:
      - "Deployments with 0 traffic"
      - "PVCs not mounted"
      - "Load balancers with no connections"

  spot_instances:
    description: "Use spot/preemptible for non-critical workloads"
    candidates:
      - Development environments
      - CI/CD runners
      - Batch processing

  reserved_capacity:
    description: "Commit to savings for predictable workloads"
    targets:
      - Production databases
      - Core platform services

FinOps Policies

Implement cost guardrails:

# Kyverno policy for cost controls
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: enforce
  rules:
    - name: require-limits
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "CPU and memory limits are required"
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"

---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-cost-labels
spec:
  validationFailureAction: enforce
  rules:
    - name: require-team-label
      match:
        resources:
          kinds:
            - Deployment
            - StatefulSet
      validate:
        message: "Team label is required for cost allocation"
        pattern:
          metadata:
            labels:
              team: "?*"

Cost Reports

Generate team cost reports:

# Cost reporting configuration
cost_reports:

  weekly:
    recipients: "team-leads@acme.com"
    content:
      - "Cost by team (week-over-week change)"
      - "Top 10 expensive resources"
      - "Optimization recommendations"

  monthly:
    recipients: "finance@acme.com, engineering-vp@acme.com"
    content:
      - "Cost by team and product"
      - "Trend analysis"
      - "Budget vs actual"
      - "Forecast for next month"

  alerts:
    - name: "Budget exceeded"
      condition: "team_cost > budget * 1.1"
      action: "notify team lead"

    - name: "Anomaly detected"
      condition: "daily_cost > avg_daily_cost * 2"
      action: "alert platform team"

Showback vs Chargeback

# Cost allocation models
allocation_models:

  showback:
    description: "Show teams their costs without charging"
    use_case: "Building cost awareness culture"
    implementation:
      - Monthly cost reports per team
      - Dashboard visibility
      - No budget enforcement

  chargeback:
    description: "Charge teams for actual usage"
    use_case: "Mature organizations with budget ownership"
    implementation:
      - Internal billing system
      - Budget alerts and enforcement
      - Approval workflows for large requests

  recommendation: |
    Start with showback to build awareness,
    then move to chargeback when teams are ready

In the next lesson, we'll explore platform reliability patterns including multi-tenancy and isolation. :::

Quiz

Module 5: Platform Observability & Reliability

Take Quiz