Platform Observability & Reliability

Platform Reliability Patterns

12 min read

Platform reliability ensures consistent service delivery to development teams. This lesson covers multi-tenancy, namespace isolation, and virtual clusters for platform resilience.

Multi-Tenancy Challenges

Platform teams serve multiple development teams with varying requirements. Multi-tenancy introduces challenges around resource isolation, security boundaries, and fair resource allocation.

Tenancy Models

┌─────────────────────────────────────────────────────────────────────┐
│                     Multi-Tenancy Models                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Soft Multi-Tenancy          │  Hard Multi-Tenancy                  │
│  ─────────────────           │  ──────────────────                  │
│                              │                                      │
│  ┌─────────────────────┐     │  ┌─────────┐ ┌─────────┐ ┌─────────┐│
│  │   Shared Cluster    │     │  │Cluster A│ │Cluster B│ │Cluster C││
│  │  ┌─────┐ ┌─────┐   │     │  │ Team A  │ │ Team B  │ │ Team C  ││
│  │  │NS-A │ │NS-B │   │     │  └─────────┘ └─────────┘ └─────────┘│
│  │  └─────┘ └─────┘   │     │                                      │
│  │  ┌─────┐ ┌─────┐   │     │  • Complete isolation                │
│  │  │NS-C │ │NS-D │   │     │  • Higher resource cost              │
│  │  └─────┘ └─────┘   │     │  • Independent upgrades              │
│  └─────────────────────┘     │                                      │
│                              │                                      │
│  • Namespace isolation       │                                      │
│  • Resource efficient        │                                      │
│  • Shared control plane      │                                      │
│                              │                                      │
└─────────────────────────────────────────────────────────────────────┘

Namespace Isolation with Network Policies

Network policies control traffic flow between namespaces and pods.

Default Deny Policy

# network-policy-default-deny.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: team-alpha
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Allow Intra-Namespace Traffic

# network-policy-allow-same-namespace.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace
  namespace: team-alpha
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector: {}
  egress:
    - to:
        - podSelector: {}
    # Allow DNS resolution
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53

Allow Traffic from Ingress Controller

# network-policy-allow-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-controller
  namespace: team-alpha
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/component: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
          podSelector:
            matchLabels:
              app.kubernetes.io/name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080

Resource Quotas and Limits

Resource quotas prevent any single team from consuming excessive cluster resources.

Namespace Resource Quota

# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-alpha
spec:
  hard:
    # Compute resources
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi

    # Storage resources
    requests.storage: 100Gi
    persistentvolumeclaims: "10"

    # Object counts
    pods: "50"
    services: "20"
    secrets: "50"
    configmaps: "50"

    # Services by type
    services.loadbalancers: "2"
    services.nodeports: "5"

Limit Ranges for Default Limits

# limit-range.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-alpha
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: 512Mi
      defaultRequest:
        cpu: "100m"
        memory: 128Mi
      min:
        cpu: "50m"
        memory: 64Mi
      max:
        cpu: "4"
        memory: 8Gi

    - type: PersistentVolumeClaim
      min:
        storage: 1Gi
      max:
        storage: 50Gi

Virtual Clusters with vCluster

vCluster creates fully functional virtual Kubernetes clusters inside namespaces. Each team gets an isolated control plane while sharing underlying infrastructure.

vCluster Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    Host Cluster                                     │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │                    Namespace: vcluster-team-a                  │ │
│  │  ┌─────────────────────────────────────────────────────────┐  │ │
│  │  │              vCluster Control Plane                      │  │ │
│  │  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐  │  │ │
│  │  │  │ API      │  │ Syncer   │  │ CoreDNS              │  │  │ │
│  │  │  │ Server   │  │          │  │                      │  │  │ │
│  │  │  └──────────┘  └──────────┘  └──────────────────────┘  │  │ │
│  │  │  ┌──────────────────────────────────────────────────┐  │  │ │
│  │  │  │ etcd (or SQLite)                                 │  │  │ │
│  │  │  └──────────────────────────────────────────────────┘  │  │ │
│  │  └─────────────────────────────────────────────────────────┘  │ │
│  │                                                               │ │
│  │  Virtual Resources → Synced to Host → Scheduled on Nodes      │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │                    Namespace: vcluster-team-b                  │ │
│  │  ┌─────────────────────────────────────────────────────────┐  │ │
│  │  │              vCluster Control Plane                      │  │ │
│  │  │                    (same structure)                      │  │ │
│  │  └─────────────────────────────────────────────────────────┘  │ │
│  └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Installing vCluster CLI

# macOS
brew install loft-sh/tap/vcluster

# Linux
curl -L -o vcluster "https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-amd64"
chmod +x vcluster
sudo mv vcluster /usr/local/bin/

# Verify installation
vcluster --version

Creating a vCluster

# Create vCluster for team-alpha
vcluster create team-alpha \
  --namespace vcluster-team-alpha \
  --connect=false

# Connect to vCluster
vcluster connect team-alpha \
  --namespace vcluster-team-alpha \
  --update-current=false \
  -- kubectl get namespaces

vCluster Configuration

# vcluster-values.yaml
vcluster:
  image: rancher/k3s:v1.28.4-k3s1

  resources:
    limits:
      cpu: "2"
      memory: 4Gi
    requests:
      cpu: "200m"
      memory: 256Mi

syncer:
  extraArgs:
    - --sync-all-nodes

sync:
  # Sync services to host cluster
  services:
    enabled: true

  # Sync ingresses
  ingresses:
    enabled: true

  # Sync persistent volume claims
  persistentvolumeclaims:
    enabled: true

  # Don't sync network policies (use host policies)
  networkpolicies:
    enabled: false

# Isolation settings
isolation:
  enabled: true

  resourceQuota:
    enabled: true
    quota:
      requests.cpu: "10"
      requests.memory: 20Gi
      limits.cpu: "20"
      limits.memory: 40Gi
      pods: "100"

  limitRange:
    enabled: true
    default:
      cpu: "500m"
      memory: 512Mi
    defaultRequest:
      cpu: "100m"
      memory: 128Mi

# Use SQLite instead of etcd for smaller footprint
storage:
  persistence: true
  size: 5Gi

Deploying vCluster with Helm

# Add Loft Helm repository
helm repo add loft https://charts.loft.sh
helm repo update

# Create namespace
kubectl create namespace vcluster-team-alpha

# Install vCluster
helm upgrade --install team-alpha loft/vcluster \
  --namespace vcluster-team-alpha \
  --values vcluster-values.yaml \
  --wait

# Generate kubeconfig for team
vcluster connect team-alpha \
  --namespace vcluster-team-alpha \
  --update-current=false \
  --kube-config ./team-alpha-kubeconfig.yaml

Policy Enforcement with Kyverno

Kyverno enforces policies across clusters using Kubernetes-native patterns.

Installing Kyverno

# Add Kyverno Helm repository
helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update

# Install Kyverno
helm upgrade --install kyverno kyverno/kyverno \
  --namespace kyverno \
  --create-namespace \
  --set replicaCount=3

Require Resource Limits Policy

# policy-require-limits.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
  annotations:
    policies.kyverno.io/title: Require Resource Limits
    policies.kyverno.io/description: >-
      All containers must specify CPU and memory limits.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: validate-resources
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "CPU and memory limits are required for all containers."
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"

Require Labels Policy

# policy-require-labels.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-labels
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-team-label
      match:
        any:
          - resources:
              kinds:
                - Deployment
                - StatefulSet
                - DaemonSet
      validate:
        message: "The label 'team' is required."
        pattern:
          metadata:
            labels:
              team: "?*"

    - name: require-app-label
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "The label 'app.kubernetes.io/name' is required."
        pattern:
          metadata:
            labels:
              app.kubernetes.io/name: "?*"

Auto-Generate Network Policies

# policy-generate-network-policy.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-default-network-policy
spec:
  rules:
    - name: generate-default-deny
      match:
        any:
          - resources:
              kinds:
                - Namespace
              selector:
                matchLabels:
                  platform.company.com/managed: "true"
      generate:
        apiVersion: networking.k8s.io/v1
        kind: NetworkPolicy
        name: default-deny
        namespace: "{{request.object.metadata.name}}"
        synchronize: true
        data:
          spec:
            podSelector: {}
            policyTypes:
              - Ingress
              - Egress

Platform Reliability Dashboard

Monitor platform health and tenant resource usage.

Prometheus Recording Rules

# platform-reliability-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: platform-reliability
  namespace: monitoring
spec:
  groups:
    - name: platform.reliability
      interval: 30s
      rules:
        # Tenant resource utilization
        - record: platform:tenant_cpu_utilization:ratio
          expr: |
            sum by (namespace) (
              rate(container_cpu_usage_seconds_total{namespace=~"team-.*"}[5m])
            )
            /
            sum by (namespace) (
              kube_resourcequota{resource="limits.cpu", type="hard", namespace=~"team-.*"}
            )

        - record: platform:tenant_memory_utilization:ratio
          expr: |
            sum by (namespace) (
              container_memory_working_set_bytes{namespace=~"team-.*"}
            )
            /
            sum by (namespace) (
              kube_resourcequota{resource="limits.memory", type="hard", namespace=~"team-.*"}
            )

        # vCluster health
        - record: platform:vcluster_api_availability:ratio
          expr: |
            avg_over_time(up{job=~"vcluster-.*"}[5m])

        # Policy violations
        - record: platform:policy_violations:count
          expr: |
            sum by (policy) (
              increase(kyverno_policy_results_total{rule_result="fail"}[1h])
            )

Grafana Dashboard for Platform Health

{
  "dashboard": {
    "title": "Platform Reliability",
    "panels": [
      {
        "title": "Tenant CPU Utilization",
        "type": "gauge",
        "targets": [
          {
            "expr": "platform:tenant_cpu_utilization:ratio * 100",
            "legendFormat": "{{namespace}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            },
            "unit": "percent",
            "max": 100
          }
        }
      },
      {
        "title": "vCluster Availability",
        "type": "stat",
        "targets": [
          {
            "expr": "avg(platform:vcluster_api_availability:ratio) * 100",
            "legendFormat": "Availability"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 99},
                {"color": "green", "value": 99.9}
              ]
            }
          }
        }
      },
      {
        "title": "Policy Violations (Last Hour)",
        "type": "table",
        "targets": [
          {
            "expr": "topk(10, platform:policy_violations:count)",
            "format": "table"
          }
        ]
      }
    ]
  }
}

Summary

Platform reliability patterns ensure consistent service delivery:

PatternToolPurpose
Namespace IsolationNetwork PoliciesControl pod-to-pod traffic
Resource LimitsResourceQuota, LimitRangePrevent resource exhaustion
Virtual ClustersvClusterFull API isolation per tenant
Policy EnforcementKyvernoAutomated compliance

Choose the right isolation level based on your requirements:

  • Namespace isolation: Sufficient for trusted internal teams
  • vCluster: Needed for untrusted workloads or compliance requirements
  • Dedicated clusters: Required for strict regulatory separation

Next Lesson: Module 6 covers Platform Team Operations and Maturity, including building a platform team and adoption strategies.

:::

Quick check: how does this lesson land for you?

Quiz

Module 5: Platform Observability & Reliability

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.