Platform Observability & Reliability

Platform Reliability Patterns

12 min read

Platform reliability ensures consistent service delivery to development teams. This lesson covers multi-tenancy, namespace isolation, and virtual clusters for platform resilience.

Multi-Tenancy Challenges

Platform teams serve multiple development teams with varying requirements. Multi-tenancy introduces challenges around resource isolation, security boundaries, and fair resource allocation.

Tenancy Models

┌─────────────────────────────────────────────────────────────────────┐
│                     Multi-Tenancy Models                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Soft Multi-Tenancy          │  Hard Multi-Tenancy                  │
│  ─────────────────           │  ──────────────────                  │
│                              │                                      │
│  ┌─────────────────────┐     │  ┌─────────┐ ┌─────────┐ ┌─────────┐│
│  │   Shared Cluster    │     │  │Cluster A│ │Cluster B│ │Cluster C││
│  │  ┌─────┐ ┌─────┐   │     │  │ Team A  │ │ Team B  │ │ Team C  ││
│  │  │NS-A │ │NS-B │   │     │  └─────────┘ └─────────┘ └─────────┘│
│  │  └─────┘ └─────┘   │     │                                      │
│  │  ┌─────┐ ┌─────┐   │     │  • Complete isolation                │
│  │  │NS-C │ │NS-D │   │     │  • Higher resource cost              │
│  │  └─────┘ └─────┘   │     │  • Independent upgrades              │
│  └─────────────────────┘     │                                      │
│                              │                                      │
│  • Namespace isolation       │                                      │
│  • Resource efficient        │                                      │
│  • Shared control plane      │                                      │
│                              │                                      │
└─────────────────────────────────────────────────────────────────────┘

Namespace Isolation with Network Policies

Network policies control traffic flow between namespaces and pods.

Default Deny Policy

# network-policy-default-deny.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: team-alpha
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Allow Intra-Namespace Traffic

# network-policy-allow-same-namespace.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace
  namespace: team-alpha
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector: {}
  egress:
    - to:
        - podSelector: {}
    # Allow DNS resolution
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53

Allow Traffic from Ingress Controller

# network-policy-allow-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-controller
  namespace: team-alpha
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/component: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
          podSelector:
            matchLabels:
              app.kubernetes.io/name: ingress-nginx
      ports:
        - protocol: TCP
          port: 8080

Resource Quotas and Limits

Resource quotas prevent any single team from consuming excessive cluster resources.

Namespace Resource Quota

# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-alpha
spec:
  hard:
    # Compute resources
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi

    # Storage resources
    requests.storage: 100Gi
    persistentvolumeclaims: "10"

    # Object counts
    pods: "50"
    services: "20"
    secrets: "50"
    configmaps: "50"

    # Services by type
    services.loadbalancers: "2"
    services.nodeports: "5"

Limit Ranges for Default Limits

# limit-range.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-alpha
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: 512Mi
      defaultRequest:
        cpu: "100m"
        memory: 128Mi
      min:
        cpu: "50m"
        memory: 64Mi
      max:
        cpu: "4"
        memory: 8Gi

    - type: PersistentVolumeClaim
      min:
        storage: 1Gi
      max:
        storage: 50Gi

Virtual Clusters with vCluster

vCluster creates fully functional virtual Kubernetes clusters inside namespaces. Each team gets an isolated control plane while sharing underlying infrastructure.

vCluster Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    Host Cluster                                     │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │                    Namespace: vcluster-team-a                  │ │
│  │  ┌─────────────────────────────────────────────────────────┐  │ │
│  │  │              vCluster Control Plane                      │  │ │
│  │  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐  │  │ │
│  │  │  │ API      │  │ Syncer   │  │ CoreDNS              │  │  │ │
│  │  │  │ Server   │  │          │  │                      │  │  │ │
│  │  │  └──────────┘  └──────────┘  └──────────────────────┘  │  │ │
│  │  │  ┌──────────────────────────────────────────────────┐  │  │ │
│  │  │  │ etcd (or SQLite)                                 │  │  │ │
│  │  │  └──────────────────────────────────────────────────┘  │  │ │
│  │  └─────────────────────────────────────────────────────────┘  │ │
│  │                                                               │ │
│  │  Virtual Resources → Synced to Host → Scheduled on Nodes      │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │                    Namespace: vcluster-team-b                  │ │
│  │  ┌─────────────────────────────────────────────────────────┐  │ │
│  │  │              vCluster Control Plane                      │  │ │
│  │  │                    (same structure)                      │  │ │
│  │  └─────────────────────────────────────────────────────────┘  │ │
│  └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Installing vCluster CLI

# macOS
brew install loft-sh/tap/vcluster

# Linux
curl -L -o vcluster "https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-amd64"
chmod +x vcluster
sudo mv vcluster /usr/local/bin/

# Verify installation
vcluster --version

Creating a vCluster

# Create vCluster for team-alpha
vcluster create team-alpha \
  --namespace vcluster-team-alpha \
  --connect=false

# Connect to vCluster
vcluster connect team-alpha \
  --namespace vcluster-team-alpha \
  --update-current=false \
  -- kubectl get namespaces

vCluster Configuration

# vcluster-values.yaml
vcluster:
  image: rancher/k3s:v1.28.4-k3s1

  resources:
    limits:
      cpu: "2"
      memory: 4Gi
    requests:
      cpu: "200m"
      memory: 256Mi

syncer:
  extraArgs:
    - --sync-all-nodes

sync:
  # Sync services to host cluster
  services:
    enabled: true

  # Sync ingresses
  ingresses:
    enabled: true

  # Sync persistent volume claims
  persistentvolumeclaims:
    enabled: true

  # Don't sync network policies (use host policies)
  networkpolicies:
    enabled: false

# Isolation settings
isolation:
  enabled: true

  resourceQuota:
    enabled: true
    quota:
      requests.cpu: "10"
      requests.memory: 20Gi
      limits.cpu: "20"
      limits.memory: 40Gi
      pods: "100"

  limitRange:
    enabled: true
    default:
      cpu: "500m"
      memory: 512Mi
    defaultRequest:
      cpu: "100m"
      memory: 128Mi

# Use SQLite instead of etcd for smaller footprint
storage:
  persistence: true
  size: 5Gi

Deploying vCluster with Helm

# Add Loft Helm repository
helm repo add loft https://charts.loft.sh
helm repo update

# Create namespace
kubectl create namespace vcluster-team-alpha

# Install vCluster
helm upgrade --install team-alpha loft/vcluster \
  --namespace vcluster-team-alpha \
  --values vcluster-values.yaml \
  --wait

# Generate kubeconfig for team
vcluster connect team-alpha \
  --namespace vcluster-team-alpha \
  --update-current=false \
  --kube-config ./team-alpha-kubeconfig.yaml

Policy Enforcement with Kyverno

Kyverno enforces policies across clusters using Kubernetes-native patterns.

Installing Kyverno

# Add Kyverno Helm repository
helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update

# Install Kyverno
helm upgrade --install kyverno kyverno/kyverno \
  --namespace kyverno \
  --create-namespace \
  --set replicaCount=3

Require Resource Limits Policy

# policy-require-limits.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
  annotations:
    policies.kyverno.io/title: Require Resource Limits
    policies.kyverno.io/description: >-
      All containers must specify CPU and memory limits.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: validate-resources
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "CPU and memory limits are required for all containers."
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"

Require Labels Policy

# policy-require-labels.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-labels
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-team-label
      match:
        any:
          - resources:
              kinds:
                - Deployment
                - StatefulSet
                - DaemonSet
      validate:
        message: "The label 'team' is required."
        pattern:
          metadata:
            labels:
              team: "?*"

    - name: require-app-label
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "The label 'app.kubernetes.io/name' is required."
        pattern:
          metadata:
            labels:
              app.kubernetes.io/name: "?*"

Auto-Generate Network Policies

# policy-generate-network-policy.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-default-network-policy
spec:
  rules:
    - name: generate-default-deny
      match:
        any:
          - resources:
              kinds:
                - Namespace
              selector:
                matchLabels:
                  platform.company.com/managed: "true"
      generate:
        apiVersion: networking.k8s.io/v1
        kind: NetworkPolicy
        name: default-deny
        namespace: "{{request.object.metadata.name}}"
        synchronize: true
        data:
          spec:
            podSelector: {}
            policyTypes:
              - Ingress
              - Egress

Platform Reliability Dashboard

Monitor platform health and tenant resource usage.

Prometheus Recording Rules

# platform-reliability-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: platform-reliability
  namespace: monitoring
spec:
  groups:
    - name: platform.reliability
      interval: 30s
      rules:
        # Tenant resource utilization
        - record: platform:tenant_cpu_utilization:ratio
          expr: |
            sum by (namespace) (
              rate(container_cpu_usage_seconds_total{namespace=~"team-.*"}[5m])
            )
            /
            sum by (namespace) (
              kube_resourcequota{resource="limits.cpu", type="hard", namespace=~"team-.*"}
            )

        - record: platform:tenant_memory_utilization:ratio
          expr: |
            sum by (namespace) (
              container_memory_working_set_bytes{namespace=~"team-.*"}
            )
            /
            sum by (namespace) (
              kube_resourcequota{resource="limits.memory", type="hard", namespace=~"team-.*"}
            )

        # vCluster health
        - record: platform:vcluster_api_availability:ratio
          expr: |
            avg_over_time(up{job=~"vcluster-.*"}[5m])

        # Policy violations
        - record: platform:policy_violations:count
          expr: |
            sum by (policy) (
              increase(kyverno_policy_results_total{rule_result="fail"}[1h])
            )

Grafana Dashboard for Platform Health

{
  "dashboard": {
    "title": "Platform Reliability",
    "panels": [
      {
        "title": "Tenant CPU Utilization",
        "type": "gauge",
        "targets": [
          {
            "expr": "platform:tenant_cpu_utilization:ratio * 100",
            "legendFormat": "{{namespace}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            },
            "unit": "percent",
            "max": 100
          }
        }
      },
      {
        "title": "vCluster Availability",
        "type": "stat",
        "targets": [
          {
            "expr": "avg(platform:vcluster_api_availability:ratio) * 100",
            "legendFormat": "Availability"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 99},
                {"color": "green", "value": 99.9}
              ]
            }
          }
        }
      },
      {
        "title": "Policy Violations (Last Hour)",
        "type": "table",
        "targets": [
          {
            "expr": "topk(10, platform:policy_violations:count)",
            "format": "table"
          }
        ]
      }
    ]
  }
}

Summary

Platform reliability patterns ensure consistent service delivery:

Pattern Tool Purpose
Namespace Isolation Network Policies Control pod-to-pod traffic
Resource Limits ResourceQuota, LimitRange Prevent resource exhaustion
Virtual Clusters vCluster Full API isolation per tenant
Policy Enforcement Kyverno Automated compliance

Choose the right isolation level based on your requirements:

  • Namespace isolation: Sufficient for trusted internal teams
  • vCluster: Needed for untrusted workloads or compliance requirements
  • Dedicated clusters: Required for strict regulatory separation

Next Lesson: Module 6 covers Platform Team Operations and Maturity, including building a platform team and adoption strategies.

:::

Quiz

Module 5: Platform Observability & Reliability

Take Quiz