Platform Observability & Reliability
Platform Reliability Patterns
Platform reliability ensures consistent service delivery to development teams. This lesson covers multi-tenancy, namespace isolation, and virtual clusters for platform resilience.
Multi-Tenancy Challenges
Platform teams serve multiple development teams with varying requirements. Multi-tenancy introduces challenges around resource isolation, security boundaries, and fair resource allocation.
Tenancy Models
┌─────────────────────────────────────────────────────────────────────┐
│ Multi-Tenancy Models │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Soft Multi-Tenancy │ Hard Multi-Tenancy │
│ ───────────────── │ ────────────────── │
│ │ │
│ ┌─────────────────────┐ │ ┌─────────┐ ┌─────────┐ ┌─────────┐│
│ │ Shared Cluster │ │ │Cluster A│ │Cluster B│ │Cluster C││
│ │ ┌─────┐ ┌─────┐ │ │ │ Team A │ │ Team B │ │ Team C ││
│ │ │NS-A │ │NS-B │ │ │ └─────────┘ └─────────┘ └─────────┘│
│ │ └─────┘ └─────┘ │ │ │
│ │ ┌─────┐ ┌─────┐ │ │ • Complete isolation │
│ │ │NS-C │ │NS-D │ │ │ • Higher resource cost │
│ │ └─────┘ └─────┘ │ │ • Independent upgrades │
│ └─────────────────────┘ │ │
│ │ │
│ • Namespace isolation │ │
│ • Resource efficient │ │
│ • Shared control plane │ │
│ │ │
└─────────────────────────────────────────────────────────────────────┘
Namespace Isolation with Network Policies
Network policies control traffic flow between namespaces and pods.
Default Deny Policy
# network-policy-default-deny.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: team-alpha
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Allow Intra-Namespace Traffic
# network-policy-allow-same-namespace.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-same-namespace
namespace: team-alpha
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector: {}
egress:
- to:
- podSelector: {}
# Allow DNS resolution
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
Allow Traffic from Ingress Controller
# network-policy-allow-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ingress-controller
namespace: team-alpha
spec:
podSelector:
matchLabels:
app.kubernetes.io/component: api
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
podSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
ports:
- protocol: TCP
port: 8080
Resource Quotas and Limits
Resource quotas prevent any single team from consuming excessive cluster resources.
Namespace Resource Quota
# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-alpha
spec:
hard:
# Compute resources
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
# Storage resources
requests.storage: 100Gi
persistentvolumeclaims: "10"
# Object counts
pods: "50"
services: "20"
secrets: "50"
configmaps: "50"
# Services by type
services.loadbalancers: "2"
services.nodeports: "5"
Limit Ranges for Default Limits
# limit-range.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-alpha
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: 512Mi
defaultRequest:
cpu: "100m"
memory: 128Mi
min:
cpu: "50m"
memory: 64Mi
max:
cpu: "4"
memory: 8Gi
- type: PersistentVolumeClaim
min:
storage: 1Gi
max:
storage: 50Gi
Virtual Clusters with vCluster
vCluster creates fully functional virtual Kubernetes clusters inside namespaces. Each team gets an isolated control plane while sharing underlying infrastructure.
vCluster Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Host Cluster │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Namespace: vcluster-team-a │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ vCluster Control Plane │ │ │
│ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │ │ │
│ │ │ │ API │ │ Syncer │ │ CoreDNS │ │ │ │
│ │ │ │ Server │ │ │ │ │ │ │ │
│ │ │ └──────────┘ └──────────┘ └──────────────────────┘ │ │ │
│ │ │ ┌──────────────────────────────────────────────────┐ │ │ │
│ │ │ │ etcd (or SQLite) │ │ │ │
│ │ │ └──────────────────────────────────────────────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Virtual Resources → Synced to Host → Scheduled on Nodes │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Namespace: vcluster-team-b │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ vCluster Control Plane │ │ │
│ │ │ (same structure) │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Installing vCluster CLI
# macOS
brew install loft-sh/tap/vcluster
# Linux
curl -L -o vcluster "https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-amd64"
chmod +x vcluster
sudo mv vcluster /usr/local/bin/
# Verify installation
vcluster --version
Creating a vCluster
# Create vCluster for team-alpha
vcluster create team-alpha \
--namespace vcluster-team-alpha \
--connect=false
# Connect to vCluster
vcluster connect team-alpha \
--namespace vcluster-team-alpha \
--update-current=false \
-- kubectl get namespaces
vCluster Configuration
# vcluster-values.yaml
vcluster:
image: rancher/k3s:v1.28.4-k3s1
resources:
limits:
cpu: "2"
memory: 4Gi
requests:
cpu: "200m"
memory: 256Mi
syncer:
extraArgs:
- --sync-all-nodes
sync:
# Sync services to host cluster
services:
enabled: true
# Sync ingresses
ingresses:
enabled: true
# Sync persistent volume claims
persistentvolumeclaims:
enabled: true
# Don't sync network policies (use host policies)
networkpolicies:
enabled: false
# Isolation settings
isolation:
enabled: true
resourceQuota:
enabled: true
quota:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "100"
limitRange:
enabled: true
default:
cpu: "500m"
memory: 512Mi
defaultRequest:
cpu: "100m"
memory: 128Mi
# Use SQLite instead of etcd for smaller footprint
storage:
persistence: true
size: 5Gi
Deploying vCluster with Helm
# Add Loft Helm repository
helm repo add loft https://charts.loft.sh
helm repo update
# Create namespace
kubectl create namespace vcluster-team-alpha
# Install vCluster
helm upgrade --install team-alpha loft/vcluster \
--namespace vcluster-team-alpha \
--values vcluster-values.yaml \
--wait
# Generate kubeconfig for team
vcluster connect team-alpha \
--namespace vcluster-team-alpha \
--update-current=false \
--kube-config ./team-alpha-kubeconfig.yaml
Policy Enforcement with Kyverno
Kyverno enforces policies across clusters using Kubernetes-native patterns.
Installing Kyverno
# Add Kyverno Helm repository
helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
# Install Kyverno
helm upgrade --install kyverno kyverno/kyverno \
--namespace kyverno \
--create-namespace \
--set replicaCount=3
Require Resource Limits Policy
# policy-require-limits.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
annotations:
policies.kyverno.io/title: Require Resource Limits
policies.kyverno.io/description: >-
All containers must specify CPU and memory limits.
spec:
validationFailureAction: Enforce
background: true
rules:
- name: validate-resources
match:
any:
- resources:
kinds:
- Pod
validate:
message: "CPU and memory limits are required for all containers."
pattern:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
Require Labels Policy
# policy-require-labels.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-labels
spec:
validationFailureAction: Enforce
rules:
- name: require-team-label
match:
any:
- resources:
kinds:
- Deployment
- StatefulSet
- DaemonSet
validate:
message: "The label 'team' is required."
pattern:
metadata:
labels:
team: "?*"
- name: require-app-label
match:
any:
- resources:
kinds:
- Pod
validate:
message: "The label 'app.kubernetes.io/name' is required."
pattern:
metadata:
labels:
app.kubernetes.io/name: "?*"
Auto-Generate Network Policies
# policy-generate-network-policy.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: add-default-network-policy
spec:
rules:
- name: generate-default-deny
match:
any:
- resources:
kinds:
- Namespace
selector:
matchLabels:
platform.company.com/managed: "true"
generate:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
name: default-deny
namespace: "{{request.object.metadata.name}}"
synchronize: true
data:
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Platform Reliability Dashboard
Monitor platform health and tenant resource usage.
Prometheus Recording Rules
# platform-reliability-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: platform-reliability
namespace: monitoring
spec:
groups:
- name: platform.reliability
interval: 30s
rules:
# Tenant resource utilization
- record: platform:tenant_cpu_utilization:ratio
expr: |
sum by (namespace) (
rate(container_cpu_usage_seconds_total{namespace=~"team-.*"}[5m])
)
/
sum by (namespace) (
kube_resourcequota{resource="limits.cpu", type="hard", namespace=~"team-.*"}
)
- record: platform:tenant_memory_utilization:ratio
expr: |
sum by (namespace) (
container_memory_working_set_bytes{namespace=~"team-.*"}
)
/
sum by (namespace) (
kube_resourcequota{resource="limits.memory", type="hard", namespace=~"team-.*"}
)
# vCluster health
- record: platform:vcluster_api_availability:ratio
expr: |
avg_over_time(up{job=~"vcluster-.*"}[5m])
# Policy violations
- record: platform:policy_violations:count
expr: |
sum by (policy) (
increase(kyverno_policy_results_total{rule_result="fail"}[1h])
)
Grafana Dashboard for Platform Health
{
"dashboard": {
"title": "Platform Reliability",
"panels": [
{
"title": "Tenant CPU Utilization",
"type": "gauge",
"targets": [
{
"expr": "platform:tenant_cpu_utilization:ratio * 100",
"legendFormat": "{{namespace}}"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
},
"unit": "percent",
"max": 100
}
}
},
{
"title": "vCluster Availability",
"type": "stat",
"targets": [
{
"expr": "avg(platform:vcluster_api_availability:ratio) * 100",
"legendFormat": "Availability"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 99},
{"color": "green", "value": 99.9}
]
}
}
}
},
{
"title": "Policy Violations (Last Hour)",
"type": "table",
"targets": [
{
"expr": "topk(10, platform:policy_violations:count)",
"format": "table"
}
]
}
]
}
}
Summary
Platform reliability patterns ensure consistent service delivery:
| Pattern | Tool | Purpose |
|---|---|---|
| Namespace Isolation | Network Policies | Control pod-to-pod traffic |
| Resource Limits | ResourceQuota, LimitRange | Prevent resource exhaustion |
| Virtual Clusters | vCluster | Full API isolation per tenant |
| Policy Enforcement | Kyverno | Automated compliance |
Choose the right isolation level based on your requirements:
- Namespace isolation: Sufficient for trusted internal teams
- vCluster: Needed for untrusted workloads or compliance requirements
- Dedicated clusters: Required for strict regulatory separation
Next Lesson: Module 6 covers Platform Team Operations and Maturity, including building a platform team and adoption strategies.
:::