Kubernetes Troubleshooting Mastery

Troubleshooting skills are what interviewers use to identify senior candidates. Let's master the systematic approach.

The Troubleshooting Framework

1. Gather Information
   kubectl describe, logs, events

2. Identify the Layer
   Application → Pod → Node → Cluster

3. Form Hypothesis
   Based on error messages and symptoms

4. Test and Verify
   Fix → Test → Document

Pod Troubleshooting

Pod States and What They Mean

State	Common Causes	First Steps
Pending	No resources, affinity, PVC	`kubectl describe pod`
ImagePullBackOff	Wrong image, auth, registry	Check image name, secrets
CrashLoopBackOff	App crashes on start	`kubectl logs`, `kubectl logs --previous`
OOMKilled	Memory exceeded	Increase limits or fix leak
Evicted	Node resource pressure	Check node resources

Debug Commands

# Basic status
kubectl get pod <name> -o wide

# Detailed information
kubectl describe pod <name>

# Current logs
kubectl logs <name> -c <container>

# Previous container logs (after crash)
kubectl logs <name> --previous

# Follow logs
kubectl logs -f <name>

# All containers in pod
kubectl logs <name> --all-containers

# Execute into container
kubectl exec -it <name> -- /bin/sh

# For distroless images
kubectl debug -it <name> --image=busybox --target=<container>

CrashLoopBackOff Deep Dive

# Step 1: Check current status
kubectl describe pod <name> | grep -A10 "State:"

# Step 2: Check exit code
kubectl describe pod <name> | grep "Exit Code"
# Exit 1: Application error
# Exit 137: OOMKilled (128 + 9 SIGKILL)
# Exit 143: SIGTERM (128 + 15)

# Step 3: Check previous logs
kubectl logs <name> --previous

# Step 4: Check events
kubectl get events --field-selector involvedObject.name=<name>

# Step 5: Debug container
kubectl run debug --rm -it --image=busybox -- /bin/sh

Node Troubleshooting

Node Conditions

# Check node status
kubectl get nodes

# Detailed node info
kubectl describe node <name>

# Key conditions to check:
# Ready           - Node is healthy
# MemoryPressure  - Running out of memory
# DiskPressure    - Running out of disk
# PIDPressure     - Too many processes
# NetworkUnavailable - Network not configured

Node Not Ready

# Step 1: Check kubelet
ssh node1 "systemctl status kubelet"
ssh node1 "journalctl -u kubelet --since '5 minutes ago'"

# Step 2: Check container runtime
ssh node1 "systemctl status containerd"
ssh node1 "crictl ps"

# Step 3: Check resources
ssh node1 "free -h && df -h"

# Step 4: Check network
ssh node1 "ping -c3 <control-plane-ip>"

Service and Networking Troubleshooting

Service Not Working

# Step 1: Verify service exists and has endpoints
kubectl get svc <name>
kubectl get endpoints <name>

# Step 2: Check if pods match selector
kubectl get pods -l <selector-labels>

# Step 3: Test from within cluster
kubectl run test --rm -it --image=busybox -- wget -qO- http://<service>:<port>

# Step 4: Check kube-proxy
kubectl logs -n kube-system -l k8s-app=kube-proxy

# Step 5: Check DNS
kubectl run test --rm -it --image=busybox -- nslookup <service>

DNS Issues

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Test DNS resolution
kubectl run test --rm -it --image=busybox -- nslookup kubernetes.default

# Check /etc/resolv.conf in pod
kubectl exec <pod> -- cat /etc/resolv.conf

Storage Troubleshooting

PVC Pending

# Check PVC status
kubectl get pvc
kubectl describe pvc <name>

# Common issues:
# - No matching StorageClass
# - Insufficient storage
# - Access mode mismatch
# - Node affinity with storage

# Check storage class
kubectl get sc

# Check PV availability
kubectl get pv

Interview Scenarios

Q: "Pods are running but users report 502 errors. How do you investigate?"

# 1. Check if pods are ready
kubectl get pods -l app=web
# Look for READY column: should be 1/1

# 2. Check readiness probe
kubectl describe pod <name> | grep -A5 "Readiness"

# 3. Check service endpoints
kubectl get endpoints web-service
# Empty endpoints = no ready pods

# 4. Check ingress
kubectl describe ingress web-ingress
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# 5. Test directly
kubectl port-forward pod/<name> 8080:8080
curl localhost:8080

Q: "A deployment rollout is stuck. What do you do?"

# Check rollout status
kubectl rollout status deployment/<name>

# See deployment events
kubectl describe deployment <name>

# Check replicasets
kubectl get rs -l app=<name>

# Common issues:
# - New pods failing readiness
# - Resource quota exceeded
# - PodDisruptionBudget blocking

# Rollback if needed
kubectl rollout undo deployment/<name>

# Or to specific revision
kubectl rollout undo deployment/<name> --to-revision=2

Q: "Cluster is running slow. How do you identify the bottleneck?"

# 1. Check control plane
kubectl get componentstatuses  # Deprecated but sometimes works
kubectl get pods -n kube-system

# 2. Check API server latency
kubectl get --raw /metrics | grep apiserver_request_duration

# 3. Check etcd
kubectl exec -n kube-system etcd-master -- etcdctl endpoint status

# 4. Check node resources
kubectl top nodes
kubectl top pods --all-namespaces | sort -k3 -rn | head

# 5. Check for resource pressure
kubectl get nodes -o custom-columns=NAME:.metadata.name,MEMORY:.status.conditions[?(@.type==\"MemoryPressure\")].status

You've mastered Kubernetes! Next module: Monitoring, Observability & Incident Response—the production skills that define SRE. :::