Kubernetes Foundations for ML
kubectl & Debugging for ML Workloads
3 min read
Debugging ML workloads requires understanding GPU scheduling, OOM kills, and distributed training failures. This lesson covers essential kubectl commands and debugging techniques.
Essential kubectl Commands
GPU Resource Inspection
# List all GPU nodes and their capacity
kubectl get nodes -l nvidia.com/gpu.present=true \
-o custom-columns=NAME:.metadata.name,\
GPU_COUNT:.status.allocatable.nvidia\\.com/gpu,\
GPU_TYPE:.metadata.labels.nvidia\\.com/gpu\\.product
# Check GPU allocation across cluster
kubectl describe nodes | grep -A5 "Allocated resources"
# View GPU requests by namespace
kubectl get pods --all-namespaces \
-o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.containers[*].resources.limits.nvidia\.com/gpu}{"\n"}{end}' \
| grep -v "^$"
Training Job Inspection
# List all training jobs
kubectl get jobs -n ml-research
# Watch job progress
kubectl get jobs -w -n ml-research
# Get job details with events
kubectl describe job training-job-123 -n ml-research
# View job pod status
kubectl get pods -l job-name=training-job-123 -n ml-research
Pod Debugging
# Get pod logs (current container)
kubectl logs training-pod-abc -n ml-research
# Follow logs in real-time
kubectl logs -f training-pod-abc -n ml-research
# Get logs from previous crashed container
kubectl logs training-pod-abc -n ml-research --previous
# Get logs from specific container in multi-container pod
kubectl logs training-pod-abc -c trainer -n ml-research
# Get logs with timestamps
kubectl logs training-pod-abc -n ml-research --timestamps=true
Debugging Common ML Issues
GPU Not Available
# Check if GPU device plugin is running
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds
# Check node GPU capacity
kubectl describe node gpu-node-1 | grep -A10 "Capacity:"
# Check if pod is pending due to GPU
kubectl describe pod training-pod-abc -n ml-research | grep -A10 "Events:"
# Common output when GPU unavailable:
# Warning FailedScheduling Insufficient nvidia.com/gpu
Solution checklist:
- Verify NVIDIA device plugin is running
- Check node labels match pod requirements
- Verify resource quota allows GPU allocation
- Check if other pods are consuming GPUs
OOM (Out of Memory) Kills
# Check for OOMKilled status
kubectl get pod training-pod-abc -n ml-research \
-o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'
# Get detailed termination info
kubectl describe pod training-pod-abc -n ml-research | grep -A5 "Last State:"
# Check memory usage before OOM
kubectl top pod training-pod-abc -n ml-research
OOM debugging flow:
Pod OOMKilled
│
├── Check requested vs used memory
│ kubectl top pod <pod>
│
├── Increase memory limits
│ resources.limits.memory: "64Gi"
│
└── Check for memory leaks
- GPU memory not released
- DataLoader workers accumulating
- Gradient accumulation
Distributed Training Failures
# Check all pods in distributed training
kubectl get pods -l training-job=distributed-123 -n ml-research
# Check pod-to-pod communication
kubectl exec -it worker-0 -n ml-research -- ping worker-1
# Check NCCL environment variables
kubectl exec -it worker-0 -n ml-research -- env | grep NCCL
# View logs from master/worker-0
kubectl logs -f worker-0 -n ml-research
# Check if all workers are ready
kubectl get pods -l training-job=distributed-123 \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'
Interactive Debugging
Exec into Running Pods
# Start shell in training container
kubectl exec -it training-pod-abc -n ml-research -- /bin/bash
# Run nvidia-smi inside pod
kubectl exec training-pod-abc -n ml-research -- nvidia-smi
# Check GPU memory usage
kubectl exec training-pod-abc -n ml-research -- nvidia-smi --query-gpu=memory.used,memory.total --format=csv
# Test Python environment
kubectl exec training-pod-abc -n ml-research -- python -c "import torch; print(torch.cuda.is_available())"
Debug Pods for Failed Containers
# Create debug pod on same node as failed pod
apiVersion: v1
kind: Pod
metadata:
name: debug-pod
namespace: ml-research
spec:
nodeName: gpu-node-1 # Same node as failed pod
containers:
- name: debug
image: pytorch/pytorch:2.1-cuda12.1
command: ["sleep", "infinity"]
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: shared-data
mountPath: /data
volumes:
- name: shared-data
persistentVolumeClaim:
claimName: training-data
Port Forwarding for ML Services
# Forward TensorBoard
kubectl port-forward svc/tensorboard 6006:6006 -n ml-research
# Forward MLflow UI
kubectl port-forward svc/mlflow 5000:5000 -n ml-research
# Forward Jupyter notebook
kubectl port-forward pod/notebook-abc 8888:8888 -n ml-research
# Forward model inference endpoint for testing
kubectl port-forward svc/model-server 8080:80 -n ml-production
Resource Monitoring
# Watch resource usage across namespace
watch -n 2 'kubectl top pods -n ml-research'
# Get detailed node resource usage
kubectl top nodes
# Check pod resource requests vs actual usage
kubectl get pod training-pod-abc -n ml-research \
-o jsonpath='{.spec.containers[*].resources}' | jq
# Export metrics to file
kubectl top pods -n ml-research --no-headers \
| awk '{print $1","$2","$3}' > pod_metrics.csv
Quick Reference Table
| Issue | Command | What to Check |
|---|---|---|
| GPU unavailable | kubectl describe pod |
Events section |
| OOM killed | kubectl top pod |
Memory vs limits |
| Training stuck | kubectl logs -f |
Last log output |
| Network issue | kubectl exec -- ping |
Pod connectivity |
| Storage full | kubectl exec -- df -h |
Disk usage |
| Image pull error | kubectl describe pod |
Image pull status |
Next module: GPU Scheduling and Resource Management with NVIDIA tools, Kueue, and advanced scheduling patterns. :::