kubectl & Debugging for ML Workloads

Debugging ML workloads requires understanding GPU scheduling, OOM kills, and distributed training failures. This lesson covers essential kubectl commands and debugging techniques.

Essential kubectl Commands

GPU Resource Inspection

# List all GPU nodes and their capacity
kubectl get nodes -l nvidia.com/gpu.present=true \
  -o custom-columns=NAME:.metadata.name,\
GPU_COUNT:.status.allocatable.nvidia\\.com/gpu,\
GPU_TYPE:.metadata.labels.nvidia\\.com/gpu\\.product

# Check GPU allocation across cluster
kubectl describe nodes | grep -A5 "Allocated resources"

# View GPU requests by namespace
kubectl get pods --all-namespaces \
  -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.containers[*].resources.limits.nvidia\.com/gpu}{"\n"}{end}' \
  | grep -v "^$"

Training Job Inspection

# List all training jobs
kubectl get jobs -n ml-research

# Watch job progress
kubectl get jobs -w -n ml-research

# Get job details with events
kubectl describe job training-job-123 -n ml-research

# View job pod status
kubectl get pods -l job-name=training-job-123 -n ml-research

Pod Debugging

# Get pod logs (current container)
kubectl logs training-pod-abc -n ml-research

# Follow logs in real-time
kubectl logs -f training-pod-abc -n ml-research

# Get logs from previous crashed container
kubectl logs training-pod-abc -n ml-research --previous

# Get logs from specific container in multi-container pod
kubectl logs training-pod-abc -c trainer -n ml-research

# Get logs with timestamps
kubectl logs training-pod-abc -n ml-research --timestamps=true

Debugging Common ML Issues

GPU Not Available

# Check if GPU device plugin is running
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds

# Check node GPU capacity
kubectl describe node gpu-node-1 | grep -A10 "Capacity:"

# Check if pod is pending due to GPU
kubectl describe pod training-pod-abc -n ml-research | grep -A10 "Events:"

# Common output when GPU unavailable:
# Warning  FailedScheduling  Insufficient nvidia.com/gpu

Solution checklist:

Verify NVIDIA device plugin is running
Check node labels match pod requirements
Verify resource quota allows GPU allocation
Check if other pods are consuming GPUs

OOM (Out of Memory) Kills

# Check for OOMKilled status
kubectl get pod training-pod-abc -n ml-research \
  -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'

# Get detailed termination info
kubectl describe pod training-pod-abc -n ml-research | grep -A5 "Last State:"

# Check memory usage before OOM
kubectl top pod training-pod-abc -n ml-research

OOM debugging flow:

Pod OOMKilled
      │
      ├── Check requested vs used memory
      │   kubectl top pod <pod>
      │
      ├── Increase memory limits
      │   resources.limits.memory: "64Gi"
      │
      └── Check for memory leaks
          - GPU memory not released
          - DataLoader workers accumulating
          - Gradient accumulation

Distributed Training Failures

# Check all pods in distributed training
kubectl get pods -l training-job=distributed-123 -n ml-research

# Check pod-to-pod communication
kubectl exec -it worker-0 -n ml-research -- ping worker-1

# Check NCCL environment variables
kubectl exec -it worker-0 -n ml-research -- env | grep NCCL

# View logs from master/worker-0
kubectl logs -f worker-0 -n ml-research

# Check if all workers are ready
kubectl get pods -l training-job=distributed-123 \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'

Interactive Debugging

Exec into Running Pods

# Start shell in training container
kubectl exec -it training-pod-abc -n ml-research -- /bin/bash

# Run nvidia-smi inside pod
kubectl exec training-pod-abc -n ml-research -- nvidia-smi

# Check GPU memory usage
kubectl exec training-pod-abc -n ml-research -- nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Test Python environment
kubectl exec training-pod-abc -n ml-research -- python -c "import torch; print(torch.cuda.is_available())"

Debug Pods for Failed Containers

# Create debug pod on same node as failed pod
apiVersion: v1
kind: Pod
metadata:
  name: debug-pod
  namespace: ml-research
spec:
  nodeName: gpu-node-1  # Same node as failed pod
  containers:
  - name: debug
    image: pytorch/pytorch:2.1-cuda12.1
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 1
    volumeMounts:
    - name: shared-data
      mountPath: /data
  volumes:
  - name: shared-data
    persistentVolumeClaim:
      claimName: training-data

Port Forwarding for ML Services

# Forward TensorBoard
kubectl port-forward svc/tensorboard 6006:6006 -n ml-research

# Forward MLflow UI
kubectl port-forward svc/mlflow 5000:5000 -n ml-research

# Forward Jupyter notebook
kubectl port-forward pod/notebook-abc 8888:8888 -n ml-research

# Forward model inference endpoint for testing
kubectl port-forward svc/model-server 8080:80 -n ml-production

Resource Monitoring

# Watch resource usage across namespace
watch -n 2 'kubectl top pods -n ml-research'

# Get detailed node resource usage
kubectl top nodes

# Check pod resource requests vs actual usage
kubectl get pod training-pod-abc -n ml-research \
  -o jsonpath='{.spec.containers[*].resources}' | jq

# Export metrics to file
kubectl top pods -n ml-research --no-headers \
  | awk '{print $1","$2","$3}' > pod_metrics.csv

Quick Reference Table

Issue	Command	What to Check
GPU unavailable	`kubectl describe pod`	Events section
OOM killed	`kubectl top pod`	Memory vs limits
Training stuck	`kubectl logs -f`	Last log output
Network issue	`kubectl exec -- ping`	Pod connectivity
Storage full	`kubectl exec -- df -h`	Disk usage
Image pull error	`kubectl describe pod`	Image pull status

Next module: GPU Scheduling and Resource Management with NVIDIA tools, Kueue, and advanced scheduling patterns. :::