تحسين التكلفة لأعباء عمل GPU

غالباً ما تهيمن تكاليف GPU على ميزانيات البنية التحتية لـ ML. يجمع تحسين التكلفة الفعال بين التحجيم الصحيح ومثيلات spot ومشاركة GPU والجدولة الذكية لتقليل التكاليف بنسبة 40-70% مع الحفاظ على الأداء.

استراتيجيات تحسين التكلفة

┌─────────────────────────────────────────────────────────────────────┐
│                    طبقات تحسين تكلفة GPU                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  الطبقة 1: التحجيم الصحيح                                   │    │
│  │  - مطابقة نوع GPU لمتطلبات عبء العمل                        │    │
│  │  - تجنب الإفراط في توفير الذاكرة/الحوسبة                    │    │
│  │  - استخدام التوصيف لتحديد الاحتياجات الفعلية               │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  الطبقة 2: مثيلات Spot/القابلة للإنهاء (توفير 50-90%)       │    │
│  │  - مهام التدريب على مثيلات spot                             │    │
│  │  - نقاط التفتيش للتسامح مع الأخطاء                         │    │
│  │  - التراجع إلى حسب الطلب عند الحاجة                        │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  الطبقة 3: مشاركة GPU وتقسيم الوقت                          │    │
│  │  - MIG لعزل أعباء عمل A100/H100                            │    │
│  │  - تقسيم الوقت لأعباء العمل الأصغر                         │    │
│  │  - خدمة متعددة المثيلات                                    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  الطبقة 4: الجدولة الذكية                                   │    │
│  │  - تحسين التعبئة                                            │    │
│  │  - سياسات الإنهاء المسبق                                   │    │
│  │  - القبول القائم على القائمة (Kueue)                       │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

تكوين مثيل Spot

# تجمع عقد GKE Spot
apiVersion: container.gke.io/v1
kind: NodePool
metadata:
  name: gpu-spot-pool
spec:
  config:
    machineType: a2-highgpu-4g  # 4x A100 40GB
    accelerators:
    - acceleratorType: nvidia-tesla-a100
      acceleratorCount: 4
    spot: true
    taints:
    - key: cloud.google.com/gke-spot
      value: "true"
      effect: NoSchedule
---
# مهمة تدريب مع تحمل spot
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  template:
    spec:
      tolerations:
      - key: cloud.google.com/gke-spot
        operator: Equal
        value: "true"
        effect: NoSchedule
      containers:
      - name: trainer
        image: training:v1
        resources:
          limits:
            nvidia.com/gpu: 4
        env:
        - name: CHECKPOINT_DIR
          value: "s3://checkpoints/training-job"
        - name: CHECKPOINT_INTERVAL
          value: "300"  # حفظ كل 5 دقائق
      restartPolicy: OnFailure
---
# AWS Karpenter لتوفير GPU spot
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-spot
spec:
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["p4d.24xlarge", "p3.16xlarge", "g5.48xlarge"]
  - key: nvidia.com/gpu.product
    operator: In
    values: ["NVIDIA-A100-SXM4-40GB", "NVIDIA-V100-SXM2-16GB"]
  limits:
    resources:
      nvidia.com/gpu: 100
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 86400

مراقبة استخدام GPU

# OpenCost لتتبع تكلفة GPU
apiVersion: apps/v1
kind: Deployment
metadata:
  name: opencost
  namespace: opencost
spec:
  template:
    spec:
      containers:
      - name: opencost
        image: ghcr.io/opencost/opencost:1.110
        env:
        - name: CLUSTER_ID
          value: "ml-production"
        - name: PROMETHEUS_SERVER_ENDPOINT
          value: "http://prometheus:9090"
        - name: GPU_ENABLED
          value: "true"
        - name: CLOUD_PROVIDER_API_KEY
          valueFrom:
            secretKeyRef:
              name: cloud-provider-key
              key: api-key
---
# لوحة Grafana لتحليل تكلفة GPU
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-cost-dashboard
data:
  dashboard.json: |
    {
      "panels": [
        {
          "title": "تكلفة GPU حسب مساحة الاسم",
          "targets": [{
            "expr": "sum(gpu_cost_total) by (namespace)"
          }]
        },
        {
          "title": "استخدام GPU مقابل كفاءة التكلفة",
          "targets": [{
            "expr": "avg(DCGM_FI_DEV_GPU_UTIL) by (pod) / sum(gpu_cost_hourly) by (pod)"
          }]
        },
        {
          "title": "تكلفة GPU الخاملة (الهدر)",
          "targets": [{
            "expr": "sum(gpu_cost_hourly * (1 - DCGM_FI_DEV_GPU_UTIL/100))"
          }]
        }
      ]
    }

توصيات التحجيم الصحيح

# Vertical Pod Autoscaler لأعباء عمل GPU
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: inference-vpa
  namespace: ml-serving
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-service
  updatePolicy:
    updateMode: "Off"  # توصية فقط
  resourcePolicy:
    containerPolicies:
    - containerName: inference
      controlledResources: ["cpu", "memory"]
      # موارد GPU تُدار بشكل منفصل
---
# تحجيم GPU مخصص بناءً على الاستخدام
apiVersion: batch/v1
kind: CronJob
metadata:
  name: gpu-rightsizing-report
spec:
  schedule: "0 8 * * 1"  # أسبوعياً الاثنين 8 صباحاً
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: analyzer
            image: gpu-analyzer:v1
            command:
            - /bin/sh
            - -c
            - |
              # استعلام متوسط استخدام GPU خلال الأسبوع الماضي
              UTIL=$(curl -s "prometheus:9090/api/v1/query?query=avg_over_time(DCGM_FI_DEV_GPU_UTIL[7d])")

              # إنشاء توصيات التحجيم
              python3 /scripts/analyze.py --utilization "$UTIL" \
                --output-format markdown \
                --send-to slack:#ml-cost-alerts

Kueue لجدولة المهام الدفعية

# ResourceFlavor لأنواع GPU مختلفة
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: spot-a100
spec:
  nodeLabels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100
    cloud.google.com/gke-spot: "true"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: ondemand-a100
spec:
  nodeLabels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100
    cloud.google.com/gke-spot: "false"
---
# ClusterQueue مع تحسين التكلفة
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: ml-training
spec:
  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerPriority
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: spot-a100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 32
        borrowingLimit: 16
    - name: ondemand-a100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 8
        borrowingLimit: 0
---
# Workload باستخدام spot افتراضياً، التراجع إلى حسب الطلب
apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
  name: training-job
spec:
  queueName: ml-training
  priority: 100
  podSets:
  - name: main
    count: 4
    template:
      spec:
        containers:
        - name: trainer
          resources:
            requests:
              nvidia.com/gpu: 4

علامات تخصيص التكلفة

# تسميات لإسناد التكلفة
apiVersion: v1
kind: Namespace
metadata:
  name: ml-team-a
  labels:
    cost-center: "ml-research"
    team: "team-a"
    project: "llm-development"
---
# Pod مع تخصيص التكلفة
apiVersion: v1
kind: Pod
metadata:
  labels:
    cost-center: "ml-research"
    workload-type: "training"
    priority: "batch"
    spot-eligible: "true"
spec:
  containers:
  - name: training
    resources:
      limits:
        nvidia.com/gpu: 1

الدرس التالي: خطوط CI/CD لنشر نماذج ML. :::