خدمة LLM مع vLLM وTGI

تتطلب خدمة نماذج اللغة الكبيرة محركات استدلال متخصصة محسّنة لبنيات المحولات. vLLM وText Generation Inference (TGI) هما الحلان الرائدان لنشر LLM الإنتاجي على Kubernetes.

بنية خدمة LLM

┌─────────────────────────────────────────────────────────────────────┐
│                    مكدس خدمة LLM                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   بوابة API / موازن الحمل                    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              محرك استدلال LLM                                │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │    │
│  │  │    vLLM     │  │    TGI      │  │  TensorRT   │          │    │
│  │  │  (PagedAttn)│  │ (HuggingFace)│ │     LLM     │          │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘          │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              التحسينات                                       │    │
│  │  - الدفعات المستمرة      - إدارة ذاكرة KV                    │    │
│  │  - التوازي الموتري       - فك التشفير التخميني              │    │
│  │  - Flash Attention       - التكميم (AWQ/GPTQ)              │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              متعدد GPU / متعدد العقد                         │    │
│  │  [GPU 0] [GPU 1] [GPU 2] [GPU 3] [GPU 4] [GPU 5] [GPU 6]    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

نشر vLLM

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama
  template:
    metadata:
      labels:
        app: vllm-llama
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.0
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - "--model=/models/llama-3-70b"
        - "--tensor-parallel-size=4"
        - "--max-model-len=8192"
        - "--gpu-memory-utilization=0.9"
        - "--enable-prefix-caching"
        - "--enable-chunked-prefill"
        - "--max-num-seqs=256"
        - "--quantization=awq"
        ports:
        - containerPort: 8000
          name: http
        resources:
          requests:
            nvidia.com/gpu: 4
            memory: "128Gi"
            cpu: "32"
          limits:
            nvidia.com/gpu: 4
            memory: "256Gi"
            cpu: "64"
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1,2,3"
        - name: NCCL_DEBUG
          value: "INFO"
        volumeMounts:
        - name: models
          mountPath: /models
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-storage
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "32Gi"
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-llama
  ports:
  - port: 8000
    targetPort: 8000

نشر TGI (Text Generation Inference)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-mistral
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tgi-mistral
  template:
    metadata:
      labels:
        app: tgi-mistral
    spec:
      containers:
      - name: tgi
        image: ghcr.io/huggingface/text-generation-inference:2.0
        args:
        - "--model-id=mistralai/Mistral-7B-Instruct-v0.3"
        - "--num-shard=2"
        - "--max-batch-size=64"
        - "--max-input-length=4096"
        - "--max-total-tokens=8192"
        - "--quantize=awq"
        - "--trust-remote-code"
        ports:
        - containerPort: 80
          name: http
        resources:
          requests:
            nvidia.com/gpu: 2
            memory: "48Gi"
          limits:
            nvidia.com/gpu: 2
            memory: "64Gi"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        - name: MAX_CONCURRENT_REQUESTS
          value: "128"
        volumeMounts:
        - name: cache
          mountPath: /data
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: cache
        persistentVolumeClaim:
          claimName: tgi-cache
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "16Gi"

وقت تشغيل KServe LLM

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: vllm-runtime
spec:
  annotations:
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
  supportedModelFormats:
  - name: vllm
    version: "1"
    autoSelect: true
  containers:
  - name: vllm
    image: vllm/vllm-openai:v0.6.0
    command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
    args:
    - "--model=/mnt/models"
    - "--port=8080"
    ports:
    - containerPort: 8080
      protocol: TCP
    env:
    - name: STORAGE_URI
      value: "{{.StorageUri}}"
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-vllm
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      runtime: vllm-runtime
      storageUri: "s3://models/llama-3-8b"
      resources:
        requests:
          nvidia.com/gpu: 1
          memory: "32Gi"
        limits:
          nvidia.com/gpu: 1
          memory: "48Gi"

خدمة LLM متعددة العقد (نماذج 70B+)

# مجموعة Ray للاستدلال الموزع
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: llm-cluster
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray-ml:2.9.0-gpu
          resources:
            requests:
              nvidia.com/gpu: 8
              memory: "512Gi"
            limits:
              nvidia.com/gpu: 8
              memory: "768Gi"
  workerGroupSpecs:
  - replicas: 3
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.9.0-gpu
          resources:
            requests:
              nvidia.com/gpu: 8
              memory: "512Gi"
            limits:
              nvidia.com/gpu: 8
              memory: "768Gi"
        nodeSelector:
          nvidia.com/gpu.product: NVIDIA-H100-80GB
---
# vLLM على Ray لنموذج 70B عبر 4 عقد (32 GPU)
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: llama-70b-service
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path: vllm_ray:deployment
      deployments:
      - name: VLLMDeployment
        num_replicas: 1
        ray_actor_options:
          num_gpus: 32
        user_config:
          model: meta-llama/Llama-3-70B-Instruct
          tensor_parallel_size: 8
          pipeline_parallel_size: 4

استخدام API المتوافقة مع OpenAI

# vLLM يكشف نقاط نهاية متوافقة مع OpenAI
curl http://vllm-service:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b",
    "messages": [
      {"role": "system", "content": "أنت مساعد مفيد."},
      {"role": "user", "content": "اشرح Kubernetes بعبارات بسيطة."}
    ],
    "max_tokens": 500,
    "temperature": 0.7,
    "stream": true
  }'

# استجابة متدفقة
curl http://vllm-service:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b",
    "prompt": "فوائد الحاويات هي:",
    "max_tokens": 200,
    "stream": true
  }'

تحسين الأداء

التقنية	زيادة الإنتاجية	تقليل الذاكرة
PagedAttention	2-4x	50-70%
تكميم AWQ	1.2x	75%
الدفعات المستمرة	10-20x	-
Flash Attention 2	1.5-2x	40%
فك التشفير التخميني	2-3x	-

الوحدة التالية: شبكة الخدمات والشبكات لأعباء عمل ML مع Istio. :::