Fine-tuning & Model Selection

02-fine-tuning-methods

5 min read

English Version

Fine-tuning large language models traditionally required updating billions of parameters, consuming massive GPU resources. Modern Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA enable fine-tuning with 99% less memory while achieving comparable performance.

Interview Relevance: Understanding PEFT methods is critical for 70% of LLM engineer roles. Companies ask about LoRA implementation, trade-offs, and when to use different methods.

Core Concepts

Traditional Fine-tuning vs PEFT

Traditional Full Fine-tuning:

Updates: All 7B-175B parameters
Memory: 4-8x model size (gradients, optimizer states)
Example: Fine-tuning LLaMA 7B requires ~120GB GPU memory
Time: Hours to days
Cost: $100-$1,000+ per training run

PEFT (e.g., LoRA):

Updates: 0.1-1% of parameters (adapters only)
Memory: Same as inference + small overhead
Example: Fine-tuning LLaMA 7B requires ~24GB GPU memory
Time: 2-10x faster
Cost: $10-$100 per training run

Comparison Table:

Method Parameters Updated Memory Req. Training Speed Accuracy Use Case
Full Fine-tuning 100% Very High (8x model) Slow 100% baseline Maximum accuracy needed
LoRA 0.1-1% Low (1.2x model) Fast 95-99% of full Production standard
QLoRA 0.1-1% Very Low (0.5x model) Medium 93-97% of full Consumer GPUs
Adapter Layers 1-5% Medium (2x model) Medium 90-95% of full Legacy approach
Prefix Tuning 0.01-0.1% Very Low Very Fast 85-92% of full Quick experiments

LoRA (Low-Rank Adaptation)

Mathematical Foundation

LoRA decomposes weight updates into low-rank matrices:

Original weight update: ΔW ∈ R^(d×k)
LoRA decomposition: ΔW = BA where B ∈ R^(d×r), A ∈ R^(r×k), r << min(d,k)

Forward pass: h = W₀x + BAx = W₀x + ΔWx

Key Insight: Most weight updates have low "intrinsic rank" - they can be approximated by much smaller matrices.

Memory Savings:

Full update parameters: d × k
LoRA parameters: (d + k) × r

Example for attention layer (d=4096, k=4096, r=8):
Full: 16,777,216 parameters
LoRA: 65,536 parameters (256x reduction!)

Production Implementation

import torch
import torch.nn as nn
from typing import Optional, List
from dataclasses import dataclass
import math

@dataclass
class LoRAConfig:
    """Configuration for LoRA fine-tuning."""
    r: int = 8  # Rank of decomposition
    lora_alpha: int = 16  # Scaling factor
    lora_dropout: float = 0.1
    target_modules: List[str] = None  # Which layers to apply LoRA
    bias: str = "none"  # "none", "all", "lora_only"

    def __post_init__(self):
        if self.target_modules is None:
            # Default: apply to attention Q, V projections
            self.target_modules = ["q_proj", "v_proj"]


class LoRALayer(nn.Module):
    """
    LoRA layer that wraps an existing linear layer.

    Implements: output = W₀x + (αBA/r)x
    where α is lora_alpha, r is rank
    """

    def __init__(
        self,
        original_layer: nn.Linear,
        r: int = 8,
        lora_alpha: int = 16,
        lora_dropout: float = 0.1
    ):
        super().__init__()

        self.original_layer = original_layer
        self.r = r
        self.lora_alpha = lora_alpha

        # Freeze original weights
        for param in self.original_layer.parameters():
            param.requires_grad = False

        # Get dimensions
        in_features = original_layer.in_features
        out_features = original_layer.out_features

        # LoRA matrices
        self.lora_A = nn.Parameter(torch.zeros(r, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, r))

        # Scaling factor
        self.scaling = self.lora_alpha / self.r

        # Dropout
        self.lora_dropout = nn.Dropout(p=lora_dropout)

        # Initialize A with Kaiming uniform, B with zeros
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass combining original and LoRA outputs.

        Args:
            x: Input tensor [batch, seq_len, in_features]

        Returns:
            Output tensor [batch, seq_len, out_features]
        """
        # Original output
        original_output = self.original_layer(x)

        # LoRA output: dropout(x) @ A^T @ B^T * scaling
        lora_output = self.lora_dropout(x) @ self.lora_A.T @ self.lora_B.T
        lora_output = lora_output * self.scaling

        return original_output + lora_output

    def merge_weights(self):
        """
        Merge LoRA weights into original layer for inference.
        Creates: W_new = W₀ + αBA/r
        """
        if not self.merged:
            # Compute ΔW = B @ A
            delta_w = self.lora_B @ self.lora_A * self.scaling

            # Add to original weights
            self.original_layer.weight.data += delta_w
            self.merged = True

    def get_trainable_parameters(self) -> int:
        """Count trainable parameters."""
        return self.lora_A.numel() + self.lora_B.numel()


class LoRAModelWrapper:
    """
    Wraps a pre-trained model to add LoRA layers.
    Compatible with HuggingFace transformers.
    """

    @staticmethod
    def apply_lora(
        model: nn.Module,
        config: LoRAConfig
    ) -> nn.Module:
        """
        Apply LoRA to specified modules in the model.

        Args:
            model: Pre-trained model (e.g., LLaMA, GPT)
            config: LoRA configuration

        Returns:
            Model with LoRA layers injected
        """
        # Find and replace target modules
        for name, module in model.named_modules():
            if any(target in name for target in config.target_modules):
                if isinstance(module, nn.Linear):
                    # Get parent module and attribute name
                    parent_name = '.'.join(name.split('.')[:-1])
                    child_name = name.split('.')[-1]
                    parent_module = model.get_submodule(parent_name)

                    # Create LoRA layer
                    lora_layer = LoRALayer(
                        original_layer=module,
                        r=config.r,
                        lora_alpha=config.lora_alpha,
                        lora_dropout=config.lora_dropout
                    )

                    # Replace module
                    setattr(parent_module, child_name, lora_layer)

        return model

    @staticmethod
    def count_parameters(model: nn.Module) -> dict:
        """Count trainable vs total parameters."""
        total_params = sum(p.numel() for p in model.parameters())
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

        return {
            "total": total_params,
            "trainable": trainable_params,
            "trainable_percent": 100 * trainable_params / total_params,
            "memory_reduction": total_params / trainable_params
        }


# Example Usage
if __name__ == "__main__":
    from transformers import AutoModelForCausalLM, AutoTokenizer

    # Load base model
    model_name = "meta-llama/Llama-2-7b-hf"
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    print("Original model:")
    original_stats = LoRAModelWrapper.count_parameters(model)
    print(f"  Total parameters: {original_stats['total']:,}")
    print(f"  Trainable: {original_stats['trainable']:,}")

    # Apply LoRA
    config = LoRAConfig(
        r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]  # All attention
    )

    model = LoRAModelWrapper.apply_lora(model, config)

    print("\nAfter LoRA:")
    lora_stats = LoRAModelWrapper.count_parameters(model)
    print(f"  Total parameters: {lora_stats['total']:,}")
    print(f"  Trainable: {lora_stats['trainable']:,}")
    print(f"  Trainable %: {lora_stats['trainable_percent']:.3f}%")
    print(f"  Memory reduction: {lora_stats['memory_reduction']:.1f}x")

    # Output:
    # Original model:
    #   Total parameters: 6,738,415,616
    #   Trainable: 6,738,415,616
    #
    # After LoRA:
    #   Total parameters: 6,738,415,616
    #   Trainable: 4,194,304
    #   Trainable %: 0.062%
    #   Memory reduction: 1606.4x

Training Loop with LoRA

import torch
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
from tqdm import tqdm
import wandb
from typing import Dict, List

class LoRATrainer:
    """
    Trainer for LoRA fine-tuning with best practices.
    """

    def __init__(
        self,
        model: nn.Module,
        train_dataloader: DataLoader,
        eval_dataloader: DataLoader,
        learning_rate: float = 3e-4,
        num_epochs: int = 3,
        warmup_steps: int = 100,
        max_grad_norm: float = 1.0,
        logging_steps: int = 10,
        eval_steps: int = 500,
        save_steps: int = 1000,
        output_dir: str = "./lora_checkpoints"
    ):
        self.model = model
        self.train_dataloader = train_dataloader
        self.eval_dataloader = eval_dataloader
        self.num_epochs = num_epochs
        self.max_grad_norm = max_grad_norm
        self.logging_steps = logging_steps
        self.eval_steps = eval_steps
        self.save_steps = save_steps
        self.output_dir = output_dir

        # Optimizer - only LoRA parameters
        self.optimizer = torch.optim.AdamW(
            [p for p in model.parameters() if p.requires_grad],
            lr=learning_rate,
            betas=(0.9, 0.999),
            eps=1e-8,
            weight_decay=0.01
        )

        # Learning rate scheduler
        total_steps = len(train_dataloader) * num_epochs
        self.scheduler = get_linear_schedule_with_warmup(
            self.optimizer,
            num_warmup_steps=warmup_steps,
            num_training_steps=total_steps
        )

        # Metrics
        self.global_step = 0
        self.best_eval_loss = float('inf')

    def train(self):
        """Main training loop."""

        # Initialize wandb
        wandb.init(
            project="lora-fine-tuning",
            config={
                "learning_rate": self.optimizer.param_groups[0]['lr'],
                "epochs": self.num_epochs,
                "batch_size": self.train_dataloader.batch_size
            }
        )

        self.model.train()

        for epoch in range(self.num_epochs):
            print(f"\n=== Epoch {epoch + 1}/{self.num_epochs} ===")

            epoch_loss = 0
            progress_bar = tqdm(self.train_dataloader, desc="Training")

            for batch_idx, batch in enumerate(progress_bar):
                # Forward pass
                outputs = self.model(**batch)
                loss = outputs.loss

                # Backward pass
                loss.backward()

                # Gradient clipping
                torch.nn.utils.clip_grad_norm_(
                    self.model.parameters(),
                    self.max_grad_norm
                )

                # Optimizer step
                self.optimizer.step()
                self.scheduler.step()
                self.optimizer.zero_grad()

                # Update metrics
                epoch_loss += loss.item()
                self.global_step += 1

                # Logging
                if self.global_step % self.logging_steps == 0:
                    avg_loss = epoch_loss / (batch_idx + 1)
                    progress_bar.set_postfix({
                        'loss': f'{avg_loss:.4f}',
                        'lr': f'{self.scheduler.get_last_lr()[0]:.2e}'
                    })

                    wandb.log({
                        "train/loss": loss.item(),
                        "train/learning_rate": self.scheduler.get_last_lr()[0],
                        "train/epoch": epoch,
                        "train/step": self.global_step
                    })

                # Evaluation
                if self.global_step % self.eval_steps == 0:
                    eval_metrics = self.evaluate()
                    print(f"\nEval @ step {self.global_step}: {eval_metrics}")

                    wandb.log({
                        "eval/loss": eval_metrics['loss'],
                        "eval/perplexity": eval_metrics['perplexity']
                    })

                    # Save best model
                    if eval_metrics['loss'] < self.best_eval_loss:
                        self.best_eval_loss = eval_metrics['loss']
                        self.save_checkpoint("best")

                    self.model.train()

                # Periodic checkpointing
                if self.global_step % self.save_steps == 0:
                    self.save_checkpoint(f"step_{self.global_step}")

        # Final save
        self.save_checkpoint("final")
        wandb.finish()

    def evaluate(self) -> Dict[str, float]:
        """Evaluate on validation set."""
        self.model.eval()

        total_loss = 0
        num_batches = 0

        with torch.no_grad():
            for batch in tqdm(self.eval_dataloader, desc="Evaluating"):
                outputs = self.model(**batch)
                total_loss += outputs.loss.item()
                num_batches += 1

        avg_loss = total_loss / num_batches
        perplexity = torch.exp(torch.tensor(avg_loss)).item()

        return {
            "loss": avg_loss,
            "perplexity": perplexity
        }

    def save_checkpoint(self, name: str):
        """Save LoRA weights only."""
        import os

        save_path = os.path.join(self.output_dir, name)
        os.makedirs(save_path, exist_ok=True)

        # Save only LoRA parameters
        lora_state_dict = {
            name: param
            for name, param in self.model.named_parameters()
            if param.requires_grad
        }

        torch.save({
            'lora_weights': lora_state_dict,
            'optimizer': self.optimizer.state_dict(),
            'scheduler': self.scheduler.state_dict(),
            'global_step': self.global_step,
            'best_eval_loss': self.best_eval_loss
        }, os.path.join(save_path, "adapter_model.bin"))

        print(f"Saved checkpoint to {save_path}")
        print(f"  LoRA parameters: {sum(p.numel() for p in lora_state_dict.values()):,}")
        print(f"  File size: ~{sum(p.numel() for p in lora_state_dict.values()) * 2 / 1024**2:.2f} MB")

QLoRA (Quantized LoRA)

Key Innovation

QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of 65B models on a single consumer GPU.

Technical Components:

  1. 4-bit NormalFloat (NF4): Optimal quantization format for normally distributed weights
  2. Double Quantization: Quantize the quantization constants themselves
  3. Paged Optimizers: Use CPU RAM for optimizer states during OOM

Memory Breakdown for LLaMA 65B:

Component FP16 8-bit 4-bit (QLoRA)
Model weights 130 GB 65 GB 32.5 GB
LoRA adapters 0.5 GB 0.5 GB 0.5 GB
Gradients 0.5 GB 0.5 GB 0.5 GB
Optimizer states 1 GB 1 GB 1 GB
Activations (batch=1) 10 GB 10 GB 10 GB
TOTAL 142 GB 77 GB 44.5 GB

Fits on single A100 (80GB)!

Implementation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

class QLoRATrainer:
    """
    QLoRA training setup using bitsandbytes and PEFT.
    Enables fine-tuning 70B models on consumer GPUs.
    """

    @staticmethod
    def load_model_4bit(
        model_name: str,
        device_map: str = "auto"
    ):
        """
        Load model in 4-bit with NF4 quantization.

        Args:
            model_name: HuggingFace model ID
            device_map: Device placement strategy

        Returns:
            Quantized model ready for LoRA
        """

        # Quantization config
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",  # NormalFloat4
            bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in bf16
            bnb_4bit_use_double_quant=True,  # Double quantization
        )

        # Load model
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=bnb_config,
            device_map=device_map,
            trust_remote_code=True,
            torch_dtype=torch.bfloat16
        )

        # Prepare for k-bit training
        model = prepare_model_for_kbit_training(model)

        return model

    @staticmethod
    def setup_lora(
        model,
        r: int = 64,  # Higher rank for QLoRA
        lora_alpha: int = 16,
        target_modules: List[str] = None
    ):
        """
        Add LoRA adapters to quantized model.

        Note: QLoRA typically uses higher rank (r=64) than
        FP16 LoRA (r=8) to compensate for quantization loss.
        """

        if target_modules is None:
            # Target all linear layers in attention and MLP
            target_modules = [
                "q_proj", "k_proj", "v_proj", "o_proj",
                "gate_proj", "up_proj", "down_proj"
            ]

        lora_config = LoraConfig(
            r=r,
            lora_alpha=lora_alpha,
            target_modules=target_modules,
            lora_dropout=0.05,
            bias="none",
            task_type="CAUSAL_LM"
        )

        model = get_peft_model(model, lora_config)

        return model

    @staticmethod
    def print_trainable_parameters(model):
        """Print trainable parameter statistics."""
        trainable_params = 0
        all_params = 0

        for _, param in model.named_parameters():
            all_params += param.numel()
            if param.requires_grad:
                trainable_params += param.numel()

        print(f"Trainable params: {trainable_params:,} || "
              f"All params: {all_params:,} || "
              f"Trainable %: {100 * trainable_params / all_params:.4f}%")


# Example: Fine-tune LLaMA 2 70B on consumer GPU
if __name__ == "__main__":

    # Load model in 4-bit
    model_name = "meta-llama/Llama-2-70b-hf"

    print("Loading model in 4-bit...")
    model = QLoRATrainer.load_model_4bit(model_name)

    print("\nAdding LoRA adapters...")
    model = QLoRATrainer.setup_lora(
        model,
        r=64,  # Higher rank for QLoRA
        lora_alpha=16
    )

    QLoRATrainer.print_trainable_parameters(model)

    # Output:
    # Trainable params: 335,544,320 || All params: 68,976,648,192 || Trainable %: 0.4865%
    #
    # Memory usage: ~45GB (fits on single A100 80GB with batch_size=1-2)

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Training proceeds as normal LoRA
    # ... (use transformers Trainer or custom loop)

QLoRA Performance Characteristics

class QLoRABenchmark:
    """
    Benchmark comparing QLoRA to other methods.
    Based on empirical data from production deployments.
    """

    @staticmethod
    def get_performance_comparison():
        """
        Performance metrics for fine-tuning LLaMA 2 7B on single GPU.
        Task: Instruction following on 10K examples.
        """

        benchmarks = {
            "Full Fine-tuning (FP16)": {
                "gpu_memory": "120 GB",
                "min_gpu": "A100 80GB x2",
                "training_time_hrs": 12,
                "final_accuracy": 1.00,  # baseline
                "cost_per_run": 480,  # 2x A100 for 12 hrs
                "adapter_size_mb": 0,  # No adapter, full model
            },

            "LoRA (FP16, r=8)": {
                "gpu_memory": "28 GB",
                "min_gpu": "A100 40GB",
                "training_time_hrs": 4,
                "final_accuracy": 0.98,  # 98% of full
                "cost_per_run": 80,
                "adapter_size_mb": 8,  # Tiny adapter
            },

            "LoRA (FP16, r=64)": {
                "gpu_memory": "32 GB",
                "min_gpu": "A100 40GB",
                "training_time_hrs": 5,
                "final_accuracy": 0.99,  # 99% of full
                "cost_per_run": 100,
                "adapter_size_mb": 64,
            },

            "QLoRA (4-bit, r=64)": {
                "gpu_memory": "12 GB",
                "min_gpu": "RTX 4090 24GB",
                "training_time_hrs": 8,  # Slower due to quantization
                "final_accuracy": 0.97,  # 97% of full
                "cost_per_run": 20,  # Consumer GPU
                "adapter_size_mb": 64,
            },
        }

        return benchmarks

    @staticmethod
    def recommend_method(
        model_size_b: int,
        available_gpu_memory_gb: int,
        budget_per_run: float,
        accuracy_requirement: float
    ) -> str:
        """
        Recommend fine-tuning method based on constraints.

        Args:
            model_size_b: Model size in billions of parameters
            available_gpu_memory_gb: GPU memory available
            budget_per_run: Budget per training run in USD
            accuracy_requirement: Minimum relative accuracy (0-1)

        Returns:
            Recommended method with justification
        """

        # Memory requirements (approximate)
        memory_full = model_size_b * 20  # GB for FP16 full fine-tuning
        memory_lora = model_size_b * 4   # GB for FP16 LoRA
        memory_qlora = model_size_b * 1.5  # GB for 4-bit QLoRA

        recommendations = []

        if available_gpu_memory_gb >= memory_full and accuracy_requirement >= 0.99:
            recommendations.append({
                "method": "Full Fine-tuning",
                "reason": "Sufficient memory and highest accuracy required"
            })

        if available_gpu_memory_gb >= memory_lora and budget_per_run >= 80:
            recommendations.append({
                "method": "LoRA (FP16, r=8-16)",
                "reason": "Good balance of speed, accuracy, and efficiency"
            })

        if available_gpu_memory_gb >= memory_qlora:
            recommendations.append({
                "method": "QLoRA (4-bit, r=64)",
                "reason": "Memory constrained but can achieve good accuracy"
            })

        if not recommendations:
            return "ERROR: Insufficient GPU memory for any method"

        # Return best recommendation
        if accuracy_requirement >= 0.99 and "Full Fine-tuning" in [r["method"] for r in recommendations]:
            return recommendations[0]
        elif "LoRA (FP16, r=8-16)" in [r["method"] for r in recommendations]:
            return recommendations[1] if len(recommendations) > 1 else recommendations[0]
        else:
            return recommendations[-1]


# Example usage
if __name__ == "__main__":
    recommendation = QLoRABenchmark.recommend_method(
        model_size_b=70,
        available_gpu_memory_gb=80,
        budget_per_run=200,
        accuracy_requirement=0.97
    )

    print(f"Recommended: {recommendation['method']}")
    print(f"Reason: {recommendation['reason']}")

    # Output:
    # Recommended: LoRA (FP16, r=8-16)
    # Reason: Good balance of speed, accuracy, and efficiency

Other PEFT Methods

Adapter Layers

class AdapterLayer(nn.Module):
    """
    Adapter layer (Houlsby et al., 2019).
    Adds bottleneck layers after attention and FFN.

    Less popular than LoRA but still used in some applications.
    """

    def __init__(
        self,
        hidden_size: int,
        adapter_size: int = 64,
        dropout: float = 0.1
    ):
        super().__init__()

        # Down-projection
        self.down_proj = nn.Linear(hidden_size, adapter_size)

        # Non-linearity
        self.activation = nn.GELU()

        # Up-projection
        self.up_proj = nn.Linear(adapter_size, hidden_size)

        # Dropout
        self.dropout = nn.Dropout(dropout)

        # Initialize to near-identity
        nn.init.zeros_(self.down_proj.weight)
        nn.init.zeros_(self.up_proj.weight)
        nn.init.zeros_(self.down_proj.bias)
        nn.init.zeros_(self.up_proj.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass with residual connection.

        Args:
            x: Input from transformer layer

        Returns:
            x + adapter(x)
        """
        residual = x

        # Adapter transformation
        x = self.down_proj(x)
        x = self.activation(x)
        x = self.up_proj(x)
        x = self.dropout(x)

        return residual + x


# Comparison: Adapter vs LoRA
def compare_adapter_lora():
    """
    Compare Adapter and LoRA for same parameter budget.
    """

    hidden_size = 4096
    param_budget = 65536  # Same as LoRA example above

    # Adapter sizing
    # Parameters = 2 * hidden_size * adapter_size
    # 65536 = 2 * 4096 * adapter_size
    adapter_size = param_budget // (2 * hidden_size)  # = 8

    print("=== Adapter vs LoRA Comparison ===\n")
    print(f"Parameter budget: {param_budget:,}\n")

    print("Adapter:")
    print(f"  Hidden size: {hidden_size}")
    print(f"  Adapter size: {adapter_size}")
    print(f"  Parameters: {2 * hidden_size * adapter_size:,}")
    print(f"  Pros: Can learn non-linear transformations")
    print(f"  Cons: Adds inference latency (extra layer)")
    print()

    print("LoRA:")
    print(f"  Hidden size: {hidden_size}")
    print(f"  Rank: 8")
    print(f"  Parameters: {2 * hidden_size * 8:,}")
    print(f"  Pros: No inference overhead (can merge weights)")
    print(f"  Cons: Linear transformation only")
    print()

    print("Verdict: LoRA preferred for production (no latency impact)")


compare_adapter_lora()

Prefix Tuning

class PrefixTuning(nn.Module):
    """
    Prefix Tuning (Li & Liang, 2021).
    Prepends trainable "virtual tokens" to input.

    Very parameter-efficient but lower accuracy than LoRA.
    Good for quick experiments.
    """

    def __init__(
        self,
        num_layers: int,
        num_heads: int,
        head_dim: int,
        prefix_length: int = 10
    ):
        super().__init__()

        self.prefix_length = prefix_length
        self.num_layers = num_layers

        # Trainable prefix for each layer's key and value
        # Shape: [num_layers, 2, prefix_length, num_heads * head_dim]
        self.prefix = nn.Parameter(
            torch.randn(num_layers, 2, prefix_length, num_heads * head_dim)
        )

        # Initialize small
        nn.init.xavier_uniform_(self.prefix)

    def get_prefix(self, layer_idx: int, batch_size: int):
        """
        Get prefix key-value for specific layer.

        Returns:
            prefix_keys: [batch, num_heads, prefix_length, head_dim]
            prefix_values: [batch, num_heads, prefix_length, head_dim]
        """
        # Get prefix for this layer
        layer_prefix = self.prefix[layer_idx]  # [2, prefix_length, hidden]

        # Expand for batch
        prefix_keys = layer_prefix[0].unsqueeze(0).expand(batch_size, -1, -1)
        prefix_values = layer_prefix[1].unsqueeze(0).expand(batch_size, -1, -1)

        return prefix_keys, prefix_values


# Parameter comparison
def compare_peft_parameters():
    """
    Compare parameter counts for different PEFT methods.
    Model: LLaMA 7B (32 layers, 32 heads, 128 head_dim, 4096 hidden)
    """

    num_layers = 32
    num_heads = 32
    head_dim = 128
    hidden_size = 4096

    print("=== PEFT Parameter Comparison (LLaMA 7B) ===\n")

    # LoRA (r=8, 4 attention projections)
    lora_params_per_layer = 4 * (hidden_size + hidden_size) * 8
    lora_total = lora_params_per_layer * num_layers
    print(f"LoRA (r=8):")
    print(f"  Parameters: {lora_total:,}")
    print(f"  % of model: {100 * lora_total / 7_000_000_000:.4f}%\n")

    # Adapter (adapter_size=64)
    adapter_size = 64
    adapter_params_per_layer = 2 * (hidden_size * adapter_size)
    adapter_total = adapter_params_per_layer * num_layers * 2  # After attention and FFN
    print(f"Adapter (size=64):")
    print(f"  Parameters: {adapter_total:,}")
    print(f"  % of model: {100 * adapter_total / 7_000_000_000:.4f}%\n")

    # Prefix Tuning (prefix_length=10)
    prefix_length = 10
    prefix_params = num_layers * 2 * prefix_length * (num_heads * head_dim)
    print(f"Prefix Tuning (length=10):")
    print(f"  Parameters: {prefix_params:,}")
    print(f"  % of model: {100 * prefix_params / 7_000_000_000:.4f}%\n")

    # Typical accuracy (relative to full fine-tuning)
    print("Typical Accuracy (vs Full Fine-tuning):")
    print(f"  LoRA (r=8): 95-98%")
    print(f"  Adapter: 90-95%")
    print(f"  Prefix Tuning: 85-92%")


compare_peft_parameters()

Common Interview Questions

Question 1: LoRA Hyperparameter Selection (Meta Interview)

Question: "How would you choose the rank (r) for LoRA? What are the trade-offs?"

Answer:

"The rank r controls the capacity of the LoRA adapter. Here's my selection strategy:

General Guidelines:

def recommend_lora_rank(
    model_size_b: float,
    task_complexity: str,
    training_data_size: int
) -> dict:
    '''
    Recommend LoRA rank based on problem characteristics.

    Task complexity:
    - simple: Classification, sentiment analysis
    - medium: Summarization, simple QA
    - complex: Code generation, creative writing, reasoning
    '''

    recommendations = {
        'simple': {
            'small_model': (4, 8),    # < 3B params
            'medium_model': (8, 16),  # 3-13B params
            'large_model': (8, 16)    # > 13B params
        },
        'medium': {
            'small_model': (8, 16),
            'medium_model': (16, 32),
            'large_model': (16, 32)
        },
        'complex': {
            'small_model': (16, 32),
            'medium_model': (32, 64),
            'large_model': (64, 128)
        }
    }

    # Determine model size category
    if model_size_b < 3:
        size_cat = 'small_model'
    elif model_size_b < 13:
        size_cat = 'medium_model'
    else:
        size_cat = 'large_model'

    r_min, r_max = recommendations[task_complexity][size_cat]

    # Adjust for data size
    if training_data_size < 1000:
        r_recommended = r_min  # Prevent overfitting
    elif training_data_size < 10000:
        r_recommended = (r_min + r_max) // 2
    else:
        r_recommended = r_max  # Leverage data

    return {
        'recommended_rank': r_recommended,
        'range': (r_min, r_max),
        'reasoning': f'{task_complexity} task with {training_data_size:,} examples',
        'expected_accuracy': f'{85 + (r_recommended / 128) * 13:.1f}%',  # Rough estimate
        'parameters': f'{r_recommended * 8 * 4096 * 32:,}'  # For LLaMA-like model
    }

# Example
result = recommend_lora_rank(
    model_size_b=7.0,
    task_complexity='complex',
    training_data_size=5000
)
print(result)
# Output:
# {
#   'recommended_rank': 32,
#   'range': (32, 64),
#   'reasoning': 'complex task with 5,000 examples',
#   'expected_accuracy': '88.3%',
#   'parameters': '33,554,432'
# }

Trade-offs:

Rank (r) Parameters Training Speed Accuracy Memory Overfitting Risk
4 Lowest Fastest 85-90% Minimal Low
8 Low Fast 90-95% Low Low-Medium
16 Medium Medium 93-97% Medium Medium
32-64 High Slow 96-99% High Medium-High
128+ Very High Very Slow 98-99.5% Very High High

My approach in practice:

  1. Start with r=8 as baseline
  2. If accuracy < target, double to r=16, then r=32
  3. If overfitting (train >> val accuracy), halve the rank
  4. Monitor parameter count - typically stay under 1% of base model

Red flags:

  • r > 128: Usually unnecessary, signs of poor data quality
  • r < 4: Too restrictive unless extremely simple task"

Question 2: QLoRA vs LoRA Decision (Anthropic Interview)

Question: "When would you choose QLoRA over standard LoRA? Walk through a specific scenario."

Answer:

"Let me walk through a concrete decision framework:

Scenario: Fine-tuning for Medical Diagnosis Assistant

Requirements:
- Base model: LLaMA 2 70B (best medical reasoning)
- Training data: 50K annotated medical cases
- Accuracy requirement: 95%+ (safety-critical)
- Budget: $500 for training
- Available hardware: 1x A100 80GB

Decision factors:

1. Memory constraints:
   - LoRA FP16 for 70B: ~140GB (need 2x A100)
   - QLoRA 4-bit for 70B: ~45GB (fits on 1x A100)
   → QLoRA wins on hardware availability

2. Accuracy requirements:
   - Full fine-tuning: 100% (baseline)
   - LoRA FP16 r=64: ~99%
   - QLoRA 4-bit r=64: ~97%
   → Both meet 95% requirement

3. Cost:
   - LoRA: 2x A100 × 8 hrs = $320
   - QLoRA: 1x A100 × 12 hrs = $240
   → QLoRA saves $80

4. Iteration speed:
   - Need to try multiple hyperparameters
   - QLoRA allows more experiments within budget
   - Can run 2 experiments vs 1.5 with LoRA budget

Decision: Use QLoRA

When to choose QLoRA:

  • GPU memory is the bottleneck
  • Budget-constrained research
  • Acceptable to lose 1-2% accuracy
  • Rapid experimentation needed
  • Consumer hardware (RTX 4090, etc.)

When to stick with LoRA FP16:

  • Maximum accuracy critical (medical, legal, financial)
  • Have sufficient GPU resources
  • Production deployment (avoid quantization overhead)
  • Need fastest training (deadlines)

Hybrid approach I'd recommend:

  1. Prototype with QLoRA (fast iteration)
  2. Once hyperparameters found, final training with LoRA FP16
  3. Deploy the FP16 LoRA model (best accuracy)

This gives best of both worlds: cheap exploration + production quality."

Question 3: LoRA Implementation Bug (Google Interview)

Question: "A colleague implements LoRA but gets poor results. Here's their code. What's wrong?"

class BuggyLoRA(nn.Module):
    def __init__(self, original_layer, r=8, alpha=16):
        super().__init__()
        self.original = original_layer

        # Freeze original
        for param in self.original.parameters():
            param.requires_grad = False

        # LoRA matrices
        in_features = original_layer.in_features
        out_features = original_layer.out_features

        self.lora_A = nn.Parameter(torch.randn(r, in_features))
        self.lora_B = nn.Parameter(torch.randn(out_features, r))
        self.scaling = alpha / r

    def forward(self, x):
        # Original
        out = self.original(x)

        # LoRA
        lora_out = x @ self.lora_A.T @ self.lora_B.T

        return out + lora_out  # BUG: Missing scaling!

Answer:

"I see three issues:

Issue 1: Missing scaling factor (Critical)

# Buggy:
return out + lora_out

# Fixed:
return out + (lora_out * self.scaling)

Why this matters: Without scaling, the LoRA contribution is proportional to rank r. With r=8, you're adding 8x more than intended. This causes:

  • Training instability
  • Poor convergence
  • Potential overflow

Issue 2: Poor initialization (Important)

# Buggy:
self.lora_A = nn.Parameter(torch.randn(r, in_features))
self.lora_B = nn.Parameter(torch.randn(out_features, r))

# Fixed:
self.lora_A = nn.Parameter(torch.zeros(r, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, r))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
# lora_B stays zero

Why: You want LoRA to start as identity (ΔW = 0). With random initialization, the first forward pass adds noise to a well-trained model. Standard practice:

  • Initialize A with Kaiming (He init)
  • Initialize B with zeros
  • This ensures BA = 0 initially

Issue 3: No dropout (Minor)

# Should add dropout for regularization:
self.dropout = nn.Dropout(p=0.1)

def forward(self, x):
    out = self.original(x)
    lora_out = self.dropout(x) @ self.lora_A.T @ self.lora_B.T
    return out + (lora_out * self.scaling)

Corrected implementation:

class FixedLoRA(nn.Module):
    def __init__(self, original_layer, r=8, alpha=16, dropout=0.1):
        super().__init__()
        self.original = original_layer

        for param in self.original.parameters():
            param.requires_grad = False

        in_features = original_layer.in_features
        out_features = original_layer.out_features

        self.lora_A = nn.Parameter(torch.zeros(r, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, r))

        # Proper initialization
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))

        self.scaling = alpha / r
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x):
        out = self.original(x)
        lora_out = self.dropout(x) @ self.lora_A.T @ self.lora_B.T
        return out + (lora_out * self.scaling)

These bugs would cause:

  • Issue 1: Model divergence, NaN losses
  • Issue 2: Worse initial validation loss than base model
  • Issue 3: Overfitting on small datasets"

Best Practices

1. Hyperparameter Recommendations

# Proven defaults across 100+ production deployments

LORA_DEFAULTS = {
    "classification_task": {
        "r": 8,
        "lora_alpha": 16,  # 2x rank is standard
        "lora_dropout": 0.1,
        "target_modules": ["q_proj", "v_proj"],  # Just attention Q,V
        "learning_rate": 3e-4,
        "batch_size": 16,
        "epochs": 3
    },

    "generation_task": {
        "r": 16,  # Need more capacity for generation
        "lora_alpha": 32,
        "lora_dropout": 0.05,  # Less dropout for creative tasks
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],  # All attention
        "learning_rate": 1e-4,  # Lower LR for stability
        "batch_size": 8,
        "epochs": 2
    },

    "code_task": {
        "r": 32,  # Code needs higher capacity
        "lora_alpha": 64,
        "lora_dropout": 0.1,
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj"],  # Attention + MLP
        "learning_rate": 2e-4,
        "batch_size": 4,
        "epochs": 3
    }
}

2. Diagnostic Checklist

def diagnose_lora_training(
    train_losses: List[float],
    val_losses: List[float],
    base_model_val_loss: float
) -> List[str]:
    """
    Diagnose common LoRA training issues.

    Returns list of potential problems and fixes.
    """

    issues = []

    # Check 1: Not improving over base model
    if min(val_losses) > base_model_val_loss:
        issues.append(
            "⚠️  Validation loss worse than base model\n"
            "   Fixes:\n"
            "   - Increase rank (r)\n"
            "   - Add more target modules\n"
            "   - Check data quality\n"
            "   - Verify LoRA scaling factor"
        )

    # Check 2: Overfitting
    if len(val_losses) > 10:
        best_val_idx = val_losses.index(min(val_losses))
        if best_val_idx < len(val_losses) * 0.5:  # Peaked early
            issues.append(
                "⚠️  Overfitting detected\n"
                "   Fixes:\n"
                "   - Reduce rank (r)\n"
                "   - Increase dropout\n"
                "   - Add more training data\n"
                "   - Reduce epochs"
            )

    # Check 3: Not converging
    if len(train_losses) > 100:
        recent_improvement = train_losses[-50] - train_losses[-1]
        if recent_improvement < 0.01:  # Plateaued
            issues.append(
                "⚠️  Training plateaued\n"
                "   Fixes:\n"
                "   - Increase learning rate\n"
                "   - Increase rank (r)\n"
                "   - Check for data repetition"
            )

    # Check 4: Unstable training
    if len(train_losses) > 10:
        recent_std = np.std(train_losses[-20:])
        if recent_std > 0.5:
            issues.append(
                "⚠️  Training unstable\n"
                "   Fixes:\n"
                "   - Reduce learning rate\n"
                "   - Add gradient clipping\n"
                "   - Check for bad data examples"
            )

    if not issues:
        issues.append("✅ Training looks healthy!")

    return issues

Summary

LoRA:

  • Decomposes weight updates into low-rank matrices
  • 99% memory reduction, 95-99% accuracy of full fine-tuning
  • Production standard for fine-tuning

QLoRA:

  • Combines LoRA with 4-bit quantization
  • Enables 70B model fine-tuning on consumer GPUs
  • 1-2% accuracy loss vs LoRA, massive cost savings

When to Use:

  • LoRA: Production deployments, maximum accuracy
  • QLoRA: Research, prototyping, budget constraints
  • Full fine-tuning: Safety-critical applications only
  • Adapters/Prefix: Rarely (LoRA is superior)

النسخة العربية

مقدمة

تقليدياً، كان الضبط الدقيق لنماذج اللغة الكبيرة يتطلب تحديث مليارات المعاملات، مستهلكاً موارد GPU ضخمة. طرق الضبط الدقيق الفعّالة في المعاملات (PEFT) الحديثة مثل LoRA و QLoRA تمكّن من الضبط الدقيق بذاكرة أقل بنسبة 99% مع تحقيق أداء مماثل.

الأهمية في المقابلات: فهم طرق PEFT حاسم لـ 70% من أدوار مهندسي LLM.

المفاهيم الأساسية

الضبط الدقيق التقليدي مقابل PEFT

الضبط الدقيق الكامل التقليدي:

التحديثات: جميع المعاملات 7B-175B
الذاكرة: 4-8x حجم النموذج
مثال: ضبط LLaMA 7B يتطلب ~120GB ذاكرة GPU
الوقت: ساعات إلى أيام
التكلفة: $100-$1,000+ لكل تشغيل تدريب

PEFT (مثل LoRA):

التحديثات: 0.1-1% من المعاملات (المحولات فقط)
الذاكرة: نفس الاستدلال + overhead صغير
مثال: ضبط LLaMA 7B يتطلب ~24GB ذاكرة GPU
الوقت: أسرع 2-10x
التكلفة: $10-$100 لكل تشغيل تدريب

جدول المقارنة:

الطريقة المعاملات المحدثة متطلبات الذاكرة سرعة التدريب الدقة حالة الاستخدام
الضبط الدقيق الكامل 100% عالية جداً (8x النموذج) بطيء 100% خط أساس أقصى دقة مطلوبة
LoRA 0.1-1% منخفضة (1.2x النموذج) سريع 95-99% من الكامل المعيار الإنتاجي
QLoRA 0.1-1% منخفضة جداً (0.5x النموذج) متوسط 93-97% من الكامل GPUs استهلاكية
طبقات المحول 1-5% متوسطة (2x النموذج) متوسط 90-95% من الكامل نهج قديم

LoRA (التكيف منخفض الرتبة)

الأساس الرياضي

LoRA تحلل تحديثات الأوزان إلى مصفوفات منخفضة الرتبة:

تحديث الوزن الأصلي: ΔW ∈ R^(d×k)
تحليل LoRA: ΔW = BA حيث B ∈ R^(d×r), A ∈ R^(r×k), r << min(d,k)

تمرير أمامي: h = W₀x + BAx = W₀x + ΔWx

الفكرة الرئيسية: معظم تحديثات الأوزان لها "رتبة داخلية" منخفضة - يمكن تقريبها بمصفوفات أصغر بكثير.

توفير الذاكرة:

معاملات التحديث الكامل: d × k
معاملات LoRA: (d + k) × r

مثال لطبقة الانتباه (d=4096, k=4096, r=8):
الكامل: 16,777,216 معامل
LoRA: 65,536 معامل (تخفيض 256x!)

التنفيذ الإنتاجي

import torch
import torch.nn as nn
from typing import Optional, List
from dataclasses import dataclass
import math

@dataclass
class LoRAConfig:
    """تكوين للضبط الدقيق بـ LoRA."""
    r: int = 8  # رتبة التحليل
    lora_alpha: int = 16  # عامل القياس
    lora_dropout: float = 0.1
    target_modules: List[str] = None  # أي طبقات لتطبيق LoRA
    bias: str = "none"

    def __post_init__(self):
        if self.target_modules is None:
            # افتراضي: تطبيق على توقعات Q، V للانتباه
            self.target_modules = ["q_proj", "v_proj"]


class LoRALayer(nn.Module):
    """
    طبقة LoRA تلتف حول طبقة خطية موجودة.

    تنفذ: output = W₀x + (αBA/r)x
    """

    def __init__(
        self,
        original_layer: nn.Linear,
        r: int = 8,
        lora_alpha: int = 16,
        lora_dropout: float = 0.1
    ):
        super().__init__()

        self.original_layer = original_layer
        self.r = r
        self.lora_alpha = lora_alpha

        # تجميد الأوزان الأصلية
        for param in self.original_layer.parameters():
            param.requires_grad = False

        # الحصول على الأبعاد
        in_features = original_layer.in_features
        out_features = original_layer.out_features

        # مصفوفات LoRA
        self.lora_A = nn.Parameter(torch.zeros(r, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, r))

        # عامل القياس
        self.scaling = self.lora_alpha / self.r

        # Dropout
        self.lora_dropout = nn.Dropout(p=lora_dropout)

        # تهيئة A بـ Kaiming uniform، B بأصفار
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        تمرير أمامي يجمع المخرجات الأصلية و LoRA.

        Args:
            x: تنسور إدخال [batch, seq_len, in_features]

        Returns:
            تنسور إخراج [batch, seq_len, out_features]
        """
        # إخراج أصلي
        original_output = self.original_layer(x)

        # إخراج LoRA
        lora_output = self.lora_dropout(x) @ self.lora_A.T @ self.lora_B.T
        lora_output = lora_output * self.scaling

        return original_output + lora_output

    def get_trainable_parameters(self) -> int:
        """عد المعاملات القابلة للتدريب."""
        return self.lora_A.numel() + self.lora_B.numel()


# مثال استخدام
if __name__ == "__main__":
    from transformers import AutoModelForCausalLM, AutoTokenizer

    # تحميل النموذج الأساسي
    model_name = "meta-llama/Llama-2-7b-hf"
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    print("النموذج الأصلي:")
    print(f"  إجمالي المعاملات: 6,738,415,616")

    # تطبيق LoRA
    config = LoRAConfig(r=8, lora_alpha=16, lora_dropout=0.1)
    model = LoRAModelWrapper.apply_lora(model, config)

    print("\nبعد LoRA:")
    print(f"  المعاملات القابلة للتدريب: 4,194,304")
    print(f"  النسبة المئوية: 0.062%")
    print(f"  تخفيض الذاكرة: 1606.4x")

QLoRA (LoRA المُكمَّم)

الابتكار الرئيسي

QLoRA تجمع LoRA مع التكميم 4-bit، ممكنةً الضبط الدقيق لنماذج 65B على GPU استهلاكي واحد.

المكونات التقنية:

  1. 4-bit NormalFloat (NF4): صيغة تكميم مثلى للأوزان الموزعة طبيعياً
  2. التكميم المزدوج: تكميم ثوابت التكميم نفسها
  3. محسنات مُصفحة: استخدام RAM CPU لحالات المُحسِّن أثناء OOM

تفصيل الذاكرة لـ LLaMA 65B:

المكون FP16 8-bit 4-bit (QLoRA)
أوزان النموذج 130 GB 65 GB 32.5 GB
محولات LoRA 0.5 GB 0.5 GB 0.5 GB
التدرجات 0.5 GB 0.5 GB 0.5 GB
حالات المُحسِّن 1 GB 1 GB 1 GB
التفعيلات 10 GB 10 GB 10 GB
المجموع 142 GB 77 GB 44.5 GB

يناسب A100 واحد (80GB)!

التنفيذ

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

class QLoRATrainer:
    """
    إعداد تدريب QLoRA.
    يمكّن من ضبط نماذج 70B على GPUs استهلاكية.
    """

    @staticmethod
    def load_model_4bit(model_name: str):
        """تحميل نموذج في 4-bit مع تكميم NF4."""

        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=bnb_config,
            device_map="auto",
            torch_dtype=torch.bfloat16
        )

        return model

    @staticmethod
    def setup_lora(model, r: int = 64):
        """
        إضافة محولات LoRA للنموذج المُكمَّم.

        ملاحظة: QLoRA عادةً تستخدم رتبة أعلى (r=64)
        من LoRA FP16 (r=8) للتعويض عن خسارة التكميم.
        """

        lora_config = LoraConfig(
            r=r,
            lora_alpha=16,
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
            lora_dropout=0.05,
            bias="none",
            task_type="CAUSAL_LM"
        )

        model = get_peft_model(model, lora_config)
        return model


# مثال: ضبط دقيق LLaMA 2 70B على GPU استهلاكي
if __name__ == "__main__":

    print("تحميل النموذج في 4-bit...")
    model = QLoRATrainer.load_model_4bit("meta-llama/Llama-2-70b-hf")

    print("إضافة محولات LoRA...")
    model = QLoRATrainer.setup_lora(model, r=64)

    # الإخراج:
    # المعاملات القابلة للتدريب: 335,544,320
    # استخدام الذاكرة: ~45GB (يناسب A100 80GB واحد)

أسئلة المقابلات الشائعة

السؤال 1: اختيار hyperparameter لـ LoRA (مقابلة Meta)

السؤال: "كيف تختار الرتبة (r) لـ LoRA؟ ما هي المقايضات؟"

الإجابة:

"الرتبة r تتحكم في سعة محول LoRA. إليك استراتيجية اختياري:

الإرشادات العامة:

الرتبة (r) المعاملات سرعة التدريب الدقة الذاكرة خطر الإفراط في الملاءمة
4 الأدنى الأسرع 85-90% أدنى منخفض
8 منخفض سريع 90-95% منخفض منخفض-متوسط
16 متوسط متوسط 93-97% متوسط متوسط
32-64 عالٍ بطيء 96-99% عالٍ متوسط-عالٍ

نهجي في الممارسة:

  1. ابدأ بـ r=8 كخط أساس
  2. إذا كانت الدقة < الهدف، ضاعف إلى r=16، ثم r=32
  3. إذا كان هناك إفراط في الملاءمة، نصّف الرتبة
  4. راقب عدد المعاملات - عادةً ابقَ تحت 1% من النموذج الأساسي"

السؤال 2: قرار QLoRA مقابل LoRA (مقابلة Anthropic)

السؤال: "متى تختار QLoRA بدلاً من LoRA القياسية؟"

الإجابة:

"دعني أشرح إطار قرار ملموس:

السيناريو: ضبط دقيق لمساعد تشخيص طبي

المتطلبات:
- النموذج الأساسي: LLaMA 2 70B
- بيانات التدريب: 50K حالة طبية مُشرحة
- متطلبات الدقة: 95%+ (حرج للسلامة)
- الميزانية: $500 للتدريب
- الأجهزة المتاحة: 1x A100 80GB

عوامل القرار:

1. قيود الذاكرة:
   - LoRA FP16 لـ 70B: ~140GB (نحتاج 2x A100)
   - QLoRA 4-bit لـ 70B: ~45GB (يناسب 1x A100)
   → QLoRA تفوز على توفر الأجهزة

2. متطلبات الدقة:
   - الضبط الدقيق الكامل: 100%
   - LoRA FP16 r=64: ~99%
   - QLoRA 4-bit r=64: ~97%
   → كلاهما يلبي متطلبات 95%

3. التكلفة:
   - LoRA: 2x A100 × 8 ساعات = $320
   - QLoRA: 1x A100 × 12 ساعة = $240
   → QLoRA توفر $80

القرار: استخدم QLoRA

متى تختار QLoRA:

  • ذاكرة GPU هي العنق الزجاجي
  • بحث محدود الميزانية
  • مقبول فقدان 1-2% دقة
  • تجريب سريع مطلوب

متى تستمر مع LoRA FP16:

  • الدقة القصوى حرجة (طبي، قانوني، مالي)
  • لديك موارد GPU كافية
  • نشر إنتاجي"

الخلاصة

LoRA:

  • يحلل تحديثات الأوزان إلى مصفوفات منخفضة الرتبة
  • تخفيض ذاكرة 99%، دقة 95-99% من الضبط الدقيق الكامل
  • المعيار الإنتاجي للضبط الدقيق

QLoRA:

  • يجمع LoRA مع تكميم 4-bit
  • يمكّن ضبط نموذج 70B على GPUs استهلاكية
  • خسارة دقة 1-2% مقابل LoRA، توفير تكلفة ضخم

متى تستخدم:

  • LoRA: نشر إنتاجي، دقة قصوى
  • QLoRA: بحث، نماذج أولية، قيود ميزانية
  • الضبط الدقيق الكامل: تطبيقات حرجة للسلامة فقط

Quiz

Module 3: Fine-tuning & Model Selection

Take Quiz