الضبط الدقيق واختيار النموذج
02-fine-tuning-methods
English Version
Fine-tuning large language models traditionally required updating billions of parameters, consuming massive GPU resources. Modern Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA enable fine-tuning with 99% less memory while achieving comparable performance.
Interview Relevance: Understanding PEFT methods is critical for 70% of LLM engineer roles. Companies ask about LoRA implementation, trade-offs, and when to use different methods.
Core Concepts
Traditional Fine-tuning vs PEFT
Traditional Full Fine-tuning:
Updates: All 7B-175B parameters
Memory: 4-8x model size (gradients, optimizer states)
Example: Fine-tuning LLaMA 7B requires ~120GB GPU memory
Time: Hours to days
Cost: $100-$1,000+ per training run
PEFT (e.g., LoRA):
Updates: 0.1-1% of parameters (adapters only)
Memory: Same as inference + small overhead
Example: Fine-tuning LLaMA 7B requires ~24GB GPU memory
Time: 2-10x faster
Cost: $10-$100 per training run
Comparison Table:
| Method | Parameters Updated | Memory Req. | Training Speed | Accuracy | Use Case |
|---|---|---|---|---|---|
| Full Fine-tuning | 100% | Very High (8x model) | Slow | 100% baseline | Maximum accuracy needed |
| LoRA | 0.1-1% | Low (1.2x model) | Fast | 95-99% of full | Production standard |
| QLoRA | 0.1-1% | Very Low (0.5x model) | Medium | 93-97% of full | Consumer GPUs |
| Adapter Layers | 1-5% | Medium (2x model) | Medium | 90-95% of full | Legacy approach |
| Prefix Tuning | 0.01-0.1% | Very Low | Very Fast | 85-92% of full | Quick experiments |
LoRA (Low-Rank Adaptation)
Mathematical Foundation
LoRA decomposes weight updates into low-rank matrices:
Original weight update: ΔW ∈ R^(d×k)
LoRA decomposition: ΔW = BA where B ∈ R^(d×r), A ∈ R^(r×k), r << min(d,k)
Forward pass: h = W₀x + BAx = W₀x + ΔWx
Key Insight: Most weight updates have low "intrinsic rank" - they can be approximated by much smaller matrices.
Memory Savings:
Full update parameters: d × k
LoRA parameters: (d + k) × r
Example for attention layer (d=4096, k=4096, r=8):
Full: 16,777,216 parameters
LoRA: 65,536 parameters (256x reduction!)
Production Implementation
import torch
import torch.nn as nn
from typing import Optional, List
from dataclasses import dataclass
import math
@dataclass
class LoRAConfig:
"""Configuration for LoRA fine-tuning."""
r: int = 8 # Rank of decomposition
lora_alpha: int = 16 # Scaling factor
lora_dropout: float = 0.1
target_modules: List[str] = None # Which layers to apply LoRA
bias: str = "none" # "none", "all", "lora_only"
def __post_init__(self):
if self.target_modules is None:
# Default: apply to attention Q, V projections
self.target_modules = ["q_proj", "v_proj"]
class LoRALayer(nn.Module):
"""
LoRA layer that wraps an existing linear layer.
Implements: output = W₀x + (αBA/r)x
where α is lora_alpha, r is rank
"""
def __init__(
self,
original_layer: nn.Linear,
r: int = 8,
lora_alpha: int = 16,
lora_dropout: float = 0.1
):
super().__init__()
self.original_layer = original_layer
self.r = r
self.lora_alpha = lora_alpha
# Freeze original weights
for param in self.original_layer.parameters():
param.requires_grad = False
# Get dimensions
in_features = original_layer.in_features
out_features = original_layer.out_features
# LoRA matrices
self.lora_A = nn.Parameter(torch.zeros(r, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, r))
# Scaling factor
self.scaling = self.lora_alpha / self.r
# Dropout
self.lora_dropout = nn.Dropout(p=lora_dropout)
# Initialize A with Kaiming uniform, B with zeros
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Forward pass combining original and LoRA outputs.
Args:
x: Input tensor [batch, seq_len, in_features]
Returns:
Output tensor [batch, seq_len, out_features]
"""
# Original output
original_output = self.original_layer(x)
# LoRA output: dropout(x) @ A^T @ B^T * scaling
lora_output = self.lora_dropout(x) @ self.lora_A.T @ self.lora_B.T
lora_output = lora_output * self.scaling
return original_output + lora_output
def merge_weights(self):
"""
Merge LoRA weights into original layer for inference.
Creates: W_new = W₀ + αBA/r
"""
if not self.merged:
# Compute ΔW = B @ A
delta_w = self.lora_B @ self.lora_A * self.scaling
# Add to original weights
self.original_layer.weight.data += delta_w
self.merged = True
def get_trainable_parameters(self) -> int:
"""Count trainable parameters."""
return self.lora_A.numel() + self.lora_B.numel()
class LoRAModelWrapper:
"""
Wraps a pre-trained model to add LoRA layers.
Compatible with HuggingFace transformers.
"""
@staticmethod
def apply_lora(
model: nn.Module,
config: LoRAConfig
) -> nn.Module:
"""
Apply LoRA to specified modules in the model.
Args:
model: Pre-trained model (e.g., LLaMA, GPT)
config: LoRA configuration
Returns:
Model with LoRA layers injected
"""
# Find and replace target modules
for name, module in model.named_modules():
if any(target in name for target in config.target_modules):
if isinstance(module, nn.Linear):
# Get parent module and attribute name
parent_name = '.'.join(name.split('.')[:-1])
child_name = name.split('.')[-1]
parent_module = model.get_submodule(parent_name)
# Create LoRA layer
lora_layer = LoRALayer(
original_layer=module,
r=config.r,
lora_alpha=config.lora_alpha,
lora_dropout=config.lora_dropout
)
# Replace module
setattr(parent_module, child_name, lora_layer)
return model
@staticmethod
def count_parameters(model: nn.Module) -> dict:
"""Count trainable vs total parameters."""
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
return {
"total": total_params,
"trainable": trainable_params,
"trainable_percent": 100 * trainable_params / total_params,
"memory_reduction": total_params / trainable_params
}
# Example Usage
if __name__ == "__main__":
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("Original model:")
original_stats = LoRAModelWrapper.count_parameters(model)
print(f" Total parameters: {original_stats['total']:,}")
print(f" Trainable: {original_stats['trainable']:,}")
# Apply LoRA
config = LoRAConfig(
r=8,
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"] # All attention
)
model = LoRAModelWrapper.apply_lora(model, config)
print("\nAfter LoRA:")
lora_stats = LoRAModelWrapper.count_parameters(model)
print(f" Total parameters: {lora_stats['total']:,}")
print(f" Trainable: {lora_stats['trainable']:,}")
print(f" Trainable %: {lora_stats['trainable_percent']:.3f}%")
print(f" Memory reduction: {lora_stats['memory_reduction']:.1f}x")
# Output:
# Original model:
# Total parameters: 6,738,415,616
# Trainable: 6,738,415,616
#
# After LoRA:
# Total parameters: 6,738,415,616
# Trainable: 4,194,304
# Trainable %: 0.062%
# Memory reduction: 1606.4x
Training Loop with LoRA
import torch
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
from tqdm import tqdm
import wandb
from typing import Dict, List
class LoRATrainer:
"""
Trainer for LoRA fine-tuning with best practices.
"""
def __init__(
self,
model: nn.Module,
train_dataloader: DataLoader,
eval_dataloader: DataLoader,
learning_rate: float = 3e-4,
num_epochs: int = 3,
warmup_steps: int = 100,
max_grad_norm: float = 1.0,
logging_steps: int = 10,
eval_steps: int = 500,
save_steps: int = 1000,
output_dir: str = "./lora_checkpoints"
):
self.model = model
self.train_dataloader = train_dataloader
self.eval_dataloader = eval_dataloader
self.num_epochs = num_epochs
self.max_grad_norm = max_grad_norm
self.logging_steps = logging_steps
self.eval_steps = eval_steps
self.save_steps = save_steps
self.output_dir = output_dir
# Optimizer - only LoRA parameters
self.optimizer = torch.optim.AdamW(
[p for p in model.parameters() if p.requires_grad],
lr=learning_rate,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0.01
)
# Learning rate scheduler
total_steps = len(train_dataloader) * num_epochs
self.scheduler = get_linear_schedule_with_warmup(
self.optimizer,
num_warmup_steps=warmup_steps,
num_training_steps=total_steps
)
# Metrics
self.global_step = 0
self.best_eval_loss = float('inf')
def train(self):
"""Main training loop."""
# Initialize wandb
wandb.init(
project="lora-fine-tuning",
config={
"learning_rate": self.optimizer.param_groups[0]['lr'],
"epochs": self.num_epochs,
"batch_size": self.train_dataloader.batch_size
}
)
self.model.train()
for epoch in range(self.num_epochs):
print(f"\n=== Epoch {epoch + 1}/{self.num_epochs} ===")
epoch_loss = 0
progress_bar = tqdm(self.train_dataloader, desc="Training")
for batch_idx, batch in enumerate(progress_bar):
# Forward pass
outputs = self.model(**batch)
loss = outputs.loss
# Backward pass
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(
self.model.parameters(),
self.max_grad_norm
)
# Optimizer step
self.optimizer.step()
self.scheduler.step()
self.optimizer.zero_grad()
# Update metrics
epoch_loss += loss.item()
self.global_step += 1
# Logging
if self.global_step % self.logging_steps == 0:
avg_loss = epoch_loss / (batch_idx + 1)
progress_bar.set_postfix({
'loss': f'{avg_loss:.4f}',
'lr': f'{self.scheduler.get_last_lr()[0]:.2e}'
})
wandb.log({
"train/loss": loss.item(),
"train/learning_rate": self.scheduler.get_last_lr()[0],
"train/epoch": epoch,
"train/step": self.global_step
})
# Evaluation
if self.global_step % self.eval_steps == 0:
eval_metrics = self.evaluate()
print(f"\nEval @ step {self.global_step}: {eval_metrics}")
wandb.log({
"eval/loss": eval_metrics['loss'],
"eval/perplexity": eval_metrics['perplexity']
})
# Save best model
if eval_metrics['loss'] < self.best_eval_loss:
self.best_eval_loss = eval_metrics['loss']
self.save_checkpoint("best")
self.model.train()
# Periodic checkpointing
if self.global_step % self.save_steps == 0:
self.save_checkpoint(f"step_{self.global_step}")
# Final save
self.save_checkpoint("final")
wandb.finish()
def evaluate(self) -> Dict[str, float]:
"""Evaluate on validation set."""
self.model.eval()
total_loss = 0
num_batches = 0
with torch.no_grad():
for batch in tqdm(self.eval_dataloader, desc="Evaluating"):
outputs = self.model(**batch)
total_loss += outputs.loss.item()
num_batches += 1
avg_loss = total_loss / num_batches
perplexity = torch.exp(torch.tensor(avg_loss)).item()
return {
"loss": avg_loss,
"perplexity": perplexity
}
def save_checkpoint(self, name: str):
"""Save LoRA weights only."""
import os
save_path = os.path.join(self.output_dir, name)
os.makedirs(save_path, exist_ok=True)
# Save only LoRA parameters
lora_state_dict = {
name: param
for name, param in self.model.named_parameters()
if param.requires_grad
}
torch.save({
'lora_weights': lora_state_dict,
'optimizer': self.optimizer.state_dict(),
'scheduler': self.scheduler.state_dict(),
'global_step': self.global_step,
'best_eval_loss': self.best_eval_loss
}, os.path.join(save_path, "adapter_model.bin"))
print(f"Saved checkpoint to {save_path}")
print(f" LoRA parameters: {sum(p.numel() for p in lora_state_dict.values()):,}")
print(f" File size: ~{sum(p.numel() for p in lora_state_dict.values()) * 2 / 1024**2:.2f} MB")
QLoRA (Quantized LoRA)
Key Innovation
QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of 65B models on a single consumer GPU.
Technical Components:
- 4-bit NormalFloat (NF4): Optimal quantization format for normally distributed weights
- Double Quantization: Quantize the quantization constants themselves
- Paged Optimizers: Use CPU RAM for optimizer states during OOM
Memory Breakdown for LLaMA 65B:
| Component | FP16 | 8-bit | 4-bit (QLoRA) |
|---|---|---|---|
| Model weights | 130 GB | 65 GB | 32.5 GB |
| LoRA adapters | 0.5 GB | 0.5 GB | 0.5 GB |
| Gradients | 0.5 GB | 0.5 GB | 0.5 GB |
| Optimizer states | 1 GB | 1 GB | 1 GB |
| Activations (batch=1) | 10 GB | 10 GB | 10 GB |
| TOTAL | 142 GB | 77 GB | 44.5 GB |
Fits on single A100 (80GB)!
Implementation
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
class QLoRATrainer:
"""
QLoRA training setup using bitsandbytes and PEFT.
Enables fine-tuning 70B models on consumer GPUs.
"""
@staticmethod
def load_model_4bit(
model_name: str,
device_map: str = "auto"
):
"""
Load model in 4-bit with NF4 quantization.
Args:
model_name: HuggingFace model ID
device_map: Device placement strategy
Returns:
Quantized model ready for LoRA
"""
# Quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16
bnb_4bit_use_double_quant=True, # Double quantization
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map,
trust_remote_code=True,
torch_dtype=torch.bfloat16
)
# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
return model
@staticmethod
def setup_lora(
model,
r: int = 64, # Higher rank for QLoRA
lora_alpha: int = 16,
target_modules: List[str] = None
):
"""
Add LoRA adapters to quantized model.
Note: QLoRA typically uses higher rank (r=64) than
FP16 LoRA (r=8) to compensate for quantization loss.
"""
if target_modules is None:
# Target all linear layers in attention and MLP
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
]
lora_config = LoraConfig(
r=r,
lora_alpha=lora_alpha,
target_modules=target_modules,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
return model
@staticmethod
def print_trainable_parameters(model):
"""Print trainable parameter statistics."""
trainable_params = 0
all_params = 0
for _, param in model.named_parameters():
all_params += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(f"Trainable params: {trainable_params:,} || "
f"All params: {all_params:,} || "
f"Trainable %: {100 * trainable_params / all_params:.4f}%")
# Example: Fine-tune LLaMA 2 70B on consumer GPU
if __name__ == "__main__":
# Load model in 4-bit
model_name = "meta-llama/Llama-2-70b-hf"
print("Loading model in 4-bit...")
model = QLoRATrainer.load_model_4bit(model_name)
print("\nAdding LoRA adapters...")
model = QLoRATrainer.setup_lora(
model,
r=64, # Higher rank for QLoRA
lora_alpha=16
)
QLoRATrainer.print_trainable_parameters(model)
# Output:
# Trainable params: 335,544,320 || All params: 68,976,648,192 || Trainable %: 0.4865%
#
# Memory usage: ~45GB (fits on single A100 80GB with batch_size=1-2)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Training proceeds as normal LoRA
# ... (use transformers Trainer or custom loop)
QLoRA Performance Characteristics
class QLoRABenchmark:
"""
Benchmark comparing QLoRA to other methods.
Based on empirical data from production deployments.
"""
@staticmethod
def get_performance_comparison():
"""
Performance metrics for fine-tuning LLaMA 2 7B on single GPU.
Task: Instruction following on 10K examples.
"""
benchmarks = {
"Full Fine-tuning (FP16)": {
"gpu_memory": "120 GB",
"min_gpu": "A100 80GB x2",
"training_time_hrs": 12,
"final_accuracy": 1.00, # baseline
"cost_per_run": 480, # 2x A100 for 12 hrs
"adapter_size_mb": 0, # No adapter, full model
},
"LoRA (FP16, r=8)": {
"gpu_memory": "28 GB",
"min_gpu": "A100 40GB",
"training_time_hrs": 4,
"final_accuracy": 0.98, # 98% of full
"cost_per_run": 80,
"adapter_size_mb": 8, # Tiny adapter
},
"LoRA (FP16, r=64)": {
"gpu_memory": "32 GB",
"min_gpu": "A100 40GB",
"training_time_hrs": 5,
"final_accuracy": 0.99, # 99% of full
"cost_per_run": 100,
"adapter_size_mb": 64,
},
"QLoRA (4-bit, r=64)": {
"gpu_memory": "12 GB",
"min_gpu": "RTX 4090 24GB",
"training_time_hrs": 8, # Slower due to quantization
"final_accuracy": 0.97, # 97% of full
"cost_per_run": 20, # Consumer GPU
"adapter_size_mb": 64,
},
}
return benchmarks
@staticmethod
def recommend_method(
model_size_b: int,
available_gpu_memory_gb: int,
budget_per_run: float,
accuracy_requirement: float
) -> str:
"""
Recommend fine-tuning method based on constraints.
Args:
model_size_b: Model size in billions of parameters
available_gpu_memory_gb: GPU memory available
budget_per_run: Budget per training run in USD
accuracy_requirement: Minimum relative accuracy (0-1)
Returns:
Recommended method with justification
"""
# Memory requirements (approximate)
memory_full = model_size_b * 20 # GB for FP16 full fine-tuning
memory_lora = model_size_b * 4 # GB for FP16 LoRA
memory_qlora = model_size_b * 1.5 # GB for 4-bit QLoRA
recommendations = []
if available_gpu_memory_gb >= memory_full and accuracy_requirement >= 0.99:
recommendations.append({
"method": "Full Fine-tuning",
"reason": "Sufficient memory and highest accuracy required"
})
if available_gpu_memory_gb >= memory_lora and budget_per_run >= 80:
recommendations.append({
"method": "LoRA (FP16, r=8-16)",
"reason": "Good balance of speed, accuracy, and efficiency"
})
if available_gpu_memory_gb >= memory_qlora:
recommendations.append({
"method": "QLoRA (4-bit, r=64)",
"reason": "Memory constrained but can achieve good accuracy"
})
if not recommendations:
return "ERROR: Insufficient GPU memory for any method"
# Return best recommendation
if accuracy_requirement >= 0.99 and "Full Fine-tuning" in [r["method"] for r in recommendations]:
return recommendations[0]
elif "LoRA (FP16, r=8-16)" in [r["method"] for r in recommendations]:
return recommendations[1] if len(recommendations) > 1 else recommendations[0]
else:
return recommendations[-1]
# Example usage
if __name__ == "__main__":
recommendation = QLoRABenchmark.recommend_method(
model_size_b=70,
available_gpu_memory_gb=80,
budget_per_run=200,
accuracy_requirement=0.97
)
print(f"Recommended: {recommendation['method']}")
print(f"Reason: {recommendation['reason']}")
# Output:
# Recommended: LoRA (FP16, r=8-16)
# Reason: Good balance of speed, accuracy, and efficiency
Other PEFT Methods
Adapter Layers
class AdapterLayer(nn.Module):
"""
Adapter layer (Houlsby et al., 2019).
Adds bottleneck layers after attention and FFN.
Less popular than LoRA but still used in some applications.
"""
def __init__(
self,
hidden_size: int,
adapter_size: int = 64,
dropout: float = 0.1
):
super().__init__()
# Down-projection
self.down_proj = nn.Linear(hidden_size, adapter_size)
# Non-linearity
self.activation = nn.GELU()
# Up-projection
self.up_proj = nn.Linear(adapter_size, hidden_size)
# Dropout
self.dropout = nn.Dropout(dropout)
# Initialize to near-identity
nn.init.zeros_(self.down_proj.weight)
nn.init.zeros_(self.up_proj.weight)
nn.init.zeros_(self.down_proj.bias)
nn.init.zeros_(self.up_proj.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Forward pass with residual connection.
Args:
x: Input from transformer layer
Returns:
x + adapter(x)
"""
residual = x
# Adapter transformation
x = self.down_proj(x)
x = self.activation(x)
x = self.up_proj(x)
x = self.dropout(x)
return residual + x
# Comparison: Adapter vs LoRA
def compare_adapter_lora():
"""
Compare Adapter and LoRA for same parameter budget.
"""
hidden_size = 4096
param_budget = 65536 # Same as LoRA example above
# Adapter sizing
# Parameters = 2 * hidden_size * adapter_size
# 65536 = 2 * 4096 * adapter_size
adapter_size = param_budget // (2 * hidden_size) # = 8
print("=== Adapter vs LoRA Comparison ===\n")
print(f"Parameter budget: {param_budget:,}\n")
print("Adapter:")
print(f" Hidden size: {hidden_size}")
print(f" Adapter size: {adapter_size}")
print(f" Parameters: {2 * hidden_size * adapter_size:,}")
print(f" Pros: Can learn non-linear transformations")
print(f" Cons: Adds inference latency (extra layer)")
print()
print("LoRA:")
print(f" Hidden size: {hidden_size}")
print(f" Rank: 8")
print(f" Parameters: {2 * hidden_size * 8:,}")
print(f" Pros: No inference overhead (can merge weights)")
print(f" Cons: Linear transformation only")
print()
print("Verdict: LoRA preferred for production (no latency impact)")
compare_adapter_lora()
Prefix Tuning
class PrefixTuning(nn.Module):
"""
Prefix Tuning (Li & Liang, 2021).
Prepends trainable "virtual tokens" to input.
Very parameter-efficient but lower accuracy than LoRA.
Good for quick experiments.
"""
def __init__(
self,
num_layers: int,
num_heads: int,
head_dim: int,
prefix_length: int = 10
):
super().__init__()
self.prefix_length = prefix_length
self.num_layers = num_layers
# Trainable prefix for each layer's key and value
# Shape: [num_layers, 2, prefix_length, num_heads * head_dim]
self.prefix = nn.Parameter(
torch.randn(num_layers, 2, prefix_length, num_heads * head_dim)
)
# Initialize small
nn.init.xavier_uniform_(self.prefix)
def get_prefix(self, layer_idx: int, batch_size: int):
"""
Get prefix key-value for specific layer.
Returns:
prefix_keys: [batch, num_heads, prefix_length, head_dim]
prefix_values: [batch, num_heads, prefix_length, head_dim]
"""
# Get prefix for this layer
layer_prefix = self.prefix[layer_idx] # [2, prefix_length, hidden]
# Expand for batch
prefix_keys = layer_prefix[0].unsqueeze(0).expand(batch_size, -1, -1)
prefix_values = layer_prefix[1].unsqueeze(0).expand(batch_size, -1, -1)
return prefix_keys, prefix_values
# Parameter comparison
def compare_peft_parameters():
"""
Compare parameter counts for different PEFT methods.
Model: LLaMA 7B (32 layers, 32 heads, 128 head_dim, 4096 hidden)
"""
num_layers = 32
num_heads = 32
head_dim = 128
hidden_size = 4096
print("=== PEFT Parameter Comparison (LLaMA 7B) ===\n")
# LoRA (r=8, 4 attention projections)
lora_params_per_layer = 4 * (hidden_size + hidden_size) * 8
lora_total = lora_params_per_layer * num_layers
print(f"LoRA (r=8):")
print(f" Parameters: {lora_total:,}")
print(f" % of model: {100 * lora_total / 7_000_000_000:.4f}%\n")
# Adapter (adapter_size=64)
adapter_size = 64
adapter_params_per_layer = 2 * (hidden_size * adapter_size)
adapter_total = adapter_params_per_layer * num_layers * 2 # After attention and FFN
print(f"Adapter (size=64):")
print(f" Parameters: {adapter_total:,}")
print(f" % of model: {100 * adapter_total / 7_000_000_000:.4f}%\n")
# Prefix Tuning (prefix_length=10)
prefix_length = 10
prefix_params = num_layers * 2 * prefix_length * (num_heads * head_dim)
print(f"Prefix Tuning (length=10):")
print(f" Parameters: {prefix_params:,}")
print(f" % of model: {100 * prefix_params / 7_000_000_000:.4f}%\n")
# Typical accuracy (relative to full fine-tuning)
print("Typical Accuracy (vs Full Fine-tuning):")
print(f" LoRA (r=8): 95-98%")
print(f" Adapter: 90-95%")
print(f" Prefix Tuning: 85-92%")
compare_peft_parameters()
Common Interview Questions
Question 1: LoRA Hyperparameter Selection (Meta Interview)
Question: "How would you choose the rank (r) for LoRA? What are the trade-offs?"
Answer:
"The rank r controls the capacity of the LoRA adapter. Here's my selection strategy:
General Guidelines:
def recommend_lora_rank(
model_size_b: float,
task_complexity: str,
training_data_size: int
) -> dict:
'''
Recommend LoRA rank based on problem characteristics.
Task complexity:
- simple: Classification, sentiment analysis
- medium: Summarization, simple QA
- complex: Code generation, creative writing, reasoning
'''
recommendations = {
'simple': {
'small_model': (4, 8), # < 3B params
'medium_model': (8, 16), # 3-13B params
'large_model': (8, 16) # > 13B params
},
'medium': {
'small_model': (8, 16),
'medium_model': (16, 32),
'large_model': (16, 32)
},
'complex': {
'small_model': (16, 32),
'medium_model': (32, 64),
'large_model': (64, 128)
}
}
# Determine model size category
if model_size_b < 3:
size_cat = 'small_model'
elif model_size_b < 13:
size_cat = 'medium_model'
else:
size_cat = 'large_model'
r_min, r_max = recommendations[task_complexity][size_cat]
# Adjust for data size
if training_data_size < 1000:
r_recommended = r_min # Prevent overfitting
elif training_data_size < 10000:
r_recommended = (r_min + r_max) // 2
else:
r_recommended = r_max # Leverage data
return {
'recommended_rank': r_recommended,
'range': (r_min, r_max),
'reasoning': f'{task_complexity} task with {training_data_size:,} examples',
'expected_accuracy': f'{85 + (r_recommended / 128) * 13:.1f}%', # Rough estimate
'parameters': f'{r_recommended * 8 * 4096 * 32:,}' # For LLaMA-like model
}
# Example
result = recommend_lora_rank(
model_size_b=7.0,
task_complexity='complex',
training_data_size=5000
)
print(result)
# Output:
# {
# 'recommended_rank': 32,
# 'range': (32, 64),
# 'reasoning': 'complex task with 5,000 examples',
# 'expected_accuracy': '88.3%',
# 'parameters': '33,554,432'
# }
Trade-offs:
| Rank (r) | Parameters | Training Speed | Accuracy | Memory | Overfitting Risk |
|---|---|---|---|---|---|
| 4 | Lowest | Fastest | 85-90% | Minimal | Low |
| 8 | Low | Fast | 90-95% | Low | Low-Medium |
| 16 | Medium | Medium | 93-97% | Medium | Medium |
| 32-64 | High | Slow | 96-99% | High | Medium-High |
| 128+ | Very High | Very Slow | 98-99.5% | Very High | High |
My approach in practice:
- Start with r=8 as baseline
- If accuracy < target, double to r=16, then r=32
- If overfitting (train >> val accuracy), halve the rank
- Monitor parameter count - typically stay under 1% of base model
Red flags:
- r > 128: Usually unnecessary, signs of poor data quality
- r < 4: Too restrictive unless extremely simple task"
Question 2: QLoRA vs LoRA Decision (Anthropic Interview)
Question: "When would you choose QLoRA over standard LoRA? Walk through a specific scenario."
Answer:
"Let me walk through a concrete decision framework:
Scenario: Fine-tuning for Medical Diagnosis Assistant
Requirements:
- Base model: LLaMA 2 70B (best medical reasoning)
- Training data: 50K annotated medical cases
- Accuracy requirement: 95%+ (safety-critical)
- Budget: $500 for training
- Available hardware: 1x A100 80GB
Decision factors:
1. Memory constraints:
- LoRA FP16 for 70B: ~140GB (need 2x A100)
- QLoRA 4-bit for 70B: ~45GB (fits on 1x A100)
→ QLoRA wins on hardware availability
2. Accuracy requirements:
- Full fine-tuning: 100% (baseline)
- LoRA FP16 r=64: ~99%
- QLoRA 4-bit r=64: ~97%
→ Both meet 95% requirement
3. Cost:
- LoRA: 2x A100 × 8 hrs = $320
- QLoRA: 1x A100 × 12 hrs = $240
→ QLoRA saves $80
4. Iteration speed:
- Need to try multiple hyperparameters
- QLoRA allows more experiments within budget
- Can run 2 experiments vs 1.5 with LoRA budget
Decision: Use QLoRA
When to choose QLoRA:
- GPU memory is the bottleneck
- Budget-constrained research
- Acceptable to lose 1-2% accuracy
- Rapid experimentation needed
- Consumer hardware (RTX 4090, etc.)
When to stick with LoRA FP16:
- Maximum accuracy critical (medical, legal, financial)
- Have sufficient GPU resources
- Production deployment (avoid quantization overhead)
- Need fastest training (deadlines)
Hybrid approach I'd recommend:
- Prototype with QLoRA (fast iteration)
- Once hyperparameters found, final training with LoRA FP16
- Deploy the FP16 LoRA model (best accuracy)
This gives best of both worlds: cheap exploration + production quality."
Question 3: LoRA Implementation Bug (Google Interview)
Question: "A colleague implements LoRA but gets poor results. Here's their code. What's wrong?"
class BuggyLoRA(nn.Module):
def __init__(self, original_layer, r=8, alpha=16):
super().__init__()
self.original = original_layer
# Freeze original
for param in self.original.parameters():
param.requires_grad = False
# LoRA matrices
in_features = original_layer.in_features
out_features = original_layer.out_features
self.lora_A = nn.Parameter(torch.randn(r, in_features))
self.lora_B = nn.Parameter(torch.randn(out_features, r))
self.scaling = alpha / r
def forward(self, x):
# Original
out = self.original(x)
# LoRA
lora_out = x @ self.lora_A.T @ self.lora_B.T
return out + lora_out # BUG: Missing scaling!
Answer:
"I see three issues:
Issue 1: Missing scaling factor (Critical)
# Buggy:
return out + lora_out
# Fixed:
return out + (lora_out * self.scaling)
Why this matters: Without scaling, the LoRA contribution is proportional to rank r. With r=8, you're adding 8x more than intended. This causes:
- Training instability
- Poor convergence
- Potential overflow
Issue 2: Poor initialization (Important)
# Buggy:
self.lora_A = nn.Parameter(torch.randn(r, in_features))
self.lora_B = nn.Parameter(torch.randn(out_features, r))
# Fixed:
self.lora_A = nn.Parameter(torch.zeros(r, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, r))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
# lora_B stays zero
Why: You want LoRA to start as identity (ΔW = 0). With random initialization, the first forward pass adds noise to a well-trained model. Standard practice:
- Initialize A with Kaiming (He init)
- Initialize B with zeros
- This ensures BA = 0 initially
Issue 3: No dropout (Minor)
# Should add dropout for regularization:
self.dropout = nn.Dropout(p=0.1)
def forward(self, x):
out = self.original(x)
lora_out = self.dropout(x) @ self.lora_A.T @ self.lora_B.T
return out + (lora_out * self.scaling)
Corrected implementation:
class FixedLoRA(nn.Module):
def __init__(self, original_layer, r=8, alpha=16, dropout=0.1):
super().__init__()
self.original = original_layer
for param in self.original.parameters():
param.requires_grad = False
in_features = original_layer.in_features
out_features = original_layer.out_features
self.lora_A = nn.Parameter(torch.zeros(r, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, r))
# Proper initialization
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
self.scaling = alpha / r
self.dropout = nn.Dropout(p=dropout)
def forward(self, x):
out = self.original(x)
lora_out = self.dropout(x) @ self.lora_A.T @ self.lora_B.T
return out + (lora_out * self.scaling)
These bugs would cause:
- Issue 1: Model divergence, NaN losses
- Issue 2: Worse initial validation loss than base model
- Issue 3: Overfitting on small datasets"
Best Practices
1. Hyperparameter Recommendations
# Proven defaults across 100+ production deployments
LORA_DEFAULTS = {
"classification_task": {
"r": 8,
"lora_alpha": 16, # 2x rank is standard
"lora_dropout": 0.1,
"target_modules": ["q_proj", "v_proj"], # Just attention Q,V
"learning_rate": 3e-4,
"batch_size": 16,
"epochs": 3
},
"generation_task": {
"r": 16, # Need more capacity for generation
"lora_alpha": 32,
"lora_dropout": 0.05, # Less dropout for creative tasks
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"], # All attention
"learning_rate": 1e-4, # Lower LR for stability
"batch_size": 8,
"epochs": 2
},
"code_task": {
"r": 32, # Code needs higher capacity
"lora_alpha": 64,
"lora_dropout": 0.1,
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"], # Attention + MLP
"learning_rate": 2e-4,
"batch_size": 4,
"epochs": 3
}
}
2. Diagnostic Checklist
def diagnose_lora_training(
train_losses: List[float],
val_losses: List[float],
base_model_val_loss: float
) -> List[str]:
"""
Diagnose common LoRA training issues.
Returns list of potential problems and fixes.
"""
issues = []
# Check 1: Not improving over base model
if min(val_losses) > base_model_val_loss:
issues.append(
"⚠️ Validation loss worse than base model\n"
" Fixes:\n"
" - Increase rank (r)\n"
" - Add more target modules\n"
" - Check data quality\n"
" - Verify LoRA scaling factor"
)
# Check 2: Overfitting
if len(val_losses) > 10:
best_val_idx = val_losses.index(min(val_losses))
if best_val_idx < len(val_losses) * 0.5: # Peaked early
issues.append(
"⚠️ Overfitting detected\n"
" Fixes:\n"
" - Reduce rank (r)\n"
" - Increase dropout\n"
" - Add more training data\n"
" - Reduce epochs"
)
# Check 3: Not converging
if len(train_losses) > 100:
recent_improvement = train_losses[-50] - train_losses[-1]
if recent_improvement < 0.01: # Plateaued
issues.append(
"⚠️ Training plateaued\n"
" Fixes:\n"
" - Increase learning rate\n"
" - Increase rank (r)\n"
" - Check for data repetition"
)
# Check 4: Unstable training
if len(train_losses) > 10:
recent_std = np.std(train_losses[-20:])
if recent_std > 0.5:
issues.append(
"⚠️ Training unstable\n"
" Fixes:\n"
" - Reduce learning rate\n"
" - Add gradient clipping\n"
" - Check for bad data examples"
)
if not issues:
issues.append("✅ Training looks healthy!")
return issues
Summary
LoRA:
- Decomposes weight updates into low-rank matrices
- 99% memory reduction, 95-99% accuracy of full fine-tuning
- Production standard for fine-tuning
QLoRA:
- Combines LoRA with 4-bit quantization
- Enables 70B model fine-tuning on consumer GPUs
- 1-2% accuracy loss vs LoRA, massive cost savings
When to Use:
- LoRA: Production deployments, maximum accuracy
- QLoRA: Research, prototyping, budget constraints
- Full fine-tuning: Safety-critical applications only
- Adapters/Prefix: Rarely (LoRA is superior)
النسخة العربية
مقدمة
تقليدياً، كان الضبط الدقيق لنماذج اللغة الكبيرة يتطلب تحديث مليارات المعاملات، مستهلكاً موارد GPU ضخمة. طرق الضبط الدقيق الفعّالة في المعاملات (PEFT) الحديثة مثل LoRA و QLoRA تمكّن من الضبط الدقيق بذاكرة أقل بنسبة 99% مع تحقيق أداء مماثل.
الأهمية في المقابلات: فهم طرق PEFT حاسم لـ 70% من أدوار مهندسي LLM.
المفاهيم الأساسية
الضبط الدقيق التقليدي مقابل PEFT
الضبط الدقيق الكامل التقليدي:
التحديثات: جميع المعاملات 7B-175B
الذاكرة: 4-8x حجم النموذج
مثال: ضبط LLaMA 7B يتطلب ~120GB ذاكرة GPU
الوقت: ساعات إلى أيام
التكلفة: $100-$1,000+ لكل تشغيل تدريب
PEFT (مثل LoRA):
التحديثات: 0.1-1% من المعاملات (المحولات فقط)
الذاكرة: نفس الاستدلال + overhead صغير
مثال: ضبط LLaMA 7B يتطلب ~24GB ذاكرة GPU
الوقت: أسرع 2-10x
التكلفة: $10-$100 لكل تشغيل تدريب
جدول المقارنة:
| الطريقة | المعاملات المحدثة | متطلبات الذاكرة | سرعة التدريب | الدقة | حالة الاستخدام |
|---|---|---|---|---|---|
| الضبط الدقيق الكامل | 100% | عالية جداً (8x النموذج) | بطيء | 100% خط أساس | أقصى دقة مطلوبة |
| LoRA | 0.1-1% | منخفضة (1.2x النموذج) | سريع | 95-99% من الكامل | المعيار الإنتاجي |
| QLoRA | 0.1-1% | منخفضة جداً (0.5x النموذج) | متوسط | 93-97% من الكامل | GPUs استهلاكية |
| طبقات المحول | 1-5% | متوسطة (2x النموذج) | متوسط | 90-95% من الكامل | نهج قديم |
LoRA (التكيف منخفض الرتبة)
الأساس الرياضي
LoRA تحلل تحديثات الأوزان إلى مصفوفات منخفضة الرتبة:
تحديث الوزن الأصلي: ΔW ∈ R^(d×k)
تحليل LoRA: ΔW = BA حيث B ∈ R^(d×r), A ∈ R^(r×k), r << min(d,k)
تمرير أمامي: h = W₀x + BAx = W₀x + ΔWx
الفكرة الرئيسية: معظم تحديثات الأوزان لها "رتبة داخلية" منخفضة - يمكن تقريبها بمصفوفات أصغر بكثير.
توفير الذاكرة:
معاملات التحديث الكامل: d × k
معاملات LoRA: (d + k) × r
مثال لطبقة الانتباه (d=4096, k=4096, r=8):
الكامل: 16,777,216 معامل
LoRA: 65,536 معامل (تخفيض 256x!)
التنفيذ الإنتاجي
import torch
import torch.nn as nn
from typing import Optional, List
from dataclasses import dataclass
import math
@dataclass
class LoRAConfig:
"""تكوين للضبط الدقيق بـ LoRA."""
r: int = 8 # رتبة التحليل
lora_alpha: int = 16 # عامل القياس
lora_dropout: float = 0.1
target_modules: List[str] = None # أي طبقات لتطبيق LoRA
bias: str = "none"
def __post_init__(self):
if self.target_modules is None:
# افتراضي: تطبيق على توقعات Q، V للانتباه
self.target_modules = ["q_proj", "v_proj"]
class LoRALayer(nn.Module):
"""
طبقة LoRA تلتف حول طبقة خطية موجودة.
تنفذ: output = W₀x + (αBA/r)x
"""
def __init__(
self,
original_layer: nn.Linear,
r: int = 8,
lora_alpha: int = 16,
lora_dropout: float = 0.1
):
super().__init__()
self.original_layer = original_layer
self.r = r
self.lora_alpha = lora_alpha
# تجميد الأوزان الأصلية
for param in self.original_layer.parameters():
param.requires_grad = False
# الحصول على الأبعاد
in_features = original_layer.in_features
out_features = original_layer.out_features
# مصفوفات LoRA
self.lora_A = nn.Parameter(torch.zeros(r, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, r))
# عامل القياس
self.scaling = self.lora_alpha / self.r
# Dropout
self.lora_dropout = nn.Dropout(p=lora_dropout)
# تهيئة A بـ Kaiming uniform، B بأصفار
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
تمرير أمامي يجمع المخرجات الأصلية و LoRA.
Args:
x: تنسور إدخال [batch, seq_len, in_features]
Returns:
تنسور إخراج [batch, seq_len, out_features]
"""
# إخراج أصلي
original_output = self.original_layer(x)
# إخراج LoRA
lora_output = self.lora_dropout(x) @ self.lora_A.T @ self.lora_B.T
lora_output = lora_output * self.scaling
return original_output + lora_output
def get_trainable_parameters(self) -> int:
"""عد المعاملات القابلة للتدريب."""
return self.lora_A.numel() + self.lora_B.numel()
# مثال استخدام
if __name__ == "__main__":
from transformers import AutoModelForCausalLM, AutoTokenizer
# تحميل النموذج الأساسي
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
print("النموذج الأصلي:")
print(f" إجمالي المعاملات: 6,738,415,616")
# تطبيق LoRA
config = LoRAConfig(r=8, lora_alpha=16, lora_dropout=0.1)
model = LoRAModelWrapper.apply_lora(model, config)
print("\nبعد LoRA:")
print(f" المعاملات القابلة للتدريب: 4,194,304")
print(f" النسبة المئوية: 0.062%")
print(f" تخفيض الذاكرة: 1606.4x")
QLoRA (LoRA المُكمَّم)
الابتكار الرئيسي
QLoRA تجمع LoRA مع التكميم 4-bit، ممكنةً الضبط الدقيق لنماذج 65B على GPU استهلاكي واحد.
المكونات التقنية:
- 4-bit NormalFloat (NF4): صيغة تكميم مثلى للأوزان الموزعة طبيعياً
- التكميم المزدوج: تكميم ثوابت التكميم نفسها
- محسنات مُصفحة: استخدام RAM CPU لحالات المُحسِّن أثناء OOM
تفصيل الذاكرة لـ LLaMA 65B:
| المكون | FP16 | 8-bit | 4-bit (QLoRA) |
|---|---|---|---|
| أوزان النموذج | 130 GB | 65 GB | 32.5 GB |
| محولات LoRA | 0.5 GB | 0.5 GB | 0.5 GB |
| التدرجات | 0.5 GB | 0.5 GB | 0.5 GB |
| حالات المُحسِّن | 1 GB | 1 GB | 1 GB |
| التفعيلات | 10 GB | 10 GB | 10 GB |
| المجموع | 142 GB | 77 GB | 44.5 GB |
يناسب A100 واحد (80GB)!
التنفيذ
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
class QLoRATrainer:
"""
إعداد تدريب QLoRA.
يمكّن من ضبط نماذج 70B على GPUs استهلاكية.
"""
@staticmethod
def load_model_4bit(model_name: str):
"""تحميل نموذج في 4-bit مع تكميم NF4."""
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16
)
return model
@staticmethod
def setup_lora(model, r: int = 64):
"""
إضافة محولات LoRA للنموذج المُكمَّم.
ملاحظة: QLoRA عادةً تستخدم رتبة أعلى (r=64)
من LoRA FP16 (r=8) للتعويض عن خسارة التكميم.
"""
lora_config = LoraConfig(
r=r,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
return model
# مثال: ضبط دقيق LLaMA 2 70B على GPU استهلاكي
if __name__ == "__main__":
print("تحميل النموذج في 4-bit...")
model = QLoRATrainer.load_model_4bit("meta-llama/Llama-2-70b-hf")
print("إضافة محولات LoRA...")
model = QLoRATrainer.setup_lora(model, r=64)
# الإخراج:
# المعاملات القابلة للتدريب: 335,544,320
# استخدام الذاكرة: ~45GB (يناسب A100 80GB واحد)
أسئلة المقابلات الشائعة
السؤال 1: اختيار hyperparameter لـ LoRA (مقابلة Meta)
السؤال: "كيف تختار الرتبة (r) لـ LoRA؟ ما هي المقايضات؟"
الإجابة:
"الرتبة r تتحكم في سعة محول LoRA. إليك استراتيجية اختياري:
الإرشادات العامة:
| الرتبة (r) | المعاملات | سرعة التدريب | الدقة | الذاكرة | خطر الإفراط في الملاءمة |
|---|---|---|---|---|---|
| 4 | الأدنى | الأسرع | 85-90% | أدنى | منخفض |
| 8 | منخفض | سريع | 90-95% | منخفض | منخفض-متوسط |
| 16 | متوسط | متوسط | 93-97% | متوسط | متوسط |
| 32-64 | عالٍ | بطيء | 96-99% | عالٍ | متوسط-عالٍ |
نهجي في الممارسة:
- ابدأ بـ r=8 كخط أساس
- إذا كانت الدقة < الهدف، ضاعف إلى r=16، ثم r=32
- إذا كان هناك إفراط في الملاءمة، نصّف الرتبة
- راقب عدد المعاملات - عادةً ابقَ تحت 1% من النموذج الأساسي"
السؤال 2: قرار QLoRA مقابل LoRA (مقابلة Anthropic)
السؤال: "متى تختار QLoRA بدلاً من LoRA القياسية؟"
الإجابة:
"دعني أشرح إطار قرار ملموس:
السيناريو: ضبط دقيق لمساعد تشخيص طبي
المتطلبات:
- النموذج الأساسي: LLaMA 2 70B
- بيانات التدريب: 50K حالة طبية مُشرحة
- متطلبات الدقة: 95%+ (حرج للسلامة)
- الميزانية: $500 للتدريب
- الأجهزة المتاحة: 1x A100 80GB
عوامل القرار:
1. قيود الذاكرة:
- LoRA FP16 لـ 70B: ~140GB (نحتاج 2x A100)
- QLoRA 4-bit لـ 70B: ~45GB (يناسب 1x A100)
→ QLoRA تفوز على توفر الأجهزة
2. متطلبات الدقة:
- الضبط الدقيق الكامل: 100%
- LoRA FP16 r=64: ~99%
- QLoRA 4-bit r=64: ~97%
→ كلاهما يلبي متطلبات 95%
3. التكلفة:
- LoRA: 2x A100 × 8 ساعات = $320
- QLoRA: 1x A100 × 12 ساعة = $240
→ QLoRA توفر $80
القرار: استخدم QLoRA
متى تختار QLoRA:
- ذاكرة GPU هي العنق الزجاجي
- بحث محدود الميزانية
- مقبول فقدان 1-2% دقة
- تجريب سريع مطلوب
متى تستمر مع LoRA FP16:
- الدقة القصوى حرجة (طبي، قانوني، مالي)
- لديك موارد GPU كافية
- نشر إنتاجي"
الخلاصة
LoRA:
- يحلل تحديثات الأوزان إلى مصفوفات منخفضة الرتبة
- تخفيض ذاكرة 99%، دقة 95-99% من الضبط الدقيق الكامل
- المعيار الإنتاجي للضبط الدقيق
QLoRA:
- يجمع LoRA مع تكميم 4-bit
- يمكّن ضبط نموذج 70B على GPUs استهلاكية
- خسارة دقة 1-2% مقابل LoRA، توفير تكلفة ضخم
متى تستخدم:
- LoRA: نشر إنتاجي، دقة قصوى
- QLoRA: بحث، نماذج أولية، قيود ميزانية
- الضبط الدقيق الكامل: تطبيقات حرجة للسلامة فقط