Understanding Fine-tuning
PEFT Methods Explained
Parameter-Efficient Fine-Tuning (PEFT) is a family of techniques that dramatically reduce the resources needed for fine-tuning. Let's explore the main methods.
LoRA: Low-Rank Adaptation
LoRA is the most popular PEFT method, used in the vast majority of fine-tuning projects.
How LoRA Works
Instead of updating the full weight matrix W, LoRA adds two small matrices:
Original: W (4096 x 4096 = 16M parameters)
LoRA: A (4096 x 16) × B (16 x 4096) = 131K parameters
Savings: 99.2% fewer trainable parameters!
LoRA Configuration
from peft import LoraConfig
config = LoraConfig(
r=16, # Rank - size of LoRA matrices
lora_alpha=32, # Scaling factor
target_modules=[ # Which layers to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05, # Regularization
bias="none", # Don't train biases
task_type="CAUSAL_LM" # For decoder models
)
Key Parameters
| Parameter | Typical Value | Effect |
|---|---|---|
r (rank) |
8-64 | Higher = more capacity, more params |
lora_alpha |
16-64 | Scales the LoRA update |
target_modules |
"all-linear" | Which layers to modify |
lora_dropout |
0.0-0.1 | Prevents overfitting |
QLoRA: Quantized LoRA
QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of 70B models on a single 24GB GPU.
How QLoRA Works
- Quantize base model to 4-bit NormalFloat (NF4)
- Add LoRA adapters on top of quantized weights
- Train in mixed precision (adapters in fp16/bf16)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 - optimal for weights
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True # Quantize quantization constants
)
LoRA vs QLoRA Memory Comparison
| Model | Full Fine-tune | LoRA | QLoRA |
|---|---|---|---|
| Llama 3.2 8B | 64GB | 18GB | 6GB |
| Mistral 7B | 56GB | 16GB | 5GB |
| Llama 3.3 70B | 560GB | 160GB | 24GB |
DoRA: Weight-Decomposed LoRA
DoRA (2024) improves on LoRA by decomposing weights into magnitude and direction.
LoRA: W' = W + ΔW
DoRA: W' = m × (W + ΔW) / ||W + ΔW||
where m = learned magnitude, direction is normalized
Benefit: Better performance, especially on difficult tasks.
Other PEFT Methods
Adapters
Small neural network modules inserted between transformer layers.
# Adapter architecture
input → LayerNorm → Down-project → ReLU → Up-project → + input
Prefix Tuning
Prepends learnable vectors to the input at each layer.
Prompt Tuning
Prepends learnable "soft prompt" tokens to the input.
Comparison Summary
| Method | Memory | Speed | Performance | Best For |
|---|---|---|---|---|
| LoRA | Medium | Fast | Very Good | Default choice |
| QLoRA | Low | Medium | Very Good | Limited VRAM |
| DoRA | Medium | Medium | Excellent | Best quality needed |
| Full FT | Very High | Slow | Maximum | Unlimited resources |
Modern Best Practice (2025)
# Recommended starting configuration
config = LoraConfig(
r=16,
lora_alpha=16,
target_modules="all-linear", # All linear layers
lora_dropout=0.0,
bias="none"
)
Tip: Start with
r=16andtarget_modules="all-linear". Only reduce rank or targets if you're running out of memory.
Next, we'll help you choose the right approach for your specific use case. :::