PEFT Methods Explained

Parameter-Efficient Fine-Tuning (PEFT) is a family of techniques that dramatically reduce the resources needed for fine-tuning. Let's explore the main methods.

LoRA: Low-Rank Adaptation

LoRA is the most popular PEFT method, used in the vast majority of fine-tuning projects.

How LoRA Works

Instead of updating the full weight matrix W, LoRA adds two small matrices:

Original: W (4096 x 4096 = 16M parameters)
LoRA: A (4096 x 16) × B (16 x 4096) = 131K parameters

Savings: 99.2% fewer trainable parameters!

LoRA Configuration

from peft import LoraConfig

config = LoraConfig(
    r=16,                    # Rank - size of LoRA matrices
    lora_alpha=32,           # Scaling factor
    target_modules=[         # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,       # Regularization
    bias="none",             # Don't train biases
    task_type="CAUSAL_LM"    # For decoder models
)

Key Parameters

Parameter	Typical Value	Effect
`r` (rank)	8-64	Higher = more capacity, more params
`lora_alpha`	16-64	Scales the LoRA update
`target_modules`	"all-linear"	Which layers to modify
`lora_dropout`	0.0-0.1	Prevents overfitting

QLoRA: Quantized LoRA

QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of 70B models on a single 24GB GPU.

How QLoRA Works

Quantize base model to 4-bit NormalFloat (NF4)
Add LoRA adapters on top of quantized weights
Train in mixed precision (adapters in fp16/bf16)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 - optimal for weights
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True    # Quantize quantization constants
)

LoRA vs QLoRA Memory Comparison

Model	Full Fine-tune	LoRA	QLoRA
Llama 3.2 8B	64GB	18GB	6GB
Mistral 7B	56GB	16GB	5GB
Llama 3.3 70B	560GB	160GB	24GB

DoRA: Weight-Decomposed LoRA

DoRA (2024) improves on LoRA by decomposing weights into magnitude and direction.

LoRA:  W' = W + ΔW
DoRA:  W' = m × (W + ΔW) / ||W + ΔW||
       where m = learned magnitude, direction is normalized

Benefit: Better performance, especially on difficult tasks.

Other PEFT Methods

Adapters

Small neural network modules inserted between transformer layers.

# Adapter architecture
input → LayerNorm → Down-project → ReLU → Up-project → + input

Prefix Tuning

Prepends learnable vectors to the input at each layer.

Prompt Tuning

Prepends learnable "soft prompt" tokens to the input.

Comparison Summary

Method	Memory	Speed	Performance	Best For
LoRA	Medium	Fast	Very Good	Default choice
QLoRA	Low	Medium	Very Good	Limited VRAM
DoRA	Medium	Medium	Excellent	Best quality needed
Full FT	Very High	Slow	Maximum	Unlimited resources

Modern Best Practice (2026)

# Recommended starting configuration
config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules="all-linear",  # All linear layers
    lora_dropout=0.0,
    bias="none"
)

Tip: Start with r=16 and target_modules="all-linear". Only reduce rank or targets if you're running out of memory.

Next, we'll help you choose the right approach for your specific use case. :::