Lesson 3 of 24

Understanding Fine-tuning

PEFT Methods Explained

3 min read

Parameter-Efficient Fine-Tuning (PEFT) is a family of techniques that dramatically reduce the resources needed for fine-tuning. Let's explore the main methods.

LoRA: Low-Rank Adaptation

LoRA is the most popular PEFT method, used in the vast majority of fine-tuning projects.

How LoRA Works

Instead of updating the full weight matrix W, LoRA adds two small matrices:

Original: W (4096 x 4096 = 16M parameters)
LoRA: A (4096 x 16) × B (16 x 4096) = 131K parameters

Savings: 99.2% fewer trainable parameters!

LoRA Configuration

from peft import LoraConfig

config = LoraConfig(
    r=16,                    # Rank - size of LoRA matrices
    lora_alpha=32,           # Scaling factor
    target_modules=[         # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,       # Regularization
    bias="none",             # Don't train biases
    task_type="CAUSAL_LM"    # For decoder models
)

Key Parameters

Parameter Typical Value Effect
r (rank) 8-64 Higher = more capacity, more params
lora_alpha 16-64 Scales the LoRA update
target_modules "all-linear" Which layers to modify
lora_dropout 0.0-0.1 Prevents overfitting

QLoRA: Quantized LoRA

QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of 70B models on a single 24GB GPU.

How QLoRA Works

  1. Quantize base model to 4-bit NormalFloat (NF4)
  2. Add LoRA adapters on top of quantized weights
  3. Train in mixed precision (adapters in fp16/bf16)
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 - optimal for weights
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True    # Quantize quantization constants
)

LoRA vs QLoRA Memory Comparison

Model Full Fine-tune LoRA QLoRA
Llama 3.2 8B 64GB 18GB 6GB
Mistral 7B 56GB 16GB 5GB
Llama 3.3 70B 560GB 160GB 24GB

DoRA: Weight-Decomposed LoRA

DoRA (2024) improves on LoRA by decomposing weights into magnitude and direction.

LoRA:  W' = W + ΔW
DoRA:  W' = m × (W + ΔW) / ||W + ΔW||
       where m = learned magnitude, direction is normalized

Benefit: Better performance, especially on difficult tasks.

Other PEFT Methods

Adapters

Small neural network modules inserted between transformer layers.

# Adapter architecture
input → LayerNorm → Down-project → ReLU → Up-project → + input

Prefix Tuning

Prepends learnable vectors to the input at each layer.

Prompt Tuning

Prepends learnable "soft prompt" tokens to the input.

Comparison Summary

Method Memory Speed Performance Best For
LoRA Medium Fast Very Good Default choice
QLoRA Low Medium Very Good Limited VRAM
DoRA Medium Medium Excellent Best quality needed
Full FT Very High Slow Maximum Unlimited resources

Modern Best Practice (2025)

# Recommended starting configuration
config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules="all-linear",  # All linear layers
    lora_dropout=0.0,
    bias="none"
)

Tip: Start with r=16 and target_modules="all-linear". Only reduce rank or targets if you're running out of memory.

Next, we'll help you choose the right approach for your specific use case. :::

Quiz

Module 1: Understanding Fine-tuning

Take Quiz