LoRA & QLoRA in Practice
QLoRA: 4-bit Fine-tuning
QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of large models on consumer GPUs. Let's understand how it works and how to set it up.
How QLoRA Works
QLoRA has three key innovations:
- 4-bit NormalFloat (NF4) - Quantization format optimized for neural network weights
- Double Quantization - Quantizes the quantization constants for additional memory savings
- Paged Optimizers - Uses CPU memory for optimizer states when GPU runs out
Standard 7B Model: ~14GB VRAM (fp16)
With QLoRA: ~4-6GB VRAM
Savings: 60-70% less VRAM
BitsAndBytesConfig
The core configuration for 4-bit quantization:
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Enable 4-bit loading
bnb_4bit_quant_type="nf4", # NormalFloat4 quantization
bnb_4bit_compute_dtype=torch.bfloat16, # Computation precision
bnb_4bit_use_double_quant=True # Double quantization
)
Parameter Deep Dive
load_in_4bit
Enables 4-bit quantization. The model weights are stored in 4-bit but computed in higher precision.
bnb_4bit_quant_type
| Type | Description | Best For |
|---|---|---|
| "nf4" | NormalFloat4 - optimized for normally distributed weights | Most cases |
| "fp4" | Standard 4-bit float | Compatibility |
# NF4 (recommended)
bnb_config = BitsAndBytesConfig(bnb_4bit_quant_type="nf4")
bnb_4bit_compute_dtype
The data type used for computation during forward/backward pass:
# BFloat16 (recommended for modern GPUs)
bnb_config = BitsAndBytesConfig(bnb_4bit_compute_dtype=torch.bfloat16)
# Float16 (for older GPUs or compatibility)
bnb_config = BitsAndBytesConfig(bnb_4bit_compute_dtype=torch.float16)
bnb_4bit_use_double_quant
Quantizes the quantization constants, saving ~0.4 bits per parameter:
# Enable double quantization (recommended)
bnb_config = BitsAndBytesConfig(bnb_4bit_use_double_quant=True)
Loading a Model with QLoRA
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# 1. Configure quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
# 2. Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16
)
# 3. Prepare for training
model = prepare_model_for_kbit_training(model)
# 4. Add LoRA adapters
lora_config = LoraConfig(
r=16,
lora_alpha=16,
target_modules="all-linear",
lora_dropout=0.0,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# 5. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
Memory Comparison
| Model | Full FT | LoRA (fp16) | QLoRA (4-bit) |
|---|---|---|---|
| Llama 3.2 1B | 8GB | 4GB | 2GB |
| Llama 3.2 3B | 24GB | 12GB | 4GB |
| Llama 3.2 8B | 64GB | 18GB | 6GB |
| Mistral 7B | 56GB | 16GB | 5GB |
| Llama 3.3 70B | 560GB | 160GB | 24GB |
Gradient Checkpointing
Further reduce memory by recomputing activations during backward pass:
# Enable gradient checkpointing
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
# Or via training arguments
training_args = TrainingArguments(
gradient_checkpointing=True,
...
)
Complete QLoRA Setup
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
# Model configuration
model_name = "meta-llama/Llama-3.2-3B-Instruct"
# Quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()
# LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=16,
target_modules="all-linear",
lora_dropout=0.0,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
# Check trainable parameters
model.print_trainable_parameters()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
Common Issues and Fixes
NaN Loss
# Use gradient clipping
training_args = TrainingArguments(
max_grad_norm=0.3, # Lower than default 1.0
...
)
Slow Training
# Enable Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
model_name,
attn_implementation="flash_attention_2",
...
)
Out of Memory
# Reduce batch size and use gradient accumulation
training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
...
)
Tip: QLoRA is the go-to choice for consumer GPUs. Start here, and only move to full LoRA if you have the VRAM and need the slight quality improvement.
Next, we'll put everything together and run a complete fine-tuning job. :::