QLoRA: 4-bit Fine-tuning

QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of large models on consumer GPUs. Let's understand how it works and how to set it up.

How QLoRA Works

QLoRA has three key innovations:

4-bit NormalFloat (NF4) - Quantization format optimized for neural network weights
Double Quantization - Quantizes the quantization constants for additional memory savings
Paged Optimizers - Uses CPU memory for optimizer states when GPU runs out

Standard 7B Model:  ~14GB VRAM (fp16)
With QLoRA:         ~4-6GB VRAM
Savings:            60-70% less VRAM

BitsAndBytesConfig

The core configuration for 4-bit quantization:

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enable 4-bit loading
    bnb_4bit_quant_type="nf4",              # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16,  # Computation precision
    bnb_4bit_use_double_quant=True          # Double quantization
)

Parameter Deep Dive

load_in_4bit

Enables 4-bit quantization. The model weights are stored in 4-bit but computed in higher precision.

bnb_4bit_quant_type

Type	Description	Best For
"nf4"	NormalFloat4 - optimized for normally distributed weights	Most cases
"fp4"	Standard 4-bit float	Compatibility

# NF4 (recommended)
bnb_config = BitsAndBytesConfig(bnb_4bit_quant_type="nf4")

bnb_4bit_compute_dtype

The data type used for computation during forward/backward pass:

# BFloat16 (recommended for modern GPUs)
bnb_config = BitsAndBytesConfig(bnb_4bit_compute_dtype=torch.bfloat16)

# Float16 (for older GPUs or compatibility)
bnb_config = BitsAndBytesConfig(bnb_4bit_compute_dtype=torch.float16)

bnb_4bit_use_double_quant

Quantizes the quantization constants, saving ~0.4 bits per parameter:

# Enable double quantization (recommended)
bnb_config = BitsAndBytesConfig(bnb_4bit_use_double_quant=True)

Loading a Model with QLoRA

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 1. Configure quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# 2. Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# 3. Prepare for training
model = prepare_model_for_kbit_training(model)

# 4. Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# 5. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

Memory Comparison

Model	Full FT	LoRA (fp16)	QLoRA (4-bit)
Llama 3.2 1B	8GB	4GB	2GB
Llama 3.2 3B	24GB	12GB	4GB
Llama 3.2 8B	64GB	18GB	6GB
Mistral 7B	56GB	16GB	5GB
Llama 3.3 70B	560GB	160GB	24GB

Gradient Checkpointing

Further reduce memory by recomputing activations during backward pass:

# Enable gradient checkpointing
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

# Or via training arguments
training_args = TrainingArguments(
    gradient_checkpointing=True,
    ...
)

Complete QLoRA Setup

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset

# Model configuration
model_name = "meta-llama/Llama-3.2-3B-Instruct"

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Common Issues and Fixes

NaN Loss

# Use gradient clipping
training_args = TrainingArguments(
    max_grad_norm=0.3,  # Lower than default 1.0
    ...
)

Slow Training

# Enable Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2",
    ...
)

Out of Memory

# Reduce batch size and use gradient accumulation
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    ...
)

Tip: QLoRA is the go-to choice for consumer GPUs. Start here, and only move to full LoRA if you have the VRAM and need the slight quality improvement.

Next, we'll put everything together and run a complete fine-tuning job. :::