Lesson 11 of 24

LoRA & QLoRA in Practice

QLoRA: 4-bit Fine-tuning

3 min read

QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of large models on consumer GPUs. Let's understand how it works and how to set it up.

How QLoRA Works

QLoRA has three key innovations:

  1. 4-bit NormalFloat (NF4) - Quantization format optimized for neural network weights
  2. Double Quantization - Quantizes the quantization constants for additional memory savings
  3. Paged Optimizers - Uses CPU memory for optimizer states when GPU runs out
Standard 7B Model:  ~14GB VRAM (fp16)
With QLoRA:         ~4-6GB VRAM
Savings:            60-70% less VRAM

BitsAndBytesConfig

The core configuration for 4-bit quantization:

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enable 4-bit loading
    bnb_4bit_quant_type="nf4",              # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16,  # Computation precision
    bnb_4bit_use_double_quant=True          # Double quantization
)

Parameter Deep Dive

load_in_4bit

Enables 4-bit quantization. The model weights are stored in 4-bit but computed in higher precision.

bnb_4bit_quant_type

TypeDescriptionBest For
"nf4"NormalFloat4 - optimized for normally distributed weightsMost cases
"fp4"Standard 4-bit floatCompatibility
# NF4 (recommended)
bnb_config = BitsAndBytesConfig(bnb_4bit_quant_type="nf4")

bnb_4bit_compute_dtype

The data type used for computation during forward/backward pass:

# BFloat16 (recommended for modern GPUs)
bnb_config = BitsAndBytesConfig(bnb_4bit_compute_dtype=torch.bfloat16)

# Float16 (for older GPUs or compatibility)
bnb_config = BitsAndBytesConfig(bnb_4bit_compute_dtype=torch.float16)

bnb_4bit_use_double_quant

Quantizes the quantization constants, saving ~0.4 bits per parameter:

# Enable double quantization (recommended)
bnb_config = BitsAndBytesConfig(bnb_4bit_use_double_quant=True)

Loading a Model with QLoRA

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 1. Configure quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# 2. Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# 3. Prepare for training
model = prepare_model_for_kbit_training(model)

# 4. Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# 5. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

Memory Comparison

ModelFull FTLoRA (fp16)QLoRA (4-bit)
Llama 3.2 1B8GB4GB2GB
Llama 3.2 3B24GB12GB4GB
Llama 3.1 8B64GB18GB6GB
Mistral 7B56GB16GB5GB
Llama 3.3 70B560GB160GB24GB

Gradient Checkpointing

Further reduce memory by recomputing activations during backward pass:

# Enable gradient checkpointing
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

# Or via training arguments
training_args = TrainingArguments(
    gradient_checkpointing=True,
    ...
)

Complete QLoRA Setup

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset

# Model configuration
model_name = "meta-llama/Llama-3.2-3B-Instruct"

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Common Issues and Fixes

NaN Loss

# Use gradient clipping
training_args = TrainingArguments(
    max_grad_norm=0.3,  # Lower than default 1.0
    ...
)

Slow Training

# Enable Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2",
    ...
)

Out of Memory

# Reduce batch size and use gradient accumulation
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    ...
)

Tip: QLoRA is the go-to choice for consumer GPUs. Start here, and only move to full LoRA if you have the VRAM and need the slight quality improvement.

Next, we'll put everything together and run a complete fine-tuning job. :::

Quick check: how does this lesson land for you?

Quiz

Module 3: LoRA & QLoRA in Practice

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.