Lesson 11 of 24

LoRA & QLoRA in Practice

QLoRA: 4-bit Fine-tuning

3 min read

QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of large models on consumer GPUs. Let's understand how it works and how to set it up.

How QLoRA Works

QLoRA has three key innovations:

  1. 4-bit NormalFloat (NF4) - Quantization format optimized for neural network weights
  2. Double Quantization - Quantizes the quantization constants for additional memory savings
  3. Paged Optimizers - Uses CPU memory for optimizer states when GPU runs out
Standard 7B Model:  ~14GB VRAM (fp16)
With QLoRA:         ~4-6GB VRAM
Savings:            60-70% less VRAM

BitsAndBytesConfig

The core configuration for 4-bit quantization:

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enable 4-bit loading
    bnb_4bit_quant_type="nf4",              # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16,  # Computation precision
    bnb_4bit_use_double_quant=True          # Double quantization
)

Parameter Deep Dive

load_in_4bit

Enables 4-bit quantization. The model weights are stored in 4-bit but computed in higher precision.

bnb_4bit_quant_type

Type Description Best For
"nf4" NormalFloat4 - optimized for normally distributed weights Most cases
"fp4" Standard 4-bit float Compatibility
# NF4 (recommended)
bnb_config = BitsAndBytesConfig(bnb_4bit_quant_type="nf4")

bnb_4bit_compute_dtype

The data type used for computation during forward/backward pass:

# BFloat16 (recommended for modern GPUs)
bnb_config = BitsAndBytesConfig(bnb_4bit_compute_dtype=torch.bfloat16)

# Float16 (for older GPUs or compatibility)
bnb_config = BitsAndBytesConfig(bnb_4bit_compute_dtype=torch.float16)

bnb_4bit_use_double_quant

Quantizes the quantization constants, saving ~0.4 bits per parameter:

# Enable double quantization (recommended)
bnb_config = BitsAndBytesConfig(bnb_4bit_use_double_quant=True)

Loading a Model with QLoRA

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 1. Configure quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# 2. Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# 3. Prepare for training
model = prepare_model_for_kbit_training(model)

# 4. Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# 5. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

Memory Comparison

Model Full FT LoRA (fp16) QLoRA (4-bit)
Llama 3.2 1B 8GB 4GB 2GB
Llama 3.2 3B 24GB 12GB 4GB
Llama 3.2 8B 64GB 18GB 6GB
Mistral 7B 56GB 16GB 5GB
Llama 3.3 70B 560GB 160GB 24GB

Gradient Checkpointing

Further reduce memory by recomputing activations during backward pass:

# Enable gradient checkpointing
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

# Or via training arguments
training_args = TrainingArguments(
    gradient_checkpointing=True,
    ...
)

Complete QLoRA Setup

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset

# Model configuration
model_name = "meta-llama/Llama-3.2-3B-Instruct"

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Common Issues and Fixes

NaN Loss

# Use gradient clipping
training_args = TrainingArguments(
    max_grad_norm=0.3,  # Lower than default 1.0
    ...
)

Slow Training

# Enable Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2",
    ...
)

Out of Memory

# Reduce batch size and use gradient accumulation
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    ...
)

Tip: QLoRA is the go-to choice for consumer GPUs. Start here, and only move to full LoRA if you have the VRAM and need the slight quality improvement.

Next, we'll put everything together and run a complete fine-tuning job. :::

Quiz

Module 3: LoRA & QLoRA in Practice

Take Quiz