Setting Up the Environment

Before we start fine-tuning, let's set up a proper environment with all the required libraries.

Required Libraries

Here are the core libraries for fine-tuning in 2026:

Library	Purpose	Version
transformers	Model loading and tokenization	≥4.46.0
peft	LoRA and adapter methods	≥0.13.0
trl	Training (SFTTrainer, DPOTrainer)	≥0.12.0
bitsandbytes	4-bit quantization	≥0.44.0
datasets	Dataset loading and processing	≥3.0.0
accelerate	Distributed training	≥1.0.0

Installation

Basic Installation

pip install transformers peft trl datasets accelerate bitsandbytes

With CUDA Support (Recommended)

pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install transformers peft trl datasets accelerate bitsandbytes

Full Installation with Extras

pip install transformers[torch] peft trl datasets accelerate bitsandbytes
pip install wandb  # For experiment tracking
pip install flash-attn --no-build-isolation  # Optional: faster attention

Verify Installation

Run this script to verify everything is working:

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

import transformers
print(f"Transformers version: {transformers.__version__}")

import peft
print(f"PEFT version: {peft.__version__}")

import trl
print(f"TRL version: {trl.__version__}")

import bitsandbytes
print(f"bitsandbytes version: {bitsandbytes.__version__}")

Expected output:

PyTorch version: 2.5.1+cu121
CUDA available: True
CUDA version: 12.1
GPU: NVIDIA GeForce RTX 4090
VRAM: 24.0 GB
Transformers version: 4.46.0
PEFT version: 0.13.0
TRL version: 0.12.0
bitsandbytes version: 0.44.0

Environment Options

Local GPU

Pros: Full control, no time limits Cons: Hardware investment required

Minimum specs for QLoRA:

GPU: 8GB VRAM (RTX 3070 or better)
RAM: 32GB
Storage: 100GB SSD

Google Colab

Pros: Free tier available, easy to start Cons: Session time limits, variable GPU availability

# Check Colab GPU
!nvidia-smi

Cloud Providers

Provider	GPU Options	Cost
RunPod	A100, 4090, A6000	$0.40-2.00/hr
Lambda Labs	A100, H100	$1.10-3.00/hr
Vast.ai	Various	$0.20-1.50/hr
AWS SageMaker	P4d, G5	$1.50-5.00/hr

Project Structure

Organize your fine-tuning project:

my-fine-tuning-project/
├── data/
│   ├── train.json
│   └── validation.json
├── configs/
│   └── lora_config.yaml
├── scripts/
│   ├── train.py
│   └── evaluate.py
├── outputs/
│   ├── checkpoints/
│   └── final_model/
├── requirements.txt
└── README.md

Configuration File

Create a config file for reproducibility:

# configs/lora_config.yaml
model:
  name: "meta-llama/Llama-3.2-3B-Instruct"
  max_seq_length: 2048

lora:
  r: 16
  alpha: 16
  dropout: 0.0
  target_modules: "all-linear"

training:
  batch_size: 4
  gradient_accumulation_steps: 4
  learning_rate: 2e-4
  num_epochs: 3
  warmup_ratio: 0.03

quantization:
  load_in_4bit: true
  bnb_4bit_quant_type: "nf4"
  bnb_4bit_compute_dtype: "bfloat16"

Hugging Face Setup

Most models require authentication:

# Install CLI
pip install huggingface_hub

# Login (get token from huggingface.co/settings/tokens)
huggingface-cli login

Or in Python:

from huggingface_hub import login
login(token="your_token_here")

Common Setup Issues

CUDA Out of Memory

# Reduce batch size
batch_size = 2

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Use smaller max_seq_length
max_seq_length = 1024

bitsandbytes Issues on Windows

# Use pre-built Windows wheels
pip install bitsandbytes-windows

Flash Attention Not Available

# Fall back to standard attention
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="eager"  # or "sdpa"
)

Tip: Start with a small model (1B-3B) to verify your setup before moving to larger models.

Next, we'll dive into LoRA configuration and understand each parameter. :::