Lesson 16 of 24

Training with Unsloth

Exporting Your Model

3 min read

After training, you need to export your model for deployment. Unsloth provides several export options including GGUF for Ollama.

Export Options

Format Use Case Tools
LoRA adapters Inference with base model PEFT, vLLM
Merged model HuggingFace Hub transformers
GGUF Ollama, llama.cpp Local deployment
ONNX Edge deployment ONNX Runtime

Saving LoRA Adapters Only

The smallest export - just the trained weights:

from unsloth import FastLanguageModel

# After training, save just the LoRA adapter
model.save_pretrained("./outputs/lora-adapter")
tokenizer.save_pretrained("./outputs/lora-adapter")

# Size: ~50-200 MB depending on rank

Loading LoRA Adapters

from unsloth import FastLanguageModel

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Load adapter
model.load_adapter("./outputs/lora-adapter")

Merging and Saving Full Model

Merge adapters into base model for standalone deployment:

from unsloth import FastLanguageModel

# Method 1: Save merged 16-bit model
model.save_pretrained_merged(
    "outputs/merged-model-16bit",
    tokenizer,
    save_method="merged_16bit",
)

# Method 2: Save merged 4-bit model (smaller)
model.save_pretrained_merged(
    "outputs/merged-model-4bit",
    tokenizer,
    save_method="merged_4bit",
)

Exporting to GGUF for Ollama

GGUF is the format used by Ollama and llama.cpp:

from unsloth import FastLanguageModel

# Export to GGUF with various quantization levels
model.save_pretrained_gguf(
    "outputs/model-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # 4-bit quantization
)

Quantization Options

Method Size Quality Use Case
q8_0 Largest Best When quality matters
q6_k Large Very Good Balance
q5_k_m Medium Good General use
q4_k_m Small Good Default choice
q4_0 Smaller Decent Memory constrained
q2_k Smallest Lower Extreme constraints
# High quality
model.save_pretrained_gguf("outputs/q8", tokenizer, quantization_method="q8_0")

# Balanced (recommended)
model.save_pretrained_gguf("outputs/q4km", tokenizer, quantization_method="q4_k_m")

# Small
model.save_pretrained_gguf("outputs/q4", tokenizer, quantization_method="q4_0")

Using with Ollama

After exporting to GGUF:

1. Create Modelfile

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./outputs/model-gguf/unsloth.Q4_K_M.gguf

TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"
EOF

2. Create Ollama Model

ollama create my-finetuned-model -f Modelfile

3. Run Your Model

ollama run my-finetuned-model "Hello, how are you?"

Pushing to HuggingFace Hub

Share your model with the community:

# Login first
from huggingface_hub import login
login(token="your_token")

# Push LoRA adapter
model.push_to_hub("your-username/my-lora-adapter")
tokenizer.push_to_hub("your-username/my-lora-adapter")

# Or push merged model
model.push_to_hub_merged(
    "your-username/my-merged-model",
    tokenizer,
    save_method="merged_16bit",
)

# Push GGUF
model.push_to_hub_gguf(
    "your-username/my-model-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

Complete Export Script

from unsloth import FastLanguageModel
from huggingface_hub import login

# Load trained model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./outputs/unsloth-finetune/final",
    max_seq_length=2048,
    load_in_4bit=True,
)

# ============================================
# Export Options
# ============================================

# 1. Save LoRA adapter only (smallest)
model.save_pretrained("exports/lora-only")
tokenizer.save_pretrained("exports/lora-only")
print("LoRA adapter saved!")

# 2. Save merged model (for HuggingFace deployment)
model.save_pretrained_merged(
    "exports/merged-16bit",
    tokenizer,
    save_method="merged_16bit",
)
print("Merged model saved!")

# 3. Save GGUF (for Ollama)
model.save_pretrained_gguf(
    "exports/gguf",
    tokenizer,
    quantization_method="q4_k_m",
)
print("GGUF model saved!")

# 4. Push to HuggingFace (optional)
# login(token="your_token")
# model.push_to_hub_gguf("username/model-name", tokenizer, quantization_method="q4_k_m")

Export Size Comparison

For a fine-tuned Llama 3.2 3B:

Export Type Size
LoRA adapter only ~100 MB
Merged fp16 ~6 GB
GGUF q8_0 ~3 GB
GGUF q4_k_m ~2 GB
GGUF q4_0 ~1.7 GB

Best Practices

  1. Always save LoRA first - It's small and allows experimentation
  2. Use q4_k_m for Ollama - Best balance of size and quality
  3. Test before pushing - Verify outputs match expectations
  4. Include chat template - Essential for instruction-tuned models

Tip: Export to GGUF with q4_k_m for Ollama deployment - it's the sweet spot between quality and size for most use cases.

In the next module, we'll learn how to further improve our model with DPO alignment. :::

Quiz

Module 4: Training with Unsloth

Take Quiz