Training with Unsloth
Exporting Your Model
3 min read
After training, you need to export your model for deployment. Unsloth provides several export options including GGUF for Ollama.
Export Options
| Format | Use Case | Tools |
|---|---|---|
| LoRA adapters | Inference with base model | PEFT, vLLM |
| Merged model | HuggingFace Hub | transformers |
| GGUF | Ollama, llama.cpp | Local deployment |
| ONNX | Edge deployment | ONNX Runtime |
Saving LoRA Adapters Only
The smallest export - just the trained weights:
from unsloth import FastLanguageModel
# After training, save just the LoRA adapter
model.save_pretrained("./outputs/lora-adapter")
tokenizer.save_pretrained("./outputs/lora-adapter")
# Size: ~50-200 MB depending on rank
Loading LoRA Adapters
from unsloth import FastLanguageModel
# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
# Load adapter
model.load_adapter("./outputs/lora-adapter")
Merging and Saving Full Model
Merge adapters into base model for standalone deployment:
from unsloth import FastLanguageModel
# Method 1: Save merged 16-bit model
model.save_pretrained_merged(
"outputs/merged-model-16bit",
tokenizer,
save_method="merged_16bit",
)
# Method 2: Save merged 4-bit model (smaller)
model.save_pretrained_merged(
"outputs/merged-model-4bit",
tokenizer,
save_method="merged_4bit",
)
Exporting to GGUF for Ollama
GGUF is the format used by Ollama and llama.cpp:
from unsloth import FastLanguageModel
# Export to GGUF with various quantization levels
model.save_pretrained_gguf(
"outputs/model-gguf",
tokenizer,
quantization_method="q4_k_m", # 4-bit quantization
)
Quantization Options
| Method | Size | Quality | Use Case |
|---|---|---|---|
| q8_0 | Largest | Best | When quality matters |
| q6_k | Large | Very Good | Balance |
| q5_k_m | Medium | Good | General use |
| q4_k_m | Small | Good | Default choice |
| q4_0 | Smaller | Decent | Memory constrained |
| q2_k | Smallest | Lower | Extreme constraints |
# High quality
model.save_pretrained_gguf("outputs/q8", tokenizer, quantization_method="q8_0")
# Balanced (recommended)
model.save_pretrained_gguf("outputs/q4km", tokenizer, quantization_method="q4_k_m")
# Small
model.save_pretrained_gguf("outputs/q4", tokenizer, quantization_method="q4_0")
Using with Ollama
After exporting to GGUF:
1. Create Modelfile
# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./outputs/model-gguf/unsloth.Q4_K_M.gguf
TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"
EOF
2. Create Ollama Model
ollama create my-finetuned-model -f Modelfile
3. Run Your Model
ollama run my-finetuned-model "Hello, how are you?"
Pushing to HuggingFace Hub
Share your model with the community:
# Login first
from huggingface_hub import login
login(token="your_token")
# Push LoRA adapter
model.push_to_hub("your-username/my-lora-adapter")
tokenizer.push_to_hub("your-username/my-lora-adapter")
# Or push merged model
model.push_to_hub_merged(
"your-username/my-merged-model",
tokenizer,
save_method="merged_16bit",
)
# Push GGUF
model.push_to_hub_gguf(
"your-username/my-model-gguf",
tokenizer,
quantization_method="q4_k_m",
)
Complete Export Script
from unsloth import FastLanguageModel
from huggingface_hub import login
# Load trained model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./outputs/unsloth-finetune/final",
max_seq_length=2048,
load_in_4bit=True,
)
# ============================================
# Export Options
# ============================================
# 1. Save LoRA adapter only (smallest)
model.save_pretrained("exports/lora-only")
tokenizer.save_pretrained("exports/lora-only")
print("LoRA adapter saved!")
# 2. Save merged model (for HuggingFace deployment)
model.save_pretrained_merged(
"exports/merged-16bit",
tokenizer,
save_method="merged_16bit",
)
print("Merged model saved!")
# 3. Save GGUF (for Ollama)
model.save_pretrained_gguf(
"exports/gguf",
tokenizer,
quantization_method="q4_k_m",
)
print("GGUF model saved!")
# 4. Push to HuggingFace (optional)
# login(token="your_token")
# model.push_to_hub_gguf("username/model-name", tokenizer, quantization_method="q4_k_m")
Export Size Comparison
For a fine-tuned Llama 3.2 3B:
| Export Type | Size |
|---|---|
| LoRA adapter only | ~100 MB |
| Merged fp16 | ~6 GB |
| GGUF q8_0 | ~3 GB |
| GGUF q4_k_m | ~2 GB |
| GGUF q4_0 | ~1.7 GB |
Best Practices
- Always save LoRA first - It's small and allows experimentation
- Use q4_k_m for Ollama - Best balance of size and quality
- Test before pushing - Verify outputs match expectations
- Include chat template - Essential for instruction-tuned models
Tip: Export to GGUF with q4_k_m for Ollama deployment - it's the sweet spot between quality and size for most use cases.
In the next module, we'll learn how to further improve our model with DPO alignment. :::