Evaluation & Deployment
Deploying to Ollama
3 min read
After fine-tuning, deploy your model locally with Ollama for fast, easy inference.
Export Pipeline Overview
Fine-tuned Model (HF format)
↓
Merge LoRA weights
↓
Convert to GGUF
↓
Create Ollama Modelfile
↓
Import to Ollama
↓
Run locally!
Step 1: Merge LoRA Weights
First, merge the LoRA adapters with the base model:
from unsloth import FastLanguageModel
# Load your fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
"./outputs/fine-tuned",
max_seq_length=2048,
load_in_4bit=False, # Load in full precision for merging
)
# Merge LoRA weights into base model
model.save_pretrained_merged(
"./outputs/merged",
tokenizer,
save_method="merged_16bit", # Full precision merged weights
)
print("Merged model saved to ./outputs/merged")
Step 2: Convert to GGUF
GGUF is Ollama's native format. Unsloth can export directly:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"./outputs/fine-tuned",
max_seq_length=2048,
load_in_4bit=False,
)
# Export to GGUF with quantization
model.save_pretrained_gguf(
"./outputs/gguf",
tokenizer,
quantization_method="q4_k_m", # Good balance of quality/size
)
Quantization Options
| Method | Size | Quality | Speed | Use Case |
|---|---|---|---|---|
| q4_k_m | Small | Good | Fast | General use |
| q5_k_m | Medium | Better | Medium | Quality focus |
| q8_0 | Large | Best | Slower | Maximum quality |
| f16 | Largest | Perfect | Slowest | Reference |
# For different quantization levels
for quant in ["q4_k_m", "q5_k_m", "q8_0"]:
model.save_pretrained_gguf(
f"./outputs/gguf-{quant}",
tokenizer,
quantization_method=quant,
)
Step 3: Create Modelfile
Create an Ollama Modelfile to configure your model:
# Create Modelfile
cat > Modelfile << 'EOF'
# Base model from GGUF file
FROM ./outputs/gguf/unsloth.Q4_K_M.gguf
# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 2048
# System prompt (customize for your use case)
SYSTEM """You are a helpful assistant specialized in [your domain].
You provide accurate, concise, and helpful responses."""
# Chat template (match your training format)
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
# Stop tokens
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"
EOF
Step 4: Import to Ollama
# Create the model in Ollama
ollama create my-fine-tuned -f Modelfile
# Verify it was created
ollama list
# Test the model
ollama run my-fine-tuned "Your test prompt here"
Complete Deployment Script
Here's a complete script combining all steps:
from unsloth import FastLanguageModel
import subprocess
import os
def deploy_to_ollama(
model_path: str,
model_name: str,
quantization: str = "q4_k_m",
system_prompt: str = "You are a helpful assistant."
):
"""Deploy a fine-tuned model to Ollama"""
print(f"Loading model from {model_path}...")
model, tokenizer = FastLanguageModel.from_pretrained(
model_path,
max_seq_length=2048,
load_in_4bit=False,
)
# Export to GGUF
gguf_dir = f"./outputs/{model_name}-gguf"
print(f"Exporting to GGUF ({quantization})...")
model.save_pretrained_gguf(
gguf_dir,
tokenizer,
quantization_method=quantization,
)
# Find the GGUF file
gguf_file = None
for f in os.listdir(gguf_dir):
if f.endswith(".gguf"):
gguf_file = os.path.join(gguf_dir, f)
break
if not gguf_file:
raise FileNotFoundError("GGUF file not created")
# Create Modelfile
modelfile_content = f'''FROM {gguf_file}
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 2048
SYSTEM """{system_prompt}"""
TEMPLATE """{{{{ if .System }}}}<|start_header_id|>system<|end_header_id|>
{{{{ .System }}}}<|eot_id|>{{{{ end }}}}{{{{ if .Prompt }}}}<|start_header_id|>user<|end_header_id|>
{{{{ .Prompt }}}}<|eot_id|>{{{{ end }}}}<|start_header_id|>assistant<|end_header_id|>
{{{{ .Response }}}}<|eot_id|>"""
PARAMETER stop "<|eot_id|>"
'''
modelfile_path = os.path.join(gguf_dir, "Modelfile")
with open(modelfile_path, "w") as f:
f.write(modelfile_content)
# Create Ollama model
print(f"Creating Ollama model '{model_name}'...")
subprocess.run(["ollama", "create", model_name, "-f", modelfile_path], check=True)
print(f"Model '{model_name}' ready! Run with: ollama run {model_name}")
return model_name
# Usage
deploy_to_ollama(
model_path="./outputs/fine-tuned",
model_name="my-domain-expert",
quantization="q4_k_m",
system_prompt="You are an expert in [your domain]. Provide detailed, accurate responses."
)
Testing Your Deployed Model
Interactive Testing
# Start interactive chat
ollama run my-fine-tuned
# Test specific prompts
echo "Your test question" | ollama run my-fine-tuned
API Testing
import requests
def query_ollama(prompt: str, model: str = "my-fine-tuned"):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Test
response = query_ollama("What is your specialty?")
print(response)
Benchmarking
import time
def benchmark_model(model: str, prompts: list, num_runs: int = 3):
"""Benchmark inference speed"""
times = []
for prompt in prompts:
for _ in range(num_runs):
start = time.time()
query_ollama(prompt, model)
times.append(time.time() - start)
avg_time = sum(times) / len(times)
print(f"Average response time: {avg_time:.2f}s")
print(f"Throughput: {1/avg_time:.2f} responses/second")
test_prompts = [
"Short question?",
"A medium length question that requires more thought and detail in the response.",
"A longer, more complex question that tests the model's ability to handle detailed queries with multiple parts."
]
benchmark_model("my-fine-tuned", test_prompts)
Managing Ollama Models
# List all models
ollama list
# Show model info
ollama show my-fine-tuned
# Remove a model
ollama rm my-fine-tuned
# Copy/rename a model
ollama cp my-fine-tuned my-fine-tuned-backup
# Push to Ollama registry (requires account)
ollama push username/my-fine-tuned
Troubleshooting Deployment
| Issue | Solution |
|---|---|
| GGUF conversion fails | Ensure llama.cpp is installed, try different quantization |
| Wrong chat format | Update TEMPLATE in Modelfile to match training format |
| Model too slow | Use smaller quantization (q4_k_m) or reduce context |
| Model too large | Use more aggressive quantization |
| Ollama can't find model | Check Modelfile FROM path is correct |
Tip: Test your GGUF with llama.cpp directly before creating the Ollama model. This helps isolate whether issues are with the export or the Modelfile configuration.
Next, let's explore what's next on your learning journey. :::