Lesson 23 of 24

Evaluation & Deployment

Deploying to Ollama

3 min read

After fine-tuning, deploy your model locally with Ollama for fast, easy inference.

Export Pipeline Overview

Fine-tuned Model (HF format)
    Merge LoRA weights
    Convert to GGUF
    Create Ollama Modelfile
    Import to Ollama
    Run locally!

Step 1: Merge LoRA Weights

First, merge the LoRA adapters with the base model:

from unsloth import FastLanguageModel

# Load your fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    "./outputs/fine-tuned",
    max_seq_length=2048,
    load_in_4bit=False,  # Load in full precision for merging
)

# Merge LoRA weights into base model
model.save_pretrained_merged(
    "./outputs/merged",
    tokenizer,
    save_method="merged_16bit",  # Full precision merged weights
)

print("Merged model saved to ./outputs/merged")

Step 2: Convert to GGUF

GGUF is Ollama's native format. Unsloth can export directly:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "./outputs/fine-tuned",
    max_seq_length=2048,
    load_in_4bit=False,
)

# Export to GGUF with quantization
model.save_pretrained_gguf(
    "./outputs/gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Good balance of quality/size
)

Quantization Options

Method Size Quality Speed Use Case
q4_k_m Small Good Fast General use
q5_k_m Medium Better Medium Quality focus
q8_0 Large Best Slower Maximum quality
f16 Largest Perfect Slowest Reference
# For different quantization levels
for quant in ["q4_k_m", "q5_k_m", "q8_0"]:
    model.save_pretrained_gguf(
        f"./outputs/gguf-{quant}",
        tokenizer,
        quantization_method=quant,
    )

Step 3: Create Modelfile

Create an Ollama Modelfile to configure your model:

# Create Modelfile
cat > Modelfile << 'EOF'
# Base model from GGUF file
FROM ./outputs/gguf/unsloth.Q4_K_M.gguf

# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 2048

# System prompt (customize for your use case)
SYSTEM """You are a helpful assistant specialized in [your domain].
You provide accurate, concise, and helpful responses."""

# Chat template (match your training format)
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

# Stop tokens
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"
EOF

Step 4: Import to Ollama

# Create the model in Ollama
ollama create my-fine-tuned -f Modelfile

# Verify it was created
ollama list

# Test the model
ollama run my-fine-tuned "Your test prompt here"

Complete Deployment Script

Here's a complete script combining all steps:

from unsloth import FastLanguageModel
import subprocess
import os

def deploy_to_ollama(
    model_path: str,
    model_name: str,
    quantization: str = "q4_k_m",
    system_prompt: str = "You are a helpful assistant."
):
    """Deploy a fine-tuned model to Ollama"""

    print(f"Loading model from {model_path}...")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_path,
        max_seq_length=2048,
        load_in_4bit=False,
    )

    # Export to GGUF
    gguf_dir = f"./outputs/{model_name}-gguf"
    print(f"Exporting to GGUF ({quantization})...")
    model.save_pretrained_gguf(
        gguf_dir,
        tokenizer,
        quantization_method=quantization,
    )

    # Find the GGUF file
    gguf_file = None
    for f in os.listdir(gguf_dir):
        if f.endswith(".gguf"):
            gguf_file = os.path.join(gguf_dir, f)
            break

    if not gguf_file:
        raise FileNotFoundError("GGUF file not created")

    # Create Modelfile
    modelfile_content = f'''FROM {gguf_file}

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 2048

SYSTEM """{system_prompt}"""

TEMPLATE """{{{{ if .System }}}}<|start_header_id|>system<|end_header_id|>

{{{{ .System }}}}<|eot_id|>{{{{ end }}}}{{{{ if .Prompt }}}}<|start_header_id|>user<|end_header_id|>

{{{{ .Prompt }}}}<|eot_id|>{{{{ end }}}}<|start_header_id|>assistant<|end_header_id|>

{{{{ .Response }}}}<|eot_id|>"""

PARAMETER stop "<|eot_id|>"
'''

    modelfile_path = os.path.join(gguf_dir, "Modelfile")
    with open(modelfile_path, "w") as f:
        f.write(modelfile_content)

    # Create Ollama model
    print(f"Creating Ollama model '{model_name}'...")
    subprocess.run(["ollama", "create", model_name, "-f", modelfile_path], check=True)

    print(f"Model '{model_name}' ready! Run with: ollama run {model_name}")
    return model_name

# Usage
deploy_to_ollama(
    model_path="./outputs/fine-tuned",
    model_name="my-domain-expert",
    quantization="q4_k_m",
    system_prompt="You are an expert in [your domain]. Provide detailed, accurate responses."
)

Testing Your Deployed Model

Interactive Testing

# Start interactive chat
ollama run my-fine-tuned

# Test specific prompts
echo "Your test question" | ollama run my-fine-tuned

API Testing

import requests

def query_ollama(prompt: str, model: str = "my-fine-tuned"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Test
response = query_ollama("What is your specialty?")
print(response)

Benchmarking

import time

def benchmark_model(model: str, prompts: list, num_runs: int = 3):
    """Benchmark inference speed"""
    times = []

    for prompt in prompts:
        for _ in range(num_runs):
            start = time.time()
            query_ollama(prompt, model)
            times.append(time.time() - start)

    avg_time = sum(times) / len(times)
    print(f"Average response time: {avg_time:.2f}s")
    print(f"Throughput: {1/avg_time:.2f} responses/second")

test_prompts = [
    "Short question?",
    "A medium length question that requires more thought and detail in the response.",
    "A longer, more complex question that tests the model's ability to handle detailed queries with multiple parts."
]

benchmark_model("my-fine-tuned", test_prompts)

Managing Ollama Models

# List all models
ollama list

# Show model info
ollama show my-fine-tuned

# Remove a model
ollama rm my-fine-tuned

# Copy/rename a model
ollama cp my-fine-tuned my-fine-tuned-backup

# Push to Ollama registry (requires account)
ollama push username/my-fine-tuned

Troubleshooting Deployment

Issue Solution
GGUF conversion fails Ensure llama.cpp is installed, try different quantization
Wrong chat format Update TEMPLATE in Modelfile to match training format
Model too slow Use smaller quantization (q4_k_m) or reduce context
Model too large Use more aggressive quantization
Ollama can't find model Check Modelfile FROM path is correct

Tip: Test your GGUF with llama.cpp directly before creating the Ollama model. This helps isolate whether issues are with the export or the Modelfile configuration.

Next, let's explore what's next on your learning journey. :::

Quiz

Module 6: Evaluation & Deployment

Take Quiz