Evaluation & Deployment
Evaluating Fine-tuned Models
3 min read
After fine-tuning, you need to measure if your model actually improved. Let's explore evaluation strategies and metrics.
Why Evaluation Matters
Fine-tuning can go wrong in many ways:
| Issue | Result |
|---|---|
| Overfitting | Great on training data, bad on new data |
| Underfitting | No improvement at all |
| Catastrophic forgetting | Lost general knowledge |
| Style collapse | Repetitive or generic outputs |
Automatic Metrics
Loss and Perplexity
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("./outputs/fine-tuned")
tokenizer = AutoTokenizer.from_pretrained("./outputs/fine-tuned")
def calculate_perplexity(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
perplexity = torch.exp(loss)
return perplexity.item()
# Lower perplexity = better
test_texts = [
"Your domain-specific test text here",
"Another test example"
]
for text in test_texts:
ppl = calculate_perplexity(text)
print(f"Perplexity: {ppl:.2f}")
Task-Specific Metrics
from datasets import load_metric
# For classification tasks
accuracy = load_metric("accuracy")
f1 = load_metric("f1")
# For generation tasks
bleu = load_metric("bleu")
rouge = load_metric("rouge")
# Example: Rouge for summarization
predictions = ["The model generated this summary"]
references = ["The reference summary"]
rouge_scores = rouge.compute(
predictions=predictions,
references=references
)
print(rouge_scores)
Comparing Base vs Fine-tuned
Always compare against the base model:
from unsloth import FastLanguageModel
# Load both models
base_model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.2-3B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
fine_tuned_model, _ = FastLanguageModel.from_pretrained(
"./outputs/fine-tuned",
max_seq_length=2048,
load_in_4bit=True,
)
# Prepare for inference
FastLanguageModel.for_inference(base_model)
FastLanguageModel.for_inference(fine_tuned_model)
# Test prompt
test_prompt = "Your domain-specific question here"
def generate(model, prompt):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
print("=== Base Model ===")
print(generate(base_model, test_prompt))
print("\n=== Fine-tuned Model ===")
print(generate(fine_tuned_model, test_prompt))
Evaluation Framework
Create a systematic evaluation:
import json
from datetime import datetime
class ModelEvaluator:
def __init__(self, model, tokenizer, model_name):
self.model = model
self.tokenizer = tokenizer
self.model_name = model_name
self.results = []
def evaluate_prompt(self, prompt, expected_keywords=None):
"""Evaluate a single prompt"""
response = self.generate(prompt)
result = {
"prompt": prompt,
"response": response,
"response_length": len(response),
"timestamp": datetime.now().isoformat()
}
# Check for expected keywords
if expected_keywords:
found = sum(1 for kw in expected_keywords if kw.lower() in response.lower())
result["keyword_match"] = found / len(expected_keywords)
self.results.append(result)
return result
def generate(self, prompt):
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
outputs = self.model.generate(**inputs, max_new_tokens=200)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
def save_results(self, filename):
with open(filename, "w") as f:
json.dump(self.results, f, indent=2)
# Usage
evaluator = ModelEvaluator(fine_tuned_model, tokenizer, "my-fine-tuned")
test_cases = [
{"prompt": "Question 1", "expected_keywords": ["keyword1", "keyword2"]},
{"prompt": "Question 2", "expected_keywords": ["keyword3"]},
]
for test in test_cases:
evaluator.evaluate_prompt(test["prompt"], test["expected_keywords"])
evaluator.save_results("evaluation_results.json")
Human Evaluation
For subjective quality, human evaluation is essential:
Evaluation Criteria
| Criterion | Score 1-5 | Description |
|---|---|---|
| Relevance | Does it answer the question? | |
| Accuracy | Is information correct? | |
| Fluency | Is it well-written? | |
| Helpfulness | Is it useful? | |
| Safety | Is it appropriate? |
A/B Testing
import random
def blind_comparison(prompt, model_a, model_b, tokenizer):
"""Generate responses from both models in random order"""
response_a = generate(model_a, prompt, tokenizer)
response_b = generate(model_b, prompt, tokenizer)
# Randomize order
if random.random() > 0.5:
return [
{"label": "A", "response": response_a, "model": "fine-tuned"},
{"label": "B", "response": response_b, "model": "base"}
]
else:
return [
{"label": "A", "response": response_b, "model": "base"},
{"label": "B", "response": response_a, "model": "fine-tuned"}
]
# Human evaluator sees responses labeled A and B
# Records which they prefer without knowing which model produced each
Evaluation Checklist
Before deploying, verify:
- Loss decreased during training
- Validation loss didn't diverge (no overfitting)
- Model responds correctly to domain-specific prompts
- Model still handles general prompts (no catastrophic forgetting)
- Response style matches expectations
- No harmful or inappropriate outputs
- Inference speed is acceptable
Tip: Create a standardized test set of 20-50 prompts covering all your use cases. Run this evaluation after every training run to track improvements.
Next, let's learn how to diagnose and fix common training issues. :::