Lesson 21 of 24

Evaluation & Deployment

Evaluating Fine-tuned Models

3 min read

After fine-tuning, you need to measure if your model actually improved. Let's explore evaluation strategies and metrics.

Why Evaluation Matters

Fine-tuning can go wrong in many ways:

Issue Result
Overfitting Great on training data, bad on new data
Underfitting No improvement at all
Catastrophic forgetting Lost general knowledge
Style collapse Repetitive or generic outputs

Automatic Metrics

Loss and Perplexity

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("./outputs/fine-tuned")
tokenizer = AutoTokenizer.from_pretrained("./outputs/fine-tuned")

def calculate_perplexity(text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
        perplexity = torch.exp(loss)
    return perplexity.item()

# Lower perplexity = better
test_texts = [
    "Your domain-specific test text here",
    "Another test example"
]

for text in test_texts:
    ppl = calculate_perplexity(text)
    print(f"Perplexity: {ppl:.2f}")

Task-Specific Metrics

from datasets import load_metric

# For classification tasks
accuracy = load_metric("accuracy")
f1 = load_metric("f1")

# For generation tasks
bleu = load_metric("bleu")
rouge = load_metric("rouge")

# Example: Rouge for summarization
predictions = ["The model generated this summary"]
references = ["The reference summary"]

rouge_scores = rouge.compute(
    predictions=predictions,
    references=references
)
print(rouge_scores)

Comparing Base vs Fine-tuned

Always compare against the base model:

from unsloth import FastLanguageModel

# Load both models
base_model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

fine_tuned_model, _ = FastLanguageModel.from_pretrained(
    "./outputs/fine-tuned",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Prepare for inference
FastLanguageModel.for_inference(base_model)
FastLanguageModel.for_inference(fine_tuned_model)

# Test prompt
test_prompt = "Your domain-specific question here"

def generate(model, prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print("=== Base Model ===")
print(generate(base_model, test_prompt))

print("\n=== Fine-tuned Model ===")
print(generate(fine_tuned_model, test_prompt))

Evaluation Framework

Create a systematic evaluation:

import json
from datetime import datetime

class ModelEvaluator:
    def __init__(self, model, tokenizer, model_name):
        self.model = model
        self.tokenizer = tokenizer
        self.model_name = model_name
        self.results = []

    def evaluate_prompt(self, prompt, expected_keywords=None):
        """Evaluate a single prompt"""
        response = self.generate(prompt)

        result = {
            "prompt": prompt,
            "response": response,
            "response_length": len(response),
            "timestamp": datetime.now().isoformat()
        }

        # Check for expected keywords
        if expected_keywords:
            found = sum(1 for kw in expected_keywords if kw.lower() in response.lower())
            result["keyword_match"] = found / len(expected_keywords)

        self.results.append(result)
        return result

    def generate(self, prompt):
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        outputs = self.model.generate(**inputs, max_new_tokens=200)
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

    def save_results(self, filename):
        with open(filename, "w") as f:
            json.dump(self.results, f, indent=2)

# Usage
evaluator = ModelEvaluator(fine_tuned_model, tokenizer, "my-fine-tuned")

test_cases = [
    {"prompt": "Question 1", "expected_keywords": ["keyword1", "keyword2"]},
    {"prompt": "Question 2", "expected_keywords": ["keyword3"]},
]

for test in test_cases:
    evaluator.evaluate_prompt(test["prompt"], test["expected_keywords"])

evaluator.save_results("evaluation_results.json")

Human Evaluation

For subjective quality, human evaluation is essential:

Evaluation Criteria

Criterion Score 1-5 Description
Relevance Does it answer the question?
Accuracy Is information correct?
Fluency Is it well-written?
Helpfulness Is it useful?
Safety Is it appropriate?

A/B Testing

import random

def blind_comparison(prompt, model_a, model_b, tokenizer):
    """Generate responses from both models in random order"""
    response_a = generate(model_a, prompt, tokenizer)
    response_b = generate(model_b, prompt, tokenizer)

    # Randomize order
    if random.random() > 0.5:
        return [
            {"label": "A", "response": response_a, "model": "fine-tuned"},
            {"label": "B", "response": response_b, "model": "base"}
        ]
    else:
        return [
            {"label": "A", "response": response_b, "model": "base"},
            {"label": "B", "response": response_a, "model": "fine-tuned"}
        ]

# Human evaluator sees responses labeled A and B
# Records which they prefer without knowing which model produced each

Evaluation Checklist

Before deploying, verify:

  • Loss decreased during training
  • Validation loss didn't diverge (no overfitting)
  • Model responds correctly to domain-specific prompts
  • Model still handles general prompts (no catastrophic forgetting)
  • Response style matches expectations
  • No harmful or inappropriate outputs
  • Inference speed is acceptable

Tip: Create a standardized test set of 20-50 prompts covering all your use cases. Run this evaluation after every training run to track improvements.

Next, let's learn how to diagnose and fix common training issues. :::

Quiz

Module 6: Evaluation & Deployment

Take Quiz