MLflow for LLM Evaluation

Built-in LLM Scorers

3 min read

MLflow provides ready-to-use scorers for common LLM evaluation tasks. These scorers use LLM-as-Judge patterns internally to assess response quality.

Available Built-in Scorers

Scorer What It Measures
answer_correctness Factual accuracy against expected answer
answer_relevance How relevant the answer is to the question
faithfulness Whether the answer is grounded in context
answer_similarity Semantic similarity to expected answer
toxicity Harmful or offensive content

Using Built-in Scorers

Basic Evaluation

from mlflow.genai.scorers import (
    answer_correctness,
    answer_relevance,
    faithfulness
)
from mlflow.genai import evaluate

# Prepare evaluation data
eval_data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": {"answer": "Paris is the capital of France."},
        "expectations": {"expected_answer": "Paris"}
    },
    {
        "inputs": {"question": "Who wrote Hamlet?"},
        "outputs": {"answer": "Shakespeare wrote Hamlet."},
        "expectations": {"expected_answer": "William Shakespeare"}
    }
]

# Run evaluation with built-in scorers
results = evaluate(
    data=eval_data,
    scorers=[
        answer_correctness(),
        answer_relevance(),
        faithfulness()
    ]
)

Correctness Scorer

Checks if the response is factually correct:

from mlflow.genai.scorers import answer_correctness

# The scorer compares outputs.answer against expectations.expected_answer
scorer = answer_correctness()

# Returns scores like:
# {
#     "answer_correctness": 0.95,
#     "reasoning": "The answer correctly identifies Paris as the capital..."
# }

Relevance Scorer

Measures how relevant the answer is to the question:

from mlflow.genai.scorers import answer_relevance

scorer = answer_relevance()

# High score: Answer directly addresses the question
# Low score: Answer is off-topic or tangential

Faithfulness Scorer

For RAG systems—checks if the answer is grounded in provided context:

from mlflow.genai.scorers import faithfulness

eval_data = [
    {
        "inputs": {
            "question": "What are the store hours?",
            "context": "Our store is open Monday-Friday, 9 AM to 6 PM."
        },
        "outputs": {
            "answer": "The store is open weekdays from 9 AM to 6 PM."
        }
    }
]

results = evaluate(
    data=eval_data,
    scorers=[faithfulness()]  # Checks answer against context
)

Configuring Scorers

Customize scorer behavior:

from mlflow.genai.scorers import answer_correctness

# Specify which model to use as judge
scorer = answer_correctness(
    model="openai:/gpt-4o",  # Judge model
    examples=[  # Few-shot examples for better judging
        {
            "question": "What is 2+2?",
            "answer": "The answer is 4.",
            "expected": "4",
            "score": 1.0,
            "reasoning": "Correct answer"
        }
    ]
)

Combining Multiple Scorers

Run multiple evaluations together:

from mlflow.genai.scorers import (
    answer_correctness,
    answer_relevance,
    answer_similarity
)
from mlflow.genai import evaluate

results = evaluate(
    data=eval_data,
    scorers=[
        answer_correctness(),
        answer_relevance(),
        answer_similarity()
    ]
)

# Access individual scores
print(results.tables["eval_results"])
# Shows columns: answer_correctness, answer_relevance, answer_similarity

Viewing Results

Results integrate with MLflow tracking:

import mlflow

with mlflow.start_run():
    results = evaluate(
        data=eval_data,
        scorers=[answer_correctness()]
    )

    # Metrics automatically logged
    # View in MLflow UI or access programmatically
    print(f"Mean correctness: {results.metrics['answer_correctness/mean']}")

When to Use Built-in Scorers

Use Case Recommended Scorers
Q&A systems answer_correctness, answer_relevance
RAG applications faithfulness, answer_correctness
Summarization answer_similarity, faithfulness
Content moderation toxicity

Tip: Start with built-in scorers. Only create custom judges when you need domain-specific evaluation criteria.

Next, we'll learn how to create custom judges for specialized evaluation needs. :::

Quiz

Module 4: MLflow for LLM Evaluation

Take Quiz