MLflow for LLM Evaluation
Built-in LLM Scorers
3 min read
MLflow provides ready-to-use scorers for common LLM evaluation tasks. These scorers use LLM-as-Judge patterns internally to assess response quality.
Available Built-in Scorers
| Scorer | What It Measures |
|---|---|
| answer_correctness | Factual accuracy against expected answer |
| answer_relevance | How relevant the answer is to the question |
| faithfulness | Whether the answer is grounded in context |
| answer_similarity | Semantic similarity to expected answer |
| toxicity | Harmful or offensive content |
Using Built-in Scorers
Basic Evaluation
from mlflow.genai.scorers import (
answer_correctness,
answer_relevance,
faithfulness
)
from mlflow.genai import evaluate
# Prepare evaluation data
eval_data = [
{
"inputs": {"question": "What is the capital of France?"},
"outputs": {"answer": "Paris is the capital of France."},
"expectations": {"expected_answer": "Paris"}
},
{
"inputs": {"question": "Who wrote Hamlet?"},
"outputs": {"answer": "Shakespeare wrote Hamlet."},
"expectations": {"expected_answer": "William Shakespeare"}
}
]
# Run evaluation with built-in scorers
results = evaluate(
data=eval_data,
scorers=[
answer_correctness(),
answer_relevance(),
faithfulness()
]
)
Correctness Scorer
Checks if the response is factually correct:
from mlflow.genai.scorers import answer_correctness
# The scorer compares outputs.answer against expectations.expected_answer
scorer = answer_correctness()
# Returns scores like:
# {
# "answer_correctness": 0.95,
# "reasoning": "The answer correctly identifies Paris as the capital..."
# }
Relevance Scorer
Measures how relevant the answer is to the question:
from mlflow.genai.scorers import answer_relevance
scorer = answer_relevance()
# High score: Answer directly addresses the question
# Low score: Answer is off-topic or tangential
Faithfulness Scorer
For RAG systems—checks if the answer is grounded in provided context:
from mlflow.genai.scorers import faithfulness
eval_data = [
{
"inputs": {
"question": "What are the store hours?",
"context": "Our store is open Monday-Friday, 9 AM to 6 PM."
},
"outputs": {
"answer": "The store is open weekdays from 9 AM to 6 PM."
}
}
]
results = evaluate(
data=eval_data,
scorers=[faithfulness()] # Checks answer against context
)
Configuring Scorers
Customize scorer behavior:
from mlflow.genai.scorers import answer_correctness
# Specify which model to use as judge
scorer = answer_correctness(
model="openai:/gpt-4o", # Judge model
examples=[ # Few-shot examples for better judging
{
"question": "What is 2+2?",
"answer": "The answer is 4.",
"expected": "4",
"score": 1.0,
"reasoning": "Correct answer"
}
]
)
Combining Multiple Scorers
Run multiple evaluations together:
from mlflow.genai.scorers import (
answer_correctness,
answer_relevance,
answer_similarity
)
from mlflow.genai import evaluate
results = evaluate(
data=eval_data,
scorers=[
answer_correctness(),
answer_relevance(),
answer_similarity()
]
)
# Access individual scores
print(results.tables["eval_results"])
# Shows columns: answer_correctness, answer_relevance, answer_similarity
Viewing Results
Results integrate with MLflow tracking:
import mlflow
with mlflow.start_run():
results = evaluate(
data=eval_data,
scorers=[answer_correctness()]
)
# Metrics automatically logged
# View in MLflow UI or access programmatically
print(f"Mean correctness: {results.metrics['answer_correctness/mean']}")
When to Use Built-in Scorers
| Use Case | Recommended Scorers |
|---|---|
| Q&A systems | answer_correctness, answer_relevance |
| RAG applications | faithfulness, answer_correctness |
| Summarization | answer_similarity, faithfulness |
| Content moderation | toxicity |
Tip: Start with built-in scorers. Only create custom judges when you need domain-specific evaluation criteria.
Next, we'll learn how to create custom judges for specialized evaluation needs. :::