MLflow for LLM Evaluation
DeepEval & RAGAS Integration
3 min read
MLflow integrates with popular external evaluation frameworks through its get_judge API, letting you use specialized metrics from DeepEval and RAGAS within MLflow experiments.
Why Use External Frameworks?
| Framework | Specialization |
|---|---|
| DeepEval | LLM unit testing, G-Eval metrics |
| RAGAS | RAG-specific metrics (retrieval + generation) |
| MLflow | Experiment tracking, model registry |
Using get_judge API
The get_judge function creates MLflow-compatible scorers from external frameworks:
from mlflow.genai.judges import get_judge
# Get a judge from an external framework
ragas_faithfulness = get_judge(
judge_type="ragas/faithfulness"
)
deepeval_coherence = get_judge(
judge_type="deepeval/coherence"
)
RAGAS Integration
RAGAS provides metrics specifically designed for RAG systems:
from mlflow.genai.judges import get_judge
from mlflow.genai import evaluate
# RAGAS metrics for RAG evaluation
ragas_scorers = [
get_judge("ragas/faithfulness"),
get_judge("ragas/answer_relevancy"),
get_judge("ragas/context_precision"),
get_judge("ragas/context_recall")
]
# RAG evaluation data
eval_data = [
{
"inputs": {
"question": "What is the return policy?",
"contexts": [
"Returns accepted within 30 days.",
"Refunds processed in 5-7 business days."
]
},
"outputs": {
"answer": "You can return items within 30 days for a full refund."
},
"expectations": {
"ground_truth": "30-day return policy with full refund."
}
}
]
results = evaluate(
data=eval_data,
scorers=ragas_scorers
)
RAGAS Metrics Explained
| Metric | What It Measures |
|---|---|
| Faithfulness | Is the answer grounded in retrieved contexts? |
| Answer Relevancy | Does the answer address the question? |
| Context Precision | Are retrieved contexts relevant? |
| Context Recall | Do contexts contain needed information? |
DeepEval Integration
DeepEval provides unit testing patterns for LLMs:
from mlflow.genai.judges import get_judge
from mlflow.genai import evaluate
# DeepEval metrics
deepeval_scorers = [
get_judge("deepeval/coherence"),
get_judge("deepeval/hallucination"),
get_judge("deepeval/toxicity")
]
eval_data = [
{
"inputs": {"question": "Explain quantum computing"},
"outputs": {"answer": "Quantum computing uses qubits..."}
}
]
results = evaluate(
data=eval_data,
scorers=deepeval_scorers
)
DeepEval Metrics Explained
| Metric | What It Measures |
|---|---|
| Coherence | Logical flow and clarity |
| Hallucination | Unsupported claims |
| Toxicity | Harmful content |
| Bias | Unfair stereotypes |
Combining All Frameworks
Use MLflow, RAGAS, and DeepEval together:
from mlflow.genai import evaluate
from mlflow.genai.scorers import answer_correctness
from mlflow.genai.judges import get_judge, make_judge
# MLflow built-in
mlflow_scorers = [answer_correctness()]
# RAGAS for RAG quality
ragas_scorers = [
get_judge("ragas/faithfulness"),
get_judge("ragas/context_precision")
]
# DeepEval for safety
deepeval_scorers = [
get_judge("deepeval/toxicity")
]
# Custom for domain-specific
custom_scorers = [
make_judge(
name="brand_alignment",
judge_prompt="Does this match our brand voice? {{ outputs }}",
output_type="numeric",
output_range=(1, 5)
)
]
# Run comprehensive evaluation
results = evaluate(
data=eval_data,
scorers=mlflow_scorers + ragas_scorers + deepeval_scorers + custom_scorers
)
Installation Requirements
# For RAGAS integration
pip install ragas
# For DeepEval integration
pip install deepeval
# MLflow with GenAI
pip install mlflow>=3.4.0
Tracking Combined Results
All metrics tracked in MLflow:
import mlflow
with mlflow.start_run(run_name="comprehensive-eval"):
results = evaluate(
data=eval_data,
scorers=all_scorers
)
# All metrics from all frameworks logged
print(results.metrics)
# {
# 'answer_correctness/mean': 0.85,
# 'faithfulness/mean': 0.92,
# 'context_precision/mean': 0.78,
# 'toxicity/mean': 0.02,
# 'brand_alignment/mean': 4.2
# }
When to Use Each Framework
| Need | Use |
|---|---|
| General LLM quality | MLflow built-in scorers |
| RAG system evaluation | RAGAS metrics |
| Safety and compliance | DeepEval metrics |
| Domain-specific criteria | MLflow make_judge |
Tip: Start with one framework, then add others as your evaluation needs grow.
With MLflow mastered, let's explore W&B Weave—another powerful evaluation platform. :::