DeepEval & RAGAS Integration

MLflow integrates with popular external evaluation frameworks through its get_judge API, letting you use specialized metrics from DeepEval and RAGAS within MLflow experiments.

Why Use External Frameworks?

Framework	Specialization
DeepEval	LLM unit testing, G-Eval metrics
RAGAS	RAG-specific metrics (retrieval + generation)
MLflow	Experiment tracking, model registry

Using get_judge API

The get_judge function creates MLflow-compatible scorers from external frameworks:

from mlflow.genai.judges import get_judge

# Get a judge from an external framework
ragas_faithfulness = get_judge(
    judge_type="ragas/faithfulness"
)

deepeval_coherence = get_judge(
    judge_type="deepeval/coherence"
)

RAGAS Integration

RAGAS provides metrics specifically designed for RAG systems:

from mlflow.genai.judges import get_judge
from mlflow.genai import evaluate

# RAGAS metrics for RAG evaluation
ragas_scorers = [
    get_judge("ragas/faithfulness"),
    get_judge("ragas/answer_relevancy"),
    get_judge("ragas/context_precision"),
    get_judge("ragas/context_recall")
]

# RAG evaluation data
eval_data = [
    {
        "inputs": {
            "question": "What is the return policy?",
            "contexts": [
                "Returns accepted within 30 days.",
                "Refunds processed in 5-7 business days."
            ]
        },
        "outputs": {
            "answer": "You can return items within 30 days for a full refund."
        },
        "expectations": {
            "ground_truth": "30-day return policy with full refund."
        }
    }
]

results = evaluate(
    data=eval_data,
    scorers=ragas_scorers
)

RAGAS Metrics Explained

Metric	What It Measures
Faithfulness	Is the answer grounded in retrieved contexts?
Answer Relevancy	Does the answer address the question?
Context Precision	Are retrieved contexts relevant?
Context Recall	Do contexts contain needed information?

DeepEval Integration

DeepEval provides unit testing patterns for LLMs:

from mlflow.genai.judges import get_judge
from mlflow.genai import evaluate

# DeepEval metrics
deepeval_scorers = [
    get_judge("deepeval/coherence"),
    get_judge("deepeval/hallucination"),
    get_judge("deepeval/toxicity")
]

eval_data = [
    {
        "inputs": {"question": "Explain quantum computing"},
        "outputs": {"answer": "Quantum computing uses qubits..."}
    }
]

results = evaluate(
    data=eval_data,
    scorers=deepeval_scorers
)

DeepEval Metrics Explained

Metric	What It Measures
Coherence	Logical flow and clarity
Hallucination	Unsupported claims
Toxicity	Harmful content
Bias	Unfair stereotypes

Combining All Frameworks

Use MLflow, RAGAS, and DeepEval together:

from mlflow.genai import evaluate
from mlflow.genai.scorers import answer_correctness
from mlflow.genai.judges import get_judge, make_judge

# MLflow built-in
mlflow_scorers = [answer_correctness()]

# RAGAS for RAG quality
ragas_scorers = [
    get_judge("ragas/faithfulness"),
    get_judge("ragas/context_precision")
]

# DeepEval for safety
deepeval_scorers = [
    get_judge("deepeval/toxicity")
]

# Custom for domain-specific
custom_scorers = [
    make_judge(
        name="brand_alignment",
        judge_prompt="Does this match our brand voice? {{ outputs }}",
        output_type="numeric",
        output_range=(1, 5)
    )
]

# Run comprehensive evaluation
results = evaluate(
    data=eval_data,
    scorers=mlflow_scorers + ragas_scorers + deepeval_scorers + custom_scorers
)

Installation Requirements

# For RAGAS integration
pip install ragas

# For DeepEval integration
pip install deepeval

# MLflow with GenAI
pip install mlflow>=3.4.0

Tracking Combined Results

All metrics tracked in MLflow:

import mlflow

with mlflow.start_run(run_name="comprehensive-eval"):
    results = evaluate(
        data=eval_data,
        scorers=all_scorers
    )

    # All metrics from all frameworks logged
    print(results.metrics)
    # {
    #     'answer_correctness/mean': 0.85,
    #     'faithfulness/mean': 0.92,
    #     'context_precision/mean': 0.78,
    #     'toxicity/mean': 0.02,
    #     'brand_alignment/mean': 4.2
    # }

When to Use Each Framework

Need	Use
General LLM quality	MLflow built-in scorers
RAG system evaluation	RAGAS metrics
Safety and compliance	DeepEval metrics
Domain-specific criteria	MLflow make_judge

Tip: Start with one framework, then add others as your evaluation needs grow.

With MLflow mastered, let's explore W&B Weave—another powerful evaluation platform. :::