MLflow for LLM Evaluation

DeepEval & RAGAS Integration

3 min read

MLflow integrates with popular external evaluation frameworks through its get_judge API, letting you use specialized metrics from DeepEval and RAGAS within MLflow experiments.

Why Use External Frameworks?

FrameworkSpecialization
DeepEvalLLM unit testing, G-Eval metrics
RAGASRAG-specific metrics (retrieval + generation)
MLflowExperiment tracking, model registry

Using get_judge API

The get_judge function creates MLflow-compatible scorers from external frameworks:

from mlflow.genai.judges import get_judge

# Get a judge from an external framework
ragas_faithfulness = get_judge(
    judge_type="ragas/faithfulness"
)

deepeval_coherence = get_judge(
    judge_type="deepeval/coherence"
)

RAGAS Integration

RAGAS provides metrics specifically designed for RAG systems:

from mlflow.genai.judges import get_judge
from mlflow.genai import evaluate

# RAGAS metrics for RAG evaluation
ragas_scorers = [
    get_judge("ragas/faithfulness"),
    get_judge("ragas/answer_relevancy"),
    get_judge("ragas/context_precision"),
    get_judge("ragas/context_recall")
]

# RAG evaluation data
eval_data = [
    {
        "inputs": {
            "question": "What is the return policy?",
            "contexts": [
                "Returns accepted within 30 days.",
                "Refunds processed in 5-7 business days."
            ]
        },
        "outputs": {
            "answer": "You can return items within 30 days for a full refund."
        },
        "expectations": {
            "ground_truth": "30-day return policy with full refund."
        }
    }
]

results = evaluate(
    data=eval_data,
    scorers=ragas_scorers
)

RAGAS Metrics Explained

MetricWhat It Measures
FaithfulnessIs the answer grounded in retrieved contexts?
Answer RelevancyDoes the answer address the question?
Context PrecisionAre retrieved contexts relevant?
Context RecallDo contexts contain needed information?

DeepEval Integration

DeepEval provides unit testing patterns for LLMs:

from mlflow.genai.judges import get_judge
from mlflow.genai import evaluate

# DeepEval metrics
deepeval_scorers = [
    get_judge("deepeval/coherence"),
    get_judge("deepeval/hallucination"),
    get_judge("deepeval/toxicity")
]

eval_data = [
    {
        "inputs": {"question": "Explain quantum computing"},
        "outputs": {"answer": "Quantum computing uses qubits..."}
    }
]

results = evaluate(
    data=eval_data,
    scorers=deepeval_scorers
)

DeepEval Metrics Explained

MetricWhat It Measures
CoherenceLogical flow and clarity
HallucinationUnsupported claims
ToxicityHarmful content
BiasUnfair stereotypes

Combining All Frameworks

Use MLflow, RAGAS, and DeepEval together:

from mlflow.genai import evaluate
from mlflow.genai.scorers import answer_correctness
from mlflow.genai.judges import get_judge, make_judge

# MLflow built-in
mlflow_scorers = [answer_correctness()]

# RAGAS for RAG quality
ragas_scorers = [
    get_judge("ragas/faithfulness"),
    get_judge("ragas/context_precision")
]

# DeepEval for safety
deepeval_scorers = [
    get_judge("deepeval/toxicity")
]

# Custom for domain-specific
custom_scorers = [
    make_judge(
        name="brand_alignment",
        judge_prompt="Does this match our brand voice? {{ outputs }}",
        output_type="numeric",
        output_range=(1, 5)
    )
]

# Run comprehensive evaluation
results = evaluate(
    data=eval_data,
    scorers=mlflow_scorers + ragas_scorers + deepeval_scorers + custom_scorers
)

Installation Requirements

# For RAGAS integration
pip install ragas

# For DeepEval integration
pip install deepeval

# MLflow with GenAI
pip install mlflow>=3.4.0

Tracking Combined Results

All metrics tracked in MLflow:

import mlflow

with mlflow.start_run(run_name="comprehensive-eval"):
    results = evaluate(
        data=eval_data,
        scorers=all_scorers
    )

    # All metrics from all frameworks logged
    print(results.metrics)
    # {
    #     'answer_correctness/mean': 0.85,
    #     'faithfulness/mean': 0.92,
    #     'context_precision/mean': 0.78,
    #     'toxicity/mean': 0.02,
    #     'brand_alignment/mean': 4.2
    # }

When to Use Each Framework

NeedUse
General LLM qualityMLflow built-in scorers
RAG system evaluationRAGAS metrics
Safety and complianceDeepEval metrics
Domain-specific criteriaMLflow make_judge

Tip: Start with one framework, then add others as your evaluation needs grow.

With MLflow mastered, let's explore W&B Weave—another powerful evaluation platform. :::

Quick check: how does this lesson land for you?

Quiz

Module 4: MLflow for LLM Evaluation

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.