MLflow for LLM Evaluation

DeepEval & RAGAS Integration

3 min read

MLflow integrates with popular external evaluation frameworks through its get_judge API, letting you use specialized metrics from DeepEval and RAGAS within MLflow experiments.

Why Use External Frameworks?

Framework Specialization
DeepEval LLM unit testing, G-Eval metrics
RAGAS RAG-specific metrics (retrieval + generation)
MLflow Experiment tracking, model registry

Using get_judge API

The get_judge function creates MLflow-compatible scorers from external frameworks:

from mlflow.genai.judges import get_judge

# Get a judge from an external framework
ragas_faithfulness = get_judge(
    judge_type="ragas/faithfulness"
)

deepeval_coherence = get_judge(
    judge_type="deepeval/coherence"
)

RAGAS Integration

RAGAS provides metrics specifically designed for RAG systems:

from mlflow.genai.judges import get_judge
from mlflow.genai import evaluate

# RAGAS metrics for RAG evaluation
ragas_scorers = [
    get_judge("ragas/faithfulness"),
    get_judge("ragas/answer_relevancy"),
    get_judge("ragas/context_precision"),
    get_judge("ragas/context_recall")
]

# RAG evaluation data
eval_data = [
    {
        "inputs": {
            "question": "What is the return policy?",
            "contexts": [
                "Returns accepted within 30 days.",
                "Refunds processed in 5-7 business days."
            ]
        },
        "outputs": {
            "answer": "You can return items within 30 days for a full refund."
        },
        "expectations": {
            "ground_truth": "30-day return policy with full refund."
        }
    }
]

results = evaluate(
    data=eval_data,
    scorers=ragas_scorers
)

RAGAS Metrics Explained

Metric What It Measures
Faithfulness Is the answer grounded in retrieved contexts?
Answer Relevancy Does the answer address the question?
Context Precision Are retrieved contexts relevant?
Context Recall Do contexts contain needed information?

DeepEval Integration

DeepEval provides unit testing patterns for LLMs:

from mlflow.genai.judges import get_judge
from mlflow.genai import evaluate

# DeepEval metrics
deepeval_scorers = [
    get_judge("deepeval/coherence"),
    get_judge("deepeval/hallucination"),
    get_judge("deepeval/toxicity")
]

eval_data = [
    {
        "inputs": {"question": "Explain quantum computing"},
        "outputs": {"answer": "Quantum computing uses qubits..."}
    }
]

results = evaluate(
    data=eval_data,
    scorers=deepeval_scorers
)

DeepEval Metrics Explained

Metric What It Measures
Coherence Logical flow and clarity
Hallucination Unsupported claims
Toxicity Harmful content
Bias Unfair stereotypes

Combining All Frameworks

Use MLflow, RAGAS, and DeepEval together:

from mlflow.genai import evaluate
from mlflow.genai.scorers import answer_correctness
from mlflow.genai.judges import get_judge, make_judge

# MLflow built-in
mlflow_scorers = [answer_correctness()]

# RAGAS for RAG quality
ragas_scorers = [
    get_judge("ragas/faithfulness"),
    get_judge("ragas/context_precision")
]

# DeepEval for safety
deepeval_scorers = [
    get_judge("deepeval/toxicity")
]

# Custom for domain-specific
custom_scorers = [
    make_judge(
        name="brand_alignment",
        judge_prompt="Does this match our brand voice? {{ outputs }}",
        output_type="numeric",
        output_range=(1, 5)
    )
]

# Run comprehensive evaluation
results = evaluate(
    data=eval_data,
    scorers=mlflow_scorers + ragas_scorers + deepeval_scorers + custom_scorers
)

Installation Requirements

# For RAGAS integration
pip install ragas

# For DeepEval integration
pip install deepeval

# MLflow with GenAI
pip install mlflow>=3.4.0

Tracking Combined Results

All metrics tracked in MLflow:

import mlflow

with mlflow.start_run(run_name="comprehensive-eval"):
    results = evaluate(
        data=eval_data,
        scorers=all_scorers
    )

    # All metrics from all frameworks logged
    print(results.metrics)
    # {
    #     'answer_correctness/mean': 0.85,
    #     'faithfulness/mean': 0.92,
    #     'context_precision/mean': 0.78,
    #     'toxicity/mean': 0.02,
    #     'brand_alignment/mean': 4.2
    # }

When to Use Each Framework

Need Use
General LLM quality MLflow built-in scorers
RAG system evaluation RAGAS metrics
Safety and compliance DeepEval metrics
Domain-specific criteria MLflow make_judge

Tip: Start with one framework, then add others as your evaluation needs grow.

With MLflow mastered, let's explore W&B Weave—another powerful evaluation platform. :::

Quiz

Module 4: MLflow for LLM Evaluation

Take Quiz