MLflow for LLM Evaluation

Custom Judges with make_judge

3 min read

When built-in scorers don't fit your needs, MLflow's make_judge lets you create custom LLM-as-Judge evaluators with your own criteria.

What is make_judge?

make_judge creates a custom scorer that:

  • Uses an LLM to evaluate responses
  • Follows your specific criteria
  • Integrates with MLflow evaluation

Requirement: make_judge requires MLflow 3.4.0 or later.

Basic make_judge Usage

from mlflow.genai.judges import make_judge

# Create a custom judge
tone_judge = make_judge(
    name="professional_tone",
    judge_prompt="""
    Evaluate if the response uses a professional tone appropriate
    for customer support.

    Response to evaluate:
    {{ outputs }}

    Score 1-5 where:
    1 = Unprofessional, inappropriate
    3 = Acceptable but could be better
    5 = Perfectly professional

    Return your score and reasoning.
    """,
    output_type="numeric",
    output_range=(1, 5)
)

Template Variables

Use these variables in your judge prompt:

Variable Contains
{{ inputs }} The input data (question, context, etc.)
{{ outputs }} The model's response
{{ expectations }} Expected outputs (ground truth)
{{ trace }} Full trace information

Creating Domain-Specific Judges

Example: Support Response Quality

from mlflow.genai.judges import make_judge

support_quality_judge = make_judge(
    name="support_quality",
    judge_prompt="""
    You are evaluating a customer support response.

    Customer question: {{ inputs.question }}
    Support response: {{ outputs.answer }}

    Evaluate on these criteria:
    1. Does it directly answer the question?
    2. Is the tone empathetic and helpful?
    3. Are next steps clearly provided?
    4. Is technical information accurate?

    Score 1-5 overall and explain your reasoning.
    """,
    output_type="numeric",
    output_range=(1, 5)
)

Example: Safety Compliance

safety_judge = make_judge(
    name="safety_compliance",
    judge_prompt="""
    Check if this response follows safety guidelines.

    Response: {{ outputs }}

    Check for:
    - No medical/legal/financial advice presented as authoritative
    - No personally identifiable information
    - No harmful instructions
    - Appropriate disclaimers where needed

    Return: pass/fail and explain any violations.
    """,
    output_type="categorical",
    output_values=["pass", "fail"]
)

Choosing a Judge Model

Specify which model evaluates responses:

from mlflow.genai.judges import make_judge

# Use OpenAI
judge = make_judge(
    name="quality_check",
    judge_prompt="...",
    model="openai:/gpt-4o"
)

# Supported providers:
# - openai:/gpt-4o, openai:/gpt-4o-mini
# - anthropic:/claude-3-sonnet
# - mistral:/mistral-large
# - bedrock:/anthropic.claude-3
# - togetherai:/meta-llama/...

Running Custom Judges

Integrate with MLflow evaluate:

from mlflow.genai import evaluate
from mlflow.genai.judges import make_judge

# Create your judge
custom_judge = make_judge(
    name="response_quality",
    judge_prompt="""
    Evaluate this customer support response.

    Question: {{ inputs.question }}
    Response: {{ outputs.answer }}

    Score 1-5 for overall quality.
    """,
    output_type="numeric",
    output_range=(1, 5)
)

# Prepare data
eval_data = [
    {
        "inputs": {"question": "How do I cancel my subscription?"},
        "outputs": {"answer": "Go to Settings > Subscription > Cancel."}
    }
]

# Run evaluation
results = evaluate(
    data=eval_data,
    scorers=[custom_judge]
)

print(results.tables["eval_results"])

Combining Built-in and Custom Judges

from mlflow.genai import evaluate
from mlflow.genai.scorers import answer_relevance
from mlflow.genai.judges import make_judge

# Custom judge for your specific needs
brand_voice_judge = make_judge(
    name="brand_voice",
    judge_prompt="""
    Does this response match our brand voice?
    - Friendly but professional
    - Uses "we" not "I"
    - Avoids jargon

    Response: {{ outputs }}

    Score 1-5.
    """,
    output_type="numeric",
    output_range=(1, 5)
)

# Combine with built-in
results = evaluate(
    data=eval_data,
    scorers=[
        answer_relevance(),  # Built-in
        brand_voice_judge    # Custom
    ]
)

Best Practices

Practice Why
Be specific Vague criteria lead to inconsistent scores
Include examples Few-shot examples improve accuracy
Test your judge Verify it scores known good/bad examples correctly
Use appropriate model GPT-4 class for nuanced evaluation

Tip: Start with a detailed prompt, then simplify once you understand what criteria matter most.

Next, we'll explore how to integrate external evaluation frameworks like DeepEval and RAGAS with MLflow. :::

Quiz

Module 4: MLflow for LLM Evaluation

Take Quiz