MLflow for LLM Evaluation

MLflow GenAI Setup

3 min read

MLflow is an open-source platform for managing the ML lifecycle. Its GenAI module provides powerful tools for evaluating LLM applications with experiment tracking built in.

Why MLflow for LLM Evaluation?

Feature Benefit
Open Source No vendor lock-in, self-hostable
Experiment Tracking Compare runs, parameters, metrics
Built-in Scorers Ready-to-use quality metrics
Custom Judges Create domain-specific evaluators
Model Registry Version and deploy models

Installation

Install MLflow with GenAI support:

pip install mlflow>=2.10.0 openai

Note: For make_judge functionality, use MLflow 3.4.0 or later.

Setting Up the Tracking Server

Option 1: Local Tracking (Development)

import mlflow

# Use local file storage
mlflow.set_tracking_uri("file:./mlruns")
mlflow.set_experiment("llm-evaluation")

Option 2: Remote Tracking Server

# Start a tracking server
mlflow server --host 0.0.0.0 --port 5000
import mlflow

# Connect to remote server
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("llm-evaluation")

Basic Experiment Logging

Log your LLM runs with parameters and metrics:

import mlflow

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("model", "gpt-4o-mini")
    mlflow.log_param("temperature", 0.7)
    mlflow.log_param("prompt_version", "v2.1")

    # Your LLM logic here
    response = call_llm(prompt, model="gpt-4o-mini")

    # Log metrics
    mlflow.log_metric("latency_ms", response.latency)
    mlflow.log_metric("tokens_used", response.token_count)
    mlflow.log_metric("quality_score", evaluate(response))

Understanding MLflow Concepts

Experiments

Group related runs together:

Experiment: "customer-support-bot"
├── Run 1: gpt-4o, prompt v1
├── Run 2: gpt-4o, prompt v2
├── Run 3: gpt-4o-mini, prompt v2
└── Run 4: claude-3, prompt v2

Runs

Each run captures:

  • Parameters: Configuration (model, temperature, etc.)
  • Metrics: Numeric results (score, latency, cost)
  • Artifacts: Files (prompts, outputs, logs)
  • Tags: Metadata (author, version, environment)

Artifacts

Store evaluation artifacts:

with mlflow.start_run():
    # Log text files
    mlflow.log_text(prompt, "prompt.txt")
    mlflow.log_text(response, "response.txt")

    # Log structured data
    mlflow.log_dict(evaluation_results, "eval_results.json")

Viewing Results

Access the MLflow UI:

mlflow ui --port 5000

The UI shows:

  • Experiment list with all runs
  • Parameter and metric comparisons
  • Visualizations and charts
  • Artifact browser

Environment Variables

Configure MLflow with environment variables:

export MLFLOW_TRACKING_URI=http://your-server:5000
export OPENAI_API_KEY=your-openai-key

Tip: For production, use a remote tracking server with a database backend (PostgreSQL, MySQL) for better performance and team collaboration.

Next, we'll explore MLflow's built-in LLM scorers for common evaluation tasks. :::

Quiz

Module 4: MLflow for LLM Evaluation

Take Quiz