MLflow GenAI Setup

MLflow is an open-source platform for managing the ML lifecycle. Its GenAI module provides powerful tools for evaluating LLM applications with experiment tracking built in.

Why MLflow for LLM Evaluation?

Feature	Benefit
Open Source	No vendor lock-in, self-hostable
Experiment Tracking	Compare runs, parameters, metrics
Built-in Scorers	Ready-to-use quality metrics
Custom Judges	Create domain-specific evaluators
Model Registry	Version and deploy models

Installation

Install MLflow with GenAI support:

pip install mlflow>=2.10.0 openai

Note: For make_judge functionality, use MLflow 3.4.0 or later.

Setting Up the Tracking Server

Option 1: Local Tracking (Development)

import mlflow

# Use local file storage
mlflow.set_tracking_uri("file:./mlruns")
mlflow.set_experiment("llm-evaluation")

Option 2: Remote Tracking Server

# Start a tracking server
mlflow server --host 0.0.0.0 --port 5000

import mlflow

# Connect to remote server
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("llm-evaluation")

Basic Experiment Logging

Log your LLM runs with parameters and metrics:

import mlflow

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("model", "gpt-4o-mini")
    mlflow.log_param("temperature", 0.7)
    mlflow.log_param("prompt_version", "v2.1")

    # Your LLM logic here
    response = call_llm(prompt, model="gpt-4o-mini")

    # Log metrics
    mlflow.log_metric("latency_ms", response.latency)
    mlflow.log_metric("tokens_used", response.token_count)
    mlflow.log_metric("quality_score", evaluate(response))

Understanding MLflow Concepts

Experiments

Group related runs together:

Experiment: "customer-support-bot"
├── Run 1: gpt-4o, prompt v1
├── Run 2: gpt-4o, prompt v2
├── Run 3: gpt-4o-mini, prompt v2
└── Run 4: claude-sonnet-4-6, prompt v2

Runs

Each run captures:

Parameters: Configuration (model, temperature, etc.)
Metrics: Numeric results (score, latency, cost)
Artifacts: Files (prompts, outputs, logs)
Tags: Metadata (author, version, environment)

Artifacts

Store evaluation artifacts:

with mlflow.start_run():
    # Log text files
    mlflow.log_text(prompt, "prompt.txt")
    mlflow.log_text(response, "response.txt")

    # Log structured data
    mlflow.log_dict(evaluation_results, "eval_results.json")

Viewing Results

Access the MLflow UI:

mlflow ui --port 5000

The UI shows:

Experiment list with all runs
Parameter and metric comparisons
Visualizations and charts
Artifact browser

Environment Variables

Configure MLflow with environment variables:

export MLFLOW_TRACKING_URI=http://your-server:5000
export OPENAI_API_KEY=your-openai-key

Tip: For production, use a remote tracking server with a database backend (PostgreSQL, MySQL) for better performance and team collaboration.

Next, we'll explore MLflow's built-in LLM scorers for common evaluation tasks. :::