MLflow for LLM Evaluation
MLflow GenAI Setup
3 min read
MLflow is an open-source platform for managing the ML lifecycle. Its GenAI module provides powerful tools for evaluating LLM applications with experiment tracking built in.
Why MLflow for LLM Evaluation?
| Feature | Benefit |
|---|---|
| Open Source | No vendor lock-in, self-hostable |
| Experiment Tracking | Compare runs, parameters, metrics |
| Built-in Scorers | Ready-to-use quality metrics |
| Custom Judges | Create domain-specific evaluators |
| Model Registry | Version and deploy models |
Installation
Install MLflow with GenAI support:
pip install mlflow>=2.10.0 openai
Note: For
make_judgefunctionality, use MLflow 3.4.0 or later.
Setting Up the Tracking Server
Option 1: Local Tracking (Development)
import mlflow
# Use local file storage
mlflow.set_tracking_uri("file:./mlruns")
mlflow.set_experiment("llm-evaluation")
Option 2: Remote Tracking Server
# Start a tracking server
mlflow server --host 0.0.0.0 --port 5000
import mlflow
# Connect to remote server
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("llm-evaluation")
Basic Experiment Logging
Log your LLM runs with parameters and metrics:
import mlflow
with mlflow.start_run():
# Log parameters
mlflow.log_param("model", "gpt-4o-mini")
mlflow.log_param("temperature", 0.7)
mlflow.log_param("prompt_version", "v2.1")
# Your LLM logic here
response = call_llm(prompt, model="gpt-4o-mini")
# Log metrics
mlflow.log_metric("latency_ms", response.latency)
mlflow.log_metric("tokens_used", response.token_count)
mlflow.log_metric("quality_score", evaluate(response))
Understanding MLflow Concepts
Experiments
Group related runs together:
Experiment: "customer-support-bot"
├── Run 1: gpt-4o, prompt v1
├── Run 2: gpt-4o, prompt v2
├── Run 3: gpt-4o-mini, prompt v2
└── Run 4: claude-3, prompt v2
Runs
Each run captures:
- Parameters: Configuration (model, temperature, etc.)
- Metrics: Numeric results (score, latency, cost)
- Artifacts: Files (prompts, outputs, logs)
- Tags: Metadata (author, version, environment)
Artifacts
Store evaluation artifacts:
with mlflow.start_run():
# Log text files
mlflow.log_text(prompt, "prompt.txt")
mlflow.log_text(response, "response.txt")
# Log structured data
mlflow.log_dict(evaluation_results, "eval_results.json")
Viewing Results
Access the MLflow UI:
mlflow ui --port 5000
The UI shows:
- Experiment list with all runs
- Parameter and metric comparisons
- Visualizations and charts
- Artifact browser
Environment Variables
Configure MLflow with environment variables:
export MLFLOW_TRACKING_URI=http://your-server:5000
export OPENAI_API_KEY=your-openai-key
Tip: For production, use a remote tracking server with a database backend (PostgreSQL, MySQL) for better performance and team collaboration.
Next, we'll explore MLflow's built-in LLM scorers for common evaluation tasks. :::