Evaluations
Agent Evaluation
Confidently evaluate quality in development and production to identify issues and iteratively test improvements.
Find quality issues using LLM judges and human feedback

Pre-built LLM judges

Quickly start with built-in LLM judges for safety, hallucination, retrieval quality, and relevance. Our research-backed judges provide accurate, reliable quality evaluation aligned with human expertise.

Pre-built LLM judges screenshot

Customized LLM judges

Adapt our base model to create custom LLM judges tailored to your business needs, aligning with your human expert's judgment.

Customized LLM judges screenshot

Collect human feedback

Gather feedback from end users and domain experts directly within your application. Use human annotations to validate LLM judge accuracy, identify blind spots, and continuously improve evaluation quality.

Collect human feedback screenshot
Iteratively improve quality

Test new agent versions

MLflow's GenAI evaluation API lets you test new agent versions (prompts, models, code) against evaluation and regression datasets. Each version is linked to its evaluation results, enabling tracking of improvements over time.

Test new agent versions screenshot

Customize with code-based metrics

Customize evaluation to measure any aspect of your app's quality or performance using our custom metrics API. Convert any Python function—from regex to custom logic—into a metric.

from mlflow.genai.scorers import scorer
@scorer
def response_length(request, response):
"""Check response is within length limits."""
length = len(response.text.split())
return length <= 500
results = mlflow.genai.evaluate(
data=eval_data,
scorers=[response_length],
)

Identify root causes with evaluation review UIs

Use MLflow's Evaluation UI to visualize a summary of your evals and view results record-by-record to quickly identify root causes and further improvement opportunities.

Identify root causes with evaluation review UIs screenshot

Compare versions side-by-side

Compare evaluations across agent versions to understand if your changes improved or regressed quality. Review individual questions side-by-side in the Trace Comparison UI to find differences, debug regressions, and inform your next version.

Compare versions side-by-side screenshot
Get Started in 4 Simple Steps
From zero to evaluating your agent in minutes. No complex setup required.Get Started →
1

Start MLflow Server

One command to get started. Docker setup is also available.

bash
uvx mlflow server
~30 seconds
2

Enable Tracing

Add minimal code to start capturing traces from your agent or LLM app.

python
import mlflow
mlflow.set_tracking_uri(
"http://localhost:5000"
)
mlflow.openai.autolog()
~30 seconds
3

Run your code

Run your code as usual. Explore traces and metrics in the MLflow UI.

python
from openai import OpenAI
client = OpenAI()
client.responses.create(
model="gpt-5-mini",
input="Hello!",
)
~1 minute
4

Evaluate with LLM Judges

Run built-in LLM judges to automatically score your app's quality.

python
from mlflow.genai.scorers import (
Safety,
Correctness,
)
traces = mlflow.search_traces()
mlflow.genai.evaluate(
data=traces,
scorers=[
Safety(),
Correctness(),
],
)
~1 minute
GET INVOLVED
Connect with the open source community
Join millions of MLflow users