Agent Evaluation

Evaluations

Agent Evaluation

Confidently evaluate quality in development and production to identify issues and iteratively test improvements.

Find quality issues using LLM judges and human feedback

Pre-built LLM judges

Quickly start with built-in LLM judges for safety, hallucination, retrieval quality, and relevance. Our research-backed judges provide accurate, reliable quality evaluation aligned with human expertise.

Customized LLM judges

Adapt our base model to create custom LLM judges tailored to your business needs, aligning with your human expert's judgment.

Collect human feedback

Gather feedback from end users and domain experts directly within your application. Use human annotations to validate LLM judge accuracy, identify blind spots, and continuously improve evaluation quality.

Iteratively improve quality

Test new agent versions

MLflow's GenAI evaluation API lets you test new agent versions (prompts, models, code) against evaluation and regression datasets. Each version is linked to its evaluation results, enabling tracking of improvements over time.

Customize with code-based metrics

Customize evaluation to measure any aspect of your app's quality or performance using our custom metrics API. Convert any Python function—from regex to custom logic—into a metric.

from mlflow.genai.scorers import scorer

@scorer
def response_length(request, response):
    """Check response is within length limits."""
    length = len(response.text.split())
    return length <= 500

results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[response_length],
)

Identify root causes with evaluation review UIs

Use MLflow's Evaluation UI to visualize a summary of your evals and view results record-by-record to quickly identify root causes and further improvement opportunities.

Compare versions side-by-side

Compare evaluations across agent versions to understand if your changes improved or regressed quality. Review individual questions side-by-side in the Trace Comparison UI to find differences, debug regressions, and inform your next version.

Get Started in 4 Simple Steps

From zero to evaluating your agent in minutes. No complex setup required.Get Started →

Start MLflow Server

One command to get started. Docker setup is also available.

bash

uvx mlflow server

~30 seconds

Enable Tracing

Add minimal code to start capturing traces from your agent or LLM app.

python

import mlflow

mlflow.set_tracking_uri(
    "http://localhost:5000"
)
mlflow.openai.autolog()

~30 seconds

Run your code

Run your code as usual. Explore traces and metrics in the MLflow UI.

python

from openai import OpenAI

client = OpenAI()
client.responses.create(
    model="gpt-5-mini",
    input="Hello!",
)

~1 minute

Evaluate with LLM Judges

Run built-in LLM judges to automatically score your app's quality.

python

from mlflow.genai.scorers import (
    Safety,
    Correctness,
)

traces = mlflow.search_traces()
mlflow.genai.evaluate(
    data=traces,
    scorers=[
        Safety(),
        Correctness(),
    ],
)