Evaluation

Measure agent quality, correctness, and safety with automated evaluation pipelines.

Evaluation covers LLM-as-judge scoring, retrieval accuracy metrics (MRR, NDCG), tool-use correctness, response latency SLOs, and safety guardrail pass rates. MLflow Evaluate provides built-in support for running these assessments against collected traces.