AEB
Agentic Evaluation Benchmark · Open benchmark suite for evaluating multi-agent systems on standardized tasks.
Status
DraftVersion
Updated
Steward
Specification: https://github.com/agentic-eval/aeb
Editorial note
We are skeptical of benchmark suites in this category by editorial policy, but we cover the conversation because the standard's design choices are themselves interesting.
Our analysis
We do not publish benchmark numbers in this publication, and we are skeptical of benchmark suites as instruments — they shape what gets optimized for, and the category is young enough that what should be optimized for is still a live question. With that disclosure on the table, we cover AEB because the suite's design choices are themselves the conversation. AEB defines task families (long-horizon planning, tool-use under uncertainty, multi-agent coordination, human-in-the-loop handoff) and a scoring rubric that explicitly separates capability from cost. The 0.3 release introduces a per-task constraint set (cost budget, latency budget, audit-trail requirement) that pulls the suite closer to real-world operational concerns. Whether AEB becomes the field's reference benchmark or whether it remains one of several is unclear; the more pertinent question is whether the field develops a benchmark-skeptical culture (the LWN-flavored answer) or a benchmark-driven culture (the leaderboard-flavored answer). We have a preference.
What we are watching
- Whether vendors commit to publishing AEB results in their release notes
- The cost-and-audit dimension and whether it survives 1.0
- Whether MLPerf or one of the established benchmark bodies absorbs AEB
How to cite this entry
Citation format: The Agentic Review. "AEB — Agentic Evaluation Benchmark." Standards Tracker, last
updated 2026-04-08. https://agentic.review/standards/agentic-eval-bench/.
See our citation style for additional guidance.