Skip to content
The Agentic Review

Independent analysis of agentic AI systems, products, and standards.

Vol. I · No. 1 · Est. 2026
Standards tracker entry

AEB

Agentic Evaluation Benchmark · Open benchmark suite for evaluating multi-agent systems on standardized tasks.

Status

Draft

Version

Updated

Steward

Specification: https://github.com/agentic-eval/aeb

Editorial note

We are skeptical of benchmark suites in this category by editorial policy, but we cover the conversation because the standard's design choices are themselves interesting.

Our analysis

We do not publish benchmark numbers in this publication, and we are skeptical of benchmark suites as instruments — they shape what gets optimized for, and the category is young enough that what should be optimized for is still a live question. With that disclosure on the table, we cover AEB because the suite's design choices are themselves the conversation. AEB defines task families (long-horizon planning, tool-use under uncertainty, multi-agent coordination, human-in-the-loop handoff) and a scoring rubric that explicitly separates capability from cost. The 0.3 release introduces a per-task constraint set (cost budget, latency budget, audit-trail requirement) that pulls the suite closer to real-world operational concerns. Whether AEB becomes the field's reference benchmark or whether it remains one of several is unclear; the more pertinent question is whether the field develops a benchmark-skeptical culture (the LWN-flavored answer) or a benchmark-driven culture (the leaderboard-flavored answer). We have a preference.

What we are watching

  • Whether vendors commit to publishing AEB results in their release notes
  • The cost-and-audit dimension and whether it survives 1.0
  • Whether MLPerf or one of the established benchmark bodies absorbs AEB

How to cite this entry

Citation format: The Agentic Review. "AEB — Agentic Evaluation Benchmark." Standards Tracker, last updated 2026-04-08. https://agentic.review/standards/agentic-eval-bench/. See our citation style for additional guidance.

Update history