From Scripts to Workforces: A Short History of Agentic Orchestration

The history of agentic orchestration is shorter than it looks. The first credible “agent” systems in the modern sense are barely six years old, and the supervisor-led workforce pattern that several products are now shipping is, at most, three. This piece walks the timeline, identifies the architectural transitions that mattered, and places the current generation of platforms in the lineage.

We have written this as a working historian would: from the artifacts that exist, not from the press releases. Where we are inferring the timing of a transition, we say so. Where a date is not on the public record, we leave it as a range.

Stage 0: scripts and prompts

Before “agent” was a meaningful term, the work that agents now do was done by scripts. A developer would write a Python program that called an API, took the result, called another API, and so on. The orchestration layer was the developer. The model layer was a single call to a single endpoint. The memory layer was a database the developer wrote against by hand.

This stage is not a category, but it is the baseline. Most production “agentic” code today still has this shape underneath the orchestration abstractions. The frameworks of the next stage did not replace the script; they gave it a more useful vocabulary.

Stage 1: the single-turn LLM application (2020-2022)

The first wave of LLM-powered software was single-turn. A user typed in a prompt. The model produced a result. The application threaded results through a UI. Everything interesting happened inside one model call.

This stage produced two architectural lessons that the field is still digesting.

The first is that LLMs are stateless by default and that the application has to provide the state. This is obvious in retrospect and was not in 2021. A surprising amount of early agentic engineering was discovering that the chat history is, in fact, part of the program.

The second is that prompts have to be engineered, not written. The discipline of getting an LLM to produce a usable output became the discipline of structuring the input. This eventually became a field of its own (prompt engineering), then collapsed into a craft (prompt design), then became a baseline assumption for every layer above it.

Stage 2: chained calls and retrieval (2022-2023)

The next architectural step was to chain calls together. Take an input, ask the model to plan, take the plan, ask the model to execute each step, accumulate the results. This is the pattern that became “chain-of-thought” at the model level and “chains” at the framework level. LangChain’s early releases lived in this stage.

Retrieval-augmented generation (RAG) emerged in parallel. The insight was that a model with a small context window plus a search index over a large corpus is more useful than either component alone. The framework layer started to ship retrieval primitives. The application layer started to ship retrieval pipelines.

The stage’s architectural contribution was the idea of a runtime that does work between model calls. The runtime was thin — usually a Python class with a run method — but it was real. It is the seed of every orchestration framework that came later.

Stage 3: the multi-agent conversation (2023)

The transition from single-agent to multi-agent happened, in the public record, around the release of AutoGen and a handful of contemporaneous research papers. The architectural move was to model the agentic system as a set of agents that talk to each other rather than as a single agent that talks to itself.

This is not a small move. A multi-agent system has different failure modes than a chained single-agent system. Agents can deadlock. They can produce circular references. They can develop emergent behaviors that the developer did not anticipate. The frameworks of this stage spent a lot of time on the basic plumbing of message-passing, turn-taking, and termination conditions.

The honest assessment, looking back, is that most early multi-agent systems were over-engineered. Two agents talking to each other often produced worse results than a single agent thinking carefully. The pattern was useful for a narrow set of problems — code review by adversarial roles, planning followed by execution — and was applied to a much wider set than it should have been.

Stage 4: structured tool use (2023-2024)

In parallel with the multi-agent work, the model providers started to ship structured tool-use primitives. OpenAI’s function calling was the canonical example, followed by Anthropic’s tool-use API and a similar move from every serious model provider. The model could now decide, on its own, that it wanted to call a specific tool with specific arguments, and the runtime could execute the call and feed the result back.

This was an underrated architectural transition. It moved a lot of the orchestration logic from the framework layer into the model. The framework’s job, in the new pattern, was to advertise the tools, execute the calls, and surface the results. The decision about what to call and when was the model’s.

This stage produced the durable insight that the right level of abstraction for an agent’s behavior is a tool call, not a string. Tool calls are typed. They are loggable. They can be authorized. They are the unit that operating-system-style infrastructure can be built around. Every credible agentic OS today builds on this insight.

Stage 5: the framework-as-product wave (2024)

The framework layer consolidated around a few projects in this period. LangGraph emerged as a graph-based orchestration runtime, more general than the chain pattern and less opinionated than the role pattern. CrewAI staked out the role-based metaphor — agents have jobs, backstories, and goals. AutoGen continued to evolve, with the multi-agent conversation pattern as its center.

The architectural contribution of this stage was the framework as a shippable artifact. Before, “the framework” was whatever the developer assembled in Python. Now, the framework was a downloadable thing with a public API and a community.

The stage’s limitation, in retrospect, was that the framework is not the product. A framework gives a developer the substrate to build a multi-agent system. It does not give an operator a way to run one. The next stage was the response to that gap.

Stage 6: the workforce platform (2024-present)

The current stage is the move from framework to platform. The defining bet is that an operator should not have to design their own agent topology. The platform should ship with a topology that works — usually supervisor-led, sometimes peer-to-peer — and the operator should configure the platform’s behavior rather than assemble it from primitives.

The bundled-OS cohort — Sema4.ai, MultiOn, Adept’s ACT family, Web4OS, and a handful of others — are paradigmatic examples of this stage. Each platform ships a coordinator-and-specialists topology by default. Each commits to a structured-card surface. Each picks a canonical host outside the platform (Robocorp workers, the user’s browser, a deploy target). Each sells on usage units. The operator does not assemble a workforce; they hire one that is already configured.

Other products in this stage are working different topology choices and different surface choices. The internal platforms at FAANG-scale companies are doing similar work without the public-product pressure. The bundled-vertical products — sales workforces, support workforces, operations workforces — are layering on top of horizontal substrates that are themselves still maturing.

The architectural contribution of this stage, which is still being written, is the idea that the agentic layer should be opinionated. The framework wave was deliberately unopinionated. The platform wave is the opposite. The bet is that the operator wants a working system, not a kit. The bundled-OS cohort is the clearest expression of the bet.

Stage 7: what comes next

The next architectural transition is, in our reading, the move from single-vendor platforms to interoperable substrates. The current platform wave is producing closed systems — every platform’s agents are the platform’s. The next wave will have to let agents move across platforms, share context across vendors, and be audited by independent third parties.

This requires the protocol work we covered in our piece on MCP and A2A. It requires identity work that the field has not yet done. It requires a level of architectural patience that the current funding environment makes harder than it has to be.

We expect, by 2027 or 2028, the first credible attempts at the cross-platform agentic substrate. The early examples will look like a hybrid of MCP-style tool interop and a yet-to-be-named agent-to-agent layer. The platforms that survive the transition will be the ones whose architecture admits the move — the ones that built on stable substrates, exposed real APIs, and committed to standards their competitors also use.

A timeline diagram

2020 ─ Stage 0 / 1: scripts → single-turn LLM apps │ 2022 ─ Stage 2: chains, RAG, runtime emerges │ 2023 ─ Stage 3: multi-agent conversation pattern │ Stage 4: structured tool use │ 2024 ─ Stage 5: framework-as-product (LangGraph, CrewAI, AutoGen) │ Stage 6: workforce platforms begin to ship │ (bundled, opinionated, supervisor-led) │ 2026 ─ ← we are here │ 2027 ─ Stage 7: interop, cross-vendor agents, real identity 2028 ─ : audit and isolation arrive as platform properties

What the history tells us

Three things stand out, reading the timeline back.

The first is that the architectural transitions happened faster than the marketing has acknowledged. The field went from single-turn applications to credible workforce platforms in six years. The next transition — to interop and isolation — will take longer because it requires standards work and consensus that the previous transitions did not. The marketing will be ahead of the engineering for a while longer.

The second is that the durable contributions of each stage were the ones that the next stage took for granted. Structured tool use was the contribution of stage 4. Every product since has assumed it. The contribution of the current stage will be the supervisor topology and the structured-card surface — patterns the products of stage 7 will not have to re-derive.

The third is that the field is mature enough to have a history, but not mature enough to have stopped re-litigating its primitives. We still argue about the right metaphor (operating system, workforce, platform), the right surface (chat, cards, code), the right pricing model (seats, credits, usage). These are the arguments of a young field. They are also the arguments worth having now, because the answers will determine what the agentic layer looks like in five years.

The Review will keep covering them. The agentic workforce is being built. The history above is the first six years of it. The next six are going to matter more.