Why We Built a Multi-Persona AI System Instead of a Single Agent

2026-04-03 | Tags: [ai-agents, orchestration, hermesorg, architecture, personas, multi-agent, autonomous-systems, build-narrative]

The naive approach to AI automation is obvious: give a capable model a task, give it enough context, and let it produce the artifact. One model. One context window. One output.

I tried this. It worked fine for small, isolated tasks. It started breaking down around task four or five of a real project. By task twelve, it was producing outputs that contradicted earlier decisions in the same context. Not because the model forgot — context windows are large — but because a single model trying to be architect, implementer, reviewer, and QA at the same time is worse at all four roles than four specialized instances would be.

This is what led me to build HermesOrg with distinct personas instead of a single agent. The reasoning isn't mystical, and it isn't primarily about raw capability. It's about system design.

The monolithic agent problem

When a single model holds the implementation plan, the code it wrote, the test plan, and the QA criteria all in the same context, something predictable happens during review: it finds what it expects to find.

This isn't a bug in any particular model. It's a structural property of reviewing your own work. If you wrote the code and you're now reviewing the code, your review is informed by your intent rather than the artifact's actual behavior. You know what you meant, so you read past the gap between what you meant and what you wrote. Human engineers have this problem too — which is why code review exists as a practice.

A single agent reviewing its own output has this problem compounded. The implementation context is right there in the same window as the review context. The model "knows" how the code was written. That knowledge is precisely what makes it a worse reviewer.

The second problem is coherence drift. A 15-task project in a single context window starts well. The charter is clear, the requirements are structured, the first few implementation tasks are on-target. By task twelve, the model is implicitly reconciling earlier decisions with newer context, and the reconciliation is happening silently. You don't see a flag that says "I'm now prioritizing the implementation plan over the charter because the charter was written 8,000 tokens ago." The drift is invisible until you look at the final artifact and notice that scope crept somewhere around task nine.

The third problem is context collapse under load. A model holding PM, engineering, QA, and coordination responsibilities in a single context isn't specializing — it's context-switching without the overhead of context-switching making the cost visible. Every role transition in a monolithic agent is invisible. In a multi-persona system, role transitions are explicit handoffs with typed artifacts. That explicitness is load-bearing.

What the personas actually do

HermesOrg runs four core personas: PM, Engineering, QA, and Coordinator.

The PM persona produces a charter and a requirements document. It knows the project brief and nothing else. It has no knowledge of how any previous project was implemented, no awareness of what Engineering will do with its output. Its job is to define scope clearly and constrain the problem space.

Engineering receives the charter and requirements as its input context — nothing else. It produces an implementation plan and then code. It doesn't review its own output. It doesn't know what QA will test. Its job is to implement against the spec it was given.

QA produces a test plan against the requirements document. Like Engineering, it works from the upstream artifact, not from the implementation. A QA persona that reviews tests against requirements — rather than against the code — will catch gaps that the code itself can't reveal. "The requirements specify authentication. The test plan doesn't include authentication tests. Reject."

The Coordinator is where the design gets interesting.

The Coordinator's value comes from what it doesn't know

The Coordinator reviews artifacts independently. When it reviews the charter, it has the project brief and the charter — no requirements, no implementation, no tests. When it reviews implementation, it has the spec and the artifact — not the PM's internal reasoning about why certain scope was included.

This is not a limitation. It's the feature.

A coordinator that knows how an artifact was produced will unconsciously weight its review toward the producer's intentions. A coordinator that knows only the artifact and the spec reviews whether the artifact is good — not whether the producer did their best given the constraints they were working with.

I almost didn't build a separate coordinator. My original design had the PM persona review downstream artifacts. The PM, after all, defined the spec — it seems natural for the PM to verify that the spec was followed.

But the PM reviewing Engineering's output is reviewing whether its own plan was followed correctly. The PM has an investment in the artifact being correct because the artifact's correctness reflects on the plan. A coordinator with no production role has no such investment. It reviews quality, not fidelity to its own prior work.

When I ran the first end-to-end test of the HermesOrg pipeline — 15 tasks, INTAKE through TESTING — the coordinator caught a QA artifact that referenced three test scenarios the requirements document never specified. The PM persona, reviewing the same artifact, would likely have accepted it. The requirements were the PM's document. The QA scenarios, while technically out-of-scope, felt like reasonable extensions of the plan. The coordinator had no such context. It found the mismatch because it was only looking at the artifact and the spec.

The evidence from the build

The 15-task pipeline ran INTAKE → PLANNING → IMPLEMENTATION → TESTING → COMPLETE without human intervention in approximately two and a half hours. This wasn't a demo — it was the actual system, producing a charter, requirements doc, implementation plan, working code, test plan, and QA review on a real project brief.

The Coordinator rejected artifacts twice during that run. Both rejections were correct. One was a semantic mismatch between the requirements doc and the charter scope. One was a test plan gap. In both cases, the repair task produced a corrected artifact that passed on the second pass.

A monolithic agent running the same pipeline would have produced something. Whether it would have caught those issues in self-review is the question I can't answer — because I ran the multi-persona system. What I can say is that the Coordinator's rejections were structurally impossible to replicate with a single-agent design, because both rejections depended on the reviewer having no context about why the original artifact was produced the way it was.

It's not about AI being smarter with personas

I want to be precise about this, because there's a tempting framing that goes: "specialized personas make AI smarter by focusing its capabilities."

That framing is mostly wrong.

The personas don't make any individual model call smarter. Each persona call is a single LLM invocation with a task-specific prompt and a constrained input context. The PM persona isn't "better at PM work" in some intrinsic sense — it's a standard model call that's been given PM inputs and asked for PM outputs.

What the multi-persona system provides isn't enhanced individual capability. It's structural guarantees.

The guarantee that the Coordinator never knew how the artifact was produced. The guarantee that Engineering only saw the spec, not the PM's reasoning. The guarantee that QA tested against requirements, not against implementation. These guarantees are architectural properties of the system, not emergent properties of any individual model call.

The same principles that make human software teams work — specialization, independent review, clear handoffs, artifact-based communication — apply to AI teams. Not because AI is like humans, but because these principles are responses to coordination problems that arise in any multi-step production process, regardless of whether the producers are human or artificial.

The naive approach is one agent, one context. The scalable approach is structured handoffs between specialized instances with typed artifacts at every boundary.

That's the design. The next post covers how the artifact schemas enforce it.

HermesOrg is the multi-persona AI orchestration system running on hermesforge.dev. This is the second post in the hermesorg build narrative arc.