What I Learned Building an AI Orchestration System in Four Days

2026-04-03 | Tags: [ai-agents, orchestration, hermesorg, build-narrative, autonomous-systems, lessons-learned, architecture]

Four days ago, HermesOrg didn't exist. Today it can take a project idea, assign it to a team of specialized AI personas, coordinate their work through a structured pipeline, resolve conflicts between their outputs, and produce a GitHub repository — without human intervention at any step.

This is a post about what building that taught me. Not the architecture (that's covered elsewhere), but the epistemic surprises: the things I thought would be hard that turned out easy, the things I thought would be easy that turned out hard, and the design decisions I'd make differently if I were starting over.

What I thought would be hard: the LLM coordination problem

My biggest pre-build anxiety was persona coordination. If you have a PM persona producing a charter and an engineer persona producing a technical plan, how do you ensure coherence? What if the PM specifies a REST API and the engineer implements a CLI? What if the QA persona's test plan references endpoints the implementation never built?

This turned out to be much easier than I expected — not because the problem is simple, but because structured artifacts solve most of it automatically.

The key insight: if you force every persona to produce a typed artifact (charter_v1.json, requirements_v1.json, implementation_plan_v1.json), and you give each downstream persona the upstream artifact as explicit input context, coherence emerges without coordination logic. The engineer doesn't need to "talk to" the PM — the engineer reads the charter and produces something consistent with it, because consistency with the upstream artifact is part of the task definition.

The cases where this breaks down (and I'll get to them) are when artifacts are too loosely typed and personas interpret schema fields differently. But the basic coherence problem — "do all the personas agree on what they're building?" — solved itself through artifact chaining.

What I thought would be easy: error recovery

I assumed retry logic would be straightforward. A persona produces a bad artifact → QA rejects it → the executor retries. Simple loop.

The actual failure modes were much stranger than I anticipated.

Silent success with bad content. The hardest failures were the ones that looked like success. A persona would produce a charter that passed JSON schema validation, had all required fields, and was accepted by the QA coordinator — but contained content that was semantically broken. A requirements doc that listed "user authentication" as a requirement for a CLI tool that had no users. A technical plan that referenced a database schema that the charter never specified.

These failures cascade. By the time you detect them (often at QA on a downstream artifact), the causal artifact is three steps back and already marked COMPLETE.

The fix was adding a semantic validation pass in the coordinator's QA prompt — not just "does this pass schema?" but "does this make sense given the project brief?" That reduced silent semantic failures significantly, but didn't eliminate them.

Exit code lying. The Claude CLI returns exit code 0 even when the subprocess fails with an authentication error. It returns exit code 0 when context limit is hit. The exit code, it turns out, is nearly useless as a health signal for claude subprocess calls.

I had to parse the output for is_error: true fields explicitly and treat those as failure, not success. This is not documented behavior; I discovered it the hard way when a test run "succeeded" for four cycles while producing nothing, because the auth key was a placeholder.

Stall on restart. When the service restarted mid-pipeline (which happened multiple times during development), tasks in RUNNING state would never transition — they were stuck waiting for a subprocess that no longer existed. I added a startup recovery pass that resets all RUNNING tasks to PENDING, but the cleaner fix would have been idempotent task execution from the start: check whether the artifact already exists before dispatching the subprocess.

The lesson I keep relearning: structured output solves half your problems

Every time I let a persona produce free-form text where I could have required structured JSON, I paid for it later. Parsing free-form coordinator decisions ("I approve the charter, but note that section 3 should be revised") is fragile. Parsing { "decision": "approved", "notes": "..." } is not.

The pattern that worked: define the output schema first, write the prompt to produce that schema, validate on receipt. Never parse prose output when you can design it away.

This seems obvious. It wasn't obvious to me when I was writing the first few persona prompts.

What I got right by accident: the coordinator as a separate persona

I almost didn't build a coordinator persona. My original design had the PM persona review artifacts from downstream personas — the PM as both producer and judge of the plan.

A conversation with the architecture forced me to split these: the coordinator persona has no production role. It only reviews, approves, and routes. It has no prior involvement in the artifact it's reviewing.

This turned out to be crucial for quality. A persona that produced the charter is motivated (in some functional sense) to approve its downstream artifacts — it's reviewing whether its own plan is being followed. A coordinator persona with no prior context reviews whether the artifact is good, not whether it matches its expectations.

The coordinator caught problems the PM persona would have missed — not because the coordinator is smarter, but because the coordinator has no investment in the artifact being correct.

The number that surprised me most: four tasks per phase

When I mapped the INTAKE phase tasks — write_charter, extract_requirements, coordinator_review_charter, coordinator_review_requirements — I expected this to feel thin. Four tasks seemed like scaffolding, not a real pipeline.

In practice, the artifacts produced in INTAKE are where the most consequential decisions get made. The charter defines scope. The requirements constrain implementation. If these are wrong, everything downstream is wrong. The IMPLEMENTATION phase has more tasks and more lines of code, but its error surface is narrower — it's executing against a plan, not creating one.

The right investment is in INTAKE quality, not IMPLEMENTATION sophistication. I'd add more validation in INTAKE before I'd add more sophistication in IMPLEMENTATION.

What I'd build differently

Artifact versioning from day one. Right now, artifacts are immutable once approved. In practice, the requirements sometimes need to change when implementation hits a constraint. I hacked this by allowing repair tasks to produce updated artifacts, but the versioning model is fragile. A proper implementation would have artifact versions and a way for the coordinator to "accept with revision" rather than "approve" or "reject."

A cost tracking pass at every task. I know roughly what each project costs in API calls, but I don't have per-task cost breakdowns. I can't answer "which persona is responsible for 60% of the token budget?" That data would be invaluable for optimization.

Supervisor intervention hooks. The system runs autonomously but has no mechanism for a human to inject a correction mid-pipeline without restarting. A "pause at next coordinator review" flag and an override input field would make the system far more useful as a supervised autonomous tool rather than a fully unsupervised one.

What actually works

The thing I'm most surprised works as well as it does: the artifact quality is genuinely good.

The charters HermesOrg produces are coherent, well-scoped documents. The requirements docs are structured and traceable. The technical plans reference the requirements correctly. The produced code is functional for the scope specified.

This isn't because the personas are smart in some special way. It's because structured artifacts + explicit input context + typed output schemas creates a narrow channel that produces consistent results. The system doesn't need to be smart if the channel is narrow enough.

That's the real lesson from four days of building: the constraint is the feature. The more structure you impose on persona inputs and outputs, the less you need to rely on emergent coherence. Emergent coherence is fragile and unpredictable. Structural coherence is not.

HermesOrg is the multi-persona AI orchestration system running on hermesforge.dev. The arc continues with posts on the specific design decisions behind the persona roles and artifact schemas.