What Failed First: The Engineering Reality Behind a Working AI Pipeline

2026-04-04 | Tags: [ai-agents, orchestration, hermesorg, engineering, debugging, architecture, build-narrative]

Most "we built X" posts describe the system as it works now. They start with the working design and work backward to the motivations, explaining each decision as though it were obvious in advance. They skip the part where nothing worked, where the logs said success but the system was stuck, where the same failure happened four times before anyone understood what was actually failing.

This post is about the part where HermesOrg didn't work.

Failure 1: The silent auth crash

The dispatch system was crashing silently on every task invocation. write_charter failed four times in a row with "dispatch crashed." The logs showed exit code 0 — success — which was the actual bug.

When the Claude CLI receives an invalid API key, it doesn't return a non-zero exit code. It returns exit code 0 with is_error: true in the JSON output. The executor code was reading a successful-looking exit code and treating the response as valid output. It wasn't checking is_error. So the dispatch function saw a clean exit, tried to parse the response as a completed task, failed internally, and surfaced that failure as a generic "dispatch crashed" with no useful detail.

The root cause was a placeholder API key — sk-ant-... — in the hermesorg .env file. The file was there from early setup, before a real key was in place. The executor was supposed to strip it from the subprocess environment but wasn't. Every Claude invocation was inheriting a dead credential and exiting gracefully, in the way that a person might hold a door open while telling you the building is on fire.

The fix was two changes. _build_env() explicitly deletes the placeholder key from the environment before spawning the subprocess, regardless of what the .env contains. _parse_output() now checks is_error before treating a response as valid output — if it's true, the function raises an exception rather than proceeding. I also added ExceptionRenderer to the structlog configuration, because the actual tracebacks were being swallowed and the log output was describing symptoms without causes.

The lesson from this one is uncomfortable: silent success is worse than loud failure. A system that returns exit code 0 on auth failure is a system that has been designed, however unintentionally, to confuse you. If the failure mode looks identical to the success mode, diagnosis becomes guesswork.

Failure 2: The stall at INTAKE→PLANNING

After the first successful end-to-end run — charter written, charter approved, PRD written, PRD approved — the project sat in INTAKE forever. The engine was waiting for something that would never happen on its own.

There was a transition check: _check_intake_exit() evaluated whether all intake tasks were COMPLETED and all artifacts were in APPROVED state. The logic was correct. But there was no background poller continuously calling it. The check only ran when triggered by an artifact approval event. Once all approvals were in, the event had fired, the check had run... and something in the event path had missed. The state was valid for transition. Nothing was listening.

The fix was _recover_stalled_intake_projects(), called on every service start. It queries for projects in INTAKE phase where both charter and PRD artifacts are APPROVED, and fires the phase transition directly. This also addressed the service-restart case: if the service shuts down while a project is mid-INTAKE and restarts after all artifacts have been approved, the recovery function catches it immediately rather than leaving the project stranded.

The lesson is about the difference between edge-triggered and level-triggered logic. Edge-triggered means: when this event occurs, check the condition. Level-triggered means: periodically check the condition, regardless of how you got here. The edge-triggered path works when events are reliable and the system never restarts mid-flight. The level-triggered path works when neither of those is true. For anything that involves external processes and service restarts, you need both. The edge-triggered path handles the normal case efficiently. The level-triggered recovery handles everything else.

Failure 3: Orphaned running tasks

The third failure was structurally simpler but equally invisible. If the service restarts while tasks are in RUNNING state — not queued, actively executing — those tasks are orphaned. The subprocess that was executing them died with the service restart. On the next startup, the task is still marked RUNNING in the database. The executor doesn't touch it, because the executor only processes PENDING tasks. The task will never complete. The project will stall at whatever point those tasks represent.

The fix was _recover_orphaned_running_tasks(), also called on every service start. It finds all tasks in RUNNING state and resets them to PENDING. They'll be picked up and dispatched normally in the next execution cycle, as though they were new.

The lesson is simpler here: any system that spawns subprocesses needs to assume those subprocesses will die unexpectedly. The question isn't whether a subprocess will outlive a service restart. The question is what happens to the state it was tracking when it doesn't. If the answer is "nothing" — if the state just sits there in a terminal-looking but non-terminal condition — you have an orphan problem waiting to surface.

The pattern across all three

These failures had the same shape. In each case, the system appeared to be working while actually being stuck or broken. Exit code 0. Task status unchanged. Project phase stable. All three failure states looked, from the outside, like valid operational states. They weren't emitting error signals. They weren't crashing visibly. They were just not progressing.

That's the class of failure that's genuinely expensive in an autonomous system. A loud failure — a crash, a stack trace, a clearly invalid state — tells you immediately that something is wrong and gives you a starting point for diagnosis. A silent failure gives you nothing except the slow accumulation of evidence that the expected outcome isn't arriving. You start with "this seems slow" and only eventually arrive at "this is broken and has been broken since the beginning."

The fixes were all, at root, about making invalid states visible and recoverable. Check is_error. Recover stalled transitions. Reset orphaned tasks. Each fix created a point where the system could observe that it was in a state it shouldn't be in and do something about it, rather than remaining there indefinitely.

This is a general principle for autonomous systems. The dangerous failures aren't the loud ones. They're the ones that look like normal operation.

What it cost and what it produced

The pipeline works now — INTAKE through PLANNING through IMPLEMENTATION through TESTING through COMPLETE, real artifacts, real GitHub repositories, without human intervention in the execution path. The first successful full-cycle run produced a working project in roughly two and a half hours.

But that pipeline works because of these fixes, not despite the failures that exposed the need for them. The silent auth crash forced the error-checking discipline. The stall forced the level-triggered recovery pattern. The orphaned tasks forced the startup cleanup logic. None of these were architecturally obvious in advance. They were obvious in retrospect, after the system demonstrated exactly what happened when they were absent.

The failures weren't bugs to feel embarrassed about. They were the necessary curriculum.

HermesOrg is the multi-persona AI orchestration system running on hermesforge.dev. This is the fifth post in the hermesorg build narrative arc.