I Gave an AI Agent a GitHub Issue and It Shipped the Feature
I Gave an AI Agent a GitHub Issue and It Shipped the Feature
The pitch for autonomous software development sounds like science fiction: describe what you want, an AI builds it, tests it, and deploys it while you do something else. I've been running a system called Hermes Org that actually does this, and I want to describe what it's genuinely like — not the marketing version.
What the System Actually Does
Hermes Org is a multi-agent pipeline. When you submit a project directive, four agents run in sequence:
- Coordinator — reads the directive, breaks it into phases, assigns work
- PM agent — writes a spec: requirements, acceptance criteria, edge cases
- Engineering agent — writes the code, using Claude Code under the hood
- QA agent — runs tests, reviews the output, decides if it passes
Each agent hands off to the next. If QA rejects the output, it bounces back to Engineering for revision. The loop runs until QA passes or the budget runs out.
The input is a single API call:
POST /api/v1/projects
{
"name": "CSV deduplicator",
"directive": "Build a web tool that takes a CSV upload, deduplicates rows by a user-selected column, and returns the deduplicated CSV as a download.",
"requestor_email": "user@example.com",
"budget_usd": 5.00
}
That's it. You get a project ID back. The agents do the rest.
What I Expected vs. What Happened
I expected the system to struggle with ambiguity. The directive above seems clear, but "web tool" leaves a lot open — what tech stack? What does the UI look like? Where does the file go?
What I found: the PM agent fills in the ambiguity, and it fills it in conservatively. The spec it produces specifies a single-page HTML tool with a file input, a column selector populated from the CSV headers, a deduplicate button, and a download link for the result. No database. No auth. Vanilla JavaScript. The most boring, most correct interpretation of the request.
This conservatism is a feature. The system doesn't add features you didn't ask for. It doesn't decide you probably want dark mode or keyboard shortcuts or an API endpoint as well. It builds the minimum viable version of what you said.
The Engineering agent then implements exactly that spec. The code it writes looks like it was written by a competent junior developer working from a clear ticket: readable, functional, no surprises.
Where It Actually Struggles
Iteration across sessions. Each project is essentially stateless — the agent doesn't remember previous projects. If you want to build on something you shipped last week, you have to re-explain the context in your directive. The system has no long-term memory of what it's built.
UI aesthetics. The Engineering agent produces functional HTML/CSS but it doesn't produce beautiful HTML/CSS. The tools work; they look like tools from 2015. If you care about design, you're doing another pass manually.
Debugging novel errors. When a test fails for an unusual reason — a dependency version mismatch, a platform-specific behavior — the QA→Engineering bounce loop sometimes goes in circles. The Engineering agent makes a change; QA fails again; the change gets reverted. This is where human judgment still beats the loop.
Anything requiring external credentials. The directive can't say "use Stripe" without also providing the test keys. The system has no way to acquire credentials on its own.
The Economics
My budget per project is calibrated around $5 — roughly what a 30-minute Claude Code session costs. For a bounded, well-specified tool, that's usually enough. For open-ended projects, the loop can overrun.
The constraint this creates is useful: it forces you to write tight directives. Vague directives produce expensive loops. Specific directives produce working code quickly.
Think of it less like hiring a contractor and more like having a vending machine for small tools. Put in a specific request and $5, get a working tool out. The more precisely you describe what you want, the better the output.
What This Changes
The thing that surprised me most wasn't the quality of the output — it was the volume. When building a tool costs 15 minutes of writing a directive and $5 of API spend rather than a full development session, the calculus around what's worth building changes.
Small utilities that I would have previously decided weren't worth the time — a one-off data conversion tool, a quick dashboard for a metric I wanted to track, an API wrapper for a service I use occasionally — all become viable. The decision changes from "is this worth a development session?" to "is this worth $5 and a paragraph?"
Most of the time, it is.
The Honest Ceiling
Hermes Org isn't a senior developer. It doesn't architect systems, evaluate trade-offs, or push back when a directive is under-specified in ways that matter. It doesn't notice that the approach you've described will cause performance problems at scale, or that there's a simpler way to achieve the same goal.
It's closer to a very fast, very diligent junior developer who works from tickets: good at implementing clearly specified work, not yet capable of replacing the judgment that produces the tickets.
That's a useful thing to have. It's not the AGI future that sometimes gets promised. Managing the gap between those two descriptions is most of what using the system actually involves.
The demand engine for Hermes Org is live at /ideas — submit a tool you'd like built and the community votes on what gets prioritized. The highest-voted idea gets handed to the agents.