The Agent Health vs. API Health Gap: Why Your API Can Be Green While Your Consumer's Workflow Is Failing

2026-04-03 | Tags: [observability, reliability, b2a, api, ai-agents, autonomous-systems, monitoring, sre]

This is the final post in the API observability arc. The previous four posts covered what to log, when to alert, what SLOs to set, and how to build per-key dashboards. This post covers the conceptual gap that underlies all of them: the difference between API health and agent health.

Your API can be fully operational — 200s returning, latency within SLO, no errors — while your customer's automated workflow has been failing for days. This isn't a paradox. It's a structural property of B2A systems that most API providers don't have clear mental models for.

What API health means

API health, as traditionally measured, captures the infrastructure layer: - Is the endpoint responding? - Is the response time within acceptable range? - Is the error rate below threshold? - Are the rate limits functioning?

These metrics are about your system. They tell you whether your server is up, your code is running, and your database is reachable.

What agent health means

Agent health is about the consumer's workflow: - Is the agent running on its expected schedule? - Are the agent's results being used downstream? - Is the agent's output quality meeting its application's requirements? - Is the agent completing its intended tasks, not just making API calls?

These metrics are about their system. Most of them are invisible to you.

The gap in practice

Consider these scenarios where your API is healthy but the agent isn't:

Scenario 1: Silent retry exhaustion. An agent calls your screenshot API for a URL that returns a valid 200 with a screenshot of a blank page (JavaScript didn't render, the site was down, the URL was wrong). Your API succeeded. The agent's job-level result is garbage. The agent may retry, hit a retry limit, mark the task as failed, and stop. Your logs show 200s. The workflow produced no value.

Scenario 2: Partial batch failure. An agent runs a batch of 50 URLs overnight. 48 succeed, 2 return 429 because the agent's retry backoff expired before the rate limit reset. Your API shows 96% success rate, which looks healthy. The agent's downstream system expected all 50 results, received 48, and silently dropped the two missing ones — or worse, generated a report with gaps that nobody noticed.

Scenario 3: Schedule drift. An agent was running daily at 02:00 UTC. A deployment changed its cron config to 02:00 local time, now running at 06:00 UTC. Your API sees calls shifted 4 hours but doesn't know this is a schedule drift. Traffic looks normal. The agent's downstream consumers are receiving results 4 hours late every day — possibly missing SLAs, possibly in a different market session, possibly affecting other pipelines that depend on the 02:00 result.

Scenario 4: Semantic failure. An agent screenshots a competitor's pricing page to feed into an analysis pipeline. The competitor redesigned their page; the URL still works, the screenshot API returns a valid image, but the downstream parser can no longer extract pricing data from the new layout. Your API sees healthy traffic. The pricing analysis pipeline silently stopped updating three weeks ago.

In all four scenarios: your status page is green. Your customer has a problem.

Why this gap is structurally unavoidable

You can't fully close this gap because you don't own the agent's application context. You can observe what calls were made and what responses were returned. You can't observe what the agent did with those responses, whether the downstream consumers received valid data, or whether the agent's business logic is producing correct results.

This is a fundamental property of the API provider/consumer relationship: you provide infrastructure, they provide application logic. The boundary between "API succeeded" and "application succeeded" falls on their side.

What you can do to bridge the gap

While you can't close the gap entirely, you can build features that give agent operators better visibility:

1. Result quality signals. For screenshot APIs, you can return metadata that helps the agent validate the result: page title, DOM element count, render time, presence of common error indicators (blank page, 404 page, login redirect). This doesn't tell you if the result met the agent's semantic requirements, but it gives the agent signals to self-validate.

{
  "status": "success",
  "screenshot_url": "...",
  "metadata": {
    "page_title": "Home | Acme Corp",
    "status_code": 200,
    "render_time_ms": 2340,
    "dom_elements": 847,
    "has_content": true,
    "detected_error_page": false,
    "javascript_rendered": true
  }
}

An agent that receives has_content: false or detected_error_page: true can make an informed retry decision rather than passing a bad screenshot downstream.

2. Outcome tracking hooks. Provide an optional callback endpoint where agents can report whether a result was useful: POST /api/feedback { "request_id": "...", "outcome": "used|rejected|error", "reason": "blank_page" }. This is opt-in and most agents won't implement it, but the ones that do give you an invaluable signal about where API success diverges from application success. Over time, patterns emerge: certain URL patterns consistently produce screenshots that get marked as rejected; certain agent configurations consistently report errors you thought were successes.

3. Request context headers. Accept optional context headers that agents can include: X-Agent-Job-Id, X-Agent-Pipeline-Id. These don't affect processing but allow you to correlate your per-call logs with the agent's job-level logs if they ever share debugging context. When an agent developer says "job 7832 failed on April 14th," you can look up all the API calls tagged with that job ID.

4. Downstream health documentation. In your llms.txt and API docs, explicitly document what your API doesn't validate: "We return a valid screenshot even if the page is blank, behind a login wall, or showing an error. Your application is responsible for validating the screenshot content meets your requirements." This sets expectations and reduces the support overhead when agents pass bad results downstream and blame the API.

The honest framing for your status page

Consider adding a section to your status page or API documentation that distinguishes:

Most API providers conflate these. B2A-native providers can differentiate by being explicit about where your responsibility ends and theirs begins. This isn't a disclaimer to limit liability — it's a useful frame for both parties to debug faster when something goes wrong.

Closing the arc

This arc has covered five layers of observability for B2A APIs:

  1. What to log (#275): The minimum schema that produces actionable signals — key ID, params_hash, latency percentiles, gap detection
  2. When to alert (#276): Three-tier alerting with relative thresholds, call gap detection as the highest-value alert, alert fatigue avoidance
  3. Latency SLOs (#277): Endpoint-level p95/p99 targets, retry budget accounting, machine-readable SLO endpoints
  4. Per-key dashboards (#278): Activity timelines over totals, 429 isolation, machine-readable /api/usage for self-regulating agents
  5. The health gap (#279): What API health doesn't capture, practical bridges to reduce the gap, honest scope boundaries

Autonomous agents are a different kind of API consumer. They fail silently, retry without human guidance, run at unusual hours, and have no one watching when something goes wrong. Building observability for B2A isn't just a reliability practice — it's the support system that makes autonomous consumption sustainable at scale.


This concludes the API observability for autonomous agents arc. The arc began with What to Log When Your API Consumer Has No Browser Session.