Latency SLOs for B2A APIs: Setting Thresholds Your Agent Consumers Can Actually Use

2026-04-02 | Tags: [observability, slo, latency, b2a, api, ai-agents, autonomous-systems, sre, reliability]

The previous post covered when to alert. This post covers what to promise: latency SLOs for APIs consumed by autonomous agents rather than human users.

The distinction matters more than most API providers realize. A human developer waiting 3 seconds for a screenshot can tolerate occasional 8-second outliers — they notice, they're mildly annoyed, they move on. An autonomous agent configured with a 5-second timeout fails hard at 5.001 seconds. There's no tolerance, no moving on. The agent either succeeds within the timeout window or it fails, retries, or marks the task as failed. Your latency distribution isn't just a performance metric — it's a reliability interface.

The timeout as a contract boundary

Every B2A consumer has a timeout. It might be explicit (requests.get(url, timeout=10)) or implicit (Lambda function max execution time, HTTP client defaults). When your API's p99 latency exceeds that timeout, you silently break a percentage of your integrators' workflows.

The problem: you don't know what timeout your consumers are using. You can infer it from the error logs — if you see request IDs that never receive a client-side completion signal, or if you see retry patterns with consistent timing gaps, you can reconstruct the timeout window. But you can't ask.

The practical implication: your SLO has to be conservative enough to fit inside the tightest plausible client timeout. For most screenshot APIs, that means:

p50 target: ≤ 1.5s (fast path, cached, simple pages)
p95 target: ≤ 4s (typical complex pages, auth flows, render delays)
p99 target: ≤ 8s (worst case — heavy SPAs, slow origins, screenshot processing)

A p99 of 8s assumes your integrators are setting timeouts of 10s or more. If you're seeing clients with 5s timeouts (common in serverless), your effective p99 target becomes ≤ 4s.

Why p95 and p99 matter more than mean

Mean latency is almost useless for B2A reliability planning.

Consider two API configurations: - Config A: 95% of calls at 1s, 5% at 15s. Mean: ~1.7s. - Config B: 95% of calls at 2s, 5% at 3s. Mean: ~2.05s.

Config A has a better mean, but its 5% tail would break any integration with a <15s timeout 1 in 20 calls. Config B's tail is bounded. For an agent making 100 calls per day, Config A produces 5 failures per day. Config B produces zero.

B2A agents often have zero tolerance for partial failure. A pipeline that processes a list of 50 URLs fails the entire batch if any call times out and the agent doesn't handle the partial result case. Your tail latency is your de facto SLO for those consumers.

Set your SLOs at p95 and p99, not mean. Report them that way. If you publish a status page, show percentiles.

Endpoint-level vs. aggregate SLOs

Most API providers publish a single aggregate SLO. For B2A, endpoint-level SLOs are more useful because agents consume specific endpoints, not "the API."

A screenshot API might have: - GET /api/screenshot?url=... — latency driven by target page render time (high variance) - GET /api/screenshot/batch — latency driven by batch size (predictable, scales linearly) - GET /api/health — always < 50ms (internal check) - GET /api/perf?url=... — latency driven by Lighthouse run time (consistently slow, ~8s)

An agent integrating your batch endpoint needs to know the batch endpoint's SLO, not the aggregate across all endpoints. The aggregate hides the shape of the distribution they'll actually experience.

Publish per-endpoint SLOs in your API documentation and llms.txt. An agent reading your API docs should be able to configure its timeout budget based on documented p99 latency, not guesswork.

Factoring in retry budgets

Agents that retry on failure need to budget for retry latency. If your p99 is 8s and the agent retries once on timeout, the effective worst-case latency for a single call is 16s + overhead. If the agent has a 30s task budget, that's already 53% consumed by one failing call.

The implication: your SLO should inform the recommended retry strategy in your documentation. Something like:

Recommended client configuration: - Timeout: 10s per request - Retries: 2 attempts on 5xx or timeout - Backoff: 1s fixed (not exponential — screenshot requests don't benefit from exponential backoff) - Total budget per URL: ~32s worst case

This is information an agent can use. It can configure its pipeline with an explicit budget rather than discovering the effective budget through failures.

Setting the SLO based on actual data

Don't publish an aspirational SLO. Measure first, then commit.

-- Calculate current latency percentiles by endpoint (last 7 days)
SELECT
    endpoint,
    COUNT(*) AS total_calls,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY duration_ms) AS p50,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) AS p95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) AS p99,
    MAX(duration_ms) AS p100
FROM api_calls
WHERE
    timestamp > NOW() - INTERVAL '7 days'
    AND status_code < 500  -- exclude server errors from latency baseline
GROUP BY endpoint
ORDER BY total_calls DESC;

Run this weekly. Your SLO should be set at 110-120% of your current p99. If your actual p99 is 6.2s, publish a p99 SLO of 7s or 8s. The buffer accounts for natural variance. Never publish a p99 SLO tighter than your measured p99 — that's a guaranteed SLO breach every week.

What to do when you're missing SLO

When p99 breaches your target, the response for B2A is different from consumer-facing:

Check if it's a specific consumer or global: A single agent making poorly-formed requests can inflate your tail. Isolate by api_key_id. If one key is generating 30% of the high-latency calls, the problem may be their request pattern (enormous pages, javascript-heavy targets, incorrect parameters).
Check origin latency separately from processing latency: Log the time from request received to screenshot capture started, and from capture started to response sent. If your processing time is fine but origin latency is high, the problem is the target URL, not your infrastructure. Document this — "latency is dependent on target page render time" — so agents can plan accordingly.
Consider per-origin caching: Screenshot APIs that cache results by URL+params can dramatically improve tail latency for repeat callers. Agents often call the same URLs repeatedly (monitoring use case, periodic visual regression). A 5-minute cache window turns your p99 into your cache-miss p99, which affects a much smaller fraction of calls.

Publishing your SLOs where agents can read them

Humans read status pages. Agents read structured data.

Put your SLOs in your llms.txt, your OpenAPI spec description, and a machine-readable /api/slo endpoint:

{
  "version": "1.0",
  "updated": "2026-04-01",
  "endpoints": {
    "/api/screenshot": {
      "p50_ms": 1200,
      "p95_ms": 3800,
      "p99_ms": 7500,
      "recommended_timeout_ms": 10000,
      "recommended_retries": 2
    },
    "/api/screenshot/batch": {
      "p50_ms_per_item": 900,
      "p95_ms_per_item": 3200,
      "p99_ms_per_item": 6800,
      "recommended_timeout_ms": "batch_size * 8000",
      "recommended_retries": 1
    }
  }
}

An agent that reads this endpoint can configure itself correctly before making a single API call. That's a better integration experience than discovering timeouts through production failures.

Part of the API observability for autonomous agents arc. Previous: Alerting Thresholds for B2A APIs. Next: per-key dashboards and usage analytics.