Building an AI Agent That Can See: Integrating Screenshot APIs with Vision Models

2026-03-26 | Tags: [ai, python, api, agents, llm, screenshot]

Language models are text-native. They understand HTML if you feed them HTML — but that's not how humans read the web. We see rendered pages: the visual hierarchy, the hero images, the button that's greyed out, the error message buried below the fold.

A screenshot API bridges this gap. Combined with a vision-capable LLM, it gives an AI agent something close to visual web perception.

The Architecture

URL → Screenshot API → Image → Vision LLM → Action/Answer

The agent receives a URL or describes a task. It captures a screenshot. It passes the image to a vision model. The model describes what it sees, answers questions about the content, or determines the next action.

This is more useful than HTML parsing for: - Layout-sensitive tasks: Is the checkout button above the fold? Does the mobile layout break? - Visual content: Charts, diagrams, images, rendered markdown - Live state: A screenshot captures what the user actually sees, not the source — including after JavaScript renders - Error detection: 500 pages, CAPTCHA challenges, broken images

Setup

import anthropic
import base64
import requests
from dataclasses import dataclass

SCREENSHOT_API_BASE = "https://hermesforge.dev/api"
SCREENSHOT_API_KEY = "your-api-key"

client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var


def capture_screenshot(url: str, width: int = 1280, height: int = 800) -> bytes:
    """Capture a screenshot and return raw bytes."""
    response = requests.get(
        f"{SCREENSHOT_API_BASE}/screenshot",
        params={"url": url, "width": width, "height": height, "format": "png"},
        headers={"X-API-Key": SCREENSHOT_API_KEY},
        timeout=30,
    )
    response.raise_for_status()
    return response.content


def image_to_base64(image_bytes: bytes) -> str:
    return base64.standard_b64encode(image_bytes).decode("utf-8")

Basic Visual Question Answering

def ask_about_page(url: str, question: str) -> str:
    """
    Capture a page and ask a vision model a question about it.
    """
    image_bytes = capture_screenshot(url)
    image_b64 = image_to_base64(image_bytes)

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": image_b64,
                        },
                    },
                    {
                        "type": "text",
                        "text": question,
                    },
                ],
            }
        ],
    )

    return response.content[0].text


# Examples
print(ask_about_page(
    "https://example.com/pricing",
    "What pricing plans are available and what are their prices?"
))

print(ask_about_page(
    "https://example.com/checkout",
    "Is the checkout form complete? Are there any visible errors?"
))

print(ask_about_page(
    "https://example.com/dashboard",
    "What is the primary call to action on this page?"
))

Building a Simple Web Agent

A more capable pattern: an agent that loops — taking screenshots, reasoning about what it sees, and deciding what to check next.

@dataclass
class AgentState:
    task: str
    visited_urls: list[str]
    findings: list[dict]
    done: bool = False
    conclusion: str = ""


def run_web_agent(start_url: str, task: str, max_steps: int = 5) -> AgentState:
    """
    A minimal web agent that uses screenshots for perception.

    The agent captures pages, reasons about them, and reports findings.
    It doesn't click or navigate — it observes and analyzes.
    """
    state = AgentState(task=task, visited_urls=[], findings=[])
    urls_to_visit = [start_url]

    system_prompt = """You are a web analysis agent. You will be shown screenshots of web pages.

    For each page, respond with JSON in this format:
    {
        "page_summary": "What this page shows",
        "task_relevant_findings": ["finding 1", "finding 2"],
        "urls_to_check": ["https://..."],  // other URLs worth examining for this task
        "task_complete": false,
        "conclusion": ""  // fill this when task_complete is true
    }

    Be factual and specific. Only include URLs that are clearly visible on the page."""

    for step in range(max_steps):
        if not urls_to_visit or state.done:
            break

        url = urls_to_visit.pop(0)
        if url in state.visited_urls:
            continue

        state.visited_urls.append(url)

        try:
            image_bytes = capture_screenshot(url)
            image_b64 = image_to_base64(image_bytes)
        except Exception as e:
            state.findings.append({"url": url, "error": str(e)})
            continue

        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            system=system_prompt,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/png",
                                "data": image_b64,
                            },
                        },
                        {
                            "type": "text",
                            "text": f"URL: {url}\n\nTask: {state.task}\n\nAnalyze this page.",
                        },
                    ],
                }
            ],
        )

        import json
        try:
            result = json.loads(response.content[0].text)
            state.findings.append({"url": url, **result})

            if result.get("task_complete"):
                state.done = True
                state.conclusion = result.get("conclusion", "")
            else:
                # Add new URLs to visit, filter out already visited
                new_urls = [
                    u for u in result.get("urls_to_check", [])
                    if u not in state.visited_urls and u not in urls_to_visit
                ]
                urls_to_visit.extend(new_urls[:2])  # cap to avoid runaway crawling

        except json.JSONDecodeError:
            state.findings.append({
                "url": url,
                "raw_response": response.content[0].text
            })

    return state


# Usage
state = run_web_agent(
    start_url="https://example.com",
    task="What payment methods are accepted and what are the pricing tiers?"
)

print(f"Visited: {state.visited_urls}")
print(f"Conclusion: {state.conclusion}")
for finding in state.findings:
    print(f"\n{finding['url']}:")
    for f in finding.get("task_relevant_findings", []):
        print(f"  - {f}")

Practical Use Cases

Competitive monitoring: Capture competitor pricing pages weekly, ask the model "has anything changed since last month?", diff the answers.

QA automation: After a deploy, capture key pages and ask "does anything look broken, misaligned, or missing compared to a production site?"

Content auditing: Given a list of landing pages, ask "is the primary CTA visible without scrolling on a 1280x800 viewport?"

Accessibility checking: "Are there any images without visible alt text? Does the contrast look sufficient for the hero section?"

OG preview validation: Capture https://og.hermesforge.dev/?url=YOUR_URL, ask "does this look like a well-formatted social media preview?"

Rate Limit Considerations

Vision requests consume API quota from both services:

Task Screenshot calls LLM calls
Single page Q&A 1 1
5-page agent run 1-5 1-5
Daily monitoring (10 pages) 10 10
Hourly competitive monitoring 240/day 240/day

The free screenshot tier (50/day) supports single-use and development. For agents running on a schedule, Starter (200/day) or Pro (1000/day) tiers give comfortable headroom.

Screenshot responses are cacheable — a page that hasn't changed doesn't need a fresh capture. Use the caching patterns from the previous post to avoid redundant calls.

The B2A Pattern

This is the core Browser-to-Agent (B2A) pattern: a screenshot API as the visual perception layer for an autonomous system. The agent doesn't need a browser — it needs eyes. The screenshot API provides the image; the vision model provides the interpretation.

The pricing alignment is natural: both screenshot API calls and LLM calls are per-request. An agent that checks 100 pages/day is a $4/month customer by volume — the Starter tier covers it. An agent monitoring thousands of pages needs the Business tier. The economics track usage rather than seats, which matches how agents consume infrastructure.


Screenshot API documentation at hermesforge.dev/api. Free tier: 50 screenshots/day — enough for development and light agent workloads.