Screenshot API for AI Agents and LLM Applications

2026-05-09 | Tags: [screenshot-api, ai-agents, llm, multimodal, tutorial]

Screenshot API for AI Agents and LLM Applications

Language models can read web content — but the web is increasingly visual. Single-page applications render in JavaScript. Charts and dashboards exist only as rendered pixels. Login states, cookie banners, and dynamic UI components don't exist in raw HTML. When an AI agent needs to truly see a webpage — not parse its source — a screenshot API bridges the gap.

This post covers the patterns that AI agents and LLM-powered applications use to integrate screenshot capture as a perception layer.

Why AI Agents Need Screenshots

Vision over source. GPT-4V, Claude, and Gemini can analyze images. A screenshot of a dashboard tells a vision model what a user sees. The raw HTML of the same dashboard — React component trees, Redux state, Webpack bundles — communicates almost nothing visually meaningful.

JavaScript-rendered content. Most modern web applications don't exist in their HTML source. Screenshots capture the rendered state: the prices, the charts, the logged-in UI, the filled form.

Grounding in current state. An agent building a report needs current data. A screenshot timestamped to the moment of capture grounds the agent's understanding in verified current reality, not cached or hallucinated content.

Web automation verification. After an agent performs an action (fill a form, click a button, navigate to a page), a screenshot confirms the result. Agents that verify their own actions are more reliable than those that don't.

Basic Pattern: Screenshot + Vision Model

import anthropic
import requests
import base64

SCREENSHOT_API_KEY = "YOUR_KEY"
ANTHROPIC_API_KEY = "YOUR_KEY"

def screenshot_to_base64(url: str, width: int = 1280) -> str:
    """Capture a screenshot and return as base64-encoded PNG."""
    response = requests.get(
        "https://hermesforge.dev/api/screenshot",
        params={"url": url, "width": width, "format": "png", "delay": 2000},
        headers={"X-API-Key": SCREENSHOT_API_KEY},
        timeout=45
    )
    response.raise_for_status()
    return base64.standard_b64encode(response.content).decode("utf-8")


def analyze_webpage(url: str, question: str) -> str:
    """
    Capture a screenshot and ask Claude a question about what it sees.
    """
    client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
    image_data = screenshot_to_base64(url)

    message = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": f"This is a screenshot of {url}. {question}"
                }
            ],
        }]
    )
    return message.content[0].text


# Usage
result = analyze_webpage(
    "https://example.com/pricing",
    "What pricing plans are available and what are their prices?"
)
print(result)

Agent Tool: Screenshot as a Perception Tool

In an agentic framework where the model has tools, screenshot capture is a natural perception primitive:

import anthropic
import requests
import base64
import json

client = anthropic.Anthropic()

# Define screenshot as an agent tool
tools = [
    {
        "name": "capture_screenshot",
        "description": "Capture a screenshot of any public URL. Use this when you need to see the current visual state of a webpage — for reading content that requires JavaScript rendering, verifying page states, extracting visual information like prices or charts, or confirming actions were successful.",
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The URL to screenshot"
                },
                "full_page": {
                    "type": "boolean",
                    "description": "Whether to capture the full page height (true) or viewport only (false). Default false.",
                    "default": False
                },
                "delay_ms": {
                    "type": "integer",
                    "description": "Milliseconds to wait after page load before capturing. Use 2000-4000 for JS-heavy pages.",
                    "default": 1000
                }
            },
            "required": ["url"]
        }
    }
]


def capture_screenshot(url: str, full_page: bool = False, delay_ms: int = 1000) -> dict:
    """Execute the screenshot tool."""
    response = requests.get(
        "https://hermesforge.dev/api/screenshot",
        params={
            "url": url,
            "width": 1440,
            "format": "png",
            "full_page": str(full_page).lower(),
            "delay": delay_ms
        },
        headers={"X-API-Key": SCREENSHOT_API_KEY},
        timeout=45
    )

    if response.status_code != 200:
        return {"error": f"Screenshot failed: HTTP {response.status_code}"}

    return {
        "image_base64": base64.standard_b64encode(response.content).decode("utf-8"),
        "media_type": "image/png",
        "url": url
    }


def run_agent(task: str, max_turns: int = 10) -> str:
    """
    Run an agent loop that can use screenshot capture as a tool.
    """
    messages = [{"role": "user", "content": task}]

    for turn in range(max_turns):
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        # If the model is done, return the final text
        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return "Task complete."

        # Process tool uses
        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []

            for block in response.content:
                if block.type == "tool_use":
                    if block.name == "capture_screenshot":
                        result = capture_screenshot(**block.input)

                        if "error" in result:
                            tool_results.append({
                                "type": "tool_result",
                                "tool_use_id": block.id,
                                "content": result["error"],
                                "is_error": True
                            })
                        else:
                            # Return the image as a vision-accessible result
                            tool_results.append({
                                "type": "tool_result",
                                "tool_use_id": block.id,
                                "content": [
                                    {
                                        "type": "image",
                                        "source": {
                                            "type": "base64",
                                            "media_type": result["media_type"],
                                            "data": result["image_base64"]
                                        }
                                    },
                                    {
                                        "type": "text",
                                        "text": f"Screenshot of {result['url']} captured successfully."
                                    }
                                ]
                            })

            messages.append({"role": "user", "content": tool_results})

    return "Max turns reached."


# Run an agent that uses screenshots to research a topic
result = run_agent(
    "Go to https://stripe.com/pricing and summarize the pricing tiers, "
    "including what's included in each tier and the monthly cost."
)
print(result)

Web Monitoring Agent

An agent that periodically checks a set of URLs for significant changes and summarizes what changed:

import hashlib
from datetime import datetime, timezone
from pathlib import Path

class WebMonitorAgent:
    """
    An agent that monitors URLs visually and reports significant changes.
    Uses Claude's vision to distinguish meaningful changes from cosmetic ones.
    """

    def __init__(self, urls: list[str], storage_dir: str = "monitors"):
        self.urls = urls
        self.storage = Path(storage_dir)
        self.storage.mkdir(exist_ok=True)
        self.client = anthropic.Anthropic()

    def capture(self, url: str) -> bytes:
        response = requests.get(
            "https://hermesforge.dev/api/screenshot",
            params={"url": url, "width": 1440, "full_page": True, "format": "png"},
            headers={"X-API-Key": SCREENSHOT_API_KEY},
            timeout=45
        )
        response.raise_for_status()
        return response.content

    def get_baseline_path(self, url: str) -> Path:
        slug = hashlib.md5(url.encode()).hexdigest()[:12]
        return self.storage / f"{slug}_baseline.png"

    def check_url(self, url: str) -> dict:
        current = self.capture(url)
        current_hash = hashlib.sha256(current).hexdigest()

        baseline_path = self.get_baseline_path(url)

        if not baseline_path.exists():
            baseline_path.write_bytes(current)
            return {"url": url, "status": "baseline_set", "changed": False}

        baseline = baseline_path.read_bytes()
        baseline_hash = hashlib.sha256(baseline).hexdigest()

        if current_hash == baseline_hash:
            return {"url": url, "status": "unchanged", "changed": False}

        # Hashes differ — ask Claude if this is a meaningful change
        analysis = self.client.messages.create(
            model="claude-opus-4-6",
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": "Here are two screenshots of the same webpage. The first is the baseline (previous state) and the second is current. Identify any meaningful changes (content updates, new features, price changes, removed sections). Ignore minor rendering differences. Be specific and concise."},
                    {"type": "text", "text": "BASELINE:"},
                    {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": base64.standard_b64encode(baseline).decode()}},
                    {"type": "text", "text": "CURRENT:"},
                    {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": base64.standard_b64encode(current).decode()}},
                ]
            }]
        )

        # Update baseline to current
        baseline_path.write_bytes(current)

        return {
            "url": url,
            "status": "changed",
            "changed": True,
            "analysis": analysis.content[0].text,
            "checked_at": datetime.now(timezone.utc).isoformat()
        }

    def run(self) -> list[dict]:
        results = []
        for url in self.urls:
            try:
                result = self.check_url(url)
                results.append(result)
                if result["changed"]:
                    print(f"CHANGED: {url}\n{result['analysis']}\n")
            except Exception as e:
                results.append({"url": url, "status": "error", "error": str(e)})
        return results


# Usage
agent = WebMonitorAgent([
    "https://openai.com/api/pricing",
    "https://anthropic.com/pricing",
])
changes = agent.run()

Competitive Intelligence Pipeline

An agent that tracks competitor product pages and summarizes weekly changes:

def competitive_intelligence_report(competitor_urls: dict[str, str]) -> str:
    """
    Screenshot competitor pages and generate a structured comparison report.
    competitor_urls: {company_name: url}
    """
    client = anthropic.Anthropic()
    screenshots = {}

    for company, url in competitor_urls.items():
        try:
            response = requests.get(
                "https://hermesforge.dev/api/screenshot",
                params={"url": url, "width": 1440, "full_page": True, "format": "png", "delay": 3000},
                headers={"X-API-Key": SCREENSHOT_API_KEY},
                timeout=60
            )
            if response.status_code == 200:
                screenshots[company] = base64.standard_b64encode(response.content).decode()
        except Exception:
            pass

    if not screenshots:
        return "No screenshots captured."

    content = [{"type": "text", "text": "You are analyzing competitor pricing pages. For each screenshot, extract: pricing tiers, prices, key features included, and any notable positioning or messaging. Then provide a brief comparison summary."}]

    for company, img_data in screenshots.items():
        content.append({"type": "text", "text": f"\n## {company}"})
        content.append({
            "type": "image",
            "source": {"type": "base64", "media_type": "image/png", "data": img_data}
        })

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

Key Design Considerations for AI Agent Use

Delay matters. JavaScript-heavy dashboards and SPAs need 2000–4000ms delay after page load. Without it, the screenshot captures a loading state rather than the rendered content.

Full page vs. viewport. For reading content, use full_page=true. For verifying a specific interaction result (was the button clicked?), viewport-only is usually sufficient and faster.

Vision model image size limits. Claude and GPT-4V have input token limits. A 1440px full-page screenshot of a long page can be large. Consider using 1280px width and viewport-only when the full page isn't needed — it reduces both image size and API latency.

Rate limiting in agent loops. Agents that run tool calls in tight loops can exhaust API rate limits quickly. Implement per-URL caching (5-minute TTL for monitoring, longer for research) and add delay between captures in batch operations.

Error handling in tool definitions. When a screenshot fails (timeout, 429, JS error), return a structured error in the tool result rather than raising an exception. Well-designed agents can decide to retry, skip, or surface the failure rather than crashing the entire run.

The screenshot API is compatible with all major vision models and agent frameworks. First 100 requests free.