Screenshot API for LlamaIndex Agents: Visual Web Retrieval in RAG Pipelines

2026-05-18 | Tags: [screenshot-api, llamaindex, ai-agents, python, tutorials, b2a, rag]

LlamaIndex's architecture is retrieval-first: it excels at connecting LLMs to data sources through indexes, retrievers, and query engines. Adding visual web perception to a LlamaIndex system means treating screenshots as a retrieval modality — not just text from web pages, but the rendered visual state of those pages at a specific moment.

This post covers two integration points: screenshot tools for LlamaIndex agents (the simpler case) and screenshot-based document nodes for RAG pipelines (the more powerful case, enabling visual retrieval alongside text retrieval).

Screenshot Tools for LlamaIndex Agents

LlamaIndex's FunctionTool wraps Python callables as agent tools:

import requests
import base64
from llama_index.core.tools import FunctionTool
from llama_index.core.agent import ReActAgent
from llama_index.llms.anthropic import Anthropic

SCREENSHOT_API = "https://hermesforge.dev/api/screenshot"
API_KEY = "your_api_key_here"

def capture_webpage(url: str, full_page: bool = False, width: int = 1440) -> str:
    """
    Capture a screenshot of a web page and return it as base64-encoded PNG.

    Use this tool when you need to see the visual layout, design, or rendered state
    of a web page — not just its text content. Particularly useful for:
    - Comparing design and UX between pages
    - Verifying that content renders correctly
    - Capturing dynamic content that text extraction misses
    - Observing visual hierarchy and call-to-action placement

    Args:
        url: The URL to screenshot.
        full_page: If True, captures the full scrollable page. Default: False (viewport only).
        width: Viewport width in pixels. Default: 1440.

    Returns:
        Base64-encoded PNG with metadata prefix.
    """
    resp = requests.get(SCREENSHOT_API, params={
        "url": url,
        "width": width,
        "format": "png",
        "full_page": full_page,
        "wait_for": "networkidle"
    }, headers={"X-API-Key": API_KEY})
    resp.raise_for_status()

    size_kb = len(resp.content) // 1024
    b64 = base64.b64encode(resp.content).decode("utf-8")
    return f"[Visual capture of {url} | {size_kb}KB]\ndata:image/png;base64,{b64}"

def capture_and_compare(url_a: str, url_b: str) -> str:
    """
    Screenshot two URLs side-by-side for visual comparison.

    Args:
        url_a: First URL.
        url_b: Second URL.

    Returns:
        Both screenshots labeled A and B.
    """
    results = []
    for label, url in [("A", url_a), ("B", url_b)]:
        resp = requests.get(SCREENSHOT_API, params={
            "url": url, "width": 1440, "format": "png", "wait_for": "networkidle"
        }, headers={"X-API-Key": API_KEY})
        resp.raise_for_status()
        b64 = base64.b64encode(resp.content).decode("utf-8")
        results.append(f"[Page {label}: {url}]\ndata:image/png;base64,{b64}")
    return "\n\n---\n\n".join(results)

# Wrap as LlamaIndex tools
screenshot_tool = FunctionTool.from_defaults(fn=capture_webpage)
compare_tool = FunctionTool.from_defaults(fn=capture_and_compare)

# Create agent with vision-capable LLM
llm = Anthropic(model="claude-opus-4-6", max_tokens=4096)

agent = ReActAgent.from_tools(
    tools=[screenshot_tool, compare_tool],
    llm=llm,
    verbose=True,
    system_prompt=(
        "You are a web analyst with visual perception capabilities. "
        "Use screenshot tools to observe web pages directly when visual analysis is needed. "
        "Describe what you see with specificity — layout, hierarchy, CTAs, trust signals."
    )
)

# Example query
response = agent.chat(
    "Compare the pricing pages of these two SaaS products and tell me "
    "which has a stronger conversion design: "
    "https://product-a.com/pricing vs https://product-b.com/pricing"
)

Screenshot Nodes in RAG Pipelines

The more powerful integration is treating screenshots as document nodes in a RAG pipeline. This enables retrieval over visual web content — querying a collection of captured pages semantically.

from llama_index.core import VectorStoreIndex, Document
from llama_index.core.schema import ImageDocument, TextNode, NodeWithScore
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal
from pathlib import Path
import hashlib
import json
from datetime import datetime, timezone

class VisualWebIndex:
    """
    Index of web pages captured as screenshots for visual RAG retrieval.
    Stores screenshots as ImageDocuments with metadata for semantic retrieval.
    """

    def __init__(self, index_dir: str = "visual_web_index"):
        self.index_dir = Path(index_dir)
        self.index_dir.mkdir(exist_ok=True)
        self.screenshots_dir = self.index_dir / "screenshots"
        self.screenshots_dir.mkdir(exist_ok=True)
        self.metadata_file = self.index_dir / "index_metadata.json"
        self.metadata = self._load_metadata()

    def _load_metadata(self) -> dict:
        if self.metadata_file.exists():
            with open(self.metadata_file) as f:
                return json.load(f)
        return {"pages": {}}

    def _save_metadata(self):
        with open(self.metadata_file, "w") as f:
            json.dump(self.metadata, f, indent=2)

    def add_page(self, url: str, description: str = None,
                  tags: list[str] = None) -> dict:
        """
        Capture and index a web page screenshot.

        Args:
            url: URL to capture.
            description: Optional human description to attach as metadata.
            tags: Optional tags for filtering (e.g. ["competitor", "pricing"]).

        Returns:
            Metadata dict for the indexed page.
        """
        resp = requests.get(SCREENSHOT_API, params={
            "url": url,
            "width": 1440,
            "full_page": True,
            "format": "png",
            "wait_for": "networkidle"
        }, headers={"X-API-Key": API_KEY})
        resp.raise_for_status()

        image_hash = hashlib.sha256(resp.content).hexdigest()
        timestamp = datetime.now(timezone.utc).isoformat()
        url_slug = url.replace("://", "_").replace("/", "_").replace(".", "_")[:60]
        filename = f"{url_slug}_{timestamp[:10]}.png"
        file_path = self.screenshots_dir / filename
        file_path.write_bytes(resp.content)

        entry = {
            "url": url,
            "file": str(file_path),
            "hash": image_hash,
            "captured_at": timestamp,
            "description": description,
            "tags": tags or [],
            "size_kb": len(resp.content) // 1024
        }

        self.metadata["pages"][url] = entry
        self._save_metadata()
        return entry

    def build_index(self) -> VectorStoreIndex:
        """
        Build a LlamaIndex VectorStoreIndex from all captured screenshots.
        Uses multimodal embeddings for visual + text retrieval.
        """
        documents = []
        for url, entry in self.metadata["pages"].items():
            # Create a text description document for text-based retrieval
            text_content = (
                f"Web page: {url}\n"
                f"Captured: {entry['captured_at'][:10]}\n"
                f"Tags: {', '.join(entry['tags'])}\n"
                f"Description: {entry.get('description', 'No description provided')}\n"
                f"Screenshot file: {entry['file']}"
            )
            doc = Document(
                text=text_content,
                metadata={
                    "url": url,
                    "captured_at": entry["captured_at"],
                    "tags": entry["tags"],
                    "screenshot_path": entry["file"],
                    "hash": entry["hash"]
                }
            )
            documents.append(doc)

        index = VectorStoreIndex.from_documents(documents)
        return index

    def query_visual(self, query: str, top_k: int = 3) -> list[dict]:
        """
        Query the index for pages relevant to the query, then re-analyze
        them visually using a multimodal LLM.

        Args:
            query: Natural language query (e.g. "pricing pages with social proof").
            top_k: Number of pages to retrieve and analyze.

        Returns:
            List of visual analysis results for retrieved pages.
        """
        index = self.build_index()
        retriever = index.as_retriever(similarity_top_k=top_k)
        nodes = retriever.retrieve(query)

        mm_llm = AnthropicMultiModal(model="claude-opus-4-6", max_tokens=2048)
        results = []

        for node in nodes:
            screenshot_path = node.metadata.get("screenshot_path")
            if not screenshot_path or not Path(screenshot_path).exists():
                continue

            image_bytes = Path(screenshot_path).read_bytes()
            b64 = base64.b64encode(image_bytes).decode("utf-8")

            # Visual analysis of this specific page in context of the query
            analysis_prompt = (
                f"Query: {query}\n\n"
                f"Page URL: {node.metadata['url']}\n\n"
                "Analyze this screenshot in the context of the query. "
                "What specifically does this page show that's relevant? "
                "Be concise and specific."
            )

            response = mm_llm.complete(
                prompt=analysis_prompt,
                image_documents=[
                    ImageDocument(image=f"data:image/png;base64,{b64}")
                ]
            )

            results.append({
                "url": node.metadata["url"],
                "relevance_score": node.score,
                "captured_at": node.metadata["captured_at"],
                "visual_analysis": str(response)
            })

        return results

Building a Competitive Intelligence RAG System

The visual RAG pattern enables a competitive intelligence system that answers questions by retrieving and re-analyzing relevant competitor pages:

def build_competitive_intelligence_system(competitor_pages: list[dict]) -> VisualWebIndex:
    """
    Build a visual RAG index of competitor pages.

    competitor_pages: list of {"url": str, "company": str, "page_type": str}
    e.g. {"url": "...", "company": "Acme", "page_type": "pricing"}
    """
    index = VisualWebIndex("competitive_intel")

    for page in competitor_pages:
        print(f"Capturing: {page['url']}")
        index.add_page(
            url=page["url"],
            description=f"{page['company']} {page['page_type']} page",
            tags=[page["company"].lower(), page["page_type"]]
        )

    return index

# Usage
competitors = [
    {"url": "https://competitor-a.com/pricing", "company": "CompetitorA", "page_type": "pricing"},
    {"url": "https://competitor-b.com/pricing", "company": "CompetitorB", "page_type": "pricing"},
    {"url": "https://competitor-a.com/features", "company": "CompetitorA", "page_type": "features"},
    {"url": "https://competitor-b.com/features", "company": "CompetitorB", "page_type": "features"},
]

intel = build_competitive_intelligence_system(competitors)

# Query the index visually
results = intel.query_visual(
    "Which competitors show pricing with annual/monthly toggle and highlight savings?",
    top_k=3
)
for r in results:
    print(f"\n{r['url']} (score: {r['relevance_score']:.2f})")
    print(r['visual_analysis'])

Incremental Index Updates

A key advantage of the RAG approach: pages can be re-captured and the index updated incrementally as competitor pages change:

def update_changed_pages(index: VisualWebIndex) -> list[str]:
    """
    Re-capture pages that have changed visually since last capture.
    Returns list of changed URLs.
    """
    changed = []
    for url, entry in index.metadata["pages"].items():
        resp = requests.get(SCREENSHOT_API, params={
            "url": url, "width": 1440, "format": "png", "wait_for": "networkidle"
        }, headers={"X-API-Key": API_KEY})

        if resp.status_code != 200:
            continue

        current_hash = hashlib.sha256(resp.content).hexdigest()
        if current_hash != entry["hash"]:
            # Page changed — re-index
            index.add_page(
                url=url,
                description=entry.get("description"),
                tags=entry.get("tags", [])
            )
            changed.append(url)

    return changed

Rate Limit Planning for RAG Workloads

RAG screenshot workloads have a different profile from agent workloads: index build is a one-time burst, followed by low-rate update checks:

RAG Workload	Index Build	Daily Updates	Total/Day	Tier
Small competitive set (5 pages)	5 (once)	5	5	Free
Mid-size competitive intel (20 pages)	20 (once)	20	20	Free
Full competitor + landing suite (50 pages)	50 (once)	50	50	Free/Starter
Multi-market competitive RAG (200 pages)	200 (once)	200	200	Starter ($4)
Enterprise competitive intelligence	500+ (once)	200+	200+	Pro ($9)

The index build cost is amortized: you pay it once, then only for incremental updates. For a 50-page competitor set, the index build uses 50 calls but daily maintenance uses far fewer (only changed pages get re-captured).

What LlamaIndex Adds Over Direct Agent Tools

The RAG approach adds something that direct agent tools don't provide: temporal indexing. When a screenshot is captured as a RAG document node, it has a timestamp. Querying the index can retrieve not just the current state of a page but historical states — if you've been capturing weekly snapshots, you can ask "how has competitor A's pricing page changed over the last month?" by retrieving nodes tagged with different capture dates.

This is the key distinction from the LangChain and CrewAI patterns: those frameworks treat screenshots as ephemeral tool outputs, consumed and discarded within a single agent run. LlamaIndex treats them as persistent documents — first-class retrieval objects that accumulate over time and can be queried retrospectively.

For competitive intelligence specifically, this temporal dimension is often more valuable than any single snapshot.

Hermesforge Screenshot API: JavaScript rendering, full-page capture, PNG/WebP output. Get a free API key — 50 calls/day, no signup required.