Screenshot API for CrewAI Agents: Web Observation in Multi-Agent Pipelines

2026-05-15 | Tags: [screenshot-api, crewai, ai-agents, python, tutorials, b2a]

CrewAI's strength is task decomposition across specialized agents. Each agent has a role, a goal, and a set of tools. Adding visual web observation to a crew means one or more agents can see rendered web pages — not just read their text — and that perception feeds into the crew's collaborative reasoning.

This post shows how to wrap a screenshot API as a CrewAI tool, which agent roles benefit most from visual perception, and how to structure crews that combine web observation with other capabilities.

The Screenshot Tool for CrewAI

CrewAI tools are classes inheriting from BaseTool. The schema is straightforward:

import requests
import base64
from crewai.tools import BaseTool
from pydantic import BaseModel, Field
from typing import Type

SCREENSHOT_API = "https://hermesforge.dev/api/screenshot"
API_KEY = "your_api_key_here"

class ScreenshotInput(BaseModel):
    url: str = Field(description="The URL to screenshot")
    full_page: bool = Field(default=False, description="Capture full scrollable page")
    width: int = Field(default=1440, description="Viewport width in pixels")

class WebScreenshotTool(BaseTool):
    name: str = "web_screenshot"
    description: str = (
        "Capture a screenshot of a web page and return it as base64-encoded PNG. "
        "Use when you need to observe visual layout, design, rendered content, or "
        "anything that text extraction cannot capture. Returns base64 PNG data."
    )
    args_schema: Type[BaseModel] = ScreenshotInput

    def _run(self, url: str, full_page: bool = False, width: int = 1440) -> str:
        resp = requests.get(SCREENSHOT_API, params={
            "url": url,
            "width": width,
            "format": "png",
            "full_page": full_page,
            "wait_for": "networkidle"
        }, headers={"X-API-Key": API_KEY})
        resp.raise_for_status()
        b64 = base64.b64encode(resp.content).decode("utf-8")
        return f"Screenshot captured ({len(resp.content) // 1024}KB): data:image/png;base64,{b64}"

The description field is what CrewAI agents use to decide when to call this tool. Framing it around what it reveals — visual layout, design, rendered content — rather than what it does mechanically (takes a screenshot) produces better tool selection by the agent's reasoning loop.

Agent Roles That Benefit from Visual Perception

Not every agent in a crew needs screenshot capability. The pattern is to give it to agents whose job description involves observing rather than analyzing text:

from crewai import Agent, LLM

# Multimodal LLM required for vision
llm = LLM(model="claude-claude-opus-4-6", max_tokens=4096)

screenshot_tool = WebScreenshotTool()

# The observer: primary visual perception role
web_observer = Agent(
    role="Web Observation Specialist",
    goal="Capture and describe the visual state of web pages accurately",
    backstory=(
        "You are an expert at observing and describing what web pages look like — "
        "their layout, design hierarchy, call-to-action placement, visual messaging, "
        "and overall user experience. You see pages as a first-time user would."
    ),
    tools=[screenshot_tool],
    llm=llm,
    verbose=True
)

# The analyst: works from the observer's descriptions
ux_analyst = Agent(
    role="UX and Conversion Analyst",
    goal="Analyze web page observations to identify conversion opportunities and UX issues",
    backstory=(
        "You are a senior UX researcher and CRO specialist. You receive visual observations "
        "of web pages and identify friction points, conversion blockers, and design patterns "
        "that correlate with high or low conversion rates."
    ),
    tools=[],  # Works from text — no screenshot tool needed
    llm=llm,
    verbose=True
)

# The strategist: synthesizes findings into recommendations
growth_strategist = Agent(
    role="Growth Strategist",
    goal="Turn UX analysis into prioritized, actionable growth recommendations",
    backstory=(
        "You are a growth advisor who specializes in translating UX findings into "
        "revenue-impact recommendations. You prioritize by effort-to-impact ratio."
    ),
    tools=[],
    llm=llm,
    verbose=True
)

This division — observer, analyst, strategist — keeps screenshot consumption concentrated in the agent that needs it and prevents unnecessary API calls from downstream agents.

Building a Competitive Analysis Crew

from crewai import Task, Crew, Process

def run_competitive_analysis(your_url: str, competitor_urls: list[str]) -> str:
    """
    Run a multi-agent competitive analysis comparing your page to competitors.
    Returns a structured strategic report.
    """

    observe_task = Task(
        description=(
            f"Screenshot and describe the following pages in detail:\n"
            f"1. OUR PAGE: {your_url}\n"
            + "\n".join(f"{i+2}. COMPETITOR: {u}" for i, u in enumerate(competitor_urls))
            + "\n\nFor each page describe: headline and value proposition, "
            "pricing structure and presentation, social proof elements, "
            "primary CTA placement and copy, visual hierarchy, trust signals."
        ),
        expected_output=(
            "A structured description of each page covering: value proposition, "
            "pricing display, social proof, CTA design, layout hierarchy, trust signals. "
            "Be specific and visual — describe what a user sees, not what the HTML says."
        ),
        agent=web_observer
    )

    analyze_task = Task(
        description=(
            "Based on the page observations, analyze: "
            "(1) Where does our page outperform competitors visually? "
            "(2) Where do competitors have stronger visual conversion signals? "
            "(3) What specific design or copy patterns explain performance differences? "
            "(4) Which competitor's approach is strongest overall and why?"
        ),
        expected_output=(
            "A comparative UX analysis identifying specific strengths, weaknesses, "
            "and the design patterns that drive them. Include evidence from the observations."
        ),
        agent=ux_analyst,
        context=[observe_task]
    )

    strategy_task = Task(
        description=(
            "Based on the competitive analysis, produce a prioritized list of "
            "improvements for our page. For each recommendation: "
            "(1) What to change and why, "
            "(2) Which competitor does this better (as evidence), "
            "(3) Estimated effort (low/medium/high), "
            "(4) Estimated impact on conversion (low/medium/high)."
        ),
        expected_output=(
            "A prioritized improvement roadmap with 5-8 specific, actionable recommendations, "
            "each justified by the competitive evidence."
        ),
        agent=growth_strategist,
        context=[observe_task, analyze_task]
    )

    crew = Crew(
        agents=[web_observer, ux_analyst, growth_strategist],
        tasks=[observe_task, analyze_task, strategy_task],
        process=Process.sequential,
        verbose=True
    )

    result = crew.kickoff()
    return result.raw

Building a Visual Regression Crew

For teams shipping frequently, a visual regression crew catches layout breaks that unit tests miss:

import json
import hashlib
from pathlib import Path

class VisualHashTool(BaseTool):
    name: str = "visual_hash_check"
    description: str = (
        "Screenshot a page and compare its visual hash to a stored baseline. "
        "Returns CHANGED or UNCHANGED with the current screenshot if changed. "
        "Use to detect visual regressions after deployments."
    )
    args_schema: Type[BaseModel] = ScreenshotInput

    def _run(self, url: str, full_page: bool = True, width: int = 1440) -> str:
        resp = requests.get(SCREENSHOT_API, params={
            "url": url, "width": width, "format": "png",
            "full_page": full_page, "wait_for": "networkidle"
        }, headers={"X-API-Key": API_KEY})
        resp.raise_for_status()

        current_hash = hashlib.sha256(resp.content).hexdigest()
        hash_file = Path(f"baselines/{url.replace('/', '_').replace(':', '')}.hash")
        hash_file.parent.mkdir(exist_ok=True)

        if not hash_file.exists():
            hash_file.write_text(current_hash)
            return f"BASELINE SET for {url}: {current_hash[:12]}..."

        stored_hash = hash_file.read_text().strip()
        if stored_hash == current_hash:
            return f"UNCHANGED: {url} matches baseline"

        # Hash changed — return screenshot for visual inspection
        b64 = base64.b64encode(resp.content).decode("utf-8")
        return (
            f"CHANGED: {url}\n"
            f"Baseline: {stored_hash[:12]}...\n"
            f"Current:  {current_hash[:12]}...\n"
            f"Screenshot: data:image/png;base64,{b64}"
        )

regression_detector = Agent(
    role="Visual Regression Detector",
    goal="Detect and describe visual regressions on web pages after deployments",
    backstory=(
        "You run visual regression checks on web pages. You detect when pages have "
        "changed from their baseline and describe what changed and whether it looks intentional."
    ),
    tools=[VisualHashTool()],
    llm=llm,
    verbose=True
)

qa_reviewer = Agent(
    role="QA Release Gatekeeper",
    goal="Decide whether detected visual changes are regressions or intended improvements",
    backstory=(
        "You are a senior QA engineer who reviews visual change reports before releases. "
        "You distinguish intentional design changes from bugs, and flag blockers vs. acceptable changes."
    ),
    tools=[],
    llm=llm,
    verbose=True
)

def run_regression_check(staging_urls: list[str]) -> dict:
    check_task = Task(
        description=(
            f"Run visual regression checks on these staging pages:\n"
            + "\n".join(f"- {u}" for u in staging_urls)
            + "\n\nFor each page: check for visual changes against baseline. "
            "If changed, describe what appears different."
        ),
        expected_output="A list of pages with CHANGED or UNCHANGED status, with descriptions of any changes detected.",
        agent=regression_detector
    )

    review_task = Task(
        description=(
            "Review the regression report. For each changed page: "
            "(1) Does the change look like a bug or an intentional improvement? "
            "(2) Is this a release blocker? "
            "(3) What should the team do — approve, fix, or investigate further?"
        ),
        expected_output="A release decision with APPROVED / BLOCKED / NEEDS INVESTIGATION for each changed page.",
        agent=qa_reviewer,
        context=[check_task]
    )

    crew = Crew(
        agents=[regression_detector, qa_reviewer],
        tasks=[check_task, review_task],
        process=Process.sequential
    )

    result = crew.kickoff()
    return {"report": result.raw, "pages": len(staging_urls)}

Rate Limit Strategy for Crews

CrewAI crews can exhaust API limits quickly because multiple agents may trigger tools in sequence. A simple rate limit wrapper prevents burst failures:

import time
from functools import wraps

def rate_limited(max_calls_per_minute: int = 10):
    """Decorator to rate-limit tool calls within a crew run."""
    min_interval = 60.0 / max_calls_per_minute
    last_call = [0.0]

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_call[0]
            if elapsed < min_interval:
                time.sleep(min_interval - elapsed)
            last_call[0] = time.time()
            return func(*args, **kwargs)
        return wrapper
    return decorator

For most crew configurations, the per-call tier alignment is:

Crew Type	Agents with Vision	Screenshots/Run	Daily Runs	Tier
Competitive snapshot (3 pages)	1	3	1	Free
Pricing monitor (10 competitors)	1	10	1	Free
Visual regression suite (20 pages)	1	20	2	Free/Starter
Full competitive intelligence crew	1-2	30-50	daily	Starter ($4)
Multi-market monitoring platform	2-3	100+	daily	Pro ($9)

What Changes in Multi-Agent Pipelines

The key difference between a single agent with screenshot capability and a crew is what happens after the screenshot. A single agent captures and reasons about the image alone. A crew captures the image in one agent and passes the observation as context to downstream agents that may have different specializations — analysis, strategy, QA decision-making, report writing.

This separation matters for screenshot API usage: only the observation-specialist agents need screenshot tools. Downstream agents work from the natural language descriptions produced by the observer. The screenshot is consumed once; the observation it enables flows through the entire crew.

The cost implication: a 5-agent crew doing competitive analysis needs roughly the same number of API calls as a single agent doing the same task — not 5x. Visual perception is concentrated at the input stage. What the crew multiplies is reasoning depth, not screenshot consumption.

Hermesforge Screenshot API: full-page capture, JavaScript rendering, PNG/WebP output. Get a free API key — 50 calls/day, no signup required.