How AI Agents Use Screenshots as Eyes: Building a Visual Perception Loop
The first time I fed a screenshot to an LLM and asked it what was wrong with the page, I expected a vague answer. Instead it said: "The button in the top right is disabled and grayed out. The form below it has a required field marked in red — it looks like email validation failed." It was completely right. The human-level visual reasoning was instant.
That was the moment I understood that screenshot APIs aren't just for monitoring dashboards or CI smoke tests. They're a perception layer for AI agents. An agent that can see a page — actually see it, not parse its HTML — can reason about it the same way a human would.
Here's how to build that loop.
The Core Idea
An LLM agent normally operates in a text world. It reads text, generates text, calls tools that return text. When you add a screenshot API as a tool, you give it eyes. Now it can:
- Navigate to a URL and observe the current state of the page
- Describe what it sees in natural language
- Identify elements, errors, status indicators, and layout issues
- Decide what action to take next based on what it observes
- Take that action (click, fill, submit — via another tool) and observe again
This is a perception-action loop — the same architecture that robotics researchers have spent decades building in physical systems. Except now you're doing it with HTTP requests and LLM vision, and it works.
The Architecture
Agent
│
├── screenshot_tool(url) → image
│ └── sends image to LLM vision
│
├── describe_page(url) → text description
│ └── LLM generates structured observation
│
└── decide_action(observation, goal) → next_step
└── LLM chooses: screenshot again / navigate / report / done
The agent loops: observe → reason → act → observe again. The loop terminates when the agent reaches its goal or hits a maximum step count.
Step 1: The Screenshot Tool
import os
import io
import base64
import requests
from PIL import Image
API_KEY = os.environ['SCREENSHOT_API_KEY']
SCREENSHOT_URL = 'https://hermesforge.dev/api/screenshot'
def capture_page(url, width=1280, height=800, delay=1500, full_page=False):
"""Capture a screenshot and return as PIL Image."""
resp = requests.get(
SCREENSHOT_URL,
params={
'url': url,
'width': width,
'height': height,
'format': 'png',
'delay': delay,
'full_page': str(full_page).lower(),
},
headers={'X-API-Key': API_KEY},
timeout=60,
)
resp.raise_for_status()
return Image.open(io.BytesIO(resp.content)).convert('RGB')
def image_to_base64(img):
"""Convert PIL Image to base64 string for LLM API."""
buf = io.BytesIO()
img.save(buf, format='PNG')
return base64.b64encode(buf.getvalue()).decode('utf-8')
Step 2: The Vision Observation
from openai import OpenAI
client = OpenAI() # or use Anthropic client — same pattern
def observe_page(url, goal=None, context=None):
"""
Capture a screenshot and ask the LLM to describe it.
Returns a structured observation dict.
"""
img = capture_page(url)
img_b64 = image_to_base64(img)
prompt_parts = [
"You are observing a web page screenshot. Describe what you see precisely and concisely.",
"Focus on:",
"- Page title or main heading",
"- Visible content and layout",
"- Any error messages, warnings, or alerts",
"- Form fields and their state (filled/empty/invalid/disabled)",
"- Buttons and their state (enabled/disabled/loading)",
"- Any indicators of page load status",
]
if goal:
prompt_parts.append(f"\nThe goal is: {goal}")
prompt_parts.append("Indicate whether the page state suggests the goal has been achieved, is in progress, or has failed.")
if context:
prompt_parts.append(f"\nPrevious context: {context}")
resp = client.chat.completions.create(
model='gpt-4o',
messages=[{
'role': 'user',
'content': [
{'type': 'text', 'text': '\n'.join(prompt_parts)},
{
'type': 'image_url',
'image_url': {
'url': f'data:image/png;base64,{img_b64}',
'detail': 'high',
}
},
],
}],
max_tokens=500,
)
description = resp.choices[0].message.content
# Ask for structured assessment
assess_resp = client.chat.completions.create(
model='gpt-4o',
messages=[
{'role': 'user', 'content': f'Based on this page description:\n\n{description}\n\nReturn a JSON object with: {{"status": "ok|error|loading|unknown", "goal_achieved": true|false|null, "key_finding": "one sentence", "suggested_action": "what to check or do next"}}'},
],
response_format={'type': 'json_object'},
max_tokens=200,
)
import json
assessment = json.loads(assess_resp.choices[0].message.content)
return {
'url': url,
'description': description,
'assessment': assessment,
'screenshot': img,
}
Step 3: The Agent Loop
import time
class VisualAgent:
def __init__(self, goal, max_steps=10):
self.goal = goal
self.max_steps = max_steps
self.history = []
def run(self, start_url):
current_url = start_url
context = None
for step in range(self.max_steps):
print(f'\n[Step {step + 1}/{self.max_steps}] Observing: {current_url}')
observation = observe_page(
current_url,
goal=self.goal,
context=context,
)
self.history.append({
'step': step + 1,
'url': current_url,
'observation': observation,
})
assessment = observation['assessment']
print(f' Status: {assessment["status"]}')
print(f' Finding: {assessment["key_finding"]}')
print(f' Next: {assessment["suggested_action"]}')
# Goal achieved?
if assessment.get('goal_achieved') is True:
print(f'\n[Goal achieved in {step + 1} steps]')
return {'success': True, 'steps': step + 1, 'history': self.history}
# Error state?
if assessment['status'] == 'error':
print(f'\n[Error detected — stopping]')
return {'success': False, 'error': assessment['key_finding'], 'history': self.history}
# Update context for next observation
context = f'Step {step + 1}: {assessment["key_finding"]}'
# Brief wait before next observation
time.sleep(2)
return {'success': False, 'error': 'Max steps reached', 'history': self.history}
Step 4: A Real Use Case — Post-Deploy Verification Agent
The agent checks that a deployment went correctly across multiple pages:
DEPLOY_CHECKS = [
{
'url': 'https://yourapp.com',
'goal': 'Homepage renders correctly with no error messages or broken layout',
},
{
'url': 'https://yourapp.com/pricing',
'goal': 'Pricing page shows three plan tiers with working CTA buttons',
},
{
'url': 'https://yourapp.com/login',
'goal': 'Login form is visible with email + password fields and a submit button',
},
{
'url': 'https://yourapp.com/app/dashboard',
'goal': 'Dashboard loads with data visible (not loading state or error)',
},
]
def run_deploy_verification(checks):
results = []
for check in checks:
agent = VisualAgent(goal=check['goal'], max_steps=3)
result = agent.run(check['url'])
result['url'] = check['url']
result['goal'] = check['goal']
results.append(result)
icon = '✓' if result['success'] else '✗'
print(f'{icon} {check["url"]}')
passed = sum(1 for r in results if r['success'])
print(f'\n{passed}/{len(results)} checks passed')
return results
results = run_deploy_verification(DEPLOY_CHECKS)
What the Agent Actually Sees
The observation output for a broken deployment looks like this:
[Step 1/3] Observing: https://yourapp.com/app/dashboard
Status: error
Finding: Dashboard shows a white screen with a JavaScript error in the console overlay: "Cannot read properties of undefined (reading 'data')"
Next: Check application logs for API endpoint failures — the dashboard data fetch is failing
That's not what an uptime monitor would catch. The page returns HTTP 200. The server is up. But the agent sees what the user sees: a broken dashboard with a visible JS error.
Practical Considerations
Vision model costs: GPT-4o vision API costs ~$0.003 per image at high detail. For a 10-page deploy check, that's $0.03 per run — negligible. For continuous monitoring (every 15 minutes), ~$1.25/day per page. Budget accordingly.
Screenshot timing: Pages with React/Vue/Angular need delay tuning. Server-rendered pages need 500–1000ms. SPAs with API calls need 2000–3000ms. The agent can detect loading states and you can build retry logic around them.
Structured output: The two-pass approach (description then JSON assessment) is more reliable than asking for structured JSON directly from a vision prompt. The model reasons better in natural language first.
Claude vs GPT-4o: Both handle screenshots well. Claude tends to be more precise about UI element states (enabled/disabled, form validation errors). GPT-4o tends to give more context about page purpose. Use what fits your stack.
Beyond Verification: Other Agent Patterns
The perception loop opens other patterns:
Competitive intelligence: Agent navigates competitor pricing pages on a schedule and extracts pricing table data as structured JSON — no scraping fragility, just vision.
Accessibility auditing: Agent screenshots each page and describes what a screen reader would encounter — missing alt text, unlabeled buttons, color contrast issues visible from the screenshot.
User journey testing: Agent walks a multi-step signup flow, capturing each step and verifying the expected state transitions (form → email verification → onboarding → dashboard).
Content freshness monitoring: Agent checks that a news page has updated content, a dashboard shows current data, a banner campaign is still running — questions about recency and content that pixel diffs can't answer.
The Underlying Shift
Screenshot APIs were initially useful for reports, PDFs, and thumbnails. Then monitoring and CI testing. Now they're perception infrastructure for AI agents.
The shift is that the consumer of the screenshot is no longer a human looking at a PNG — it's an LLM deciding what to do next. The screenshot is input to a reasoning system, not output for human review.
That changes what matters. Resolution and fidelity matter more (the LLM needs to read text in the screenshot). Timing matters (capture state after JS execution). Authentication matters (agents need to see authenticated states, not login walls).
An API that handles those requirements well becomes a standard tool in the agent toolkit — the way file I/O or HTTP requests are standard. That's where this is heading.
Get Started
Free API key at hermesforge.dev/screenshot. The vision loop runs on any screenshot API that returns a PNG — no browser binary required on your agent's host.