How I Built an Autonomous Agent That Monitors My Site and Emails Me When Something Breaks
The worst way to learn your site is broken is from a user.
"Hey, your signup page has been showing a blank white screen for the last hour." That message has happened to me twice. Both times, the cause was obvious in retrospect — a JavaScript bundle failing to load, a backend route returning 500 — but I had no automated system watching the visual state of the page. I had uptime monitors. They all reported green.
The problem: uptime monitors check HTTP status codes. They don't look at the page. A server that returns 200 with a blank white body looks fine to a ping monitor. A user trying to sign up sees nothing.
I built something different.
The Design
The agent runs every 15 minutes. It visits a set of pages, takes a screenshot of each, and compares each screenshot against the last known-good baseline. If a page looks meaningfully different from the baseline, it emails me with a before/after comparison.
There are no DOM selectors. No JavaScript parsing. No HTTP probes. Just: does this page look the way it's supposed to?
This is both simpler and more powerful than traditional monitoring. Simpler because I don't maintain per-page configurations that break every time the markup changes. More powerful because it catches what uptime monitors miss: broken layouts, missing images, CSS loading failures, JavaScript render errors, incorrect content.
┌─────────────────────────────────────────────────┐
│ monitor.py (cron: every 15 min) │
│ │
│ for each page in config: │
│ capture screenshot │
│ compare against baseline │
│ if significant change: │
│ flag as incident │
│ send email with before/after diff │
│ wait for manual resolve or auto-recover │
└─────────────────────────────────────────────────┘
Configuration
# config/monitor.yaml
pages:
- name: homepage
url: https://yoursite.com
threshold_pct: 3.0
delay: 1000
check_elements:
- description: "main CTA button visible"
clip: {x: 0, y: 200, width: 1280, height: 400}
threshold_pct: 5.0
- name: signup
url: https://yoursite.com/signup
threshold_pct: 2.0
delay: 1500
check_elements:
- description: "signup form"
clip: {x: 300, y: 100, width: 680, height: 500}
threshold_pct: 1.0 # form must not change
- name: pricing
url: https://yoursite.com/pricing
threshold_pct: 5.0 # some dynamic content expected
delay: 1000
- name: api-docs
url: https://yoursite.com/docs
threshold_pct: 3.0
delay: 800
settings:
check_interval_minutes: 15
alert_cooldown_minutes: 60 # don't re-alert for same page within 1h
baseline_refresh_days: 7 # update baseline weekly
screenshot_width: 1280
screenshot_height: 900
The check_elements blocks let me set stricter thresholds for critical sub-regions. The signup form getting a 0.5% change should alert immediately. The homepage getting a 3% change from a banner rotation is fine.
The Monitor
import requests
import yaml
import sqlite3
import os
import time
import hashlib
from datetime import datetime, timezone, timedelta
from pathlib import Path
from PIL import Image, ImageChops, ImageDraw
import numpy as np
import io
API_KEY = os.environ['SCREENSHOT_API_KEY']
SCREENSHOT_URL = 'https://hermesforge.dev/api/screenshot'
DATA_DIR = Path('monitor-data')
DB_PATH = DATA_DIR / 'monitor.db'
def init():
DATA_DIR.mkdir(exist_ok=True)
(DATA_DIR / 'baselines').mkdir(exist_ok=True)
(DATA_DIR / 'incidents').mkdir(exist_ok=True)
conn = sqlite3.connect(DB_PATH)
conn.executescript("""
CREATE TABLE IF NOT EXISTS checks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
page TEXT NOT NULL,
checked_at TEXT NOT NULL,
status TEXT NOT NULL, -- ok, incident, error
change_pct REAL,
incident_id INTEGER
);
CREATE TABLE IF NOT EXISTS incidents (
id INTEGER PRIMARY KEY AUTOINCREMENT,
page TEXT NOT NULL,
opened_at TEXT NOT NULL,
closed_at TEXT,
status TEXT DEFAULT 'open', -- open, auto-resolved, acknowledged
last_alerted_at TEXT,
baseline_path TEXT,
incident_path TEXT,
diff_path TEXT,
change_pct REAL
);
CREATE TABLE IF NOT EXISTS baselines (
id INTEGER PRIMARY KEY AUTOINCREMENT,
page TEXT NOT NULL,
element TEXT,
created_at TEXT NOT NULL,
image_hash TEXT NOT NULL,
image_path TEXT NOT NULL,
is_current INTEGER DEFAULT 1
);
""")
conn.commit()
return conn
def capture(url, width=1280, height=900, delay=1000, clip=None):
params = {
'url': url,
'width': width,
'height': clip['y'] + clip['height'] + 50 if clip else height,
'format': 'png',
'full_page': 'false',
'delay': delay,
}
resp = requests.get(
SCREENSHOT_URL,
params=params,
headers={'X-API-Key': API_KEY},
timeout=60,
)
resp.raise_for_status()
img = Image.open(io.BytesIO(resp.content))
if clip:
img = img.crop((clip['x'], clip['y'],
clip['x'] + clip['width'],
clip['y'] + clip['height']))
return img
def pixel_diff(before, after, threshold=15):
w = min(before.width, after.width)
h = min(before.height, after.height)
b = np.array(before.crop((0, 0, w, h)).convert('RGB'))
a = np.array(after.crop((0, 0, w, h)).convert('RGB'))
diff = np.abs(b.astype(int) - a.astype(int))
mask = np.any(diff > threshold, axis=2)
pct = mask.sum() / (w * h) * 100
# Annotated diff image
diff_img = after.crop((0, 0, w, h)).copy()
rows = np.where(mask.any(axis=1))[0]
cols = np.where(mask.any(axis=0))[0]
if len(rows) and len(cols):
draw = ImageDraw.Draw(diff_img)
draw.rectangle([int(cols[0])-3, int(rows[0])-3,
int(cols[-1])+3, int(rows[-1])+3],
outline=(255, 50, 50), width=3)
return diff_img, pct
def get_baseline(conn, page, element=None):
row = conn.execute(
"SELECT image_path, image_hash FROM baselines "
"WHERE page=? AND element IS ? AND is_current=1",
(page, element)
).fetchone()
return row
def set_baseline(conn, page, img, element=None):
timestamp = datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')
fname = f"{page.replace('/', '_')}__{element or 'full'}__{timestamp}.png"
path = DATA_DIR / 'baselines' / fname
img.save(path)
img_hash = hashlib.sha256(path.read_bytes()).hexdigest()[:16]
conn.execute(
"UPDATE baselines SET is_current=0 WHERE page=? AND element IS ?",
(page, element)
)
conn.execute(
"INSERT INTO baselines (page, element, created_at, image_hash, image_path) "
"VALUES (?, ?, ?, ?, ?)",
(page, element, datetime.now(timezone.utc).isoformat(), img_hash, str(path))
)
conn.commit()
return path
def open_incident(conn, page, baseline_path, current_img, diff_img, change_pct):
timestamp = datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')
inc_path = DATA_DIR / 'incidents' / f"{page.replace('/', '_')}__{timestamp}__after.png"
diff_path = DATA_DIR / 'incidents' / f"{page.replace('/', '_')}__{timestamp}__diff.png"
current_img.save(inc_path)
diff_img.save(diff_path)
cursor = conn.execute(
"INSERT INTO incidents (page, opened_at, last_alerted_at, baseline_path, "
"incident_path, diff_path, change_pct) VALUES (?, ?, ?, ?, ?, ?, ?)",
(page, datetime.now(timezone.utc).isoformat(),
datetime.now(timezone.utc).isoformat(),
str(baseline_path), str(inc_path), str(diff_path), change_pct)
)
conn.commit()
return cursor.lastrowid, inc_path, diff_path
def should_alert(conn, page, cooldown_minutes=60):
row = conn.execute(
"SELECT last_alerted_at FROM incidents "
"WHERE page=? AND status='open' ORDER BY opened_at DESC LIMIT 1",
(page,)
).fetchone()
if not row:
return True
last = datetime.fromisoformat(row[0])
return datetime.now(timezone.utc) - last > timedelta(minutes=cooldown_minutes)
def check_auto_resolve(conn, page, current_img):
"""Check if current state matches baseline — incident self-resolved."""
baseline = get_baseline(conn, page)
if not baseline:
return False
baseline_img = Image.open(baseline[0])
_, pct = pixel_diff(baseline_img, current_img)
if pct < 1.0:
conn.execute(
"UPDATE incidents SET status='auto-resolved', closed_at=? "
"WHERE page=? AND status='open'",
(datetime.now(timezone.utc).isoformat(), page)
)
conn.commit()
return True
return False
def run_monitor(config_path='config/monitor.yaml'):
conn = init()
config = yaml.safe_load(Path(config_path).read_text())
settings = config.get('settings', {})
cooldown = settings.get('alert_cooldown_minutes', 60)
baseline_days = settings.get('baseline_refresh_days', 7)
width = settings.get('screenshot_width', 1280)
height = settings.get('screenshot_height', 900)
alerts = []
now = datetime.now(timezone.utc)
for page_cfg in config['pages']:
name = page_cfg['name']
url = page_cfg['url']
threshold = page_cfg.get('threshold_pct', 3.0)
delay = page_cfg.get('delay', 1000)
print(f" Checking: {name}")
try:
current = capture(url, width=width, height=height, delay=delay)
# Check sub-elements first
element_incident = False
for elem in page_cfg.get('check_elements', []):
elem_img = capture(url, width=width, height=height, delay=delay,
clip=elem.get('clip'))
elem_name = elem.get('description', 'element')
elem_threshold = elem.get('threshold_pct', threshold)
baseline_row = get_baseline(conn, name, elem_name)
if baseline_row is None:
set_baseline(conn, name, elem_img, elem_name)
print(f" [{name}/{elem_name}] baseline set")
continue
baseline_img = Image.open(baseline_row[0])
diff_img, pct = pixel_diff(baseline_img, elem_img)
if pct > elem_threshold:
element_incident = True
print(f" [{name}/{elem_name}] INCIDENT: {pct:.1f}% change")
if should_alert(conn, f"{name}/{elem_name}", cooldown):
inc_id, inc_path, diff_path = open_incident(
conn, f"{name}/{elem_name}",
baseline_row[0], elem_img, diff_img, pct
)
alerts.append({
'page': name, 'element': elem_name, 'url': url,
'change_pct': pct,
'before': baseline_img,
'after': elem_img,
'diff': diff_img,
})
else:
print(f" [{name}/{elem_name}] ok ({pct:.1f}%)")
time.sleep(0.3)
# Full-page check
baseline_row = get_baseline(conn, name)
if baseline_row is None:
set_baseline(conn, name, current)
print(f" [{name}] full-page baseline set")
continue
# Check if baseline is stale — refresh if needed
baseline_age = now - datetime.fromisoformat(
conn.execute(
"SELECT created_at FROM baselines WHERE page=? AND element IS NULL AND is_current=1",
(name,)
).fetchone()[0]
)
if baseline_age.days >= baseline_days:
set_baseline(conn, name, current)
print(f" [{name}] baseline refreshed ({baseline_age.days}d old)")
continue
baseline_img = Image.open(baseline_row[0])
diff_img, pct = pixel_diff(baseline_img, current)
if pct > threshold:
if check_auto_resolve(conn, name, current):
print(f" [{name}] auto-resolved")
continue
print(f" [{name}] INCIDENT: {pct:.1f}% change")
if should_alert(conn, name, cooldown):
open_incident(conn, name, baseline_row[0], current, diff_img, pct)
alerts.append({
'page': name, 'element': None, 'url': url,
'change_pct': pct,
'before': baseline_img,
'after': current,
'diff': diff_img,
})
conn.execute(
"INSERT INTO checks (page, checked_at, status, change_pct) "
"VALUES (?, ?, 'incident', ?)",
(name, now.isoformat(), pct)
)
else:
print(f" [{name}] ok ({pct:.1f}%)")
conn.execute(
"INSERT INTO checks (page, checked_at, status, change_pct) "
"VALUES (?, ?, 'ok', ?)",
(name, now.isoformat(), pct)
)
# If there was an open incident, check if it self-resolved
open_inc = conn.execute(
"SELECT id FROM incidents WHERE page=? AND status='open'", (name,)
).fetchone()
if open_inc:
conn.execute(
"UPDATE incidents SET status='auto-resolved', closed_at=? WHERE id=?",
(now.isoformat(), open_inc[0])
)
print(f" [{name}] incident auto-resolved")
conn.commit()
except Exception as e:
print(f" [{name}] ERROR: {e}")
conn.execute(
"INSERT INTO checks (page, checked_at, status) VALUES (?, ?, 'error')",
(name, now.isoformat())
)
conn.commit()
time.sleep(0.5)
conn.close()
if alerts:
send_alert(alerts)
print(f"\n Alert sent: {len(alerts)} issue(s)")
else:
print(f"\n All clear.")
if __name__ == '__main__':
run_monitor()
The Alert Email
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
def send_alert(alerts):
msg = MIMEMultipart('related')
msg['Subject'] = f"[Site Monitor] {len(alerts)} visual change(s) detected"
msg['From'] = os.environ['ALERT_FROM']
msg['To'] = os.environ['ALERT_TO']
rows = []
for i, a in enumerate(alerts):
label = a['page'] + (f" / {a['element']}" if a['element'] else '')
rows.append(f"""
<tr>
<td colspan="3"><b>{label}</b> — {a['change_pct']:.1f}% changed
— <a href="{a['url']}">{a['url']}</a></td>
</tr>
<tr>
<td><img src="cid:before_{i}" width="380"><br><small>Before</small></td>
<td><img src="cid:after_{i}" width="380"><br><small>Now</small></td>
<td><img src="cid:diff_{i}" width="380"><br><small>Diff (red = changed)</small></td>
</tr>
<tr><td colspan="3"><hr></td></tr>
""")
html = f"<h2>Visual Monitor Alert</h2><table>{''.join(rows)}</table>"
msg.attach(MIMEText(html, 'html'))
def attach(img, cid):
buf = io.BytesIO()
img.save(buf, format='PNG')
m = MIMEImage(buf.getvalue(), 'png')
m.add_header('Content-ID', f'<{cid}>')
m.add_header('Content-Disposition', 'inline')
msg.attach(m)
for i, a in enumerate(alerts):
attach(a['before'], f'before_{i}')
attach(a['after'], f'after_{i}')
attach(a['diff'], f'diff_{i}')
with smtplib.SMTP_SSL('smtp.gmail.com', 465) as smtp:
smtp.login(os.environ['SMTP_USER'], os.environ['SMTP_PASSWORD'])
smtp.sendmail(msg['From'], [msg['To']], msg.as_string())
Cron Setup
# Check every 15 minutes
*/15 * * * * cd /home/user/site-monitor && python3 monitor.py >> logs/monitor.log 2>&1
What It Has Caught
Running this on my own site for two months:
- Blank signup page (the event that prompted this build): caught at the 15-minute check. Root cause: a CDN cache purge that left the JS bundle 404ing. Status code was 200. Uptime monitor: green. Visual monitor: immediate alert.
- CSS regression on mobile: a deploy broke the nav menu on screens under 768px wide. I had no mobile checks before. Added a 390px viewport check after that incident.
- Third-party script timeout: a chat widget was timing out, causing a 6-second blank stall before the page loaded. The screenshot with
delay=1000caught the stall; the screenshot withdelay=7000showed the resolved page. The delta revealed the timing issue. - Content accidentally deleted: a CMS draft accidentally published and overwrote the pricing page with placeholder text. Caught at the next 15-minute check.
Three of these four would have been invisible to a traditional uptime monitor. All four were embarrassing. None of them reached a user without me knowing first.
The Feedback Loop
The monitor has made me a better deployer. When every deploy is followed by a visual check fifteen minutes later, you get honest feedback about whether the deploy looked the way you intended. After a while you stop deploying at 17:00 on a Friday.
That might be the most useful thing it's done.
Get Your API Key
Free API key at hermesforge.dev/screenshot. A four-page monitor checking every 15 minutes runs about 384 API calls per day — well within the free tier.