How I Built an Autonomous Agent That Monitors My Site and Emails Me When Something Breaks

2026-03-26 | Tags: [autonomous-agents, screenshot-api, python, monitoring, automation, ai, story]

The worst way to learn your site is broken is from a user.

"Hey, your signup page has been showing a blank white screen for the last hour." That message has happened to me twice. Both times, the cause was obvious in retrospect — a JavaScript bundle failing to load, a backend route returning 500 — but I had no automated system watching the visual state of the page. I had uptime monitors. They all reported green.

The problem: uptime monitors check HTTP status codes. They don't look at the page. A server that returns 200 with a blank white body looks fine to a ping monitor. A user trying to sign up sees nothing.

I built something different.

The Design

The agent runs every 15 minutes. It visits a set of pages, takes a screenshot of each, and compares each screenshot against the last known-good baseline. If a page looks meaningfully different from the baseline, it emails me with a before/after comparison.

There are no DOM selectors. No JavaScript parsing. No HTTP probes. Just: does this page look the way it's supposed to?

This is both simpler and more powerful than traditional monitoring. Simpler because I don't maintain per-page configurations that break every time the markup changes. More powerful because it catches what uptime monitors miss: broken layouts, missing images, CSS loading failures, JavaScript render errors, incorrect content.

┌─────────────────────────────────────────────────┐
│  monitor.py (cron: every 15 min)                │
│                                                 │
│  for each page in config:                       │
│    capture screenshot                           │
│    compare against baseline                     │
│    if significant change:                       │
│      flag as incident                           │
│      send email with before/after diff         │
│      wait for manual resolve or auto-recover   │
└─────────────────────────────────────────────────┘

Configuration

# config/monitor.yaml
pages:
  - name: homepage
    url: https://yoursite.com
    threshold_pct: 3.0
    delay: 1000
    check_elements:
      - description: "main CTA button visible"
        clip: {x: 0, y: 200, width: 1280, height: 400}
        threshold_pct: 5.0

  - name: signup
    url: https://yoursite.com/signup
    threshold_pct: 2.0
    delay: 1500
    check_elements:
      - description: "signup form"
        clip: {x: 300, y: 100, width: 680, height: 500}
        threshold_pct: 1.0   # form must not change

  - name: pricing
    url: https://yoursite.com/pricing
    threshold_pct: 5.0    # some dynamic content expected
    delay: 1000

  - name: api-docs
    url: https://yoursite.com/docs
    threshold_pct: 3.0
    delay: 800

settings:
  check_interval_minutes: 15
  alert_cooldown_minutes: 60   # don't re-alert for same page within 1h
  baseline_refresh_days: 7     # update baseline weekly
  screenshot_width: 1280
  screenshot_height: 900

The check_elements blocks let me set stricter thresholds for critical sub-regions. The signup form getting a 0.5% change should alert immediately. The homepage getting a 3% change from a banner rotation is fine.

The Monitor

import requests
import yaml
import sqlite3
import os
import time
import hashlib
from datetime import datetime, timezone, timedelta
from pathlib import Path
from PIL import Image, ImageChops, ImageDraw
import numpy as np
import io

API_KEY = os.environ['SCREENSHOT_API_KEY']
SCREENSHOT_URL = 'https://hermesforge.dev/api/screenshot'
DATA_DIR = Path('monitor-data')
DB_PATH = DATA_DIR / 'monitor.db'

def init():
    DATA_DIR.mkdir(exist_ok=True)
    (DATA_DIR / 'baselines').mkdir(exist_ok=True)
    (DATA_DIR / 'incidents').mkdir(exist_ok=True)

    conn = sqlite3.connect(DB_PATH)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS checks (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            page TEXT NOT NULL,
            checked_at TEXT NOT NULL,
            status TEXT NOT NULL,       -- ok, incident, error
            change_pct REAL,
            incident_id INTEGER
        );
        CREATE TABLE IF NOT EXISTS incidents (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            page TEXT NOT NULL,
            opened_at TEXT NOT NULL,
            closed_at TEXT,
            status TEXT DEFAULT 'open', -- open, auto-resolved, acknowledged
            last_alerted_at TEXT,
            baseline_path TEXT,
            incident_path TEXT,
            diff_path TEXT,
            change_pct REAL
        );
        CREATE TABLE IF NOT EXISTS baselines (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            page TEXT NOT NULL,
            element TEXT,
            created_at TEXT NOT NULL,
            image_hash TEXT NOT NULL,
            image_path TEXT NOT NULL,
            is_current INTEGER DEFAULT 1
        );
    """)
    conn.commit()
    return conn

def capture(url, width=1280, height=900, delay=1000, clip=None):
    params = {
        'url': url,
        'width': width,
        'height': clip['y'] + clip['height'] + 50 if clip else height,
        'format': 'png',
        'full_page': 'false',
        'delay': delay,
    }
    resp = requests.get(
        SCREENSHOT_URL,
        params=params,
        headers={'X-API-Key': API_KEY},
        timeout=60,
    )
    resp.raise_for_status()
    img = Image.open(io.BytesIO(resp.content))
    if clip:
        img = img.crop((clip['x'], clip['y'],
                        clip['x'] + clip['width'],
                        clip['y'] + clip['height']))
    return img

def pixel_diff(before, after, threshold=15):
    w = min(before.width, after.width)
    h = min(before.height, after.height)
    b = np.array(before.crop((0, 0, w, h)).convert('RGB'))
    a = np.array(after.crop((0, 0, w, h)).convert('RGB'))
    diff = np.abs(b.astype(int) - a.astype(int))
    mask = np.any(diff > threshold, axis=2)
    pct = mask.sum() / (w * h) * 100

    # Annotated diff image
    diff_img = after.crop((0, 0, w, h)).copy()
    rows = np.where(mask.any(axis=1))[0]
    cols = np.where(mask.any(axis=0))[0]
    if len(rows) and len(cols):
        draw = ImageDraw.Draw(diff_img)
        draw.rectangle([int(cols[0])-3, int(rows[0])-3,
                        int(cols[-1])+3, int(rows[-1])+3],
                       outline=(255, 50, 50), width=3)
    return diff_img, pct

def get_baseline(conn, page, element=None):
    row = conn.execute(
        "SELECT image_path, image_hash FROM baselines "
        "WHERE page=? AND element IS ? AND is_current=1",
        (page, element)
    ).fetchone()
    return row

def set_baseline(conn, page, img, element=None):
    timestamp = datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')
    fname = f"{page.replace('/', '_')}__{element or 'full'}__{timestamp}.png"
    path = DATA_DIR / 'baselines' / fname
    img.save(path)
    img_hash = hashlib.sha256(path.read_bytes()).hexdigest()[:16]

    conn.execute(
        "UPDATE baselines SET is_current=0 WHERE page=? AND element IS ?",
        (page, element)
    )
    conn.execute(
        "INSERT INTO baselines (page, element, created_at, image_hash, image_path) "
        "VALUES (?, ?, ?, ?, ?)",
        (page, element, datetime.now(timezone.utc).isoformat(), img_hash, str(path))
    )
    conn.commit()
    return path

def open_incident(conn, page, baseline_path, current_img, diff_img, change_pct):
    timestamp = datetime.now(timezone.utc).strftime('%Y%m%dT%H%M%SZ')
    inc_path = DATA_DIR / 'incidents' / f"{page.replace('/', '_')}__{timestamp}__after.png"
    diff_path = DATA_DIR / 'incidents' / f"{page.replace('/', '_')}__{timestamp}__diff.png"
    current_img.save(inc_path)
    diff_img.save(diff_path)

    cursor = conn.execute(
        "INSERT INTO incidents (page, opened_at, last_alerted_at, baseline_path, "
        "incident_path, diff_path, change_pct) VALUES (?, ?, ?, ?, ?, ?, ?)",
        (page, datetime.now(timezone.utc).isoformat(),
         datetime.now(timezone.utc).isoformat(),
         str(baseline_path), str(inc_path), str(diff_path), change_pct)
    )
    conn.commit()
    return cursor.lastrowid, inc_path, diff_path

def should_alert(conn, page, cooldown_minutes=60):
    row = conn.execute(
        "SELECT last_alerted_at FROM incidents "
        "WHERE page=? AND status='open' ORDER BY opened_at DESC LIMIT 1",
        (page,)
    ).fetchone()
    if not row:
        return True
    last = datetime.fromisoformat(row[0])
    return datetime.now(timezone.utc) - last > timedelta(minutes=cooldown_minutes)

def check_auto_resolve(conn, page, current_img):
    """Check if current state matches baseline — incident self-resolved."""
    baseline = get_baseline(conn, page)
    if not baseline:
        return False
    baseline_img = Image.open(baseline[0])
    _, pct = pixel_diff(baseline_img, current_img)
    if pct < 1.0:
        conn.execute(
            "UPDATE incidents SET status='auto-resolved', closed_at=? "
            "WHERE page=? AND status='open'",
            (datetime.now(timezone.utc).isoformat(), page)
        )
        conn.commit()
        return True
    return False

def run_monitor(config_path='config/monitor.yaml'):
    conn = init()
    config = yaml.safe_load(Path(config_path).read_text())
    settings = config.get('settings', {})
    cooldown = settings.get('alert_cooldown_minutes', 60)
    baseline_days = settings.get('baseline_refresh_days', 7)
    width = settings.get('screenshot_width', 1280)
    height = settings.get('screenshot_height', 900)

    alerts = []
    now = datetime.now(timezone.utc)

    for page_cfg in config['pages']:
        name = page_cfg['name']
        url = page_cfg['url']
        threshold = page_cfg.get('threshold_pct', 3.0)
        delay = page_cfg.get('delay', 1000)
        print(f"  Checking: {name}")

        try:
            current = capture(url, width=width, height=height, delay=delay)

            # Check sub-elements first
            element_incident = False
            for elem in page_cfg.get('check_elements', []):
                elem_img = capture(url, width=width, height=height, delay=delay,
                                   clip=elem.get('clip'))
                elem_name = elem.get('description', 'element')
                elem_threshold = elem.get('threshold_pct', threshold)

                baseline_row = get_baseline(conn, name, elem_name)
                if baseline_row is None:
                    set_baseline(conn, name, elem_img, elem_name)
                    print(f"    [{name}/{elem_name}] baseline set")
                    continue

                baseline_img = Image.open(baseline_row[0])
                diff_img, pct = pixel_diff(baseline_img, elem_img)

                if pct > elem_threshold:
                    element_incident = True
                    print(f"    [{name}/{elem_name}] INCIDENT: {pct:.1f}% change")
                    if should_alert(conn, f"{name}/{elem_name}", cooldown):
                        inc_id, inc_path, diff_path = open_incident(
                            conn, f"{name}/{elem_name}",
                            baseline_row[0], elem_img, diff_img, pct
                        )
                        alerts.append({
                            'page': name, 'element': elem_name, 'url': url,
                            'change_pct': pct,
                            'before': baseline_img,
                            'after': elem_img,
                            'diff': diff_img,
                        })
                else:
                    print(f"    [{name}/{elem_name}] ok ({pct:.1f}%)")
                time.sleep(0.3)

            # Full-page check
            baseline_row = get_baseline(conn, name)
            if baseline_row is None:
                set_baseline(conn, name, current)
                print(f"    [{name}] full-page baseline set")
                continue

            # Check if baseline is stale — refresh if needed
            baseline_age = now - datetime.fromisoformat(
                conn.execute(
                    "SELECT created_at FROM baselines WHERE page=? AND element IS NULL AND is_current=1",
                    (name,)
                ).fetchone()[0]
            )
            if baseline_age.days >= baseline_days:
                set_baseline(conn, name, current)
                print(f"    [{name}] baseline refreshed ({baseline_age.days}d old)")
                continue

            baseline_img = Image.open(baseline_row[0])
            diff_img, pct = pixel_diff(baseline_img, current)

            if pct > threshold:
                if check_auto_resolve(conn, name, current):
                    print(f"    [{name}] auto-resolved")
                    continue
                print(f"    [{name}] INCIDENT: {pct:.1f}% change")
                if should_alert(conn, name, cooldown):
                    open_incident(conn, name, baseline_row[0], current, diff_img, pct)
                    alerts.append({
                        'page': name, 'element': None, 'url': url,
                        'change_pct': pct,
                        'before': baseline_img,
                        'after': current,
                        'diff': diff_img,
                    })
                conn.execute(
                    "INSERT INTO checks (page, checked_at, status, change_pct) "
                    "VALUES (?, ?, 'incident', ?)",
                    (name, now.isoformat(), pct)
                )
            else:
                print(f"    [{name}] ok ({pct:.1f}%)")
                conn.execute(
                    "INSERT INTO checks (page, checked_at, status, change_pct) "
                    "VALUES (?, ?, 'ok', ?)",
                    (name, now.isoformat(), pct)
                )
                # If there was an open incident, check if it self-resolved
                open_inc = conn.execute(
                    "SELECT id FROM incidents WHERE page=? AND status='open'", (name,)
                ).fetchone()
                if open_inc:
                    conn.execute(
                        "UPDATE incidents SET status='auto-resolved', closed_at=? WHERE id=?",
                        (now.isoformat(), open_inc[0])
                    )
                    print(f"    [{name}] incident auto-resolved")

            conn.commit()

        except Exception as e:
            print(f"    [{name}] ERROR: {e}")
            conn.execute(
                "INSERT INTO checks (page, checked_at, status) VALUES (?, ?, 'error')",
                (name, now.isoformat())
            )
            conn.commit()

        time.sleep(0.5)

    conn.close()
    if alerts:
        send_alert(alerts)
        print(f"\n  Alert sent: {len(alerts)} issue(s)")
    else:
        print(f"\n  All clear.")

if __name__ == '__main__':
    run_monitor()

The Alert Email

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.image import MIMEImage

def send_alert(alerts):
    msg = MIMEMultipart('related')
    msg['Subject'] = f"[Site Monitor] {len(alerts)} visual change(s) detected"
    msg['From'] = os.environ['ALERT_FROM']
    msg['To'] = os.environ['ALERT_TO']

    rows = []
    for i, a in enumerate(alerts):
        label = a['page'] + (f" / {a['element']}" if a['element'] else '')
        rows.append(f"""
        <tr>
          <td colspan="3"><b>{label}</b> — {a['change_pct']:.1f}% changed
          — <a href="{a['url']}">{a['url']}</a></td>
        </tr>
        <tr>
          <td><img src="cid:before_{i}" width="380"><br><small>Before</small></td>
          <td><img src="cid:after_{i}" width="380"><br><small>Now</small></td>
          <td><img src="cid:diff_{i}" width="380"><br><small>Diff (red = changed)</small></td>
        </tr>
        <tr><td colspan="3"><hr></td></tr>
        """)

    html = f"<h2>Visual Monitor Alert</h2><table>{''.join(rows)}</table>"
    msg.attach(MIMEText(html, 'html'))

    def attach(img, cid):
        buf = io.BytesIO()
        img.save(buf, format='PNG')
        m = MIMEImage(buf.getvalue(), 'png')
        m.add_header('Content-ID', f'<{cid}>')
        m.add_header('Content-Disposition', 'inline')
        msg.attach(m)

    for i, a in enumerate(alerts):
        attach(a['before'], f'before_{i}')
        attach(a['after'], f'after_{i}')
        attach(a['diff'], f'diff_{i}')

    with smtplib.SMTP_SSL('smtp.gmail.com', 465) as smtp:
        smtp.login(os.environ['SMTP_USER'], os.environ['SMTP_PASSWORD'])
        smtp.sendmail(msg['From'], [msg['To']], msg.as_string())

Cron Setup

# Check every 15 minutes
*/15 * * * * cd /home/user/site-monitor && python3 monitor.py >> logs/monitor.log 2>&1

What It Has Caught

Running this on my own site for two months:

Blank signup page (the event that prompted this build): caught at the 15-minute check. Root cause: a CDN cache purge that left the JS bundle 404ing. Status code was 200. Uptime monitor: green. Visual monitor: immediate alert.
CSS regression on mobile: a deploy broke the nav menu on screens under 768px wide. I had no mobile checks before. Added a 390px viewport check after that incident.
Third-party script timeout: a chat widget was timing out, causing a 6-second blank stall before the page loaded. The screenshot with delay=1000 caught the stall; the screenshot with delay=7000 showed the resolved page. The delta revealed the timing issue.
Content accidentally deleted: a CMS draft accidentally published and overwrote the pricing page with placeholder text. Caught at the next 15-minute check.

Three of these four would have been invisible to a traditional uptime monitor. All four were embarrassing. None of them reached a user without me knowing first.

The Feedback Loop

The monitor has made me a better deployer. When every deploy is followed by a visual check fifteen minutes later, you get honest feedback about whether the deploy looked the way you intended. After a while you stop deploying at 17:00 on a Friday.

That might be the most useful thing it's done.

Get Your API Key

Free API key at hermesforge.dev/screenshot. A four-page monitor checking every 15 minutes runs about 384 API calls per day — well within the free tier.