Build a Web Page Archive with the Screenshot API

2026-04-24 | Tags: [screenshot, archiving, automation, python, monitoring]

Build a Web Page Archive with the Screenshot API

Web pages change and disappear. The Internet Archive captures some of them, but for pages you care about — competitor sites, portfolio pages, news articles, regulatory filings — you need your own archive. Our Screenshot API lets you build one with a few lines of code.

Simple Archive Script

Capture a page and save it with a timestamp:

#!/bin/bash
URL="https://example.com"
ARCHIVE_DIR="./archive"
TIMESTAMP=$(date +%Y-%m-%d_%H%M%S)
FILENAME="${ARCHIVE_DIR}/${TIMESTAMP}.png"

mkdir -p "$ARCHIVE_DIR"
curl -s "https://hermesforge.dev/api/screenshot?url=${URL}&full_page=true&format=webp" \
  -o "${ARCHIVE_DIR}/${TIMESTAMP}.webp"

echo "Archived: ${FILENAME} ($(stat -f%z "${ARCHIVE_DIR}/${TIMESTAMP}.webp" 2>/dev/null || stat -c%s "${ARCHIVE_DIR}/${TIMESTAMP}.webp") bytes)"

Run this daily via cron to build a visual history of any page.

Python Archive System

A more complete solution with metadata, multiple URLs, and organized storage:

import requests
import json
from pathlib import Path
from datetime import datetime

API = "https://hermesforge.dev/api/screenshot"

class WebArchive:
    def __init__(self, base_dir="web-archive"):
        self.base_dir = Path(base_dir)
        self.base_dir.mkdir(exist_ok=True)

    def capture(self, url, full_page=True):
        """Capture a page and save with metadata."""
        timestamp = datetime.utcnow()
        date_str = timestamp.strftime("%Y-%m-%d")
        time_str = timestamp.strftime("%H%M%S")

        # Create directory structure: archive/domain/date/
        from urllib.parse import urlparse
        domain = urlparse(url).netloc.replace(":", "_")
        day_dir = self.base_dir / domain / date_str
        day_dir.mkdir(parents=True, exist_ok=True)

        # Capture screenshot
        params = {
            "url": url,
            "full_page": "true" if full_page else "false",
            "format": "webp",
            "width": 1280,
            "quality": 85,
        }

        resp = requests.get(API, params=params, timeout=60)

        if resp.status_code != 200:
            return {"status": "error", "code": resp.status_code, "url": url}

        # Save image
        img_path = day_dir / f"{time_str}.webp"
        img_path.write_bytes(resp.content)

        # Save metadata
        meta = {
            "url": url,
            "timestamp": timestamp.isoformat() + "Z",
            "size_bytes": len(resp.content),
            "format": "webp",
            "full_page": full_page,
            "file": str(img_path),
        }

        meta_path = day_dir / f"{time_str}.json"
        meta_path.write_text(json.dumps(meta, indent=2))

        return meta

    def capture_many(self, urls, full_page=True):
        """Archive multiple URLs."""
        results = []
        for url in urls:
            result = self.capture(url, full_page)
            results.append(result)
            print(f"  {'OK' if 'size_bytes' in result else 'FAIL'}: {url}")
        return results


# Usage
archive = WebArchive()

urls = [
    "https://news.ycombinator.com",
    "https://lobste.rs",
    "https://example.com",
]

print(f"Archiving {len(urls)} pages...")
results = archive.capture_many(urls)
print(f"Done. {sum(1 for r in results if 'size_bytes' in r)}/{len(urls)} captured.")

This creates an organized directory structure:

web-archive/
  news.ycombinator.com/
    2026-05-15/
      143022.webp
      143022.json
    2026-05-16/
      143015.webp
      143015.json
  lobste.rs/
    2026-05-15/
      143025.webp
      143025.json

Daily Cron Archive

Add to your crontab (crontab -e):

# Archive important pages daily at 2:00 AM
0 2 * * * /usr/bin/python3 /path/to/archive_script.py >> /var/log/web-archive.log 2>&1

Change Detection

Compare today's capture with yesterday's to detect visual changes:

def has_changed(archive, url):
    """Check if a page looks different from yesterday."""
    from urllib.parse import urlparse
    domain = urlparse(url).netloc.replace(":", "_")
    domain_dir = archive.base_dir / domain

    if not domain_dir.exists():
        return True  # No previous capture

    # Find the two most recent captures
    captures = sorted(domain_dir.glob("**/*.webp"))
    if len(captures) < 2:
        return True

    # Compare file sizes as a rough change detector
    size_current = captures[-1].stat().st_size
    size_previous = captures[-2].stat().st_size

    # More than 10% size difference suggests a visual change
    ratio = abs(size_current - size_previous) / max(size_previous, 1)
    return ratio > 0.10

For pixel-level comparison, use tools like pixelmatch or ImageMagick on the captured images.

Archive with Multiple Viewports

Capture how a page looks across devices:

def archive_responsive(archive, url):
    """Capture a page at multiple viewport sizes."""
    viewports = ["mobile", "tablet", "desktop"]

    for vp in viewports:
        params = {
            "url": url,
            "viewport": vp,
            "full_page": "true",
            "format": "webp",
            "quality": 85,
        }

        resp = requests.get(API, params=params, timeout=60)
        if resp.status_code == 200:
            timestamp = datetime.utcnow().strftime("%Y-%m-%d_%H%M%S")
            path = archive.base_dir / f"{timestamp}_{vp}.webp"
            path.write_bytes(resp.content)

Storage Considerations

Captures/day Format ~Size/capture Daily storage Yearly
10 pages WebP 150 KB 1.5 MB 547 MB
50 pages WebP 150 KB 7.5 MB 2.7 GB
100 pages PNG 400 KB 40 MB 14.6 GB

WebP format saves ~50% compared to PNG. Use quality=70 to reduce size further with minimal visual loss.

Getting an API Key

Anonymous: 10 captures/day. With a free API key: 100/day.

curl "https://hermesforge.dev/api/keys?email=you@example.com"

100 captures/day covers daily archiving of up to 100 pages — sufficient for most monitoring needs.