Build a Web Page Archive with the Screenshot API
Build a Web Page Archive with the Screenshot API
Web pages change and disappear. The Internet Archive captures some of them, but for pages you care about — competitor sites, portfolio pages, news articles, regulatory filings — you need your own archive. Our Screenshot API lets you build one with a few lines of code.
Simple Archive Script
Capture a page and save it with a timestamp:
#!/bin/bash
URL="https://example.com"
ARCHIVE_DIR="./archive"
TIMESTAMP=$(date +%Y-%m-%d_%H%M%S)
FILENAME="${ARCHIVE_DIR}/${TIMESTAMP}.png"
mkdir -p "$ARCHIVE_DIR"
curl -s "https://hermesforge.dev/api/screenshot?url=${URL}&full_page=true&format=webp" \
-o "${ARCHIVE_DIR}/${TIMESTAMP}.webp"
echo "Archived: ${FILENAME} ($(stat -f%z "${ARCHIVE_DIR}/${TIMESTAMP}.webp" 2>/dev/null || stat -c%s "${ARCHIVE_DIR}/${TIMESTAMP}.webp") bytes)"
Run this daily via cron to build a visual history of any page.
Python Archive System
A more complete solution with metadata, multiple URLs, and organized storage:
import requests
import json
from pathlib import Path
from datetime import datetime
API = "https://hermesforge.dev/api/screenshot"
class WebArchive:
def __init__(self, base_dir="web-archive"):
self.base_dir = Path(base_dir)
self.base_dir.mkdir(exist_ok=True)
def capture(self, url, full_page=True):
"""Capture a page and save with metadata."""
timestamp = datetime.utcnow()
date_str = timestamp.strftime("%Y-%m-%d")
time_str = timestamp.strftime("%H%M%S")
# Create directory structure: archive/domain/date/
from urllib.parse import urlparse
domain = urlparse(url).netloc.replace(":", "_")
day_dir = self.base_dir / domain / date_str
day_dir.mkdir(parents=True, exist_ok=True)
# Capture screenshot
params = {
"url": url,
"full_page": "true" if full_page else "false",
"format": "webp",
"width": 1280,
"quality": 85,
}
resp = requests.get(API, params=params, timeout=60)
if resp.status_code != 200:
return {"status": "error", "code": resp.status_code, "url": url}
# Save image
img_path = day_dir / f"{time_str}.webp"
img_path.write_bytes(resp.content)
# Save metadata
meta = {
"url": url,
"timestamp": timestamp.isoformat() + "Z",
"size_bytes": len(resp.content),
"format": "webp",
"full_page": full_page,
"file": str(img_path),
}
meta_path = day_dir / f"{time_str}.json"
meta_path.write_text(json.dumps(meta, indent=2))
return meta
def capture_many(self, urls, full_page=True):
"""Archive multiple URLs."""
results = []
for url in urls:
result = self.capture(url, full_page)
results.append(result)
print(f" {'OK' if 'size_bytes' in result else 'FAIL'}: {url}")
return results
# Usage
archive = WebArchive()
urls = [
"https://news.ycombinator.com",
"https://lobste.rs",
"https://example.com",
]
print(f"Archiving {len(urls)} pages...")
results = archive.capture_many(urls)
print(f"Done. {sum(1 for r in results if 'size_bytes' in r)}/{len(urls)} captured.")
This creates an organized directory structure:
web-archive/
news.ycombinator.com/
2026-05-15/
143022.webp
143022.json
2026-05-16/
143015.webp
143015.json
lobste.rs/
2026-05-15/
143025.webp
143025.json
Daily Cron Archive
Add to your crontab (crontab -e):
# Archive important pages daily at 2:00 AM
0 2 * * * /usr/bin/python3 /path/to/archive_script.py >> /var/log/web-archive.log 2>&1
Change Detection
Compare today's capture with yesterday's to detect visual changes:
def has_changed(archive, url):
"""Check if a page looks different from yesterday."""
from urllib.parse import urlparse
domain = urlparse(url).netloc.replace(":", "_")
domain_dir = archive.base_dir / domain
if not domain_dir.exists():
return True # No previous capture
# Find the two most recent captures
captures = sorted(domain_dir.glob("**/*.webp"))
if len(captures) < 2:
return True
# Compare file sizes as a rough change detector
size_current = captures[-1].stat().st_size
size_previous = captures[-2].stat().st_size
# More than 10% size difference suggests a visual change
ratio = abs(size_current - size_previous) / max(size_previous, 1)
return ratio > 0.10
For pixel-level comparison, use tools like pixelmatch or ImageMagick on the captured images.
Archive with Multiple Viewports
Capture how a page looks across devices:
def archive_responsive(archive, url):
"""Capture a page at multiple viewport sizes."""
viewports = ["mobile", "tablet", "desktop"]
for vp in viewports:
params = {
"url": url,
"viewport": vp,
"full_page": "true",
"format": "webp",
"quality": 85,
}
resp = requests.get(API, params=params, timeout=60)
if resp.status_code == 200:
timestamp = datetime.utcnow().strftime("%Y-%m-%d_%H%M%S")
path = archive.base_dir / f"{timestamp}_{vp}.webp"
path.write_bytes(resp.content)
Storage Considerations
| Captures/day | Format | ~Size/capture | Daily storage | Yearly |
|---|---|---|---|---|
| 10 pages | WebP | 150 KB | 1.5 MB | 547 MB |
| 50 pages | WebP | 150 KB | 7.5 MB | 2.7 GB |
| 100 pages | PNG | 400 KB | 40 MB | 14.6 GB |
WebP format saves ~50% compared to PNG. Use quality=70 to reduce size further with minimal visual loss.
Getting an API Key
Anonymous: 10 captures/day. With a free API key: 100/day.
curl "https://hermesforge.dev/api/keys?email=you@example.com"
100 captures/day covers daily archiving of up to 100 pages — sufficient for most monitoring needs.
Related
- Monitor Website Changes with Screenshots — Visual change detection
- Batch Screenshots for Monitoring — Process multiple URLs
- Screenshot API Python Tutorial — Getting started with Python
- API Documentation — Full parameter reference