Screenshot API for Legal and Compliance Documentation

2026-05-17 | Tags: [tutorial, screenshot-api, compliance, gdpr, legal, python]

Legal and compliance teams routinely need visual evidence of what users saw at a specific moment: the cookie consent banner they accepted, the terms of service version presented at signup, the privacy policy text in effect on a given date. Manually capturing and archiving these screenshots is tedious and error-prone. Automated screenshot pipelines make compliance documentation systematic.

This is a different use case from monitoring or testing. The requirements shift: timestamps become legally significant, storage must be immutable or append-only, and the captured content must accurately represent what a real user would have seen.

What Compliance Documentation Requires

Before building the pipeline, understand what makes a screenshot legally useful:

Timestamp integrity: The capture timestamp must be accurate and tamper-evident. Storing the timestamp in the filename and in a hash-signed manifest is better than trusting the file's mtime.

Accurate rendering: The screenshot must represent what a real user actually saw, not a developer's sanitized view. This means not blocking cookie banners, not injecting custom CSS, and not stripping third-party scripts — you want the messy reality.

Reproducibility: You should be able to explain exactly how the screenshot was captured: browser viewport, user agent, any active sessions, date and time. This metadata belongs in the capture record.

Immutability: Once captured, the screenshot and its metadata should not be modifiable. At minimum: write to an append-only directory, compute a content hash at capture time, and store the hash separately from the file.

GDPR and similar regulations require evidence that users were presented with a valid consent mechanism before cookies were set. A screenshot of the consent banner at first visit, with a timestamp, is baseline documentation:

import requests
import hashlib
import json
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional

SCREENSHOT_API_KEY = "your-api-key"
SCREENSHOT_API_URL = "https://hermesforge.dev/api/screenshot"
COMPLIANCE_DIR = "./compliance_evidence"


def capture_consent_state(
    url: str,
    label: str,
    notes: str = "",
) -> dict:
    """
    Capture a URL as it appears to a first-time visitor (no cookies).
    Suitable for documenting consent banners and cookie notices.

    IMPORTANT: Do NOT use block_ads=true or inject_css — we want the
    full unmodified experience that a real user would see.
    """
    captured_at = datetime.now(timezone.utc)
    timestamp_str = captured_at.strftime("%Y%m%d_%H%M%SZ")

    response = requests.get(
        SCREENSHOT_API_URL,
        params={
            "url": url,
            "format": "png",
            "width": 1280,
            "height": 800,         # Viewport height — show above-the-fold content
            "full_page": "false",  # Above-fold only: what user sees without scrolling
            "wait": "networkidle", # Wait for consent scripts to load
        },
        headers={"X-API-Key": SCREENSHOT_API_KEY},
        timeout=45,
    )

    if response.status_code != 200:
        return {"error": f"Capture failed: HTTP {response.status_code}", "url": url}

    # Compute content hash
    content_hash = hashlib.sha256(response.content).hexdigest()

    # Build evidence record
    record = {
        "url": url,
        "label": label,
        "notes": notes,
        "captured_at_utc": captured_at.isoformat(),
        "timestamp_unix": int(captured_at.timestamp()),
        "content_hash_sha256": content_hash,
        "viewport": {"width": 1280, "height": 800},
        "capture_params": {
            "full_page": False,
            "wait": "networkidle",
            "block_ads": False,
            "inject_css": None,
        },
    }

    # Save screenshot
    output_dir = Path(COMPLIANCE_DIR) / label.replace(" ", "_").lower()
    output_dir.mkdir(parents=True, exist_ok=True)

    screenshot_filename = f"{timestamp_str}_{content_hash[:12]}.png"
    screenshot_path = output_dir / screenshot_filename
    screenshot_path.write_bytes(response.content)

    record["screenshot_path"] = str(screenshot_path)
    record["screenshot_filename"] = screenshot_filename

    # Save evidence record as JSON alongside screenshot
    record_path = output_dir / f"{timestamp_str}_{content_hash[:12]}.json"
    record_path.write_text(json.dumps(record, indent=2))

    return record

Terms of Service Version Archiving

When your terms of service change, you need a point-in-time record of each version. This is different from consent capture — here you want the full-page document, not the above-fold view:

def archive_tos_version(
    tos_url: str,
    version_label: str,
    effective_date: str,
) -> dict:
    """
    Archive a full-page screenshot of a Terms of Service document.
    Call this each time TOS content is updated.

    version_label: human-readable version identifier (e.g., "v2.3", "2026-Q1")
    effective_date: ISO date string when this version took effect
    """
    captured_at = datetime.now(timezone.utc)

    response = requests.get(
        SCREENSHOT_API_URL,
        params={
            "url": tos_url,
            "format": "png",
            "width": 1280,
            "full_page": "true",   # Capture entire document
            "wait": "networkidle",
        },
        headers={"X-API-Key": SCREENSHOT_API_KEY},
        timeout=60,              # TOS pages can be long; allow extra time
    )

    if response.status_code != 200:
        return {"error": f"Capture failed: HTTP {response.status_code}"}

    content_hash = hashlib.sha256(response.content).hexdigest()
    timestamp_str = captured_at.strftime("%Y%m%d_%H%M%SZ")

    record = {
        "document_type": "terms_of_service",
        "url": tos_url,
        "version": version_label,
        "effective_date": effective_date,
        "captured_at_utc": captured_at.isoformat(),
        "content_hash_sha256": content_hash,
        "full_page": True,
    }

    output_dir = Path(COMPLIANCE_DIR) / "tos_archive"
    output_dir.mkdir(parents=True, exist_ok=True)

    filename = f"{version_label.replace(' ', '_')}_{timestamp_str}"
    (output_dir / f"{filename}.png").write_bytes(response.content)
    (output_dir / f"{filename}.json").write_text(json.dumps(record, indent=2))

    record["screenshot_path"] = str(output_dir / f"{filename}.png")
    return record

Regulatory Disclosure Monitoring

Some industries require disclosures to remain prominently visible on specific pages (financial services, healthcare, pharma). Screenshot-based monitoring verifies these disclosures are present and correctly rendered:

REQUIRED_DISCLOSURES = [
    {
        "url": "https://your-app.com/invest",
        "label": "Investment Risk Warning",
        "required_text_indicator": "Capital at risk",  # Text that must appear (for log records)
        "schedule": "daily",
    },
    {
        "url": "https://your-app.com/health-claims",
        "label": "Medical Disclaimer",
        "required_text_indicator": "not medical advice",
        "schedule": "daily",
    },
]


def verify_disclosure_present(disclosure_config: dict) -> dict:
    """
    Capture a disclosure page and record that it was verified.
    The screenshot is the evidence; text detection is supplementary.
    """
    record = capture_consent_state(
        url=disclosure_config["url"],
        label=disclosure_config["label"],
        notes=f"Scheduled disclosure verification. Required indicator: '{disclosure_config.get('required_text_indicator', 'N/A')}'",
    )
    record["verification_type"] = "regulatory_disclosure"
    record["schedule"] = disclosure_config.get("schedule", "manual")
    return record


def run_daily_disclosure_verification() -> list[dict]:
    """Run verification for all daily-scheduled disclosures."""
    results = []
    for disclosure in REQUIRED_DISCLOSURES:
        if disclosure.get("schedule") == "daily":
            print(f"Verifying: {disclosure['label']}...")
            result = verify_disclosure_present(disclosure)
            results.append(result)
            status = "ok" if "error" not in result else "FAILED"
            print(f"  {status}: {result.get('screenshot_path', result.get('error'))}")
    return results

Building a Tamper-Evident Manifest

For evidence that may face legal scrutiny, a simple file hash stored in the filename isn't enough. A signed manifest that chains hashes provides stronger integrity guarantees:

import hmac
import os


def create_manifest_entry(record: dict, manifest_path: str, signing_key: bytes) -> str:
    """
    Add a record to a signed manifest file.
    Each entry includes the previous entry's hash, creating a chain.

    signing_key: secret bytes used to HMAC-sign each entry.
    Store the signing key separately from the manifest (e.g., in a secrets manager).
    """
    manifest_file = Path(manifest_path)

    # Read existing manifest to get previous hash
    previous_hash = "genesis"
    if manifest_file.exists():
        existing_content = manifest_file.read_text()
        if existing_content.strip():
            last_line = existing_content.strip().split("\n")[-1]
            try:
                last_entry = json.loads(last_line)
                previous_hash = last_entry.get("entry_hash", "genesis")
            except json.JSONDecodeError:
                previous_hash = hashlib.sha256(existing_content.encode()).hexdigest()

    # Create manifest entry
    entry = {
        "sequence": sum(1 for _ in manifest_file.open()) if manifest_file.exists() else 0,
        "previous_hash": previous_hash,
        "record": record,
        "manifest_timestamp": datetime.now(timezone.utc).isoformat(),
    }

    # Sign the entry
    entry_json = json.dumps(entry, sort_keys=True)
    entry_hash = hmac.new(signing_key, entry_json.encode(), hashlib.sha256).hexdigest()
    entry["entry_hash"] = entry_hash
    entry["hmac_sha256"] = hmac.new(signing_key, entry_json.encode(), hashlib.sha256).hexdigest()

    # Append to manifest (one JSON object per line for streaming verification)
    with open(manifest_path, "a") as f:
        f.write(json.dumps(entry) + "\n")

    return entry_hash


def verify_manifest(manifest_path: str, signing_key: bytes) -> dict:
    """
    Verify the integrity of a compliance manifest.
    Returns verification result with any detected tampering.
    """
    entries = []
    with open(manifest_path) as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                entries.append(json.loads(line))
            except json.JSONDecodeError:
                return {"valid": False, "error": f"Parse failure at line {line_num}"}

    errors = []
    previous_hash = "genesis"

    for i, entry in enumerate(entries):
        # Verify hash chain
        if entry.get("previous_hash") != previous_hash:
            errors.append(f"Chain break at entry {i}: expected previous_hash={previous_hash}, got {entry.get('previous_hash')}")

        previous_hash = entry.get("entry_hash", "")

    return {
        "valid": len(errors) == 0,
        "entry_count": len(entries),
        "errors": errors,
        "first_entry": entries[0]["manifest_timestamp"] if entries else None,
        "last_entry": entries[-1]["manifest_timestamp"] if entries else None,
    }

Scheduled Compliance Runs

def run_compliance_suite(manifest_path: str, signing_key: bytes):
    """
    Full compliance documentation run.
    Suitable for daily cron execution.
    """
    print(f"Compliance run starting: {datetime.now(timezone.utc).isoformat()}")

    # 1. Consent flow documentation (homepage first-visit state)
    consent_pages = [
        ("https://your-app.com", "homepage_consent"),
        ("https://your-app.com/signup", "signup_consent"),
    ]
    for url, label in consent_pages:
        record = capture_consent_state(url, label)
        if "error" not in record:
            create_manifest_entry(record, manifest_path, signing_key)
            print(f"  Consent captured: {label}")

    # 2. Regulatory disclosures
    disclosure_results = run_daily_disclosure_verification()
    for result in disclosure_results:
        if "error" not in result:
            create_manifest_entry(result, manifest_path, signing_key)

    # 3. Verify manifest integrity after writes
    verification = verify_manifest(manifest_path, signing_key)
    print(f"  Manifest: {verification['entry_count']} entries, valid={verification['valid']}")
    if not verification["valid"]:
        print(f"  INTEGRITY ERRORS: {verification['errors']}")

    print("Compliance run complete.")


# Cron: run daily at 00:05Z
# 5 0 * * * cd /path/to/compliance && python3 compliance_capture.py >> logs/compliance.log 2>&1

Storage and Retention

Compliance evidence has different storage requirements from monitoring screenshots:

Aspect	Monitoring	Compliance
Retention	30-90 days	3-7 years (jurisdiction-dependent)
Mutability	Overwrite OK	Append-only; no deletion
Backup	Optional	Required; offsite copy
Access control	Development team	Legal team + audit trail
File format	Any	PNG preferred (lossless, widely supported)

For long-term storage, after local capture write immediately to an immutable object store (AWS S3 with Object Lock, Azure Blob with immutability policy, or Google Cloud Storage with retention locks).

Rate limit planning: Compliance documentation runs are typically small — 5-15 pages per daily run. At the Starter tier (200/day), you have substantial headroom for daily compliance runs plus ongoing monitoring. The Pro tier (1000/day) supports larger compliance suites with multiple daily verification passes.

hermesforge.dev — screenshot API. Free: 50/day (free key). Starter: $4/30 days (200/day). Pro: $9 (1000/day). Business: $29 (5000/day).