How I Built a Visual Regression Testing System in a Weekend
Last weekend I got tired of shipping CSS changes that broke things I couldn't see in unit tests.
You know the problem: you refactor a button component, the tests pass, the PR gets merged, and two days later someone files a bug with a screenshot showing a broken layout on mobile. The logic was fine. The tests were fine. The pixels were wrong.
So I built a visual regression system. It took about a day and a half. Here's exactly what I did.
The Problem With the Standard Approach
The usual advice is: use Playwright or Puppeteer with toMatchSnapshot(). I've done this. The issues:
- Flakiness. Headless Chromium screenshots vary slightly between runs — antialiasing, font rendering, subpixel positioning. You end up with a 2-5px tolerance that masks real regressions.
- Infrastructure overhead. Getting Chromium to run reliably in CI is a project of its own. Different results on Mac vs Linux vs the CI container.
- Slow feedback. A full visual regression suite against 50 pages takes 8-12 minutes in Playwright. That's a long time to wait for a failing check.
I wanted something simpler: call an API, get an image, diff it. No browser to configure. No Chromium binaries to install. Just HTTP.
The Architecture
Three components:
- Baseline capture: screenshot every page in the sitemap, store as reference images
- Comparison capture: on each PR/deploy, screenshot the same pages on a preview URL
- Diff detection: compare images pixel-by-pixel, fail if diff exceeds threshold
PR opened
→ Deploy preview URL (Vercel/Netlify handles this)
→ Run comparison captures against preview URL
→ Diff each page against stored baseline
→ Comment on PR with diff report
→ Block merge if any page exceeds 1% pixel difference
Step 1: Capture the Baseline
I wrote a script that reads the sitemap and captures every URL:
import requests
import xml.etree.ElementTree as ET
import os
import time
from pathlib import Path
API_KEY = os.environ['SCREENSHOT_API_KEY']
BASE_URL = 'https://hermesforge.dev/api/screenshot'
BASELINE_DIR = Path('visual-baselines')
def get_sitemap_urls(sitemap_url):
resp = requests.get(sitemap_url, timeout=10)
root = ET.fromstring(resp.text)
ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
return [loc.text for loc in root.findall('sm:url/sm:loc', ns)]
def capture_page(url, width=1280, height=800, suffix=''):
params = {
'url': url,
'width': width,
'height': height,
'format': 'png',
'full_page': 'true',
'delay': 500, # wait for JS to settle
}
resp = requests.get(
BASE_URL,
params=params,
headers={'X-API-Key': API_KEY},
timeout=30,
)
resp.raise_for_status()
return resp.content
def url_to_filename(url):
# https://example.com/products/shoes -> products_shoes
path = url.split('://', 1)[-1].split('/', 1)[-1] if '/' in url.split('://', 1)[-1] else 'index'
path = path.rstrip('/') or 'index'
return path.replace('/', '_').replace('?', '_').replace('=', '_')[:100]
def capture_baseline(sitemap_url):
BASELINE_DIR.mkdir(exist_ok=True)
urls = get_sitemap_urls(sitemap_url)
print(f"Capturing {len(urls)} pages...")
for i, url in enumerate(urls):
filename = url_to_filename(url) + '.png'
path = BASELINE_DIR / filename
if path.exists():
print(f" [{i+1}/{len(urls)}] SKIP {url} (baseline exists)")
continue
try:
image = capture_page(url)
path.write_bytes(image)
print(f" [{i+1}/{len(urls)}] OK {url} -> {filename} ({len(image)//1024}KB)")
except Exception as e:
print(f" [{i+1}/{len(urls)}] FAIL {url}: {e}")
time.sleep(0.5) # gentle rate limiting
if __name__ == '__main__':
import sys
capture_baseline(sys.argv[1])
Run it once against production:
python capture_baseline.py https://example.com/sitemap.xml
This stores PNG files in visual-baselines/. Commit them to your repo (or store them in S3/GCS for large sites).
Step 2: Capture for Comparison
The comparison script takes a second argument: the preview URL to swap in. It replaces the production domain with the preview domain for each URL:
def capture_comparison(sitemap_url, preview_base, output_dir):
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True)
prod_base = sitemap_url.rsplit('/sitemap.xml', 1)[0]
urls = get_sitemap_urls(sitemap_url)
for i, url in enumerate(urls):
preview_url = url.replace(prod_base, preview_base, 1)
filename = url_to_filename(url) + '.png'
try:
image = capture_page(preview_url)
(output_dir / filename).write_bytes(image)
print(f" [{i+1}/{len(urls)}] OK {preview_url}")
except Exception as e:
print(f" [{i+1}/{len(urls)}] FAIL {preview_url}: {e}")
time.sleep(0.5)
Step 3: Diff Detection
I used Pillow and numpy for pixel comparison:
from PIL import Image, ImageChops
import numpy as np
from pathlib import Path
import json
def diff_images(baseline_path, comparison_path, diff_path=None):
baseline = Image.open(baseline_path).convert('RGB')
comparison = Image.open(comparison_path).convert('RGB')
# Resize to match if dimensions differ (e.g. content reflow)
if baseline.size != comparison.size:
comparison = comparison.resize(baseline.size, Image.LANCZOS)
diff = ImageChops.difference(baseline, comparison)
diff_array = np.array(diff)
total_pixels = diff_array.shape[0] * diff_array.shape[1]
changed_pixels = np.sum(np.any(diff_array > 10, axis=2)) # threshold: >10/255 per channel
change_pct = changed_pixels / total_pixels * 100
if diff_path and change_pct > 0:
# Save a highlighted diff image
diff_enhanced = Image.fromarray((diff_array * 5).clip(0, 255).astype('uint8'))
diff_enhanced.save(diff_path)
return {
'changed_pixels': int(changed_pixels),
'total_pixels': int(total_pixels),
'change_pct': round(change_pct, 3),
}
def run_diff_report(baseline_dir, comparison_dir, diff_dir, threshold_pct=1.0):
baseline_dir = Path(baseline_dir)
comparison_dir = Path(comparison_dir)
diff_dir = Path(diff_dir)
diff_dir.mkdir(exist_ok=True)
results = []
failures = []
for baseline_file in sorted(baseline_dir.glob('*.png')):
comparison_file = comparison_dir / baseline_file.name
if not comparison_file.exists():
results.append({'file': baseline_file.name, 'status': 'missing'})
failures.append(baseline_file.name)
continue
diff_file = diff_dir / baseline_file.name
diff = diff_images(baseline_file, comparison_file, diff_file)
status = 'pass' if diff['change_pct'] <= threshold_pct else 'fail'
results.append({
'file': baseline_file.name,
'status': status,
**diff,
})
if status == 'fail':
failures.append(baseline_file.name)
report = {
'total': len(results),
'passed': sum(1 for r in results if r['status'] == 'pass'),
'failed': len(failures),
'failures': failures,
'results': results,
}
Path('diff-report.json').write_text(json.dumps(report, indent=2))
return report
Step 4: GitHub Actions Integration
This all runs in CI on every PR:
# .github/workflows/visual-regression.yml
name: Visual Regression
on:
pull_request:
branches: [main]
jobs:
visual-regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
lfs: true # if baselines are stored in git LFS
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install requests pillow numpy
- name: Wait for preview deployment
run: |
# Wait for Vercel/Netlify to deploy the preview
PREVIEW_URL="https://${{ github.event.pull_request.head.sha }}.preview.example.com"
for i in {1..30}; do
if curl -sf "$PREVIEW_URL" > /dev/null 2>&1; then
echo "Preview ready: $PREVIEW_URL"
echo "PREVIEW_URL=$PREVIEW_URL" >> $GITHUB_ENV
break
fi
echo "Waiting for preview... ($i/30)"
sleep 10
done
- name: Capture comparison screenshots
env:
SCREENSHOT_API_KEY: ${{ secrets.SCREENSHOT_API_KEY }}
run: |
python capture_comparison.py \
https://example.com/sitemap.xml \
${{ env.PREVIEW_URL }} \
comparison-screenshots
- name: Run diff report
run: |
python run_diff.py \
visual-baselines \
comparison-screenshots \
diff-images \
--threshold 1.0
- name: Upload diff images
if: always()
uses: actions/upload-artifact@v4
with:
name: visual-diffs
path: diff-images/
- name: Comment on PR
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const report = JSON.parse(fs.readFileSync('diff-report.json', 'utf8'));
const icon = report.failed === 0 ? '✅' : '❌';
const body = `## ${icon} Visual Regression Report\n\n` +
`**${report.passed}/${report.total} pages passed** (threshold: 1%)\n\n` +
(report.failures.length > 0 ?
`**Failures:**\n${report.failures.map(f => `- \`${f}\``).join('\n')}` :
'No visual regressions detected.');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body,
});
- name: Fail if regressions detected
run: |
python -c "
import json, sys
report = json.load(open('diff-report.json'))
if report['failed'] > 0:
print(f'FAILED: {report[\"failed\"]} pages with visual regressions')
sys.exit(1)
print(f'PASSED: all {report[\"total\"]} pages within threshold')
"
The Result
After setting this up:
- 50 pages covered in the baseline
- ~3 minutes for a full comparison run (screenshot API handles the browser; I just wait for HTTP responses)
- Zero Chromium config in CI
- Caught 3 real regressions in the first two weeks: a z-index issue on mobile nav, a font-weight change that affected a CTA button, and a padding regression in the footer
The failure that drove me to build this happened again two weeks after I deployed it — someone changed a global CSS variable. The visual regression check caught it before merge. That felt good.
Variations
Multi-viewport testing: run the same comparison at 375px (mobile), 768px (tablet), and 1280px (desktop) width. Triple the captures, triple the coverage.
Per-component testing: instead of full-page screenshots, use the clip parameter to capture just a component region. Useful for UI libraries.
Scheduled baseline updates: update baselines automatically every Sunday night so you don't accumulate drift between intentional design changes and accidental regressions.
Get Your API Key
The screenshot API used in this guide is at hermesforge.dev/screenshot. Free tier available — the baseline capture for a 50-page site uses about 50 API calls.