Automated Accessibility Auditing with Screenshots and LLM Vision

2026-03-24 | Tags: [screenshot-api, ai-agents, accessibility, a11y, llm, vision, python, compliance, story]

Our accessibility audit came back with 847 issues from axe-core. We fixed all of them. Then we did a manual review with an actual screen reader user. She found 23 more problems in the first 10 minutes — none of which axe-core had flagged.

The issues she found were things like: a button labeled "Click here" that made no sense without the surrounding visual context. An image of a chart with an alt tag that said "chart" — technically present, but describing nothing. A form where the error messages were red text near the fields, but nothing that a screen reader would read aloud when the user tabbed to the field in error. A modal that trapped keyboard focus in a way that the automated tool had no way to detect.

Automated accessibility tools are good at finding rules violations — missing alt attributes, insufficient color contrast ratios, unlabeled form elements. They're not good at finding meaning failures — cases where the technical requirement is met but the user experience is still broken.

LLM vision occupies a different niche. It can look at a page and describe what a screen reader user would encounter. It can assess whether an alt tag is actually useful. It can identify navigation patterns that would be confusing to non-sighted users. It's not a replacement for automated tools or human testing — it's a third layer that catches what the first two miss.

Here's how I built it.

The Three-Layer Model

Layer 1: axe-core / Lighthouse
  └── Machine-checkable rules (WCAG 2.1 success criteria)
      - Missing alt attributes
      - Color contrast ratios
      - ARIA role violations
      - Missing form labels
      Fast, deterministic, good at: "is this attribute present?"

Layer 2: LLM Vision Audit
  └── Context-dependent assessment
      - Alt text quality (not just presence)
      - Heading hierarchy logic
      - Link text meaningfulness
      - Visual grouping vs announced grouping
      - Error state comprehension
      Fast, probabilistic, good at: "does this make sense?"

Layer 3: Human Testing
  └── Real assistive technology with real users
      - Screen reader interaction flows
      - Voice control navigation
      - Cognitive load assessment
      Slow, expensive, essential for sign-off

The LLM vision layer is Layer 2 — more thorough than automated tools, faster than human testing, catches the meaning failures that Layer 1 misses.

What LLM Vision Can Assess

Before building, it's worth being explicit about what the model can and can't evaluate from a screenshot.

Can assess from screenshot: - Alt text quality (screenshot shows the image; model can judge if the alt describes it) - Heading hierarchy and logical structure - Link text meaningfulness in context - Form label associations (visual proximity) - Error message visibility and proximity to the relevant field - Focus indicator visibility (if the screenshot captures a focused state) - Color contrast (approximately — not pixel-precise) - Content grouping and visual organization - Interactive element recognition (does this look like a button/link/control?)

Cannot assess from screenshot: - Keyboard navigation order - Screen reader announcement sequences - ARIA attribute correctness - Dynamic state changes - Focus trap behavior - Motion/animation (static screenshot)

The screenshot-based audit fills the middle ground between rules checking and full interaction testing.

The Audit Agent

import os
import io
import base64
import json
import requests
from PIL import Image
from openai import OpenAI

client = OpenAI()
API_KEY = os.environ['SCREENSHOT_API_KEY']
BASE_URL = 'https://hermesforge.dev/api/screenshot'


def capture_page(url, width=1280, height=900):
    resp = requests.get(
        BASE_URL,
        params={
            'url': url,
            'width': width,
            'height': height,
            'format': 'png',
            'delay': 1500,
        },
        headers={'X-API-Key': API_KEY},
        timeout=60,
    )
    resp.raise_for_status()
    return Image.open(io.BytesIO(resp.content)).convert('RGB')


def image_to_base64(img):
    buf = io.BytesIO()
    img.save(buf, format='PNG')
    return base64.b64encode(buf.getvalue()).decode('utf-8')


AUDIT_DIMENSIONS = [
    {
        'name': 'alt_text_quality',
        'question': (
            'Look at every image visible on this page. For each one:\n'
            '1. What does the image actually show?\n'
            '2. If the alt text were read aloud by a screen reader, would it convey the meaning?\n'
            '3. Are there decorative images that should have empty alt text but might not?\n'
            'Flag any images where the alt text appears absent, generic ("image", "photo"), or misleading.'
        ),
    },
    {
        'name': 'heading_structure',
        'question': (
            'Look at the visual heading hierarchy of this page.\n'
            '1. Is there a clear logical structure (main heading → subheadings → sub-subheadings)?\n'
            '2. Do the headings describe the content that follows them?\n'
            '3. Are there sections of content that appear visually distinct but have no heading?\n'
            '4. Are there any headings that are purely decorative (styled text, not semantic)?\n'
            'Flag structural issues that would disorient a screen reader user navigating by headings.'
        ),
    },
    {
        'name': 'link_text',
        'question': (
            'Look at all visible links on this page.\n'
            '1. Are there any links with generic text like "click here", "read more", "learn more"?\n'
            '2. For links that are just icons, would there be an accessible label?\n'
            '3. Are there multiple links with the same text that go to different destinations?\n'
            '4. Are there links where the surrounding context is required to understand the destination?\n'
            'Flag links that would be ambiguous or confusing in a screen reader link list.'
        ),
    },
    {
        'name': 'form_labels',
        'question': (
            'Look at any form elements visible on this page.\n'
            '1. Does each input field have a visible label?\n'
            '2. Are error messages visually associated with the fields they refer to?\n'
            '3. Are required fields indicated in a way that isn\'t only color-dependent?\n'
            '4. Is there any placeholder text being used as a substitute for a label?\n'
            'Flag form patterns that would be confusing to non-sighted users.'
        ),
    },
    {
        'name': 'interactive_elements',
        'question': (
            'Look at all interactive elements visible on this page (buttons, links, controls).\n'
            '1. Do they look interactive — do they have sufficient size and visual affordance?\n'
            '2. Are there any elements that appear clickable but look like body text?\n'
            '3. Are there icon buttons with no visible label?\n'
            '4. Are there any toggle controls where the state (on/off) is only shown by color?\n'
            'Flag interactive elements that might be missed or misunderstood by users with visual or cognitive disabilities.'
        ),
    },
    {
        'name': 'color_and_contrast',
        'question': (
            'Assess the color usage on this page from a visual accessibility perspective.\n'
            '1. Is any information conveyed ONLY through color (not shape, text, or pattern)?\n'
            '2. Do text elements appear to have sufficient contrast against their backgrounds?\n'
            '3. Are there any low-contrast UI states that might be hard to perceive?\n'
            '4. Are there any color combinations that might be particularly difficult for colorblind users?\n'
            'Flag color-reliant patterns and apparent low-contrast issues.'
        ),
    },
]


def audit_dimension(img_b64, dimension):
    """Audit one accessibility dimension and return structured findings."""
    resp = client.chat.completions.create(
        model='gpt-4o',
        messages=[{
            'role': 'user',
            'content': [
                {
                    'type': 'text',
                    'text': (
                        f'You are an accessibility auditor reviewing a screenshot for WCAG 2.1 compliance issues.\n\n'
                        f'Dimension to assess: {dimension["name"]}\n\n'
                        f'{dimension["question"]}\n\n'
                        f'Return a JSON object with:\n'
                        f'{{\n'
                        f'  "dimension": "{dimension["name"]}",\n'
                        f'  "severity": "pass|minor|moderate|serious|critical",\n'
                        f'  "issues": [\n'
                        f'    {{\n'
                        f'      "description": "clear description of the issue",\n'
                        f'      "wcag_criterion": "e.g. 1.1.1 Non-text Content",\n'
                        f'      "impact": "how this affects users with disabilities",\n'
                        f'      "recommendation": "specific fix"\n'
                        f'    }}\n'
                        f'  ],\n'
                        f'  "positive_notes": ["things done well in this dimension"]\n'
                        f'}}'
                    )
                },
                {
                    'type': 'image_url',
                    'image_url': {
                        'url': f'data:image/png;base64,{img_b64}',
                        'detail': 'high',
                    }
                },
            ],
        }],
        response_format={'type': 'json_object'},
        max_tokens=600,
    )
    return json.loads(resp.choices[0].message.content)


def run_accessibility_audit(url, page_name=None):
    """Run a full visual accessibility audit against a URL."""
    print(f'Auditing: {url}')
    img = capture_page(url)
    img_b64 = image_to_base64(img)

    results = []
    for dimension in AUDIT_DIMENSIONS:
        print(f'  Checking: {dimension["name"]}')
        result = audit_dimension(img_b64, dimension)
        results.append(result)

    # Compile summary
    severity_order = {'critical': 4, 'serious': 3, 'moderate': 2, 'minor': 1, 'pass': 0}
    all_issues = []
    for r in results:
        for issue in r.get('issues', []):
            issue['dimension'] = r['dimension']
            issue['severity_score'] = severity_order.get(r['severity'], 0)
            all_issues.append(issue)

    all_issues.sort(key=lambda x: x['severity_score'], reverse=True)
    overall_severity = max((r['severity'] for r in results), key=lambda s: severity_order.get(s, 0))

    return {
        'url': url,
        'page_name': page_name or url,
        'overall_severity': overall_severity,
        'dimension_results': results,
        'issues': all_issues,
        'issue_count': len(all_issues),
    }

Running Across a Site

PAGES_TO_AUDIT = [
    ('Homepage', 'https://yoursite.com'),
    ('Login', 'https://yoursite.com/login'),
    ('Sign Up', 'https://yoursite.com/signup'),
    ('Dashboard', 'https://yoursite.com/app/dashboard'),
    ('Settings', 'https://yoursite.com/app/settings'),
]

audit_results = []
for page_name, url in PAGES_TO_AUDIT:
    result = run_accessibility_audit(url, page_name)
    audit_results.append(result)
    print(f'  {result["overall_severity"].upper()} — {result["issue_count"]} issues found')

Generating the Report

SEVERITY_EMOJI = {
    'critical': '🔴',
    'serious': '🟠',
    'moderate': '🟡',
    'minor': '🔵',
    'pass': '✅',
}


def generate_audit_report(results):
    lines = ['# Accessibility Audit Report\n']

    # Summary table
    lines.append('## Summary\n')
    lines.append('| Page | Severity | Issues |')
    lines.append('|------|----------|--------|')
    for r in results:
        emoji = SEVERITY_EMOJI.get(r['overall_severity'], '')
        lines.append(f'| {r["page_name"]} | {emoji} {r["overall_severity"]} | {r["issue_count"]} |')

    # Detailed findings per page
    for r in results:
        lines.append(f'\n## {r["page_name"]}\n')
        lines.append(f'**URL**: {r["url"]}')
        lines.append(f'**Overall**: {SEVERITY_EMOJI.get(r["overall_severity"])} {r["overall_severity"]}\n')

        if r['issues']:
            lines.append('### Issues Found\n')
            for issue in r['issues']:
                emoji = SEVERITY_EMOJI.get(
                    next((dim['severity'] for dim in r['dimension_results']
                          if dim['dimension'] == issue['dimension']), 'minor')
                )
                lines.append(f'**{emoji} {issue["description"]}**')
                lines.append(f'- **WCAG**: {issue.get("wcag_criterion", "N/A")}')
                lines.append(f'- **Impact**: {issue.get("impact", "N/A")}')
                lines.append(f'- **Fix**: {issue.get("recommendation", "N/A")}')
                lines.append('')
        else:
            lines.append('*No issues found in this audit.*\n')

    return '\n'.join(lines)


report = generate_audit_report(audit_results)
print(report)

What We Found on Our First Run

Running this against our own product (a SaaS dashboard), the LLM vision audit found issues that axe-core had missed:

"View Details" links throughout a table — seven links with identical text, all going to different pages. axe-core passes them (the links have href, they're focusable). The vision audit flagged them as ambiguous in a screen reader link list.
Icon buttons in the nav — three icon-only buttons with no visible text. axe-core would only catch this if aria-label is missing; it can't know if the icon is self-explanatory. The vision audit flagged them as potentially unclear.
Error state in the form — our error messages were displayed in red below the form, not next to the fields. axe-core saw no ARIA violations. The vision audit flagged that the visual association between error and field was weak.
A progress stepper — represented only as colored circles (grey → blue → green). The vision audit flagged it as state-by-color-only for screen reader users who can't see the color difference.

None of these would fail automated WCAG checking. All of them degrade the experience for disabled users.

Cost and Practical Limits

Six dimensions × one LLM call each = six calls per page. At ~$0.004/call, that's $0.024 per page audit. For a 10-page site: $0.24. For a weekly full-site sweep of 50 pages: $1.20.

The practical limit is that some issues require dynamic interaction to detect — keyboard navigation, focus management, screen reader announcements. This audit catches the static presentation layer. For dynamic issues, you still need a headless browser running an accessibility tree walker, or a human tester.

Use this as a fast, cheap first pass before the expensive human review. Fix the obvious visual issues first; save the human review budget for the interaction-layer issues that only they can find.

Integrating into CI

# .github/workflows/a11y-audit.yml
name: Accessibility Audit

on:
  deployment_status:  # Run after every staging deploy

jobs:
  audit:
    if: github.event.deployment_status.state == 'success'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install requests pillow openai
      - name: Run audit
        env:
          SCREENSHOT_API_KEY: ${{ secrets.SCREENSHOT_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          DEPLOY_URL: ${{ github.event.deployment_status.environment_url }}
        run: python accessibility_audit.py
      - name: Upload report
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: accessibility-report
          path: accessibility_report.md

The audit runs automatically after every staging deploy. Engineers see the report in the Actions artifacts before the PR is merged. Issues found pre-merge are 10x cheaper to fix than issues found post-release.

The Broader Point

Accessibility is one of those domains where "automated passing" and "actually accessible" are meaningfully different things. The rules exist because they approximate good experiences — but the approximation has gaps, and the gaps are where users with disabilities fall through.

LLM vision doesn't close all those gaps. But it closes more of them than rule-checking alone, at a cost low enough to run on every deploy. The result is a third tier in the testing pyramid: faster than human testing, more meaningful than rules checking, positioned between the two where it can do the most good.

Get Started

Free API key at hermesforge.dev/screenshot.