Web Scraping With vs Without JavaScript Execution: When to Use Each
There are two fundamentally different ways to extract content from a web page. Which you choose has more impact on your scraper's reliability than any other decision.
The Two Models
HTTP scraping: Make an HTTP request, parse the response body. Fast, lightweight, works with requests + BeautifulSoup.
Rendered scraping: Launch a browser, load the page, wait for JavaScript to execute, then read the DOM. Slower, heavier, but captures what users actually see.
The distinction matters because modern web applications are split between these two content models:
| Content Model | How It Works | HTTP Scraping | Rendered Scraping |
|---|---|---|---|
| Server-rendered HTML | Server sends complete HTML | Works | Works |
| Client-side rendering (React, Vue) | JS builds the DOM after load | Fails | Works |
| Lazy-loaded content | Images/data load on scroll | Fails | Works (with scroll simulation) |
| API-driven pages | XHR/fetch calls populate content | Fails | Works |
| Static pages | HTML in the response | Works | Works (overkill) |
HTTP Scraping: When It Works
For server-rendered pages, HTTP scraping is the right choice. It's an order of magnitude faster and uses a fraction of the resources.
import requests
from bs4 import BeautifulSoup
def scrape_static(url: str) -> dict:
headers = {"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"}
resp = requests.get(url, headers=headers, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "html.parser")
return {
"title": soup.find("title").get_text(strip=True) if soup.find("title") else None,
"h1": soup.find("h1").get_text(strip=True) if soup.find("h1") else None,
"meta_description": (
soup.find("meta", attrs={"name": "description"}) or {}
).get("content"),
}
How to tell if a page is server-rendered: view source in your browser (Cmd+U / Ctrl+U). If you see the content in the HTML, HTTP scraping works. If you see a nearly empty <div id="root">, it's client-rendered.
Rendered Scraping: When You Need It
For React/Vue/Angular apps, single-page applications, and any page that loads content via JavaScript, you need to execute the JavaScript. That means a real browser.
from playwright.sync_api import sync_playwright
def scrape_rendered(url: str) -> dict:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
title = page.title()
h1 = page.query_selector("h1")
h1_text = h1.inner_text() if h1 else None
browser.close()
return {"title": title, "h1": h1_text}
This is dramatically slower — typically 3–10 seconds per page vs under 1 second for HTTP scraping. But it works on content that HTTP scraping can't touch.
The Third Option: Screenshot + Vision
For some use cases, you don't need structured data at all. You need to understand what the page looks like.
import requests
import anthropic
import base64
def analyze_page_visually(url: str, question: str, api_key: str) -> str:
# Capture screenshot via API (no local browser needed)
img_resp = requests.get(
"https://hermesforge.dev/api/screenshot",
params={"url": url, "width": 1280},
headers={"Authorization": f"Bearer {api_key}"}
)
img_resp.raise_for_status()
img_b64 = base64.standard_b64encode(img_resp.content).decode()
# Ask a vision model about it
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": img_b64
}
},
{"type": "text", "text": question}
]
}]
)
return message.content[0].text
Screenshot + vision is useful for: - Accessibility audits: "Does this page have sufficient color contrast?" - Layout checks: "Is the call-to-action button above the fold?" - Competitor analysis: "What is this company's pricing structure?" - Visual QA: "Does this page look broken on 1280px width?"
Unlike DOM-based scrapers, vision models handle arbitrary visual complexity — charts, tables, sidebars, mixed layouts. They degrade gracefully on novel page structures rather than throwing selector errors.
Detecting Which Model a Page Uses
Before writing a scraper, do a quick probe:
import requests
def detect_rendering_model(url: str) -> str:
resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
content = resp.text
# Signs of client-side rendering
csr_signals = [
'id="root"', 'id="app"', 'id="__next"',
"react", "vue", "angular", "__NEXT_DATA__",
"bundle.js", "main.js", "app.js"
]
# Signs of server-rendered content
ssr_signals = ["<h1", "<article", "<main", "<section", "<p class="]
csr_count = sum(1 for s in csr_signals if s.lower() in content.lower())
ssr_count = sum(1 for s in ssr_signals if s in content)
if csr_count > 3 and ssr_count < 5:
return "likely client-rendered — use Playwright or screenshot API"
elif ssr_count > 5:
return "likely server-rendered — HTTP scraping should work"
else:
return "unclear — test both"
print(detect_rendering_model("https://twitter.com"))
# likely client-rendered — use Playwright or screenshot API
print(detect_rendering_model("https://news.ycombinator.com"))
# likely server-rendered — HTTP scraping should work
Performance Comparison
For a batch of 100 pages:
| Method | Time | Memory | CPU |
|---|---|---|---|
| HTTP + BeautifulSoup | ~30s | ~50MB | Low |
| Playwright (local) | ~8 min | ~500MB | High |
| Screenshot API | ~4 min | ~20MB | Minimal |
The screenshot API is slower than HTTP scraping but much lighter than running Playwright locally — no browser installation, no memory overhead, no flaky CDP connections.
When to Switch Methods
Start with HTTP scraping. If you get empty or malformed content, check if the page is client-rendered. If it is, either:
-
Find the underlying API: Open browser devtools → Network → XHR/Fetch. The page is probably calling a JSON API you can hit directly — faster than any browser approach.
-
Use Playwright: When you need DOM access on a rendered page.
-
Use a screenshot API: When you need visual content, don't want local browser overhead, or are running in an environment without browser support (Lambda functions, GitHub Actions with memory constraints).
The right tool depends on what data you need and where your pipeline runs.