AI for Web Scraping and Extraction
Use LLMs to extract structured data from dynamic websites, JavaScript-rendered pages, and schema-less sources where CSS selectors break. Learn when to use AI vs traditional scraping, how to handle anti-bot measures, and the legal landscape.
Quick answer
For schema-less extraction from dynamic pages, the best stack is Playwright or Puppeteer for rendering + an LLM (Claude Sonnet 4 or GPT-4o) for extraction with a structured output schema. For high-volume, high-frequency scraping, use Firecrawl or Apify to handle rendering and rate limiting, then pipe HTML into a cheaper model (Haiku, GPT-4o-mini) for extraction. Cost runs $2-10 per 1,000 pages extracted; CSS-selector scrapers remain 50-100x cheaper where applicable.
The problem
Traditional CSS-selector-based scrapers break every time a site redesigns its DOM — the average e-commerce or news site makes breaking layout changes every 4-8 weeks, requiring constant maintenance. JavaScript-rendered SPAs and infinite-scroll feeds are inaccessible to simple HTTP scrapers. Data teams maintaining 50+ scraper pipelines report spending 30-40% of their time on breakage repairs rather than analysis. Meanwhile, unstructured page content (reviews, articles, product descriptions) requires manual labeling to convert into usable structured data.
Core workflows
LLM-Powered Schema-less Extraction
Send raw HTML (or rendered text) to an LLM with a target JSON schema. Model infers which page elements map to which fields — works even when the DOM structure changes or lacks semantic class names. No selector maintenance required.
JavaScript-Rendered Page Scraping
Use Playwright to fully render SPAs, infinite-scroll feeds, and login-gated pages before passing content to an extraction model. Captures data inaccessible to HTTP-only scrapers. Handles React, Vue, Angular sites.
Change Detection and Schema Evolution
Monitor scraped output for structural drift — when field values suddenly become null or change type, trigger a re-extraction with the LLM using the new page layout. Reduces pipeline downtime from layout changes by 80%.
Competitive Price Intelligence
Scrape competitor product pages, extract price, availability, and promotional data into a normalized schema for comparison. LLMs handle varied pricing displays (bundles, per-unit, subscription) that rule-based extractors miss.
Review and Sentiment Aggregation
Extract product reviews, ratings, pros/cons lists, and sentiment from retailer pages and review sites. LLM simultaneously extracts structure and performs sentiment classification in a single pass.
Top tools
- firecrawl
- apify
- zyte
- playwright
- brightdata
- scrapingbee
Top models
- claude-haiku-3-5
- gpt-4o-mini
- claude-sonnet-4
- gemini-2-0-flash
FAQs
When should I use an LLM for scraping vs traditional CSS selectors?
Use CSS selectors when: the site has stable, semantically named HTML classes; you're scraping at very high volume (>1M pages/day) where LLM costs are prohibitive; the data is numeric and precisely structured (stock prices, sports scores). Use LLMs when: the site changes layout frequently; you need to extract from unstructured prose (review summaries, article bodies, product descriptions); you're scraping multiple different sites with different schemas and need a single extraction pipeline; or you need simultaneous extraction + classification/sentiment. The hybrid approach — CSS selectors for structured fields, LLM for unstructured fields — often gives the best cost/accuracy balance.
How do I handle anti-bot detection and rate limiting?
Anti-bot systems detect scrapers through several signals: request rate patterns, missing browser fingerprints (user-agent, accept headers, TLS fingerprint), no JavaScript execution, and behavioral anomalies (no mouse movement, instant form completion). For moderate scraping: rotate user-agents, add random delays (2-10 seconds between requests), and use a residential proxy service (Brightdata, Oxylabs). For aggressive protection (Cloudflare, Akamai Bot Manager): use a headless browser service (Zyte SmartProxy, Apify's Playwright actors) that handles fingerprinting automatically. Never scrape at a rate that could degrade site performance — start at 1 req/sec per domain and adjust based on response codes.
Is AI-powered web scraping legal?
Web scraping legality is nuanced and jurisdiction-dependent. Key principles: (1) Publicly accessible data (no login required) is generally scrapeable under US law — the hiQ v. LinkedIn ruling affirmed this for public profiles. (2) Always check robots.txt and honor Disallow rules — violating them is an ethical (and potentially legal) issue even if not always enforced. (3) Scraping personal data of EU residents triggers GDPR compliance obligations regardless of where you're hosted. (4) Terms of Service violations may result in account bans or civil claims but are rarely criminally prosecuted. (5) Never scrape behind a login wall without authorization, and never use scraped data in ways that compete directly with the source's core product. Consult legal counsel before scraping for commercial use at scale.
How much of the page HTML should I send to the LLM?
Raw HTML is token-expensive and noisy. Pre-process before sending: strip <script>, <style>, <nav>, <footer>, <head> tags; remove HTML attributes except id, class, and data-* attributes that carry semantic meaning; convert to Markdown or plain text using a library like html2text or Markdownify. A typical e-commerce product page goes from 150,000 characters of raw HTML to 8,000-15,000 characters of cleaned text — a 10-15x token reduction. For very long pages (articles, reports), chunk by logical section and run parallel extractions, then merge results. Only send the full raw HTML when the structure itself (table layout, nesting) is necessary for extraction accuracy.
How do I handle pagination, infinite scroll, and multi-page datasets?
For traditional pagination: identify the next-page link pattern and recurse through pages until no next-page link is found. For infinite scroll: use Playwright to scroll to the bottom of the page in a loop, waiting for new content to load after each scroll, with a maximum iteration limit. For APIs behind dynamic pages: use browser DevTools to inspect network requests while browsing — most 'scraped' sites are actually loading data from a JSON API that you can call directly, bypassing HTML entirely. Rate limit your pagination: scraping 1,000 pages sequentially at 1 req/sec takes 17 minutes; parallelize across 10 workers to reduce to under 2 minutes while staying within polite limits.
Can I use web scraping to build training data for my own models?
Technically yes, but with important caveats. Several AI companies have faced copyright litigation over training data sourced from the web (The New York Times v. OpenAI, Getty Images v. Stability AI). For safe training data collection: prefer Creative Commons or explicitly open-licensed content; use Common Crawl datasets (already scraped, free, and covered by web archive precedent); avoid scraping content behind paywalls; do not reproduce or memorize verbatim copyrighted text in your model outputs. For extraction tasks (scraping structured data FROM pages rather than using prose AS training data), copyright risk is significantly lower — extracting facts (prices, names, dates) is generally not copyrightable.