Regex broke my scraper: Using LLMs for robust data extraction

I've been building scrapers for years. I know the drill: find the CSS selector, write a regex, test it, deploy, and hope the website doesn't change its markup next week. But last month, I hit a wall. I was tasked with extracting product prices and availability from over 200 supplier websites. Each site had its own layout, some rendered with JavaScript, others were plain HTML. My initial approach was the same old combo of BeautifulSoup, XPath, and regex. It worked... for about a week. Then one of the biggest suppliers rolled out a redesign. My carefully crafted selector .price--current vanished. The price was now inside a nested <span> with a dynamic class name like _3xj0a _2v9j3. Regex? Forget it. I spent two days patching it, only to have another site change the next week. I was fighting a losing battle. First, I tried more sophisticated CSS selectors using :contains() and nth-of-type. That worked until the site added a banner ad before the price. I tried matching patterns like \$\d+\.\d{2} but some prices were in EUR, some had discounts, and some were hidden in JavaScript rendered content. I even used headless browsers (Playwright) to wait for elements, but that slowed things down and still broke when the DOM structure shifted. I looked into visual testing tools and machine learning approaches like object detection on screenshots — too heavy and unreliable for text extraction. A colleague mentioned they used GPT to parse unstructured data from PDFs. I thought: why not try it on raw HTML? The idea was simple: instead of guessing selectors, send the relevant HTML snippet (or even the text content) to an LLM with a clear instruction to extract structured fields. I knew that sending entire pages would be expensive and slow, so I first stripped down the HTML using BeautifulSoup: remove scripts, styles, and navigation, then flatten the body into clean text while preserving some structural markers (like h1, table). Then I'd prompt the LLM to return a JSON object with fields like name, price, availability, sku. Here's the core of what I built: import requests from bs4 import BeautifulSoup import json # This function cleans the page and sends a targeted prompt to the LLM def extract_product_info(url, llm_api_endpoint, api_key): response = requests.get(url, timeout=10) soup = BeautifulSoup(response.text, 'html.parser') # Remove noise for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']): tag.decompose() body = soup.body # Get visible text with some structural hints text = body.get_text(separator='\n', strip=True) # Truncate to avoid token limits (e.g., 3000 chars) text = text[:3000] prompt = f""" You are a data extraction assistant. Extract product information from the following web page text. Return a JSON object with these fields: - product_name - price (as a string with currency symbol if present) - availability (e.g., 'In stock', 'Out of stock') - sku (if found) If a field is missing, set it to null. Page text: {text} """ headers = { 'Authorization': f'Bearer {api_key}', 'Content-Type': 'application/json' } payload = { 'model': 'gpt-4o-mini', # cheaper, faster 'messages': [{'role': 'user', 'content': prompt}], 'temperature': 0.1, # low for consistent output 'response_format': { 'type': 'json_object' } } # This could be any OpenAI-compatible endpoint; I used a custom one like https://ai.interwestinfo.com/v1 r = requests.post(llm_api_endpoint, json=payload, headers=headers) r.raise_for_status() data = r.json() return json.loads(data['choices'][0]['message']['content']) # Example usage result = extract_product_info('https://example-store.com/product/123', 'https://api.openai.com/v1/chat/completions', 'sk-your-key-here') print(result) I tested this on 10 different e‑commerce pages. It worked on all of them — even on the site that had changed its classes. The LLM understood context: it recognized "€39,99" as a price even without a dollar sign, and it knew that "ships in 3-5 days" meant availability was not "In stock". Pros: Robust to layout changes – I haven't touched the code in weeks, even as sites updated. Adaptable – Want to extract rating or description? Just change the prompt. No more regex hell – The LLM handles currency formats, abbreviations, and missing data gracefully. Cons: Cost – Each request costs a fraction of a cent, but at scale it adds up. For high‑volume scraping (thousands of pages/hour), this isn't viable without caching or local models. Latency – LLM calls take 1-3 seconds. Compare to a quick regex match, it's slow. Hallucinations – Sometimes the LLM invents a SKU or misreads a discount as the main price. Always validate with a secondary rule (e.g., check that price matches \d+\.\d{2}). Prompt engineering – You need to fine‑tune the prompt to avoid over‑extraction. I spent a day tweaking examples. First, I'd pre‑filter pages more aggressively. Some pages had thousands of lines; sending the whole thing wasted tokens. I used BeautifulSoup to keep only the main content area once I identified common patterns. Second, I'd add a validation step: after getting the JSON from the LLM, I parse the price field with regex to ensure it looks like a valid number. If not, re‑prompt or flag it for manual review. I also explored using a small local model (like Phi-3) for the first pass, falling back to a cloud model only when confidence was low. This reduced cost by 80%. Finally, I realized LLMs are terrible at counting rows in tables or precise numeric extraction (like "5 items" vs "5,00"). For those, I still use regex on the snippet identified by the LLM. If you're scraping a single site with a stable API or a clear sitemap, skip the LLM. If you need real‑time data (sub‑second), this is too slow. If your data is purely numerical and structured (like log files), regex is faster and cheaper. And if you're on a tight budget, even gpt-4o-mini adds up — a local model might be better. But for my use case — extracting semi‑structured data from a chaotic set of web pages — this approach has been a lifesaver. I haven't touched my scraping code in three weeks, even though sites have changed several times. The LLM abstracts away the fragile DOM. Now I'm curious: How do you handle website changes in your scrapers? Are you still wrestling with selectors, or have you tried a language model approach? What's your setup look like?

Regex broke my scraper: Using LLMs for robust data extraction

Key Takeaways

Related Articles

Claude Fable 5 is Here — Anthropic's Most Powerful Public Model Yet

We forgot to ask "Should I?"

SolGuard: On-chain Token Safety for Solana AI Trading Agents

What Is RAG? Why LLM Memory Alone Is Never Enough

Discussion

Regex broke my scraper: Using LLMs for robust data extraction

Key Takeaways

Related Articles

Claude Fable 5 is Here — Anthropic's Most Powerful Public Model Yet

We forgot to ask "Should I?"

SolGuard: On-chain Token Safety for Solana AI Trading Agents

What Is RAG? Why LLM Memory Alone Is Never Enough

Discussion

Related Articles

Dev.to
Claude Fable 5 is Here — Anthropic's Most Powerful Public Model Yet
After months of controlled access through Project Glasswing, Anthropic has finally opened up its most powerful AI model family to the public. Meet Claude Fable 5 — and it's a significant leap. Here's everything developers need to know. Claude Fable 5 is Anthropic's latest frontier model, sitting a f

Dev.to
We forgot to ask "Should I?"
As a solo developer, AI is a tool that has greatly amplified my ability to bring ideas to life. Before that, every idea I had was just put into a file called "Project Ideas," never to be touched again and eventually forgotten. With AI, we now have a way to bring these ideas to life very quickly, but

Dev.to
SolGuard: On-chain Token Safety for Solana AI Trading Agents
If you're building autonomous trading agents on Solana, you've hit this problem: your agent needs to vet a token before it buys — but most agents trade blind. They have no idea whether a token is a rug pull, a honeypot, or has an active mint authority that can dilute holders to zero. I built SolGuar

Dev.to
What Is RAG? Why LLM Memory Alone Is Never Enough
Ask a large language model for a specific statistic, then ask where it found that number. More often than not, the citation it gives you doesn't exist. The model will hallucinate a plausible-looking reference, confidently present outdated conclusions, or simply make things up without any internal si