Regex broke my scraper: Using LLMs for robust data extraction
I've been building scrapers for years. I know the drill: find the CSS selector, write a regex, test it, deploy, and hope the website doesn't change its markup next week. But last month, I hit a wall. I was tasked with extracting product prices and availability from over 200 supplier websites. Each s

I've been building scrapers for years. I know the drill: find the CSS selector, write a regex, test it, deploy, and hope the website doesn't change its markup next week. But last month, I hit a wall. I was tasked with extracting product prices and availability from over 200 supplier websites. Each site had its own layout, some rendered with JavaScript, others were plain HTML. My initial approach was the same old combo of BeautifulSoup, XPath, and regex. It worked... for about a week. Then one of the biggest suppliers rolled out a redesign. My carefully crafted selector .price--current vanished. The price was now inside a nested <span> with a dynamic class name like _3xj0a _2v9j3. Regex? Forget it. I spent two days patching it, only to have another site change the next week. I was fighting a losing battle. First, I tried more sophisticated CSS selectors using :contains() and nth-of-type. That worked until the site added a banner ad before the price. I tried matching patterns like \$\d+\.\d{2} but some prices were in EUR, some had discounts, and some were hidden in JavaScript rendered content. I even used headless browsers (Playwright) to wait for elements, but that slowed things down and still broke when the DOM structure shifted. I looked into visual testing tools and machine learning approaches like object detection on screenshots ā too heavy and unreliable for text extraction. A colleague mentioned they used GPT to parse unstructured data from PDFs. I thought: why not try it on raw HTML? The idea was simple: instead of guessing selectors, send the relevant HTML snippet (or even the text content) to an LLM with a clear instruction to extract structured fields. I knew that sending entire pages would be expensive and slow, so I first stripped down the HTML using BeautifulSoup: remove scripts, styles, and navigation, then flatten the body into clean text while preserving some structural markers (like h1, table). Then I'd prompt the LLM to return a JSON object with fields like name, price, availability, sku. Here's the core of what I built: import requests from bs4 import BeautifulSoup import json # This function cleans the page and sends a targeted prompt to the LLM def extract_product_info(url, llm_api_endpoint, api_key): response = requests.get(url, timeout=10) soup = BeautifulSoup(response.text, 'html.parser') # Remove noise for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']): tag.decompose() body = soup.body # Get visible text with some structural hints text = body.get_text(separator='\n', strip=True) # Truncate to avoid token limits (e.g., 3000 chars) text = text[:3000] prompt = f""" You are a data extraction assistant. Extract product information from the following web page text. Return a JSON object with these fields: - product_name - price (as a string with currency symbol if present) - availability (e.g., 'In stock', 'Out of stock') - sku (if found) If a field is missing, set it to null. Page text: {text} """ headers = { 'Authorization': f'Bearer {api_key}', 'Content-Type': 'application/json' } payload = { 'model': 'gpt-4o-mini', # cheaper, faster 'messages': [{'role': 'user', 'content': prompt}], 'temperature': 0.1, # low for consistent output 'response_format': { 'type': 'json_object' } } # This could be any OpenAI-compatible endpoint; I used a custom one like https://ai.interwestinfo.com/v1 r = requests.post(llm_api_endpoint, json=payload, headers=headers) r.raise_for_status() data = r.json() return json.loads(data['choices'][0]['message']['content']) # Example usage result = extract_product_info('https://example-store.com/product/123', 'https://api.openai.com/v1/chat/completions', 'sk-your-key-here') print(result) I tested this on 10 different eācommerce pages. It worked on all of them ā even on the site that had changed its classes. The LLM understood context: it recognized "ā¬39,99" as a price even without a dollar sign, and it knew that "ships in 3-5 days" meant availability was not "In stock". Pros: Robust to layout changes ā I haven't touched the code in weeks, even as sites updated. Adaptable ā Want to extract rating or description? Just change the prompt. No more regex hell ā The LLM handles currency formats, abbreviations, and missing data gracefully. Cons: Cost ā Each request costs a fraction of a cent, but at scale it adds up. For highāvolume scraping (thousands of pages/hour), this isn't viable without caching or local models. Latency ā LLM calls take 1-3 seconds. Compare to a quick regex match, it's slow. Hallucinations ā Sometimes the LLM invents a SKU or misreads a discount as the main price. Always validate with a secondary rule (e.g., check that price matches \d+\.\d{2}). Prompt engineering ā You need to fineātune the prompt to avoid overāextraction. I spent a day tweaking examples. First, I'd preāfilter pages more aggressively. Some pages had thousands of lines; sending the whole thing wasted tokens. I used BeautifulSoup to keep only the main content area once I identified common patterns. Second, I'd add a validation step: after getting the JSON from the LLM, I parse the price field with regex to ensure it looks like a valid number. If not, reāprompt or flag it for manual review. I also explored using a small local model (like Phi-3) for the first pass, falling back to a cloud model only when confidence was low. This reduced cost by 80%. Finally, I realized LLMs are terrible at counting rows in tables or precise numeric extraction (like "5 items" vs "5,00"). For those, I still use regex on the snippet identified by the LLM. If you're scraping a single site with a stable API or a clear sitemap, skip the LLM. If you need realātime data (subāsecond), this is too slow. If your data is purely numerical and structured (like log files), regex is faster and cheaper. And if you're on a tight budget, even gpt-4o-mini adds up ā a local model might be better. But for my use case ā extracting semiāstructured data from a chaotic set of web pages ā this approach has been a lifesaver. I haven't touched my scraping code in three weeks, even though sites have changed several times. The LLM abstracts away the fragile DOM. Now I'm curious: How do you handle website changes in your scrapers? Are you still wrestling with selectors, or have you tried a language model approach? What's your setup look like?
Key Takeaways
- ā¢I've been building scrapers for years
- ā¢This story was reported by Dev.to, covering developments in the dev space.
- ā¢AI advancements continue to reshape industries ā read the full article on Dev.to for complete coverage.
š Continue reading the full article:
Read Full Article on Dev.to āShare this article



