When Regex Fails: Using LLMs to Extract Structured Data from Messy Pages

I’ve been doing web scraping for years. For most projects, I lean on BeautifulSoup, cssselect, and a handful of regex patterns. You know the drill: inspect the page, find the selector, extract the text, clean it up. It works great when every page follows the same template. Then I hit a project that involved scraping product details from hundreds of small e-commerce sites. Every site had its own HTML structure. Some used <div class="price">, others <span itemprop="price">, and a few just had $29.99 buried in a paragraph with no class at all. My carefully crafted selectors broke within the first dozen sites. I was spending more time writing conditional parsers than actually using the data. My first instinct was to throw more code at the problem. I wrote a meta‑parser that tried multiple selectors and fell back to regex for patterns like prices. That worked… until a site used a different currency symbol, or a discount price that appeared after the original. Debugging became a nightmare. Next, I tried training a simple classifier to tag elements (price, name, description) based on their attributes and surrounding text. I used scikit‑learn with features like class names, tag name, text length. It worked okay on the training set, but failed on new layouts. The feature space was too shallow. I also experimented with headless browsers to grab computed styles, hoping to identify prices by their bold weight or color. That was fragile and slow. After a few weeks of frustration, I had a crazy idea: what if I just send the raw HTML to a language model and ask it to extract the data I need? No selectors, no regex – just a prompt. I tried it with a small snippet of HTML from a product page, and it worked. Not perfectly, but it got me the right values most of the time. The key was to provide a few examples (few‑shot prompting) and a clear output format. Here’s the core technique: Extract a small region of the page around the product (to keep token counts low). Build a prompt that includes 2‑3 examples of HTML → JSON. Send the target HTML and parse the JSON response. import openai import json openai.api_key = "sk-..." # From your account system_prompt = """You are a data extraction assistant. Given raw HTML, return a JSON object with fields: name, price, currency, and availability. Use the few-shot examples below as a guide. Examples: HTML: <div class="product"> <h2>Widget A</h2> <span class="price">$19.99</span> <p>In stock</p> </div> JSON: {"name": "Widget A", "price": 19.99, "currency": "USD", "availability": "in_stock"} HTML: <span itemprop="name">Gadget X</span> <span itemprop="price" content="49.99">€49,99</span> <meta itemprop="availability" content="InStock"> JSON: {"name": "Gadget X", "price": 49.99, "currency": "EUR", "availability": "in_stock"} """ user_html = """ <div class="item"> <strong>Super Tool</strong> <span>$299.00</span> <span>Only 2 left</span> </div> """ response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"HTML:\n{user_html}\nJSON:"} ], temperature=0 ) try: extracted = json.loads(response.choices[0].message.content) print(extracted) except json.JSONDecodeError: print("Failed to parse LLM output") This snippet is the essence. You can adapt it for any structured data: reviews, specifications, even table rows. This approach isn’t a silver bullet. Here’s what I discovered: Cost: Every extraction costs tokens. For high‑volume scraping (thousands of products), LLM API costs can add up. I estimate $0.02–$0.05 per extraction with GPT‑3.5. For a project with 10k products, that’s $200–500. For smaller jobs, it’s negligible. Latency: Each request takes 1–3 seconds. Not great for real‑time scraping, but fine for batch processing. Inaccuracies: LLMs hallucinate. Sometimes the price is a few cents off, or the availability is guessed. I added a validation step that checks if the extracted price matches a regex pattern for numbers – if not, flag it for manual review. Token limits: You can’t send the whole page HTML. You need to pre‑trim the region around the product. I used a simple heuristic: find the first <article> or <div> that contains a price‑like pattern, then send that subtree. Privacy: If you’re scraping internal data, sending HTML to a third‑party API might be a no‑go. Use local models like Llama 3 or Mistral via Ollama or llama.cpp. They’re slower and less accurate but keep data on‑prem. If your target pages have consistent structure, stick with CSS selectors – they’re free, fast, and deterministic. LLM extraction is for the long tail of messy, unpredictable pages. I now use a hybrid: first try a traditional parser for known sites, then fall back to an LLM for unknown layouts. I’d invest more time in the fallback strategy. Instead of sending the entire HTML region, I could pre‑process it to remove noise (scripts, styles) and normalize attribute names. That would reduce tokens and improve accuracy. I’d also try a cheaper model – maybe GPT‑3.5‑turbo‑instruct – and compare success rates. Another improvement: use structured output format (like JSON mode in recent models) to avoid parsing errors. Some APIs now allow specifying a JSON schema – that would eliminate the need for manual validation. I’ll never go back to writing dozens of fragile parsers again. The ability to say “just extract the data” in natural language feels like cheating – but it works. The trade‑offs are real, but for the kinds of messy, one‑off scraping tasks I do, this technique has saved me weeks of effort. Have you tried using LLMs for data extraction? What was your experience – did you hit the same pitfalls?

When Regex Fails: Using LLMs to Extract Structured Data from Messy Pages

Key Takeaways

Related Articles

What Anthropic Actually Said About AI Building Itself

Gemini Model Management: Ending Inefficiency! The Secret to 3x Faster Cost Tracking with Model Registry

I tested whether a code health score actually predicts bugs. Here's the benchmark

# Zero-Cost AI Agent Stack: Cloudflare Workers + Gemini Web = 24/7 Free AI

Discussion

When Regex Fails: Using LLMs to Extract Structured Data from Messy Pages

Key Takeaways

Related Articles

What Anthropic Actually Said About AI Building Itself

Gemini Model Management: Ending Inefficiency! The Secret to 3x Faster Cost Tracking with Model Registry

I tested whether a code health score actually predicts bugs. Here's the benchmark

# Zero-Cost AI Agent Stack: Cloudflare Workers + Gemini Web = 24/7 Free AI

Discussion

Related Articles

Dev.to
What Anthropic Actually Said About AI Building Itself
In June 2026, Anthropic released a report called "When AI builds itself." The headlines made it sound like AI was on the verge of superintelligence in which machines were building better versions of themselves in a feedback loop. The actual report asks something more specific. Can AI agents pick the

Dev.to
Gemini Model Management: Ending Inefficiency! The Secret to 3x Faster Cost Tracking with Model Registry
Gemini Model Management: Ending Inefficiency – How Model Registry Tripled Our Cost Tracking Speed Managing our Gemini model system had become a real headache. Model versioning was a mess, and tracking costs for each AI task was incredibly inefficient. I knew something had to change, so I started loo

Dev.to
I tested whether a code health score actually predicts bugs. Here's the benchmark
Most code health scores are vibes. A number goes up, a number goes down, and nobody checks whether the files it flags are the files that actually break later. I wanted to know if the score I built does better than that, so I ran it against a defect corpus and put it head to head with the leading com

Dev.to
# Zero-Cost AI Agent Stack: Cloudflare Workers + Gemini Web = 24/7 Free AI
I built a fully autonomous AI agent that costs $0/month to run. Here is exactly how. When you build an AI agent, the first thing you reach for is an API: OpenAI, DeepSeek, Groq. They work great — until you check the bill. Even at $0.14/million tokens, a moderately active agent burns through $30-50/m