When Regex Fails: Using LLMs to Extract Structured Data from Messy Pages
I’ve been doing web scraping for years. For most projects, I lean on BeautifulSoup, cssselect, and a handful of regex patterns. You know the drill: inspect the page, find the selector, extract the text, clean it up. It works great when every page follows the same template. Then I hit a project that

I’ve been doing web scraping for years. For most projects, I lean on BeautifulSoup, cssselect, and a handful of regex patterns. You know the drill: inspect the page, find the selector, extract the text, clean it up. It works great when every page follows the same template. Then I hit a project that involved scraping product details from hundreds of small e-commerce sites. Every site had its own HTML structure. Some used <div class="price">, others <span itemprop="price">, and a few just had $29.99 buried in a paragraph with no class at all. My carefully crafted selectors broke within the first dozen sites. I was spending more time writing conditional parsers than actually using the data. My first instinct was to throw more code at the problem. I wrote a meta‑parser that tried multiple selectors and fell back to regex for patterns like prices. That worked… until a site used a different currency symbol, or a discount price that appeared after the original. Debugging became a nightmare. Next, I tried training a simple classifier to tag elements (price, name, description) based on their attributes and surrounding text. I used scikit‑learn with features like class names, tag name, text length. It worked okay on the training set, but failed on new layouts. The feature space was too shallow. I also experimented with headless browsers to grab computed styles, hoping to identify prices by their bold weight or color. That was fragile and slow. After a few weeks of frustration, I had a crazy idea: what if I just send the raw HTML to a language model and ask it to extract the data I need? No selectors, no regex – just a prompt. I tried it with a small snippet of HTML from a product page, and it worked. Not perfectly, but it got me the right values most of the time. The key was to provide a few examples (few‑shot prompting) and a clear output format. Here’s the core technique: Extract a small region of the page around the product (to keep token counts low). Build a prompt that includes 2‑3 examples of HTML → JSON. Send the target HTML and parse the JSON response. import openai import json openai.api_key = "sk-..." # From your account system_prompt = """You are a data extraction assistant. Given raw HTML, return a JSON object with fields: name, price, currency, and availability. Use the few-shot examples below as a guide. Examples: HTML: <div class="product"> <h2>Widget A</h2> <span class="price">$19.99</span> <p>In stock</p> </div> JSON: {"name": "Widget A", "price": 19.99, "currency": "USD", "availability": "in_stock"} HTML: <span itemprop="name">Gadget X</span> <span itemprop="price" content="49.99">€49,99</span> <meta itemprop="availability" content="InStock"> JSON: {"name": "Gadget X", "price": 49.99, "currency": "EUR", "availability": "in_stock"} """ user_html = """ <div class="item"> <strong>Super Tool</strong> <span>$299.00</span> <span>Only 2 left</span> </div> """ response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"HTML:\n{user_html}\nJSON:"} ], temperature=0 ) try: extracted = json.loads(response.choices[0].message.content) print(extracted) except json.JSONDecodeError: print("Failed to parse LLM output") This snippet is the essence. You can adapt it for any structured data: reviews, specifications, even table rows. This approach isn’t a silver bullet. Here’s what I discovered: Cost: Every extraction costs tokens. For high‑volume scraping (thousands of products), LLM API costs can add up. I estimate $0.02–$0.05 per extraction with GPT‑3.5. For a project with 10k products, that’s $200–500. For smaller jobs, it’s negligible. Latency: Each request takes 1–3 seconds. Not great for real‑time scraping, but fine for batch processing. Inaccuracies: LLMs hallucinate. Sometimes the price is a few cents off, or the availability is guessed. I added a validation step that checks if the extracted price matches a regex pattern for numbers – if not, flag it for manual review. Token limits: You can’t send the whole page HTML. You need to pre‑trim the region around the product. I used a simple heuristic: find the first <article> or <div> that contains a price‑like pattern, then send that subtree. Privacy: If you’re scraping internal data, sending HTML to a third‑party API might be a no‑go. Use local models like Llama 3 or Mistral via Ollama or llama.cpp. They’re slower and less accurate but keep data on‑prem. If your target pages have consistent structure, stick with CSS selectors – they’re free, fast, and deterministic. LLM extraction is for the long tail of messy, unpredictable pages. I now use a hybrid: first try a traditional parser for known sites, then fall back to an LLM for unknown layouts. I’d invest more time in the fallback strategy. Instead of sending the entire HTML region, I could pre‑process it to remove noise (scripts, styles) and normalize attribute names. That would reduce tokens and improve accuracy. I’d also try a cheaper model – maybe GPT‑3.5‑turbo‑instruct – and compare success rates. Another improvement: use structured output format (like JSON mode in recent models) to avoid parsing errors. Some APIs now allow specifying a JSON schema – that would eliminate the need for manual validation. I’ll never go back to writing dozens of fragile parsers again. The ability to say “just extract the data” in natural language feels like cheating – but it works. The trade‑offs are real, but for the kinds of messy, one‑off scraping tasks I do, this technique has saved me weeks of effort. Have you tried using LLMs for data extraction? What was your experience – did you hit the same pitfalls?
Key Takeaways
- •I’ve been doing web scraping for years
- •This story was reported by Dev.to, covering developments in the dev space.
- •AI advancements continue to reshape industries — read the full article on Dev.to for complete coverage.
📖 Continue reading the full article:
Read Full Article on Dev.to →


