How I Stopped Regexing HTML Tables and Started Using AI for Data Extraction
I've been scraping data from the web for years. You'd think I'd have learned by now: never use regex on HTML. But sometimes, when you're staring at a messy table with inconsistent classes, random whitespace, and nested elements that barely qualify as valid markup, the temptation to just throw a rege

I've been scraping data from the web for years. You'd think I'd have learned by now: never use regex on HTML. But sometimes, when you're staring at a messy table with inconsistent classes, random whitespace, and nested elements that barely qualify as valid markup, the temptation to just throw a regex at it is overwhelming. I found myself in that exact situation last month. I needed to extract property listings from a dozen different real estate websites. Each site had its own quirks. One used <table> tags with rowspan and colspan that made BeautifulSoup cry. Another had dynamic content loaded via JavaScript that my initial scraping setup couldn't even see. This is the story of how I gave up on perfect parsing and let an AI handle the messy middle. My first attempt was the classic approach: Python + requests + BeautifulSoup. For sites with clean semantic HTML, this worked beautifully. But the real world is full of edge cases: Missing closing tags Inline styles overriding table structure Random <br> elements splitting text across rows Data that spans multiple cells visually but not in the DOM I wrote custom functions for each site. They worked for a week. Then the site updated its layout, and my parser broke. Again. I tried regex as a last resort (I know, I know). Hereโs a snippet of the mess I ended up with: import re def extract_price_from_html(html): pattern = r'<span[^>]*class="price[^"]*"[^>]*>([0-9,.$]+)</span>' match = re.search(pattern, html) return match.group(1) if match else None This worked for exactly one site, on a good day, with perfect formatting. Any minor change broke it. I was maintaining a fragile house of cards. My next idea was to use lxml with XPath. More precise, but still brittle. I even tried building a custom state machine to track table cell positions โ overengineering at its finest. I spent two days writing code that handled 80% of cases, then gave up on the long tail. I needed something that could understand the meaning of the data, not just its layout. I started experimenting with large language models (LLMs) to parse the HTML text directly. The idea: dump the raw HTML (or a cleaned version) into an AI API and ask it to return structured JSON. No parsing rules, no regex, no XPath โ just a prompt. I found a service that abstracts this into a simple REST endpoint. The core insight is that instead of writing code to find the price in a table, you tell the AI what the price looks like and let it figure out the context. Here's the approach I settled on: Fetch the page HTML. Extract the main content area (strip out headers, footers, scripts). Send that content to the AI API with a prompt describing the desired output schema. I built a small Python class around it: import requests import json from bs4 import BeautifulSoup class AIDataExtractor: def __init__(self, api_key): self.api_key = api_key self.base_url = "https://ai.interwestinfo.com/extract" # Example endpoint def extract_listings(self, html, schema): # Preprocess: extract only the visible text table-ish parts soup = BeautifulSoup(html, 'html.parser') # Remove scripts, styles, nav for tag in soup(['script', 'style', 'nav', 'header', 'footer']): tag.decompose() clean_text = soup.get_text(separator='\n', strip=True) prompt = f"""Extract property listings from the following web page content. Return a JSON array of objects with these fields: address, price, bedrooms, bathrooms, square_feet. If a field is not found, set it to null. Content: {clean_text[:5000]} # limit to avoid token overflow """ response = requests.post( self.base_url, json={"prompt": prompt, "model": "default"}, headers={"Authorization": f"Bearer {self.api_key}"} ) response.raise_for_status() return response.json() The actual API call returns a JSON object with a listings key. I then validate and transform it into my data model. Let's say I'm scraping a real estate site. Here's how I'd use the extractor: import requests url = "https://example-realty.com/listings" page_html = requests.get(url).text extractor = AIDataExtractor(api_key="sk-...") listings = extractor.extract_listings(page_html, schema) for listing in listings: print(f"{listing['address']} - {listing['price']}") The first few results were surprisingly accurate โ about 90% of fields populated correctly. The errors were mostly on edge cases like โPrice Upon Requestโ or missing square footage. I added a second pass of validation: check that price is numeric, address exists, etc. If a field is null, I can either skip that listing or use a fallback parser. This approach isn't a silver bullet. Here's what I discovered: Cost: AI API calls cost money, especially for high-volume scraping. Each request might be $0.01โ$0.05 depending on the model. For 10,000 listings, that adds up. Latency: Calling an API is slower than local parsing. Each request takes 1โ3 seconds. If you're scraping thousands of pages, this can take hours. Hallucinations: The AI sometimes invents data if it can't find it. For example, if a price is missing, it might guess โ$500,000โ from context or just make something up. You absolutely need validation steps. Rate Limits: Many AI APIs have strict rate limits. You'll need to throttle or rotate accounts. Consistency: The same page can return slightly different JSON each time due to model non-determinism. Not ideal for production ETL pipelines. When not to use this: You have well-structured HTML that BeautifulSoup can handle (e.g., government data tables). You're scraping billions of pages (cost kills you). You need perfect, deterministic output every time. The sweet spot is when you have a moderate volume of pages with inconsistent structure, and you can afford a few cents per page to avoid writing custom parsers. I'd start with the AI approach from day one, but I'd also build a caching layer to avoid re-requesting the same page. I'd also use a smaller, cheaper model for simple extractions and reserve the powerful (expensive) models for truly messy pages. Some services let you specify model size in the request. I'd also mix approaches: use regex for high-confidence fields (like prices prefixed with '$') and the AI as a fallback for the long tail. A hybrid pipeline would reduce costs while still handling the weird edge cases. AI isn't just about generating text or images. It's a tool for understanding context โ something that's incredibly hard to do with deterministic code. For data extraction, it turns the problem from โwrite a parser for every layoutโ into โdescribe what you want in English and let the model figure it out.โ That trade-off is worth it for many real-world scraping projects. What's your go-to approach when you hit a wall with structured data extraction? Do you roll your own parser, or have you tried the AI route? I'd love to hear about your experiences โ especially the horror stories.
Key Takeaways
- โขI've been scraping data from the web for years
- โขThis story was reported by Dev.to, covering developments in the dev space.
- โขAI advancements continue to reshape industries โ read the full article on Dev.to for complete coverage.
๐ Continue reading the full article:
Read Full Article on Dev.to โShare this article



