How I Stopped Regexing HTML Tables and Started Using AI for Data Extraction

I've been scraping data from the web for years. You'd think I'd have learned by now: never use regex on HTML. But sometimes, when you're staring at a messy table with inconsistent classes, random whitespace, and nested elements that barely qualify as valid markup, the temptation to just throw a regex at it is overwhelming. I found myself in that exact situation last month. I needed to extract property listings from a dozen different real estate websites. Each site had its own quirks. One used <table> tags with rowspan and colspan that made BeautifulSoup cry. Another had dynamic content loaded via JavaScript that my initial scraping setup couldn't even see. This is the story of how I gave up on perfect parsing and let an AI handle the messy middle. My first attempt was the classic approach: Python + requests + BeautifulSoup. For sites with clean semantic HTML, this worked beautifully. But the real world is full of edge cases: Missing closing tags Inline styles overriding table structure Random <br> elements splitting text across rows Data that spans multiple cells visually but not in the DOM I wrote custom functions for each site. They worked for a week. Then the site updated its layout, and my parser broke. Again. I tried regex as a last resort (I know, I know). Here’s a snippet of the mess I ended up with: import re def extract_price_from_html(html): pattern = r'<span[^>]*class="price[^"]*"[^>]*>([0-9,.$]+)</span>' match = re.search(pattern, html) return match.group(1) if match else None This worked for exactly one site, on a good day, with perfect formatting. Any minor change broke it. I was maintaining a fragile house of cards. My next idea was to use lxml with XPath. More precise, but still brittle. I even tried building a custom state machine to track table cell positions — overengineering at its finest. I spent two days writing code that handled 80% of cases, then gave up on the long tail. I needed something that could understand the meaning of the data, not just its layout. I started experimenting with large language models (LLMs) to parse the HTML text directly. The idea: dump the raw HTML (or a cleaned version) into an AI API and ask it to return structured JSON. No parsing rules, no regex, no XPath — just a prompt. I found a service that abstracts this into a simple REST endpoint. The core insight is that instead of writing code to find the price in a table, you tell the AI what the price looks like and let it figure out the context. Here's the approach I settled on: Fetch the page HTML. Extract the main content area (strip out headers, footers, scripts). Send that content to the AI API with a prompt describing the desired output schema. I built a small Python class around it: import requests import json from bs4 import BeautifulSoup class AIDataExtractor: def __init__(self, api_key): self.api_key = api_key self.base_url = "https://ai.interwestinfo.com/extract" # Example endpoint def extract_listings(self, html, schema): # Preprocess: extract only the visible text table-ish parts soup = BeautifulSoup(html, 'html.parser') # Remove scripts, styles, nav for tag in soup(['script', 'style', 'nav', 'header', 'footer']): tag.decompose() clean_text = soup.get_text(separator='\n', strip=True) prompt = f"""Extract property listings from the following web page content. Return a JSON array of objects with these fields: address, price, bedrooms, bathrooms, square_feet. If a field is not found, set it to null. Content: {clean_text[:5000]} # limit to avoid token overflow """ response = requests.post( self.base_url, json={"prompt": prompt, "model": "default"}, headers={"Authorization": f"Bearer {self.api_key}"} ) response.raise_for_status() return response.json() The actual API call returns a JSON object with a listings key. I then validate and transform it into my data model. Let's say I'm scraping a real estate site. Here's how I'd use the extractor: import requests url = "https://example-realty.com/listings" page_html = requests.get(url).text extractor = AIDataExtractor(api_key="sk-...") listings = extractor.extract_listings(page_html, schema) for listing in listings: print(f"{listing['address']} - {listing['price']}") The first few results were surprisingly accurate — about 90% of fields populated correctly. The errors were mostly on edge cases like “Price Upon Request” or missing square footage. I added a second pass of validation: check that price is numeric, address exists, etc. If a field is null, I can either skip that listing or use a fallback parser. This approach isn't a silver bullet. Here's what I discovered: Cost: AI API calls cost money, especially for high-volume scraping. Each request might be $0.01–$0.05 depending on the model. For 10,000 listings, that adds up. Latency: Calling an API is slower than local parsing. Each request takes 1–3 seconds. If you're scraping thousands of pages, this can take hours. Hallucinations: The AI sometimes invents data if it can't find it. For example, if a price is missing, it might guess “$500,000” from context or just make something up. You absolutely need validation steps. Rate Limits: Many AI APIs have strict rate limits. You'll need to throttle or rotate accounts. Consistency: The same page can return slightly different JSON each time due to model non-determinism. Not ideal for production ETL pipelines. When not to use this: You have well-structured HTML that BeautifulSoup can handle (e.g., government data tables). You're scraping billions of pages (cost kills you). You need perfect, deterministic output every time. The sweet spot is when you have a moderate volume of pages with inconsistent structure, and you can afford a few cents per page to avoid writing custom parsers. I'd start with the AI approach from day one, but I'd also build a caching layer to avoid re-requesting the same page. I'd also use a smaller, cheaper model for simple extractions and reserve the powerful (expensive) models for truly messy pages. Some services let you specify model size in the request. I'd also mix approaches: use regex for high-confidence fields (like prices prefixed with '$') and the AI as a fallback for the long tail. A hybrid pipeline would reduce costs while still handling the weird edge cases. AI isn't just about generating text or images. It's a tool for understanding context — something that's incredibly hard to do with deterministic code. For data extraction, it turns the problem from “write a parser for every layout” into “describe what you want in English and let the model figure it out.” That trade-off is worth it for many real-world scraping projects. What's your go-to approach when you hit a wall with structured data extraction? Do you roll your own parser, or have you tried the AI route? I'd love to hear about your experiences – especially the horror stories.

How I Stopped Regexing HTML Tables and Started Using AI for Data Extraction

Key Takeaways

Related Articles

I built an entire agency management platform by myself. Here's what actually happened.

AI For Fun! Électrique Chats at Hack the Kitty, Built with Kiro.

디지털 최전선, 시험대에 오르다: 암호화폐와 AI 시대, 데이터 신뢰성, 지정학적 갈등, 알고리즘 불투명성 헤쳐나가기

Why I'm Building the Fast Series

Discussion

How I Stopped Regexing HTML Tables and Started Using AI for Data Extraction

Key Takeaways

Related Articles

I built an entire agency management platform by myself. Here's what actually happened.

AI For Fun! Électrique Chats at Hack the Kitty, Built with Kiro.

디지털 최전선, 시험대에 오르다: 암호화폐와 AI 시대, 데이터 신뢰성, 지정학적 갈등, 알고리즘 불투명성 헤쳐나가기

Why I'm Building the Fast Series

Discussion

Related Articles

Dev.to
I built an entire agency management platform by myself. Here's what actually happened.
I used to deliver food on Zepto. 14-15 hours a day. Sun, rain, didn't matter. I saved up, bought a laptop, and started doing video editing for clients. That's when things got messy. I was managing clients on WhatsApp. Tracking who paid me in Google Sheets. Sending invoices as PDF attachments that no

Dev.to
AI For Fun! Électrique Chats at Hack the Kitty, Built with Kiro.
A cat astrologer, spec-driven and running on Amazon Bedrock A companion to A Builder in Paris: Do Devs Dream of Électrique Chats? Last month I wrote about the idea. Six rainy days in Paris, a closed laptop, and a hackathon I did not mean to enter, and somewhere between the Musée de l'Orangerie a

Dev.to
디지털 최전선, 시험대에 오르다: 암호화폐와 AI 시대, 데이터 신뢰성, 지정학적 갈등, 알고리즘 불투명성 헤쳐나가기
디지털 자산과 인공지능 분야는 핵심 기술은 다르지만, 데이터의 진실성, 규제 체계, 지정학적 함의에 대한 공통된 도전에 직면하며 점차 수렴하고 있다. 최근 일련의 사건들은 탈중앙화와 첨단 연산이 약속하는 미래가 인간의 행동, 경제적 유인, 그리고 국가적 목표라는 현실과 충돌하는 중요한 변곡점을 보여준다. 제재 대상 러시아 스테이블코인의 논란 많은 거래량 주장부터 전 미국 대통령이 약세장 속에서 거둔 전례 없는 암호화폐 수익, 그리고 선두 AI 모델을 둘러싼 당혹스러운 "너프(성능 저하)" 논쟁에 이르기까지, 이 모든 이야기는 혁신과

Dev.to
Why I'm Building the Fast Series
Why I'm Building the Fast Series I'm building the Fast Series because creator software has gotten too complicated. Plenty of tools are powerful, but they make you fight the software before you can make anything. You want to record a tutorial, stream a game, clip a useful moment, compress a file, o