Twitch Chat Scraper: export any VOD's full chat replay for $1.05/1K
Quick answer: Twitch stores a complete timestamped chat replay for every public VOD but exposes no public API or bulk-export endpoint for it. To get the data programmatically you walk the same VideoCommentsByOffsetOrCursor GraphQL endpoint the web player uses. The Apify Actor below does that for $0.

Quick answer: Twitch stores a complete timestamped chat replay for every public VOD but exposes no public API or bulk-export endpoint for it. To get the data programmatically you walk the same VideoCommentsByOffsetOrCursor GraphQL endpoint the web player uses. The Apify Actor below does that for $0.001 per message (~$1.05 per 1,000), with TLS fingerprinting, residential-proxy rotation, and pagination handled for you โ no login required. Twitch chat looks straightforward until you try to pull it at volume. The web player loads it fine; the Twitch Helix API does not expose VOD chat at all โ that endpoint was never built. A handful of third-party Apify Actors covered the gap for years, but most are now deprecated and delisted. If you are training a moderation classifier, building a hype-peak esports dashboard, or doing post-broadcast review, and you need more than you can scroll through by hand, you are writing your own extractor. Here is what that involves, and how I turned it into one API call. A Twitch VOD is the archived recording of a live broadcast. As long as the VOD exists, Twitch retains the full chat history โ every message, its timestamp, the commenter's identity, chat color, badges, and structured emote data โ and presents it in its own player as a replay synced to the timeline. What it does not give you: a download button, a CSV export, or any documented REST endpoint. The data is public โ anyone can watch it in the browser โ but bulk extraction means working with an undocumented GraphQL interface that the Twitch Developer documentation does not cover. No. As of 2026, the Twitch Helix API does not expose VOD chat replay. The only programmatic surface is the internal VideoCommentsByOffsetOrCursor persisted GraphQL query the web player itself calls โ and it requires a specific public Client-Id header, a non-obvious pagination scheme to dodge Twitch's integrity-check challenge, and aggressive single-IP rate limits past roughly 10,000 messages. Getting all of that right is the whole job. Each chat message comes back as one flat, typed row. Here is a real one from a verified VOD scrape: { "vod_id": "2773625679", "vod_title": "never played forza but i definitely have a drivers license so it should be easy", "channel_login": "shroud", "message_id": "1292e052-0561-4db5-86c7-adfc4556d628", "message_offset_seconds": 12, "posted_at": "2026-05-16T18:42:35.297Z", "commenter_id": "142680597", "commenter_login": "tabrexs", "commenter_display_name": "tabrexs", "message_text": "PewPewPew", "message_fragments": [ { "type": "emote", "text": "PewPewPew", "emote_id": "emotesv2_587405136a8147148c77df74baaa1bf4" } ], "user_color": "#DAA520", "badges": [], "is_subscriber": false, "scraped_at": "2026-05-16T19:00:00Z" } Fifteen fields per row, the same shape every time, Pydantic-validated before it hits the dataset. Every emote is broken out into a structured fragment carrying its Twitch emote ID, so you can rebuild the CDN image URL without extra parsing. The first pass most developers take: Open Chrome DevTools on a VOD page, find the VideoCommentsByOffsetOrCursor request in the Network tab Copy as curl, replay it with Python's requests Page through the cursor field in the response It fails on step 2, then again differently on step 3. TLS fingerprinting. Twitch's GraphQL endpoint inspects the TLS ClientHello of incoming requests. Python's requests, httpx, and everything else built on urllib3 produce a fingerprint no real browser sends โ the server sees a bot and returns 403 before it reads the body. We impersonate a real browser's TLS and HTTP/2 handshake via curl-cffi, rotating across Chrome 131, Chrome 124, Firefox 147, and Safari 180 profiles so the handshake matches what those browsers send. The cursor pagination integrity check. The GraphQL response carries both a per-edge cursor and pageInfo.hasNextPage. Paging forward on the cursor triggers Twitch's KPSDK browser-integrity challenge on the very next request โ {"errors":[{"message":"failed integrity check"}]} โ which needs a live browser to solve. We page by offset instead: pass the last message's contentOffsetSeconds + 1 as the next page's start. The integrity check does not fire on offset-based requests (live-verified 2026-05-16). Rate limiting on sustained runs. Twitch rate-limits a single IP aggressively past roughly 10,000 messages. We thread Apify residential proxies with a fresh session_id on every retry so each block rotates to a new exit IP, and on 408 / 429 / 5xx we retry with exponential backoff โ 2 seconds, doubling, capped at 30, up to 5 attempts per page. Partial successes surface with a clear status message; the Actor never silently returns a half-empty dataset. None of it is glamorous. All of it is the gap between a notebook script that worked once and a feed that survives Twitch's quarterly infra changes. The result is packaged as an Apify Actor: Twitch VOD Chat Archive. Run it from the Apify Console by pasting a VOD URL, or call it programmatically: from apify_client import ApifyClient client = ApifyClient("YOUR_APIFY_TOKEN") run = client.actor("DevilScrapes/twitch-vod-chat-archive").call( run_input={ "vodIds": ["2773625679"], "maxMessagesPerVod": 5000, "startOffsetSeconds": 0, "proxyConfiguration": { "useApifyProxy": True, "apifyProxyGroups": ["RESIDENTIAL"] } } ) for item in client.dataset(run["defaultDatasetId"]).iterate_items(): print(item["message_text"], item["commenter_login"]) Target a VOD by ID or URL, or use channel mode โ pass a channel login (e.g. "shroud") and the Actor fetches that channel's most recent archive VODs (maxRecentVods, 1โ50). The maxMessagesPerVod cap (1โ200,000) samples long broadcasts without the full run; startOffsetSeconds skips to a point in the timeline. Five concrete patterns: Moderation-classifier training. Pull chat from 50+ public VODs, label toxic vs benign, feed it into a fine-tuning run. A 6-hour mid-size broadcast produces 5,000โ30,000 messages; fifty VODs is a training set in the hundreds of thousands. Hype-peak detection for esports analytics. Join message_offset_seconds to the VOD timeline and compute message-velocity over 30-second windows. The spikes map onto game events โ kills, objectives, clutch plays โ with surprising precision. Commercial esports dashboards charge for exactly this; here you get the raw source. Post-broadcast review. Streamers and mods export chat from their last 20 broadcasts and search it for usernames, coordinated raids, or emote spikes โ without scrubbing the VOD player by hand. Community sentiment analysis. Run a VOD corpus through a sentiment model and chart positivity over the timeline. Use the is_subscriber boolean to compare sub vs non-sub chat tone. Academic research. Public Twitch chat is a rich corpus for media-studies work; the wall-clock timestamps, commenter IDs, badge context, and emote breakdowns load into R or pandas with no extra normalization. Pricing โ exact numbers ๐ฐ Pay-per-event. You pay for messages that land in your dataset, nothing for ones that fail. Event Rate What triggers it actor-start $0.05 Once per run, at Actor boot result-row $0.001 Per chat message pushed to the dataset Pull Cost 100 messages $0.15 1,000 messages $1.05 10,000 messages $10.05 100,000 messages $100.05 A 6-hour broadcast from a busy channel produces roughly 10,000โ30,000 messages, so $10.05โ$30.05 a run. Apify's free trial gives every new account $5 of credit, no card required โ about 4,900 messages, enough for two or three shorter VODs. The offset-based pagination trick is undocumented anywhere, so it is worth one paragraph. Twitch's VOD-chat endpoint supports two pagination modes. Cursor mode โ get a cursor per page, pass it forward โ trips the KPSDK integrity challenge on the second request of any session. Offset mode (pass the last message's contentOffsetSeconds as the next page's start) does not (verified across repeated calls, 2026-05-16). The challenge needs a full browser to solve, so the "obvious" implementation fails at scale and the working one requires knowing which mode to pick before you write a line of code. This Actor uses offset mode exclusively. The Apify proxy documentation covers the residential-proxy rotation that keeps long-VOD runs alive. VOD chat only โ no live chat. Live chat is a separate IRC-over-WSS protocol. Once the broadcast ends and the VOD is processed, its chat becomes accessible; while live, it is out of scope. Subscriber-only chat returns zero messages. Anonymous queries against a subscriber-gated VOD get empty edges, not an error. The Actor reports this in the run status message. VOD expiry. Default Twitch accounts retain VODs for 60 days; Partners, Affiliates, and Turbo users retain them indefinitely. The Actor cannot recover chat for an expired VOD. No moderator action log. Bans, timeouts, and deleted messages are not in the public chat-replay endpoint. Deleted messages may appear as <message deleted> or be absent. Persisted-query hash dependency. The Actor relies on a specific GraphQL persisted-query hash for VideoCommentsByOffsetOrCursor. If Twitch rotates it after a schema change, the Actor returns PersistedQueryNotFound and exits non-zero โ we pull the new hash and ship a patch. Is scraping Twitch VOD chat legal? Twitch Terms of Service governs what you may do with the data. Can I export to CSV or connect this to a warehouse? ACTOR.RUN.SUCCEEDED, or pull the dataset directly via the Apify API using the run's defaultDatasetId. How is this different from the open-source chat-downloader tool? chat-downloader is a Python library you run and maintain locally. This Actor runs on Apify's infrastructure โ no local environment, no proxy management, no dependency pinning. For one-off pulls the library is fine; for scheduled runs, multi-VOD batch jobs, or team use, the hosted Actor fits better. Why does a long VOD take a while? 429. The proxyConfiguration field defaults to RESIDENTIAL for this reason. The Actor is live on the Apify Store: apify.com/DevilScrapes/twitch-vod-chat-archive. Free $5 trial credit, no credit card. Run it on a VOD from your favorite channel and you will have the first 5,000 messages in a typed dataset in minutes. Need a field that is not there yet? Drop it in the comments โ I ship what people actually need. Built by Devil Scrapes โ Apify Actors with attitude. Pay-per-event, transparent pricing. ๐
Key Takeaways
- โขQuick answer: Twitch stores a complete timestamped chat replay for every public VOD but exposes no public API or bulk-export endpoint for it
- โขThis story was reported by Dev.to, covering developments in the dev space.
- โขAI advancements continue to reshape industries โ read the full article on Dev.to for complete coverage.
๐ Continue reading the full article:
Read Full Article on Dev.to โShare this article



