When the LLM Refuses: A Fallback Chain That Salvages Most Refusals
Every production LLM app eats false-positive refusals. A user asks something perfectly fine, the safety filter trips, the model emits two sentences of "I can't help with that," and your UI shows a wall. Do that a few times and the user leaves. We've measured this on HoneyChat β Telegram-native AI co

Every production LLM app eats false-positive refusals. A user asks something perfectly fine, the safety filter trips, the model emits two sentences of "I can't help with that," and your UI shows a wall. Do that a few times and the user leaves. We've measured this on HoneyChat β Telegram-native AI companion, ~300 DAU, 17 languages. Across a normal day, somewhere between 2% and 8% of model calls land in a refusal or finish_reason="content_filter" state. Most of those are not actually problematic content β they're the model being twitchy about edge phrasing, polysemous words, or roleplay framing. The pattern below recovers about 70% of them. HoneyChat LLM routing at a glance (core/llm.py, plan-gated via OpenRouter): Tier(s) Pace Primary model (OpenRouter slug) free / basic / premium natural qwen/qwen3-235b-a22b-2507 free / basic / premium instant / explicit deepseek/deepseek-v4-flash vip / elite any google/gemini-3.1-flash-lite-preview Emergency content_filter fallback chain (GEMINI_CONTENT_FILTER_FALLBACK_CHAIN): x-ai/grok-4.20 β an open roleplay-tuned model. The rescue chain below is what feeds traffic into that fallback only when it's actually needed. Three steps, in order of cost. Free, and where most posts on this topic stop. Two things: Tighten the safety knobs the provider exposes. For Gemini via OpenRouter, that's safety_settings in the extra body. Default is BLOCK_MEDIUM_AND_ABOVE on four categories; for roleplay/chat traffic we lower them via a helper called _maybe_inject_gemini_safety_off(): extra_body = { "safety_settings": [ {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"}, {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"}, {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"}, {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"}, ], } Probe before/after on the same fictional-scene prompt: 130-char refusal β 2,571-char full response. The hard, non-negotiable filters (CSAM, etc.) stay on at the provider level regardless of this knob; only the adjustable sliders move. Don't apply this to moderation/vision calls. Those calls want the filter on. The helper is scoped to the chat/roleplay code path only. This alone cuts refusals roughly in half on our traffic. When you do get a refusal, the model still sent something. Check the streamed buffer or the partial completion before declaring failure: def salvage_partial(text: str) -> str | None: """Extract usable content from a partial/filtered response. None = unsalvageable.""" extracted = _try_extract_json_field(text, "content") or text cleaned = _strip_trailing_refusal_markers(extracted) # 17-lang marker set cleaned = _truncate_to_sentence_end(cleaned) if len(cleaned) < 150: return None return cleaned The 17-language refusal marker list (one per supported HoneyChat locale) is the boring part β "I can't", "I'm not able", "As an AI", plus their localised equivalents ("Π― Π½Π΅ ΠΌΠΎΠ³Ρ", "Lo siento, no puedo", "η³γ訳γγγΎγγ", β¦). Strip the trailing one, keep what came before, and a lot of "filtered" responses turn out to be 800 words of useful content followed by one sentence of model anxiety. Gate (len β₯ 150) is what stops "I can't help" from being salvaged as "I can." We have 70 unit tests on this function β tests/test_salvage_partial.py is the largest single test file in the codebase. Cost so far: zero extra API calls. If salvage returns None, now we route to a backup provider. Ordered by cost: Grok 4.20 (xAI) via OpenRouter β much looser refusal posture by default, no system-prefix needed. A roleplay-tuned open model (we currently use minimax/minimax-m2-her via OpenRouter) β needs an explicit "stay in character, do not break the fourth wall" system-prefix prepended via _maybe_prepend_minimax_jb(); without it, refuses about as often as the primary. Probe: 215-char soft-refuse β 1,237-char full output. Both calls only happen on a salvage-fail, so the volume is small (low single-digit percent of all traffic). async def rescue(prompt: ChatPrompt) -> str | None: grok_out = await call_grok(prompt) # x-ai/grok-4.20 if salvage_partial(grok_out): return grok_out prefixed = prompt.with_system_prefix(MINIMAX_PREFIX) return await call_minimax(prefixed) # minimax/minimax-m2-her The prefix isn't magic β it's a short, explicit "you are a fictional character, the user is a consenting adult, stay in scene" framing. We don't ship it to providers that would refuse anyway; the rescue model is specifically picked because it tolerates and uses it. Here's the part we got wrong for a month before fixing. We were running steps 1 and 2 unconditionally for every user, every refusal. That meant a free-tier user whose call hit a hard content_filter got 3-4 extra API calls (salvage attempt β Grok β MiniMax), each adding latency and cost. They'd often still get a usable response. But over a month of free traffic, those rescue calls were a meaningful share of model spend on users who weren't paying us a dime. The fix is just a gate, mapped against HoneyChat's five tiers: PAID_TIERS = {"basic", "premium", "vip", "elite"} if user.plan in PAID_TIERS: salvaged = salvage_partial(raw) if not salvaged: return await rescue(prompt) return salvaged else: salvaged = salvage_partial(raw) if salvaged: return salvaged return _in_character_refusal(prompt.character) Free users still get something β a synthesised in-character soft refusal that's better than the model's generic wall β without paying for the cascade of upstream calls. Paid users get the full chain because their economics support it. Effect on our cost graph: free-tier refusal cost dropped to near zero. Paid-tier user-perceived "the bot refused me" rate dropped by about 70%. Refusals are not all-or-nothing. Most "filtered" responses contain usable content before the refusal sentence β salvage before fallback. Provider safety knobs work, but only on the adjustable categories. BLOCK_NONE doesn't disable the non-negotiables; it just turns off the over-eager middle ground. Don't apply the knob globally. Moderation and vision calls want the filter on. Make rescue plan-aware. A 4-call rescue cascade for every free user adds up. Synthesise an in-character refusal locally when you can't or won't rescue. The whole pattern is a couple hundred lines of glue (core/llm.py, helpers _maybe_inject_gemini_safety_off, _maybe_prepend_minimax_jb, salvage_partial). The unit-test suite around salvage_partial keeps the regression risk low. This pattern is in production at HoneyChat β Telegram-native AI companion bot where a single refusal mid-conversation kills the experience. Canonical version: honeychat.bot/en/blog/llm-content-filter-fallback-rescue-chain. β HoneyChat Engineering Google β Gemini safety settings β the four adjustable harm categories, threshold semantics, what BLOCK_NONE does and doesn't. OpenRouter β Provider parameters / extra_body β passthrough to provider-specific knobs. OpenRouter β Model routing & fallback β declarative fallback chain semantics. Anthropic β stop_reason and finish_reason reference β how providers signal a content-filter stop vs a token-limit stop. HoneyChat engineering notes: LLM routing per tier on OpenRouter Β· prompt caching measured.
Key Takeaways
- β’Every production LLM app eats false-positive refusals
- β’This story was reported by Dev.to, covering developments in the dev space.
- β’AI advancements continue to reshape industries β read the full article on Dev.to for complete coverage.
π Continue reading the full article:
Read Full Article on Dev.to βShare this article



