Your Agent Failed in Prod. Good Luck Reproducing It.
9:04 a.m. A ticket lands. A customer ran your agent yesterday, it called the wrong tool, deleted the wrong record, and now there is a screenshot in your inbox with a red box drawn around the damage. You have the user ID. You have the timestamp. You copy the exact prompt out of the logs, paste it i

9:04 a.m. A ticket lands. A customer ran your agent yesterday, it called the wrong tool, deleted the wrong record, and now there is a screenshot in your inbox with a red box drawn around the damage. You have the user ID. You have the timestamp. You copy the exact prompt out of the logs, paste it into the same model, with the same system prompt, and hit run. It works perfectly. You run it again. It works again. You run it ten more times. The agent behaves like a model employee every single time, and the one run that mattered, the one that cost a customer their data, is nowhere. You cannot make it happen again, which means you cannot debug it, which means you cannot promise it will not happen to the next customer. This is the reproducibility problem, and if you are shipping anything built on a large language model, it is already your problem. This post is about why it happens, why some of it is actually a feature you do not want to remove, and what you can do to get back the one thing you need: the ability to replay a run exactly as it happened. Most teams use the word to mean two different things and then argue past each other. Pull them apart and the whole topic gets clearer. The first meaning is bitwise determinism: the same input always produces the identical output, token for token. This is what you assume you have with ordinary software and what you almost never have with an LLM. The second meaning is replayability: given a run that already happened, you can reconstruct exactly what occurred, the inputs, the sampled outputs, the tool calls, the intermediate state, well enough to debug it. You do not need the model to be deterministic. You need the run to be recorded. The trap is chasing the first when you actually need the second. Teams spend weeks trying to force their model into bitwise determinism, fail, and conclude the system is unknowable. It is not. You were aiming at the wrong layer. The first thing everyone tries is setting temperature to zero. The reasoning is clean. Temperature controls randomness in sampling. Set it to zero and the model must pick the single most probable next token every time, which is greedy decoding, which should be deterministic. One input, one output, forever. In theory, yes. In practice, run the same prompt twice at temperature zero and sooner or later the outputs diverge. It often starts with one word, the sentence takes a slightly different turn, and the rest drifts away from there. The reason is the distinction that fixes most of the confusion in this whole area, and it comes from Sara Zan's write up on the topic: sampling determinism is not the same thing as system determinism. A quick piece of vocabulary, because it shows up everywhere from here on. Before the model emits a token, it produces a raw score for every candidate token in its vocabulary. Those scores are called logits. Picking the token with the single highest logit is an operation called argmax, literally "the argument that gives the maximum." Greedy decoding is just argmax at every step. So temperature zero makes the selection rule deterministic. Always take the argmax. But it does nothing to guarantee that the logits you are taking the argmax over are identical from one run to the next. If two candidate tokens have logits that are almost tied, a difference in the last few bits is enough to swap which one wins, and once one token changes, every token after it is generated from a different prefix, so the divergence compounds. So the question becomes: why would the logits ever differ between two runs of the same model on the same input? Here is the part that surprises people who have not stared at numerical code. With real numbers, addition is associative. (a + b) + c equals a + (b + c). With floating point numbers it does not, because every intermediate result is rounded to finite precision. The canonical demonstration, from the Thinking Machines write up by Horace He and collaborators: (0.1 + 1e20) - 1e20 = 0 0.1 + (1e20 - 1e20) = 0.1 Same three numbers, different grouping, different answer. This is not a bug. It is the price floating point pays for representing both enormous and tiny values with a constant number of significant figures. Now scale that up. A transformer forward pass, one full run of the model over the input, is millions of additions, multiplications, and reductions across matrix multiplications, normalizations, and attention. Change the order in which any of those reductions accumulate and you change the last few bits of the result. Change the last few bits of a logit and you can change which token is the argmax. That is the chain from low level arithmetic all the way up to a different sentence. The common explanation stops at floating point plus concurrency. In one line: thousands of GPU threads finish in an order nobody controls, and because floating point addition is not associative, adding the same numbers in a different order gives a slightly different sum, so the output wobbles from run to run. It sounds complete. It is wrong, and the Thinking Machines analysis is the clearest debunking of it. Here is the inconvenient fact that breaks the popular story. Run the same matrix multiplication on the same GPU on the same data a thousand times and you get bitwise identical results every single time: A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16) B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16) ref = torch.mm(A, B) for _ in range(1000): assert (torch.mm(A, B) - ref).abs().max().item() == 0 Floating point is in play. Massive concurrency is in play. And yet the result is perfectly reproducible. So concurrency plus floating point cannot be the whole answer. The true culprit is batch invariance, or rather the lack of it. Production inference servers do not run your request alone. They batch it together with whatever other requests happen to arrive at the same moment, for efficiency. The kernels, the low level GPU routines that compute your output, run reductions inside normalization, matrix multiplication, and attention whose results depend on the shape of the batch they ran in. The forward pass is deterministic for a fixed batch. But the batch is not fixed. It depends on concurrent load, on who else is hitting the server in the same millisecond, on conditions you do not control and cannot see. So your prompt is identical, your parameters are identical, and the thing that changed is the company you were keeping inside the server. This is also why a prompt looks rock solid in local testing and turns flaky in production. The model did not get more creative. The batching conditions changed. The Thinking Machines team showed both the scale of the problem and the fix. Running standard vLLM, a thousand identical prompts to Qwen-3-8B produced eighty distinct completions. With batch invariant kernels, the ones that produce the same result regardless of batch shape, the same thousand prompts produced exactly one. The cost was real but modest, one of their tests went from twenty six seconds to forty two. Their library, batch-invariant-ops, has since been picked up by SGLang. The three operations that have to be made batch invariant are RMSNorm (a normalization step), matrix multiplication, and attention. The lesson: true bitwise reproducibility is achievable, but only by controlling the entire inference stack down to the kernels. Almost no one calling a hosted API has that control. In one line: a mixture of experts model is one large network split into many smaller specialist subnetworks, with a router that sends each token to only a few of them instead of running the whole model every time. Many frontier models are built this way, and the architecture is a second independent source of the same problem. If that routing were per token and independent, it would be deterministic. It is not, and the reason is a number called the capacity factor. Each expert can only process so many tokens in a given batch. That ceiling is the capacity factor: a threshold on how many tokens one expert will accept before it is full. When too many tokens in a batch all want the same expert, the ones over the limit cannot all be served. The overflow tokens get bumped to their second choice expert, or dropped from that layer entirely. So whether your token reaches its first choice expert depends on how many other tokens in the same batch were competing for it. That is the same trap as batch invariance, wearing a different costume. The routing decision for your token is not a function of your token alone. It is a function of the whole batch your token landed in. As Vincent Schmalbach lays out, this makes a mixture of experts model deterministic at the batch level and nondeterministic at the level of a single sequence. Send the same prompt twice, get it batched with different neighbors, and the capacity math resolves differently, so your tokens route differently. Same root cause, a second mechanism delivering it. Sampling and kernels are only the inference layer, and they are just two of about eight things that moved under you between yesterday's run and today's. In a real agent they are often among the most stable. Here are the other six, and every one of them can change the output even if the model itself were frozen solid. The prompt is rarely fixed. Interpolate the date, the user's name, a feature flag, or a sampled few shot example, and the "same" prompt is not the same prompt. The context is assembled at runtime. Retrieval pulls from an index that updates continuously, so yesterday's chunks are not today's chunks. Tools return live data. A weather call, a database read, a search API each return something different every time, and the model reasoned over a world state you did not capture. Time leaks in. "Schedule it for next Tuesday" resolves to a different date depending on when it runs. The model version drifts. The gpt-4o or claude you called last month may be a different set of weights this month, with no version bump you controlled. Conversation history accumulates. In a multi turn agent, earlier turns are part of the input, so if any one of them varied, every later turn inherits it. This is the part most reproducibility discussions miss by staring only at temperature. The sampler is one knob on a machine with eight. To reproduce a run you have to pin all eight, not just the one. Before going further, we have to be honest about something, because the obvious reaction to everything above is "fine, make it all deterministic and be done." Do not. If you could flip one switch and make your model perfectly deterministic, token for token, forever, you should not flip it. The nondeterminism that wrecks your reproducibility is the same property that makes the model good. The argument has four parts, and most teams only know the first. Quality: greedy decoding is not safe, it is broken. The intuition is that always taking the single most probable token is the careful, conservative choice. It is not. Holtzman and colleagues showed in their 2020 nucleus sampling work that maximization based decoding, greedy and beam search, drives open ended generation into bland, repetitive, looping text that humans immediately recognize as machine written. Their conclusion was blunt: maximization is the wrong objective for open ended text generation. The fix is to sample, but only from the reliable head of the distribution, truncating the unreliable tail. That is nucleus sampling, the top-p knob, usually set around 0.95. The variation is not decoration. Switch it off and the prose collapses. The knobs, named. When we say variation we mean three specific controls. Temperature reshapes the distribution before sampling: low values near 0.2 make it peaky and conservative, higher values near 0.8 to 1.0 flatten it and admit more surprising tokens. Top-k restricts sampling to the k most likely tokens (Fan and colleagues). Top-p, nucleus sampling, restricts it to the smallest set of tokens whose probability mass exceeds p (Holtzman and colleagues). These are the levers. Everything downstream is a consequence of how you set them. Accuracy: sampling can make the model more correct, not less. This is the part that converts skeptics, because it is a number rather than a preference. Self consistency (Wang and colleagues, ICLR 2023) throws away the single greedy answer entirely. It samples many diverse reasoning paths, around forty, at temperature 0.7 with top-k 40, then takes the majority vote over the final answers. The gains are large and consistent: plus 17.9 percent on GSM8K, plus 11.0 on SVAMP, plus 12.2 on AQuA, plus 6.4 on StrategyQA, plus 3.9 on ARC challenge. The mechanism is the one that makes random forests beat a single decision tree. Diverse samples, aggregated, beat one confident guess. Determinism would have handed you exactly one path, and a worse answer. Exploration: agents need to try things. Anything that searches depends on variation. Best of N sampling generates many candidate completions and keeps the best under some scorer, and coverage, the chance that at least one of N samples is correct, climbs with N only because the samples differ. Agent loops that retry a failed tool call, propose alternative plans, or branch are running the same exploration versus exploitation tradeoff that reinforcement learning has always lived on. A perfectly deterministic agent retries the identical failing action forever. Variation is what lets it escape. Discovery: sampling has found things humans had not. The strongest version of the argument is no longer hypothetical. DeepMind's FunSearch (Nature, 2023) paired a pretrained LLM with an automated evaluator in an evolutionary loop: sample candidate programs, keep the ones that score, mutate those, repeat. It solved the cap set problem in extremal combinatorics, a question Terence Tao had called a favorite open problem, producing the first new discovery by an LLM on a problem of that difficulty, in collaboration with Prof. Jordan Ellenberg. Its successor AlphaEvolve (2025) used an ensemble of Gemini models as mutation operators to evolve entire codebases, and the results shipped. A data center scheduling heuristic that recovers on average 0.7 percent of Google's worldwide compute and has run in production for over a year. A matrix multiplication kernel sped up 23 percent that cut Gemini's own training time by 1 percent. A procedure to multiply two four by four complex matrices in 48 scalar multiplications, the first improvement over Strassen's algorithm in that setting in 56 years. A later study with Tao and collaborators ran it across 67 problems in analysis, combinatorics, geometry, and number theory. None of that happens with the temperature pinned to zero. The diversity of the samples is the search. So which do we want, variation or determinism? Both, and the reason they do not contradict is that they live at different layers. We want variation at generation time, because that is where quality, accuracy, exploration, and discovery come from. We want determinism at replay time, because that is where debugging, regression testing, and incident response come from. The mistake teams make is trying to buy reproducibility by killing generation time variation. That is the wrong layer, and it costs you everything in the four sections above. You do not freeze the model. You capture what it did. The inputs, the sampled outputs, the tool calls, the retrieved context, the model version, the timestamp. Then you replay the captured run, not a fresh generation. Keep the creativity. Record the evidence. The technique that resolves the whole tension is borrowed from a decades old idea in software testing: record the real interaction once, replay it forever. There are three distinct jobs it does, and it helps to keep them separate. Post mortem debugging. When the 9:04 ticket arrives, you do not re run the model and hope. You pull the recorded run: the exact assembled prompt, the exact sampled completion, the exact tool inputs and outputs, the retrieved chunks, the model version string. Now the bad run is in front of you, frozen, and you can actually trace what happened. This is the capability you were missing in the opening story. Concretely, the recording is one envelope per run. Capture it on the way out, so the agent writes its own black box recorder: record({ "run_id": run_id, "timestamp": now_iso(), "model": resolved_model_version, # not the floating alias "params": {"temperature": 0.7, "top_p": 0.95}, "system_prompt_hash": sha256(system_prompt), "messages": messages, # the assembled prompt "retrieved_chunks": [c.id for c in chunks], "tool_calls": tool_calls, # name + args, as sent "completion": completion, }) When the ticket lands, that envelope is the whole crime scene, frozen: { "run_id": "a3f9c1", "messages": [{"role": "user", "content": "clean up the inactive accounts in staging"}], "retrieved_chunks": ["runbook_staging_cleanup"], "tool_calls": [{"name": "delete_accounts", "args": {"target": "production", "filter": "status = inactive"}}], "completion": "Done, I cleared the inactive accounts." } Now look at what the envelope rules out. The user said staging. The retrieved chunk runbook_staging_cleanup is the correct runbook and it says staging. The assembled prompt is clean. And yet the tool call went to production. Nothing in the context explains the swap, and that is the whole point. The retrieval was right and the prompt was right, so the failure did not live in your data pipeline. It lived in generation. Your request was batched with sixty four others that millisecond, two candidate tokens for that argument sat almost tied, one logit crossed its neighbor, and production won where staging should have. Replay the same prompt alone and it behaves, because the batch that tipped it is gone. The envelope is what lets you say that with confidence: the inputs were perfect, so stop grepping your retriever and go read the sampler. This is the failure the first half of this post was about, caught in the act. Capturing all of this in production is not free, and that is the honest tension. Recording every run means writing the assembled prompt, the retrieved chunks, every tool input and output, and the model version to durable storage on the hot path, which costs storage and adds a little latency to each request. Those payloads also carry whatever the user typed and whatever your retriever pulled, which in most enterprises means customer PII headed for durable storage, so a deterministic redaction pass has to scrub the envelope before it ever reaches the recorder, not after it lands. Open instrumentation standards like OpenInference, and tracing backends like Phoenix, exist to make this routine: they capture the spans of an agent run as structured telemetry and stream the payloads to a data store you can query later. The practical move is to record the full envelope for everything in production but down sample or expire it, keep every run for a few days so the 9:04 ticket is always answerable, and keep the interesting runs, the failures, the flagged ones, forever. The same envelopes you captured in production are what you replay in CI. Experience reuse. A recorded run is also a cache. If the same inputs come around again, you can serve the recorded output instead of paying for another generation, which is faster and free. Deterministic CI. CI is the automated test suite that runs every time someone pushes code, and you want it deterministic, meaning the same code always gives the same pass or fail instead of flaking at random. This is where most teams adopt record and replay first, and the motivation is brutal and practical. The Learnixo write up names the three problems precisely. Cost: every real call to a hosted model during CI burns budget, and at fifty developers times ten pull requests times twenty tests, it adds up fast. Non determinism: a test that asserts the output equals an exact string fails most of the time, because the model does not return the same string twice. Latency: a real call takes two to ten seconds, so a suite with thirty of them takes minutes, which kills the fast feedback loop that makes CI worth having. Record and replay fixes all three at once. Record the real responses once, replay them on every subsequent run. Tests become free, deterministic, and fast. There is a catch that a sharp reviewer will find in about ten seconds, so let us find it first. Record and replay is a superb post mortem tool. It is a bad fix verification tool, and the reason is the same nondeterminism we have been chasing the whole way down. Walk the loop. The 9:04 envelope tells you the agent emitted production where it meant staging. You write a fix: a tighter system prompt, a guard on the tool, a reworded instruction. Now you want to prove the fix works. But the moment you change the prompt, the input hash changes, so the recorded run no longer matches and your replay is a cache miss. A miss falls through to a live call, and a live call is back in the land of batching and logit flips, the exact thing you could not reproduce in the first place. Even with the input held byte for byte identical, regenerating re batches your request with whoever else is on the server, so the flip you are trying to squash may simply not fire today. You cannot confirm a fix by replaying the model, because replaying the model is not deterministic and a fix by definition changes the input. The way out is to stop asking record and replay to do a job it cannot, and to split testing into two layers. Layer one, exact match replay for control flow. Freeze the captured context as a fixture and assert on structure, not prose. Given this exact prompt and these exact retrieved chunks, does the agent take the same path, call the same tool, with an argument of the right shape and the right target? This layer is deterministic and free because it never calls the model. It catches the regression that matters most here: the guard you added must make the destructive target impossible, and a frozen fixture proves it without a single live token. Layer two, semantic judgement for the parts that are allowed to change. When the thing you changed is the wording of a prompt or the model version, bitwise equality is the wrong assertion, because the whole point of the change is that the text will differ. Here you run the candidate against the recorded context and score the output with an evaluator, an LLM as a judge, that asks whether the new answer means the same thing as the old one rather than whether it matches word for word: did the answer stay grounded in the chunk, did it refuse the destructive call, did it preserve the meaning of the gold response. The recorded envelope becomes the regression fixture, the judge accepts any output that means the right thing. That is the loop closed. The envelope you captured in production verifies structure deterministically and meaning semantically, and neither layer asks a nondeterministic system to repeat itself on command. There is a small ecosystem for this in Python, and the right answer is a layered strategy rather than a single tool. But first, a warning about which layer you record at, because the obvious one is the wrong one. The instinct is to mock the network: intercept the HTTP call to the model and replay the bytes. For a single synchronous request that works. For a real agent it breaks, and it breaks in exactly the conditions you ship in. Token streaming with stream=True turns one response into a long lived chunked transfer that network cassettes mangle. Concurrent asyncio event loops interleave several model calls over the same connection. HTTP/2 multiplexing carries multiple requests down one socket at once. Record at the socket and you are trying to freeze a river. Record one level up instead. Mock at the framework or orchestrator boundary, the provider your agent calls through, and override the step function of the agent loop rather than the network underneath it. Call it deterministic graph state hydration: you are capturing the internal state transitions of the execution graph, the prompt that entered a node and the structured output that left it, not the raw packets in between. This is the difference any good review will probe, raw network payloads versus the internal state machine of the agent, and the agent state is the layer that actually replays cleanly. The tools below sit at different points on that spectrum, and the first thing to know about each is which layer it records at, because one of the most popular ones records at exactly the layer this section just told you to avoid. VCR style cassettes, the legacy layer. The oldest approach in this family is VCR.py with the pytest-recording plugin, and it is worth being precise about where it sits, because it is the socket level recorder this section just warned you about. VCR.py works by monkeypatching Python's low level HTTP machinery, urllib3, aiohttp, the socket calls underneath your client, and taping the bytes that cross the wire into a YAML "cassette" on first run, then replaying those bytes on every run after. You mark a test and forget about it: @pytest.mark.vcr() def test_agent_response(): result = get_agent_response("Explain recursion in one sentence.") assert "recursion" in result.lower() First run hits the real API and writes the cassette. Every run after reads from the cassette, no network, no cost, identical bytes. The one thing you must not skip: the default cassette captures your Authorization header and API key in plaintext. Redact them in your config before anything touches version control: @pytest.fixture(scope="module") def vcr_config(): return {"filter_headers": [("Authorization", "DUMMY_API_KEY")]} That is genuinely useful for the narrow case VCR was built for: a single synchronous request to a simple REST shaped endpoint, where the wire bytes and the logical call are the same thing. It is also the layer that breaks on streaming, concurrent event loops, and HTTP/2 multiplexing, which describes most real agents. So treat VCR as the historical default, the thing teams reached for before agents grew orchestration layers, not as the place to record an agent loop. The graph boundary equivalent is to let your framework hand you the state it already tracks, so you never touch a socket. LangGraph checkpointers persist the state at each node transition, so you can freeze the input that entered a node and the output it produced and replay that pair directly. LlamaIndex workflows expose the same idea through their event stream, every step's input and output as a structured object you can capture and feed back. And when you have rolled your own orchestration, the move is a mock at the provider seam, the one function your agent calls to reach the model, returning recorded structured outputs keyed by the canonicalized request. All three record the meaning of a step rather than the packets that carried it, which is the property that survives streaming and concurrency. That is the true graph boundary hydration the rest of this section is built on, and it is why the cleaner patterns below all mock above the wire, not on it. Exact match fixture replay. A lighter weight pattern, shown by the llm-fixture-replay library, stores each request and response pair as one line of JSONL, keyed by a SHA-256 hash of the canonicalized request. Replay looks for an exact match. Change the model, the messages, or any parameter and it is a miss, which is exactly what you want, because a changed input should invalidate the recording. Auto mode replays on a hit and records on a miss, so a new test extends the fixture file automatically, and committing that file makes every later run fully offline. The whole core is about ten lines. Hash the call arguments, look for that key in the fixture, replay on a hit, and call the real function only on a miss: def call(self, fn, **kwargs): key = hashlib.sha256( json.dumps(kwargs, sort_keys=True, default=str).encode() ).hexdigest() for entry in self._entries: if entry["key"] == key: return entry["response"] # replay on hit response = fn(**kwargs) # record on miss self._entries.append({"key": key, "response": response}) self._path.open("a").write( json.dumps({"key": key, "response": response}, default=str) + "\n" ) return response Sort the keys before hashing so the same request always lands on the same key. That one line is what makes the lookup stable across runs. Sorting the keys is necessary but not sufficient, and this is where the pattern quietly fails on real agents. json.dumps(kwargs, default=str) is stable for a flat dict of strings. Point it at an agent state full of nested Pydantic models, datetimes, and system objects and default=str will happily serialize a timestamp, an object id, or a memory address that is different on every run, so the same logical request hashes to a new key each time and every lookup misses. The fix is semantic canonicalization before you hash: strip the transient metadata that has no business in the key, the timestamps, the trace ids, the run ids, stabilize whitespace, and recursively sort nested structures so two equivalent states produce one canonical form. Hash the meaning of the request, not its incidental wire encoding. Without that step your fixture grows a new entry every run and replays nothing. Concretely, canonicalization is a recursive pass that normalizes the types you recognize and refuses the ones you do not, so an opaque object becomes a loud failure you fix rather than a silent memory address that poisons the key: TRANSIENT = {"run_id", "timestamp", "trace_id", "created_at"} def canonicalize(value): if isinstance(value, datetime): return value.isoformat() # stable string, not the clock object if isinstance(value, dict): return {k: canonicalize(v) for k, v in sorted(value.items()) if k not in TRANSIENT} # drop transient keys, then sort if isinstance(value, (list, tuple)): return [canonicalize(v) for v in value] if isinstance(value, (str, int, float, bool, type(None))): return value raise Unserializable(type(value)) # opaque object: fix it, never str() it key = hashlib.sha256( json.dumps(canonicalize(kwargs), sort_keys=True).encode() ).hexdigest() The difference from default=str is the final line. default=str says yes to everything, including the object whose repr changes every run, so the instability slips into the key unnoticed. Canonicalization refuses what it cannot stabilize, and that refusal is what forces the key to stay constant across runs. Zero config mocks. When you do not even want a real response, a mock library like pytest-mockllm gives you a fixture that returns whatever you tell it, with no API key and no setup: def test_chatbot(mock_openai): mock_openai.add_response("I can help with your order.") response = my_chatbot.chat("I need help") assert "order" in response.lower() assert mock_openai.call_count == 1 The layering. Use mocks for unit tests, where you are testing your own control flow and the model's content is irrelevant. Use recorded fixtures for integration tests, where you want realistic model output but deterministic and free. Keep a small number of live tests, but run them on a schedule rather than on every commit, because they do a job the fixtures structurally cannot. Record and replay buys you a deterministic CI pipeline, but it also blinds that pipeline to the one failure that originates upstream: a provider silently changing the weights behind a stable alias. An exact match fixture sails through that change, because it never makes the call and faithfully replays yesterday's cached response, while production breaks the instant the real model shifts. A scheduled live canary, a handful of real calls run nightly against the pinned alias, is the only thing watching that seam, and it is also where a prompt change that silently degrades quality finally shows up. One more practical move from the Learnixo playbook: swap the real client for a mock behind an environment flag and a dependency injection point, so the same code path runs both ways and you flip it per environment. Pulling it all together into something actionable. Stop chasing bitwise determinism through the API. Unless you own the inference stack down to the kernels, you cannot get it, and you would not want to pay its quality cost if you could. Pin everything you can pin. Pin the model version explicitly rather than trusting a floating alias. Log it with every call so you know when it drifts under you. Capture the full run, not just the prompt. Record the assembled prompt, the parameters, the sampled output, every tool input and output, every retrieved chunk, and the timestamp. The model is one of eight moving parts. Record all eight. Replay for debugging, do not regenerate. When something breaks, reconstruct the frozen run from the recording. A fresh generation is a different run and tells you nothing about the one that failed. Record at the graph boundary, and test in two layers. Capture the state transitions of the agent loop, not raw sockets, so streaming and concurrency cannot corrupt the recording. Then split your suite: exact match replays on the frozen context for control flow, and an LLM as a judge scoring semantic equivalence for anything whose wording is allowed to change. The first layer proves structure, the second verifies a fix without asking a nondeterministic model to repeat itself. Keep generation time variation alive. Do not let the pursuit of reproducible tests push you into greedy decoding in production. Determinism belongs in replay, not in generation. It would be dishonest to end on a tidy bow. Several hard parts remain open. Batch invariant kernels exist but are not the default, and they cost throughput, so most hosted providers do not run them and you cannot make them. Recording the full context of an agent run is straightforward in principle and tedious in practice, and the more tools and retrieval your agent uses, the more surface there is to capture and the easier it is to miss one. Model version drift on hosted endpoints is largely outside your control, and a provider can change the weights under a stable name. And there is a genuine philosophical tension we did not resolve so much as relocate: the field is actively building systems whose value comes from exploring nondeterministically, while simultaneously needing those same systems to be auditable and reproducible. Those two goals pull in opposite directions, and the layered answer, vary in generation, freeze in replay, is the best current reconciliation, not a final one. The ticket that opened this post is unanswerable in a world where you can only re run the model and watch it behave. It becomes routine in a world where every run was recorded. The customer's run is right there, frozen, with the prompt and the tool calls and the retrieved context exactly as they were. You see the agent call the wrong tool, you see the input that led it there, and you write the fix. You did not make the model deterministic. You never needed to. You made the run reproducible, which is the only thing you needed all along. Holtzman, Buys, Du, Forbes, Choi. "The Curious Case of Neural Text Degeneration." ICLR 2020. arXiv:1904.09751. Wang, Wei, Schuurmans, Le, Chi, Narang, Chowdhery, Zhou. "Self-Consistency Improves Chain of Thought Reasoning." ICLR 2023. arXiv:2203.11171. He and collaborators. "Defeating Nondeterminism in LLM Inference." Thinking Machines Lab, Sep 10 2025. Zan. "Setting the temperature to zero will make an LLM deterministic?" Mar 24 2026. Romera-Paredes and colleagues. "FunSearch." Nature, 2023. AlphaEvolve white paper, DeepMind, 2025. Georgiev, Gomez-Serrano, Tao, Wagner. "Mathematical exploration and discovery at scale." arXiv:2511.02864.
Key Takeaways
- โข9:04 a.m. A ticket lands
- โขThis story was reported by Dev.to, covering developments in the dev space.
- โขAI advancements continue to reshape industries โ read the full article on Dev.to for complete coverage.
๐ Continue reading the full article:
Read Full Article on Dev.to โShare this article



