Coding is solved. The factory isn't.
Highly opinionated, based on my personal experience. Not a prescription — I'm building a multi-repo personal code factory. I don't spec it up front: I dogfood it day by day — using it, and asking for improvements or fixes when something breaks. The architectural decisions still can't be made blin

Highly opinionated, based on my personal experience. Not a prescription — I'm building a multi-repo personal code factory. I don't spec it up front: I dogfood it day by day — using it, and asking for improvements or fixes when something breaks. The architectural decisions still can't be made blindly by the models, so daily use is how the system finds its shape. Two qualifiers about scope, then four claims. Personal means local — for better and worse. I call this a personal code factory because it has no business running anywhere but on my laptop. There's no auth layer, no audit log, no sandbox between the agent and my git remotes. It has my GitHub tokens, my GitLab tokens, my Slack credentials, my pass store. The other side of that coin is that local makes it stealth. I can use it without bothering anyone in the company. I don't have to ask DevOps for anything, there's no IT-security review to clear, no team-practice committee to harmonize with, no central infra to wait on — it just automates things I would otherwise do by hand. That removes every external bottleneck, and it's what lets me experiment fast. The downside is the same thing said the other way: because it's mine, it doesn't help anyone else, and it's nowhere near as efficient as something that would run on GitLab or Slack directly. This is a POC. If it turns out to work, the right move is to promote it to actual company infrastructure. Multi-repo means a particular kind of hard. I run it on multi-repo because that's what my work looks like. If your code lives in a single repo, a lot of what I describe in this series either disappears or shows up differently. I'm not claiming the multi-repo case is the interesting one — just that it's the one I have. Coding is solved — Cherny's phrase, and I think he's right. It took four things, and they're not the same thing. The model: capable enough to write the code. The harness: what lets the model act instead of just emitting text — read the repo, run the tests, iterate, fix. (Mine is Claude Code, but the principle isn't tied to it.) A layer of deterministic constraints: checks that keep the output converging toward quality instead of tech debt. I work in Python, so for me that's ruff, ty, tach run through prek, plus gitleaks and a stack of project-specific hooks. Different language, different tools — the constraint is the point, not the toolchain. And skills: written guidance that gives the model the business and project knowledge to make the right call in this codebase, not a generic one. Take any one of the four away and it stops working. What none of them guarantees is that the architecture is right — and that is the next claim. The factory around it isn't solved. I don't think you can specify it up front. There are two ways to get a system that builds and ships software for you. One: write the spec — every edge case, every failure mode, every integration — hand it to an agent, let it build. Two: use it every day and fix what breaks. I don't believe in the first — at least I wouldn't try it. A spec for a system that builds, reviews, and ships software ends up being more or less the system itself: you don't find out which edges bite until they bite. And the architectural calls inside it still can't be made blindly by the models, so the spec would have to make all of them in advance — that's the part I don't see working yet. That leaves the second way. That leaves dogfooding — using the thing every day, fixing what breaks, keeping it running tomorrow. Dogfooding fuses three things into one loop that no spec can: verifying the system works, improving it where it's wrong, and keeping it running long enough to do both. The first two are the same act — you verify by trying to use it, and the parts that don't work are the parts you fix. Making that verification less manual split into two halves. The proactive half is a test suite that checks whether the agent behaves as intended — did it reach for the right tool, did it avoid the wrong one — so a behavior regression shows up as a red test instead of going unnoticed for days. I'm only starting on these: a handful of behavioral scenarios plus the deterministic checks around them, noisy enough that I don't lean on them yet. The reactive half is a runtime hook that catches a bad action as it happens and refuses it — the backstop for when the agent misbehaves anyway. I lean on those far more today. But every backstop I need is something the proactive half didn't catch in time. If the evals and the agent were good enough, the gates would be dead weight. They aren't yet, so I keep both. The third thing in the loop is the precondition. Self-improvement and resilience are two sides of the same coin. A system that shuts down can't keep improving itself. If I had to pick which matters more, it's resilience — improvement stops the moment the loop stops. You don't get either by specifying them. You get both by running the thing every day and refusing to let it stay broken. So who orchestrates the loop? That's the last claim. Orchestration looks like the part that stays human: holding the big picture, deciding what gets attention first, noticing when two threads are about the same thing, deciding what to keep and what to drop. In teatree most of it already runs without me. One orchestrator with the big picture, not a swarm — it arbitrates and hands the actual work to sub-agents. What still needs me is basically troubleshooting and steering, and I assume that the loop can't be fully closed as long as the behavioral evals are missing. I'll try to publish roughly one post a week. Each one is the thing I keep getting wrong and trying to get less wrong: Part 1 — Software engineering became software architecture. Deterministic constraints solve code quality. Nothing solves whether the architecture is right — that's what's left for a human, and there is no gate for it. Part 2 — Suppose the skill is never followed. Why I treat prose guidance as decorative and everything that matters as a hook, why I had to invent memory because skills aren't reliably read, and how I'm starting to write evals — a test suite that checks whether the agent actually behaved as planned — so I catch a skill being ignored before a hook has to. Part 3 — Make yourself an optional reviewer. The closed-loop part, including the surprisingly hard subproblem of letting the system merge PRs without my approval without it feeling reckless. Part 4 — One orchestrator, many loops. Why I run a single session with many sub-agents instead of many sessions, what that costs, and the honest ceiling I think it has. Part 5 — FSM in the database. How concurrency, leaks, and crashes stopped being terrifying once the workflow state lived in a table instead of in memory — and how the same substrate carries resilience and a distributed improvement mechanism across multiple repos. I'll change my mind about some of this between now and the last post. That's the point.
Key Takeaways
- •Highly opinionated, based on my personal experience
- •This story was reported by Dev.to, covering developments in the dev space.
- •AI advancements continue to reshape industries — read the full article on Dev.to for complete coverage.
📖 Continue reading the full article:
Read Full Article on Dev.to →

