AI Flow Dynamics - The Loops Don’t Get Faster On Their Own
AI made code production cheap and fast. The three feedback loops that turn code into validated value still run at their old speed. Managing that gap is what mature AI product development now means.
TL;DR
AI made writing code cheap and fast. Three feedback loops turn code into value — review, market measurement, customer absorption. Especially customer absorption does not get faster.
We spent fifteen years building CI/CD to be able to ship a single line code change cheap and fast. Now we pump thousand-line, zero-cost changes through the same pipeline. The constraint moved from writing to reviewing, measuring, and being absorbed, but the pipeline is mostly unchanged.
Small batches only ever helped wherever a loop was closed. CI/CD closed the technical loops; we rarely bothered with the market loop.
The work is flow discipline, not a limit: build everything, but level the input to each loop (set up a stop light system), build many options and ship few (option storming), and limit customer-facing change to what you can measure and what customers can absorb.
Mature teams measure the output into the loops that validate it. The rest ship because they can. Your choice.



Setup
Gladly, now the code arrives faster than I can read it. An agent writes in twenty minutes what used to take me a week. The eternal busy beaver, it gives me four thousand lines, and waits for me to approve them. 5 Features in 6 variants if I want: I can judge when it’s done. Don’t have to overthink if I even get started. Cool. We spent fifteen years making code creation cheaper. And getting the code done, was also what set the pace for everything before and after code creation.
The premise we built on for the last 20 years
Software is malleable in a way hardware is not. You can change it after it ships, cheaply, again and again. That property is the foundation under everything the product discipline learned and executed in the last twenty years.
As a consequence, in the physical part, we cut the metal, start the CNC, push the button on the expensive production line, after a correct design, mostly based on actual customer orders (i.e. the product OS already validated). Hardware most times has a solid economic, validated model, before the production is started. Software, though, is mostly built on assumptions and ideas - about what users need, about what will work, about what will pay for itself. And those assumptions are not known to be true when the work starts. They have to be checked. (Except in bespoke software, where actually the deal is to deliver after the correct order.) Difficult enough, but a different game. The rest of the software world is a different beast. A whole huge org from Marketing and Sales over R&D to Product Development trying to figure out what the strange animal, the customer, actually needs.
The check happens at two moments. First, before building. Code is expensive, in money and time, so it is worth asking how true an assumption is before betting on it. And, secondly, after release, to find out whether the thing we shipped really creates value for anyone, pays the bill, gets adopted, changes behavior, etc. Each check is a loop: do something, watch what happens, adjust, go again.
That’s what Eric Ries coded into Lean Startup’s “Build, Measure, Learn”, lean always more elegantly (and with more far reaching consequences) called the OODA loop. You get it.
The central lesson of that era was that smaller batches make all of this better - cleaner code, faster correction, better, more precise market signal. Each small batch ideally just one change, so when something changes you know why it had to be changed. And if it actually did (was there outcome to the output?). Despite all that knowledge, we mostly ignored the core requirement: even a tiny, small batch only helps if the loop around it exists and is closed. A small change without the measurement loop produces no learning.
While the two inner loops often got closed, even if only to monitor what these guys in IT do (the most probable reason why is much effort is spent on these loops), the third one was mostly ignored. It would also fall back on the higher deciders.
CI/CD closed the technical loops. The inner loop is the developer’s own cycle - edit, run, see the result - in seconds. The outer loop is the integration cycle - integrate, test, review, release - in hours. Continuous Integration automated the testing; Continuous Delivery automated the release. Running the loops became cheap enough that a single-line change was worth shipping on its own. Small batches paid off, and feedback on correctness came back fast.
Hold the thought: We spent the last ten to fifteen years, optimising for the smallest code change to be “free”, low cost, low impact. Lower transaction cost for small releases, so we can release the smallest changes.
The market loop, though, stayed open in most orgs. It is the slowest one - release, measure how customers behave, learn, adjust - over weeks and quarters. All lagging indicators. But required to replace the crystal ball. Some teams approached this with the occasional A/B test and / or a product-analytics tool like Pendo, and little more. Closing that loop is actually, genuinely hard. It takes tech, grit, patience and a lot of late gratification psychology vs the instant dopamine hit. So while we got very good at shipping small correct batches quickly, most of us remained really bad at knowing if and which were actually worth shipping.
What free code actually changed
Generated code is fast and nearly free. The step that used to be slow and expensive - writing the change - has collapsed and the little agent genies do it for us, remote controlled from the iPhone on the porch. The patience is waiting for Claude to come back from its work.
What we ignored is that this rises the batch size of production. An agent produces a large, coherent change in the time a person once spent on a small one. Production no longer paces itself to how fast a human can type, and that human pace had set the rhythm of the whole system for decades.
We spent ten or fifteen years building this pipeline for one purpose: to make a single small change cheap to ship. AI now pushes the same pipeline changes of a thousand lines at once, at no cost, and they run straight through the machine we built for the opposite problem. The infrastructure that was tuned for the smallest possible batch is now being hammered with the largest batch produced over night for free, and nothing in it complains: the checks go green, the deploys fire. The only thing that changed is the size of what flows through, and that stays invisible until it reaches a human. The human might suffer under the PR load as long as the infra job of adding a gazillion automated tests to that step is not done. At least to forces us to define what a PR actually is. But that’s the easy part, sorry to say.
So production got faster and the loops did not. Speed up one station and leave the rest alone, and the work piles up in front of the next one, which Goldratt described forty years ago. Speeding up one station relocates the constraint to the next one, and here it relocates to three places: review, measurement, market absorption.
Breakpoint 1: review
Changes now reach human review faster than humans can read them, and a person reviews at a fixed rate. And it’s supposed to be that way. Frontier AI augmented coding as per the Shapiro scale sees radical “productivity” increase from level three, which means “I don’t review my code at the line level”. Now, keeping the batches small and the number of reviews climbs until review, not production, caps the throughput - you have relocated the jam to the next stage: review. Making the batches big and each one affecting more code than a person can actually examine, means human review stops working and defects get through to production. Somewhere between those is a batch size that keeps the queue stable and review honest. It is larger than the small-batch optimum we are used to, and it is still bounded.
The way to raise that boundary, is the same thing that CI/CD made on deployment: automate the checking. Types, contracts, property-based tests, generated test cases - every tiny machine-checkable aspect that can be automated, takes the load off the human. Then the human can spends time checking intent - does the feature what it’s supposed to do, what we designed it to do, does the assumption hold at least locally - while the checker agents confirm correctness the lines in a gazillion of aspects. Building that infrastructure layer is the current post-AI version of the investment we made in deployment pipelines a decade ago. In parts, the old investment pays off, but we now add test infra on the micro / code level that was guaranteed by the human until now.
The second step is to stop fusing CI and CD. We always conflated them too often. When integration and release are the same thing, a change goes live the moment it is built. That was the art and we were actually proud of. Read what I mean by option storming later and you’ll see how dependent we are now on breaking the connection: There is judgement / filter after the CI step required. We can simply build too much and the customer has only a small max tolerance of change. Pull them apart and a change can be built, run, and inspected without reaching a single customer. That built-but-unreleased state is the check and decision point for everything downstream, and most of what follows depends on that filter.
Breakpoint 2: measurement
To say that a release caused a change in customer behavior, you need enough signal to separate it from noise. The sample you need grows with the inverse square of the effect you are trying to see. On a 10% baseline, detecting a 2% relative change takes roughly 350,000 observations per variant at ordinary confidence. Ron Kohavi did the work to show how few real experiments have any impact; most are underpowered and mislead, based on misleading assumptions, guesses.
How much traffic do you and what is the the number of changes you can measure with clear attribution in a quarter? The number is fixed, and often surprisingly small. Free code has no influence on that number. But the the number of changes you send to the customer rises - linear to your speedup factor. The gap between what you ship and what you can measure explodes. Everything you produce beyond what you can measure reaches customers with no way to attribute and measure the effect. So: zero learning. And now we’re back to judgement, opinion, taste, the very thing we wanted to get rid of for the last ten years. Now it’s used as the last line of defence against the AI. I doubt it. It gets decided by taste, or it ships and is never really evaluated.
You can test a built option against a model of the market before it ships - a simulation, a proxy metric, a panel reacting to the real artifact - and spend no live traffic doing it. That extends how much you can evaluate. So you save your handful of real experiments for the genuinely new bets, the ones a model of the market cannot predict.
Breakpoint 3: market absorption
The outer market loop, is the toughest one to handle. The hardest bottleneck. It’s beyond statistics. Metrics are coming in late and fuzzy, attribution is hard to find, little signal in all the noise. Customers accept change only up to a limit and that limit is also defined to the radically of change. Little changes with little impact are accepted easier and faster, are easier to measure. Bigger changes, the opposite. Hence a bias for smaller changes: easier to manage, easier to measure, we often do what’s easier to measure. Fundamental product changes can take years to show effect and until they can be reliably measured. You know that from your own experience when the CFO asks why the feature only shows revenue next year when it released this summer.
Take eBay as an example. The company had made auctions the definition of the product and marketed it as the real deal. Then they wanted the customer - all of a sudden and after amazon showed what’s possible with simply selling / buying stuff - to accept that the opposite “just buy on eBay” gets into customers brains. It then took years to convince people that fixed-price buying was also fine, and couple more years that the next level: classifieds also has a place nuder the same brand. There is no way that fast and free code would have made that faster. On that outer loop, faster code has close to zero impact.
Push customer-facing change faster than the base will take it and the excess returns as churn, support load, and disengagement, concentrated in your most habituated users - the ones whose established way of using the product you have just disturbed. Beyond the max absorption rate, one more release kills value even when the change itself is correct.
Consequence: Level the flow to the constraint
The fix for all three is the same: separate the rate at which you produce from the rate you deliver. Splitting CI and CD creates that filter. We always had it, we rarely used it, now it’s even more important. Functionality built but not released. It’s basic physics.
While AI increases the speed of code, it doesn't change the humans Macx acceptance rate. Goldratt’s Theory of Constraints predicts the consequence: the hardest bottleneck defines the throughput of the whole system. And speeding up anything but the bottleneck has no effect on the overall output. If the humans max acceptance rate, speeding uo code like crazy does not change that much. Funny - see the eBay example. Ten times faster code creation with review remaining at human speed, the constraint moves from writing to reviewing. Most amazement and AI investment today is focused on that single step. Output goes up, delivery does not.
Little’s Law predicts and measures the downside. Lead time equals work-in-progress divided by throughput. Generated code increases the work-in-progress in front of review; review throughput is fixed and human (until we did our infrastructure work); thus: the lead time of every feature climbs in proportion to the pile. More work in flight produces longer cycle times - a predicted result. And the pile is dead capital: code waiting to be reviewed for ages creates a mess on the next merge, and it will never see the day of light of the feedback loop that would tell you whether it was worth writing. No one would run the stamping press three times faster while the paint shop has reached its limit; no one has the idea to fill the factory with bodies that can’t be painted.
Toyota’s answer is heijunka - leveling. You enter work into the system at a smooth, even rate that matches the slowest step - the biggest bottleneck - instead of releasing it in the max spikes that upstream capacity can produce. The aim is flow time, not throughput, and the two are different goals. Queueing theory has shown for a century that driving utilization toward one hundred percent makes queues and cycle times grow without bound - a system loaded to the brim has the longest and least predictable lead times. Under uncertainty - and we probably agree on current uncertainty - learning rate trumps production volume. And learning rate is a function of flow time. So you deliberately hold slack and accept less than peak output in exchange for short, predictable flow.
This is where two famous flow ideas differ. Goldratt: exploit the constraint - never starve it, subordinate the rest of the system to it, get the most through it. True. But exploiting the constraint is not the same as maximizing demand you push at it. Instead, load the bottleneck to capacity and maximize the queue in front of it - protecting flow time. Feed the constraint steadily; do not flood it.
Leveling does not need to be a rigid and fixed, which always makes teams paranoid. It can be an event-driven, dynamic limit - a pull signal on the review queue. A stop light system. While the queue is clear, generation runs at full speed, no limit. As it fills, people stop starting new features and pick up a colleague’s review. When it crosses the threshold, everyone - including the fastest AI users - stops producing new code and swarms the backlog until it clears. Freed capacity is redirected to the constraint: into review, and into the automated checks that raise review throughput, rather than into more work-in-progress.
None of this means build less. Build everything the tools will produce. It tells you when building creates value and when it creates a mess. The discipline is about how output enters the system, not about capping it: level the input, and convert the speed the AI gives you into help at whatever step is currently binding. Thus maximise value created. Resist the instinct to keep everyone busy. An engineer idling for an hour is cheap in comparison to a feature stuck in a queue for three weeks. A fully utilized pipeline is slow, a traffic jam. Pushing mire traffic into is useless, actually diminishing value. Even cheap code pays off only as smooth flow to done; a clogged pipeline converts it straight back into waste.
We’ve all been there: The first “extended work bench expirements” had the same problem: who reviews the code form the new partner? Who guarantees quality until the new partners knows are quality system and boundaries? Now the partner sits in the same room and the name is Claude Code. Same challenge.
Option storming
But here something counter intuitive. For ages we never had enough developers to do everything we want to do. We had whole departments busy with overthinking: what of the million things we could will actually done by the engineering bottleneck? Choices, prior lists, changed prior lists, confusion about versions of prior lists? When can we trust the lists? Which is final.
What if we use the new capacity given us by the machine to stop overthinking and just do whatever. Whatever is in the scope of the product direction.
Then cheap building changes not only throughput but it fundamentally changes how to choose.
When code was expensive, you could afford to build only one version of a thing, so you chose it on paper, based on assumption, or better said: bias. No crystal ball, only intuition. Sometimes informed intuition. You ranked options in a document, argued the merits, and committed before even the first line was written. Selection happened upfront, against whatever forecast. Building all the options simply to see them was unaffordable.
When building is nearly free, you can stop choosing based on theory. You build several real versions of everything and judge the ones that seem to work better. After the fact. I call this option storming: instead of brainstorming options and betting on one, you generate a ton of options of everything and let the strongest survive. Judgment moves from before the build, on a spec, to after the build and before release, on the working thing. That’s fundamentally different than the old pre-judgment of “I have a better glimpse of the future than you”. Now it’s about. team having a look at the feature and validate if it is what we wanted and if it could do what we intended. If not: kill it now.
Hence the built-but-unreleased state. The options sit there at no cost to any customer, they don’t weigh against the max acceptance rate of the customer and you can compare them against each other, against your (AI-?)model of the market, against your guardrails. 90% get thrown away, and that is the point: only few make it beyond this point and earn it to weigh against the max acceptance rate of the customer. You drop the rest before they reach a user. Build many, ship few.
Boris Cherny has a whole system that feeds his agents: suggests from twitter, automated agents that fix bugs collected on GitHub threads, suggestions from colleagues. Why bother ranking, let the agents do the work, see what fits the products intent and keep those. Ignore the rest, kill them. With each o them he angered in a mother Claude Code terminal for 10 minutes. Why bother.
Option storming is the upside of free code. Putting things on their head is not often discussed, but you can copy it rom the masters. The win is a wider field of real created options, resting the final choice on visual evidence of the real thing rather than a vague, unfounded forecast. More output is incidental.
It’s like a mood board, but for real rather than “imagine this”.
What mature AI product development looks like
With all the old-school theory applied to the new world, we can actually say more and be more specific about AI augmented product development.
It lets production run flat out and embraces the cheap build.
It has option storming at the core. That’s the best way to leverage valuable developers that never had time, but now have it as they orchestrate the agents. building several versions of whatever might be in scope, deer judgment to after the build, keep this that work and kill those that obviously don't. learn the right threshold. Don’t judge early, based on specs.
It scales correctness with machine-checkable verification, human review is at the intent / feature level, away from code level.
It measures customer-facing change of the two market rates: what you can measure and what customers can absorb. It then uses a model of the market to validate what live date can’t afford to.
The old scholl version without thinking the new level field is: let the production rate set the release rate. Build because we can, release because we built it, see what happens, don’t close the market loop, don’t correct your model and trust your “product sense”, intuition, taste.
Your choice.
Both versions produce a ton of software. They differ in how explicit loops enable learning and that learning is formed into future knowledge.
The actual test
None of this is new economics. It relies on the batch-size logic Don Reinertsen (The Great!) wrote down years ago, and the TOC constraint logic Goldratt wrote down before him. We learned it once, encoded one of its answers - small batches, ship continuously - and slowly came to treat it as a law of nature rather than an answer to a question about cost. Free code changes the costs and the batch size and flow economics. The teams that understood why they were shipping small batches already will have no problem to find the new optimum. If you were excellent in having software flow, you’ll be fine. Else, it’s a great time to learn to leverage the new systems to max outcome. Orgs that never cared and remained mediocre will not understand what’s happening to them and how to fix it under the new constraints. They’ll watch the traffic jams in their systems amazed but without options to resolve them. Again: choices. But the traffic jams are much harder nor that the amount of code explodes. We all had time to learn, if the new constraints are not a trigger to learn now, what will be? Then, free code produces no value.
Technology made code creation free. The thinking required to turn that into value is still ours. The thinking was always our job.
In the next piece, to turn the volume to the max, I’ll sketch a set of AI models that help set up and manage the loops I described.


Great article.
While the diagnosis is correct, I think that you are missing one thing:
Time orientation. It's the only way to get to actual flow, in software development (and elsewhere: https://betacodex.org/white-papers/paper/introducing-time-oriented-software-development-26
Here's an article about how this connects with AI-aided development and the "loops" you mentioned. https://nielspflaeging.substack.com/p/ai-aided-software-development-needs