Engineering Coherence
Building Decision Architectures for the AI era
AI scaled your output. Judgment did not scale with it.
You’re three days into a quarter you expected to spend on next year’s strategy when the meeting invite comes in from customer success with a note attached:
“I think I need your help with this one.”
You know the customer before you open the calendar.
They were customer number eleven. The logo your sales team still opens enterprise pitches with, the one prospects cite back to you on discovery calls. Two years ago, when the platform went down twice in a single quarter, their team stayed through the recovery and helped instrument the fixes alongside yours.
And they are not happy.
A feature your team shipped six weeks ago, the one credited with driving first-week activation from 19% to 31%, has broken part of their reconciliation workflow. The integration technically still works, but a workaround their finance team had depended on for years changed underneath them without warning. Month-end close is now delayed. These are the numbers their leadership runs on, and the head of operations who championed your platform internally is going to be sitting in front of her CFO in three days trying to explain what happened.
She is not demanding a rollback. She is asking a harder question.
“Is this going to keep happening?”
You promise her you’ll dig in personally and call her back by the end of day. Then you start digging.
At first, nothing looks wrong. The growth team did exactly what they were supposed to do. The onboarding experiment was reviewed by the right people, the engineering lead approved the changes, the release process completed cleanly, and the metrics moved materially in the right direction. By every visible check the team had, this was a resounding success.
Then you find the dependency nobody realized existed.
One of the integration touchpoints modified during the experiment had originally been implemented years earlier as a temporary stub. A placeholder behavior for something the platform team had intended to build out and later deferred. Over time, a workaround formed around that behavior, and eventually it became embedded inside this customer’s finance operations. None of it existed in a spec. The knowledge lived in two people’s heads. One had left the company. The other had moved teams.
No one involved in the release understood they were changing something load-bearing. Nobody asked the question that wasn’t on anyone’s checklist:
Who depends on the existing behavior, and what breaks for them when we change it?
That question existed nowhere in the system. Not in the OKR, not in the experiment review, not in the launch checklist. The team optimized exactly what the organization taught them to optimize. They went even faster than you asked them to. They did exactly the work the system rewarded.
The constraint that should have held it together, you never named.
And because you never named it, no one else would either.
You sit with that for a moment, then look at your calendar. There is already a cross-functional review on the books three weeks from now that was put there for exactly this kind of thing. But the customer is not going to wait three weeks. The unwinding, figuring out which of her workflows broke, which can be patched, which require her team to rebuild from scratch, is going to take longer than the development cycle that produced the problem in the first place. The compounding has already happened. The recovery is now its own project.
You start thinking about the other teams. The platform team has been moving at the same pace. The enterprise team has been moving even faster. The product surface area has roughly doubled in eighteen months, and you begin to suspect this is not the first time something like this has happened. It is simply the first time the cost became impossible to ignore.
What put one of your most important customers at risk was not recklessness, incompetence, or a lack of rigor. The team behaved rationally inside the system they were operating in. The problem was that nobody was measuring whether the system was still coherent while it accelerated.
That is the gap you now find yourself in.
The customer is still on her side of the phone, waiting for the call back. This is not a personal failure, even though it feels like one.
The missing work was the explicit defining and naming of constraints. It is the work most leaders in this generation were never required to do. For decades, execution friction acted as invisible governance. Building was expensive. Coordination was slow. Shipping carried enough operational weight that impactful decisions were naturally forced to confront one another before they reached customers. Scope negotiations, integration reviews, roadmap debates, and release planning all acted as filtering mechanisms that limited how much divergence could accumulate at once.
By the time something shipped, it had usually survived enough scrutiny that the critical constraints had been surfaced somewhere along the way, even if nobody had formally documented them. Leaders could remain highly involved because the pace of consequential decisions still matched human attention bandwidth. The rooms where important tradeoffs were made held still long enough for leadership presence itself to function as the coherence mechanism. Your presence was the working substitute for what no one had to write down.
That operating reality is disappearing.
The big decisions are no longer concentrated inside a handful of planning meetings or executive reviews. They are happening continuously, across teams, in parallel, at the speed AI-accelerated execution now enables. The growth team adjusts onboarding behavior. The platform team evolves an integration endpoint. The enterprise team makes a customer commitment to close a strategic account. Individually, each decision makes sense. Collectively, they can pull the system in different directions before anyone recognizes the drift.
You aren’t in those rooms because you can’t be. The rooms where you used to feel your judgment shaping the call, where you could see the fingerprints of your thinking on the outcome, are not where the important decisions are being made anymore. That work has moved, and it has multiplied. The leadership role that put you in the room when it mattered most is becoming something different.
The instinctive response is to pull decision-making closer through more reviews, more approvals, more executive oversight. That instinct is understandable. It also fails. The volume of decisions now being generated cannot be governed through leadership proximity alone, and adding more gates to an accelerated system does not restore coherence. It produces slower review processes that the organization gradually learns to route around.
Your job now is to design the rooms so they are all building toward the same thing. That design work is what coherence discipline is.
The work of coherence discipline does not stop the escalation calls from coming. Calls like this one will keep arriving as long as you are operating at the speed AI now enables. The discipline changes who makes the call first. The goal is not that mistakes stop happening. It is that you stop being the last to know.
The system is working too well
This is not just happening to your team, it is happening everywhere. And it is not what it looks like.
Coherence problems are not new. Large organizations have always struggled to keep distributed decisions aligned to shared intent. What is new is that the friction that used to absorb those problems has been removed. Acceleration did not invent fragmentation. It exposed the ambiguity organizations had been surviving through.
The team did not skip rigor. Each decision was reviewed. Each one made sense in the context of the experiment it was part of. The growth PM validated the data, the engineering lead approved the change, the release notes went out, the activation numbers came in above target. By every measure the system had, this was a resounding success.
What the system did not do, because it had never been designed to, was force anyone to ask the simple question that would have held everything together: “Who depends on this, and what promises are we making and breaking by changing it?” That is judgment, not a process step. And applying it requires the kind of attention the system never budgeted for, because that kind of attention had become too expensive to spend on every decision.
The system that produced this outcome is working exactly as designed, against conditions and assumptions that have fundamentally changed. The system is working too well.
The bet on AI you made was real, and it is paying off. Your teams ship more, faster, in more places, with more concurrent experiments running than would have been thought possible eighteen months ago. The deliberate attention that asks “What does this commit us to? What depends on the current behavior? What would it cost if it’s wrong?” did not scale alongside it. AI scaled output. Judgment did not scale with it. Not because the people got worse. Because the attention required to apply judgment is finite, and the output competing for it is no longer bounded by execution cost.
You don’t need to look far to see this happening across your organization. A conversation with the senior engineers, the ones reviewing pull requests, evaluating architectural proposals, weighing tradeoffs on launches. They will confess that they are exhausted. The hours have grown, but the exhaustion is from the volume and complexity of what they are now being asked to evaluate. They are running a marathon at sprint cadence, and when the review queue outpaces the reviewers, the review quality drifts. Nobody decided to lower the bar. The best intentions in the world cannot apply the same depth to every decision when decisions have multiplied.
Plausibility becomes a proxy for rigor. If the proposal looks right, is internally coherent, well-reasoned, and consistent, then it has to be enough, because there are twelve more behind it. Usually this works. Most plausible things are fine. There is a small percentage where depth actually matters, the ones carrying the hidden commitment. These are the ones that are expensive to reverse and indistinguishable from all the rest in the queue. The review process still functions. It is simply no longer capable of doing what it was designed to do.
The senior engineers are not the only signal. Your most experienced operators are leaving. The ones with the judgment to see across teams, the ones who have been carrying the unwritten constraints in their heads. One every few months. Quietly. Each departure has its own story, each one has a reasonable explanation, and none of them individually warrants alarm. But the pattern is there. They are leaving because the work itself was the reward they came for, and the system has made that work impossible to do well. What disappears with them is the organizational memory the system was depending on without knowing it.
Nothing appears obviously broken. The metrics still look strong. Teams are shipping. Customers are renewing. But the coherence holding the whole system together has become dependent on a shrinking number of people manually carrying context the organization never formally encoded. That model does not survive acceleration indefinitely.
Eventually the organization reaches a point where the volume of decisions exceeds the system’s ability to evaluate whether all the decisions still collectively reinforce the same strategic intent. At that point, the problem is no longer execution speed. The problem becomes how quickly the organization can recognize and correct drift before it compounds into something much more expensive to reverse.
”Straight roads are for fast cars, turns are for fast drivers.”
- Colin McRae, World Rally Car champion, retired with record for most WRC wins
The speed that matters has changed.
For a quarter of a century, technology organizations optimized for execution speed. The critical operational question was straightforward: how quickly can we build, test, and ship? An enormous amount of management discipline, tooling, and organizational structure evolved around compressing that cycle, and the discipline produced real results. Companies that mastered shipping speed won markets.
For many organizations, build, test, and deploy time is still the dominant bottleneck, but it is changing fast. AI does not simply accelerate your roadmap. It accelerates every team’s ability to independently modify the system in parallel. More experiments launch simultaneously. More architectural decisions compound concurrently. More customer-facing behavior changes before anyone fully understands how the interactions combine downstream. No leader can hold all those changes, their interactions, and their secondary effects in their head, let alone pressure-test them for what nobody on any single team thought to ask.
To thrive in this accelerated world, shipping at top speed is no longer enough.
Think of a race. There is the top speed you hit on the straightaway, and there is the cornering speed that lets you set up the next turn without losing the line. For decades, the race was on the straightaway. The winners were the ones who could ship faster than their competitors. Top speed on the straightaway is now table stakes. The competitive edge is in the cornering. The race will be won and lost based on how fast you can detect what you shouldn’t have shipped, decide on the fix, and have it in the hands of your customers before the damage compounds.
This speed is called correction velocity. It is the interval between a decision starting to drift and the moment the correction is in the hands of the customer. It is composed of three latencies: detection, decision, and implementation.
Detection latency. How long between the divergent decision and someone recognizing the divergence. In the opening scene, this was six weeks. The team shipped, the growth team celebrated, the dashboards looked better than they had in a long time. The signal that something was wrong arrived from the customer, not from inside the organization. In a weekly-shipping environment, that is many compounding cycles, not one.
Decision latency. How long between naming the divergence and aligning on how to correct it. In the opening scene, the clock starts when the leader sees the divergence. The scene closes while this latency is still running. The leader has not yet connected with the customer to validate her needs, or with their broader team to develop a coherent mitigation plan and assign ownership for it. Only then does the clock stop.
Implementation latency. How long between deciding on the correction and the correction reaching the customer. Each decision built on top of a divergent one makes the correction exponentially more expensive. In the opening scene, the leader’s instinct is already that this fix is going to take longer and require more effort than the updates that put them in this situation.
When most organizations say they are “moving incredibly fast” with AI, they are frequently only optimizing implementation latency. They have accelerated implementation while losing the lion’s share of their time to detection and decision latency. Implementation is mostly execution work that AI now absorbs. Detection and decision are different: both require someone to weigh what a signal means and what to do about it, and that work cannot be delegated to the tooling that accelerated everything else. It is precisely at the points of detection and decision where leadership judgment is most oversubscribed, at exactly the moment when judgment is stretched the thinnest.
The organizations that will thrive in this new reality are the ones that detect and resolve drift before it compounds, not the ones that make the fewest mistakes.
Correction velocity is a diagnostic. The failure mode is treating it as a metric to optimize. That creates the conditions for it to be gamed, for teams to hide misses or rush to declare victory before the issue is fully resolved. A long correction cycle is a signal, not a verdict on the team. It’s a map of where the system is leaking time. You are not setting a number to hit. You are finding what to redesign.
Looking back at the opening scene with this lens, you can see that the correction cycle is going to be long. Detection was external, identified by a critical customer, after weeks of compounding. The decision remains uncertain and in flight. Whatever fix is decided on will likely take longer than producing the problem did. Each of those latencies is a place where design choices either help or hurt.
Designing the rooms so they are all building toward the same thing means systematically designing to optimize each of these latencies. But correction velocity only measures whether the system is holding. It does not tell you what is actually keeping it together. Coherence discipline is the work of engineering what holds the system together.
Coherence is engineered, not cultured
Comparing how much faster AI lets you complete tasks is dominating the discourse right now. But speed benchmarks are vanity metrics. Velocity, not speed, is what matters. Velocity is motion in a specific direction. An organization optimizing for speed will accelerate toward wherever the wind blows. An organization optimizing for velocity knows where it is going.
The teams in the opening scene were optimized for speed. The growth team, the platform team, the enterprise team were all moving fast. Velocity across the organization was missing. Each team’s speed was high, but the directional coherence was not. At high speed, deviations compound before anyone catches them.
Previously, a minor deviation was caught in the next monthly program review before it had time to harden into something structural. With the speed AI enables, that same deviation multiplies into part of the system before anyone has time to look up. The unnamed constraints don’t announce themselves. The undocumented dependencies accumulate. When they do surface, more speed cannot be the answer. The fix has to be a design discipline that produces coherent direction across teams without slowing them down.
Coherence discipline is not something you install or a methodology to adopt or a framework to deploy. It is the work of designing the rooms your teams are working inside so the decisions made there compound into coherent organizational direction. The mechanisms that come out of it will look different in every organization. The discipline itself is what travels.
Amazon and Meta both engineered coherence at scale. They engineered it in opposite directions. The lesson is in the contrast.
Amazon: the machine
Amazon treats itself as a designed system, where coherence is engineered into named mechanisms that can be documented and transferred. Two of them are worth examining closely.
The first is working backwards. Before any product gets built at Amazon, the team writes a PRFAQ: a press release describing the customer experience the product will produce, paired with an FAQ that anticipates what customers and stakeholders will ask. The team starts not with what to build but with what the customer needs to feel, realize, and decide. Everyone who has shipped a complex product knows the failure mode. The team goes so deep into what they are building that they lose track of whether it still solves the problem the customer had. Working backwards keeps the customer outcome binding throughout the work.
The deeper value of the PRFAQ is something even those familiar with it can easily miss. At Amazon, PRFAQs do not get filed away after launch. The original document that secured funding for the founding team of ten still circulates years later, now inside a thousand-person business unit, shared with new team members, referenced in cross-team conversations, handed to anyone trying to understand what the team does and what it is still building toward. The customer outcome it names does not expire. That persistence is the constitution of the team, not a cultural habit.
The second mechanism is one-way and two-way doors. The framework asks one question of every consequential decision: is this reversible? A two-way door is a decision you can walk back through. A one-way door is a decision that, once made, cannot be undone without significant cost. Reversible decisions do not need to wait for the leader. The team decides and moves on. Irreversible ones get the attention they actually deserve. As Jeff Bezos put it: the problem is not that large companies make bad decisions. It is that they make good decisions slowly. Amazon doesn’t win by making better decisions. It wins by making good decisions fast and at scale.
I worked on a syndication mechanism for the Amazon catalog that enabled a product listed in one of Amazon’s marketplaces to be syndicated across all of them. This process would include automatically translating the product descriptions, enforcing country-specific legal compliance, currency conversion to maintain profit margins, and shipment time commitments. The pipeline was in the critical path for Amazon’s $500B ecommerce transaction flow, with upwards of fifty downstream teams that directly consumed the catalog data. Nobody could fully enumerate them. It’s not because we didn’t try, but because none of it was static. All of these teams had roadmaps, goals, metrics that they were trying to move aggressively. Documenting them perfectly on Monday meant they would be obsolete by Friday. There were 150 critical projects running through this pipeline at any given time and nobody had time to wait for anyone else.
If we tried to do it the traditional way, it would have died on the cutting room floor. This is the coherence problem that makes traditional validation impossible - the thing you are validating doesn’t hold still long enough to be validated against.
What enabled us to move forward despite this uncertainty was the idea that to make decisions quickly, you have to act once you have 70% of the data you would ideally want. You can’t afford to wait for a few more metrics, or one more anecdote. Any data you collect before the ‘real thing’ is only ever a proxy, so waiting for ‘better data’ is rarely worth the opportunity cost. You made the decision, and relied on its reversibility to absorb the cost of being wrong about the 30%.
With this in mind, we leveraged our shadow mode capability at a massive scale. Continuous deployment typically dials traffic up from 10% to 100% while watching metrics. We were using it for something different: shadow mode, running the entire system in parallel against production traffic so we could see how it would behave at scale before any customer was exposed to it. We leveraged ‘tracer bullets’ that acted like ‘markers’ that downstream systems would ignore so that we could follow exactly the path downstream. We built a clean separation of the shadow data from the data lake so that there wouldn’t be double-counting of events that would have skewed our insights. We implemented ‘pressure release valves’ to prevent back pressure on systems downstream that couldn’t absorb the extra load. And we negotiated with data reconciliation teams to avoid creating load spikes in already massively scaled systems.
The principal engineers, directors, and TPMs were all in standing weekly meetings where confidence levels were continuously being updated. Shadow mode caught issues: gaps in how marketplace-specific ‘partial success’ errors got reported back, images with embedded text that would need internationalized versions. These were things that would never have been caught even with the most robust test plan. Each error detected and fixed increased confidence, but nobody had >90% confidence at any point and they didn’t pretend to, nor were they expected to. If they did, the feedback would be that we weren’t pushing aggressively enough. The real work was holding the program steady while everyone knew they were operating with incomplete information.
By the time we flipped the switch, the validation had already happened. The launch happened not because we had eliminated the risk of being wrong, but because we had bounded the cost of finding out if we were. The one-way door had become a two-way door. This is what the framework actually buys you. It makes attempting hard decisions, seemingly impossible ones, possible.
Meta: the social biome
Where Amazon engineered coherence into named, documented mechanisms, Meta engineered for the same outcome through a fundamentally different approach.
Meta made a bet most organizations are too afraid to make: that the dynamism and creative collisions of a startup were worth preserving at any scale, even if it meant accepting the disorder that came with them. This was not an accident of growth. It was a deliberate design choice, encoded from the earliest days in what became known as the hacker ethos. “I hacked it together” at Meta is a compliment. It signals that someone moved, tried, learned. At Amazon, the six-pager exists precisely because they believe “I hacked it together” is never the answer. That contrast is the difference between two theories of how coherence at scale gets produced.
For the hacker ethos to work at scale without fragmenting, it required the social infrastructure most organizations never build. At Meta, you did not reach out to another team with a request. You messaged a person for a coffee chat. The conversation was about who they were, what they were working on, what they cared about. You’d end up talking about their one-eyed cat that sleepwalked. The actual ask came later, in a separate conversation, once a relationship existed to carry it. The social fabric was the mechanism by which the ecosystem held together when the org chart would not.
Meta’s coherence mechanisms are harder to point to because they were not mechanisms in Amazon’s sense. They were the social infrastructure itself: relationship density, ambient context, shared trust. That is part of why Amazon’s approach transfers more easily and Meta’s does not. You can adopt a PRFAQ. You cannot adopt years of accumulated relationships.
The most honest account of what it actually felt like to operate inside Meta’s ecosystem came from Mark Rabkin, a longtime Meta leader who called it alien chess. Imagine sitting down at a chessboard, making your moves, building your position, and then a mischievous green alien swoops in, takes the board away, and replaces it with a different one. Sometimes the pieces have changed, sometimes a key piece is missing, sometimes it is an entirely new game. The objective is no longer about mastering any single board. It is producing impact across many boards in rapid succession, as each one gets swapped out beneath you. In a single year my teams navigated four major reorganizations, each impacting hundreds of people. The expectation after each was never “take time to recover.” It was effectiveness and impact the very next day.
Noma, the three-Michelin-star Copenhagen restaurant that has held the title of world’s best five times, publishes its recipes. You can buy the book. You can read exactly how every dish is prepared. You still cannot open a Noma in your neighborhood. The recipe is visible. The decades of sourcing relationships, kitchen culture, and instinct that make the recipe work are not. That is the risk of copying Amazon’s mechanisms or Meta’s ecosystem. The mechanisms might transfer. The conditions that make them work do not.
At Amazon, that condition is the Leadership Principles. It was more than a poster on a wall, it was the living vocabulary, woven into every conversation from the most senior leader to the new graduate. A shorthand carrying layers of meaning built over years of practice. The PRFAQ, the one-way/two-way doors framework, all rest on everyone in the room reasoning from the same shared premises. Without the principles, the mechanisms would still exist on paper, but they would produce different decisions. And even at Amazon, the principles were at times weaponized, used as ammunition in disagreements rather than as a shared language for hard tradeoffs. When the conditions are real but imperfect, the mechanisms still misfire. When you copy the mechanism without the conditions, you get the bureaucracy of the mechanism without the judgment that makes it work.
At Meta, that condition is relationship density: the social trust and shared context the organization continuously reinforced as the medium everything else grew from. Without it, the hacker ethos would have produced fragmentation rather than coordination.
Amazon and Meta’s mechanisms have little in common. The principle that travels is what those mechanisms produce: coherent decisions made in the leader’s absence, at a scale no leader could personally supervise. Amazon and Meta got there by entirely different paths. That is the point. There is no single answer. There is not even a best answer. The right answer is the one you design with intention for your specific organization, your specific operating reality, your specific moment.
Amazon and Meta’s mechanisms are evidence that the problem is real and solvable. Neither is a template. The failure mode is not copying from the wrong company. The failure mode is letting it grow wild: assuming coherence will emerge on its own, that the right culture will develop without deliberate design, that the mechanisms will appear when you need them. They will not. They have to be built. The only question is whether you build them with purpose.
Coherence discipline is what you build. Correction velocity is how you know it is holding.
The authority you already distributed
The common reflex, when coherence breaks down, is to pull yourself back into the center of the decisions. More reviews, more gates, more rooms where you as the leader are present. It is wrong, and not just because it doesn’t scale. It misattributes the problem. The problem is not that authority is distributed. The problem is that it was distributed without being designed.
Your organization is already decentralized. The AI adoption you committed to made it so, and authority followed velocity. Teams acquired autonomy by moving faster than the oversight structures could track them. Nobody made a deliberate choice to distribute decision-making. It just became the operating reality, and the operating system never caught up. Fast-moving autonomous teams are not the failure mode. They are the target state. The question is whether the decentralization you already have is designed or accidental. Closing that gap is system design work, not about more leadership presence.
Organizations adapting successfully to this transition converge around three structural capabilities. Each addresses a different part of the correction cycle. Each is governed by a single design principle that determines whether the mechanism actually works or just exists on paper. Each would have caught something in the opening scene that was missed.
Named ownership
The growth team had authority over the experiment. They ran it. They moved the metric. Nobody had authority over the customer-impact question: the question of what the change committed the organization to, who depended on the current behavior, and what it would cost if the assumption was wrong. The constraint existed. “Everyone owns the customer behavior” meant nobody was explicitly designated as the owner. The release went out cleanly by every check the team had. The missing check had no owner because it had never been designed.
Named ownership would have defined who to route that question to before the release by mapping the decision to the person wholly accountable for it. The work is about making the routing explicit, not adding an approval layer. Autonomy without authority creates fragmentation, with each team doing rational things that pull in different directions. Authority without autonomy creates bureaucracy, where every decision waits for permission approvals that cannot keep pace with the output. The target is neither extreme. It is autonomy with explicitly designated limits: a defined domain in which a team can move without escalation, and a clear boundary beyond which the decisions belong to someone else.
Most organizations run on implicit ownership. Everyone assumes someone owns it. Titles substitute for explicit ownership. Seniority substitutes for assignment. Under acceleration, implicit is a booby trap. The ambiguity that used to be absorbed by slow execution is now the thing that creates the customer-impact gaps your team cannot see.
The test is simple. Can anyone on any team answer two questions: who has authority over this, and what are they not allowed to decide? If the answer is unclear, ownership has not been explicit. It has been inherited by some legacy path that could afford ambiguity because execution was slow enough to catch what slipped through.
Named ownership does two things at once. Before the divergent decision, it creates the conditions for failure to be caught before release. The customer-impact question gets asked before the release goes out. If a divergent decision is discovered later, it compresses how long the correction takes, because the context and risks are already concentrated in the leader who needs to act on them. The room is already familiar to them. They already know the problem space inside and out. Without named ownership, the organization loses time twice: first figuring out who should own the response, then bringing that person up-to-speed enough to make a good decision. With named ownership, the signal routes to someone already inside the problem domain.
Drift detection
The cross-functional review was already on the calendar, but three weeks out. It was put there for exactly this kind of thing. And after the review, whatever it identified wouldn’t have someone ready to pick it up and run with it. There was no clear path connecting what it found to what should happen next. The wire was tripped, but there was no alarm attached to it.
Dashboards are not drift detection. They only show what’s already happened. By the time the dashboard shows the support team’s load doubling in correlation with the activation lift, the drift has been compounding for weeks. The opening scene needed something that could highlight the divergent decision. A mechanism that would have flagged the integration touchpoint’s hidden dependency before the customer felt the consequences.
In systems engineering there is a foundational concept: open-loop and closed-loop systems. An open-loop system acts and assumes the action is working as intended: a timer-based sprinkler that runs regardless of whether it is raining. A closed-loop system measures its output, compares it to its intent, and uses the difference to adjust the next input: a gymnast on a balance beam is constantly making micro-adjustments to stay centered while executing a handspring. Most organizations are running open-loop decision systems in a high-velocity environment. They act, assuming the action was coherent, and they begin the next cycle before any signal arrives to tell them otherwise.
The key principle for drift detection is the tight closed feedback loop. Each decision needs a clear connection between what shipped and whether it was coherent, and that connection must complete within the same cycle that produced it. When the loop closes that fast, divergence surfaces while the team that produced it still has the context to act on it, before the next layer of decisions has been built on top.
A feedback loop that closes every six weeks in a one-week shipping cycle is not actually feedback at all. The next cycle starts before the signal arrives, and the drift compounds into the next five releases before anyone is aware that something’s not quite right. The warning signal becomes a curious amber light that blinks and gets ignored. The team review happens, the meeting ends with celebration about the successful launch, and the next sprint begins with the system drifting a little more. Under acceleration, drift that would have taken a quarter to become obviously wrong can now compound to that point in weeks.
A constraint on how tight a loop you can make is worth naming. Customer feedback arrives on the customer’s cycle, not yours. The customer’s month-end closes, quarterly reports, annual reviews are not yours to change. The tight loop instruments leading indicators: quantitative signals that are continuously available, like integration activity, error rates, and usage patterns at the key touchpoints.
Leading indicators are not a replacement for the customer signal. They tell you something is moving. The customer is still the one who tells you what it means in their business. Leading indicators give you a way to act before that slower signal arrives. You decouple your detection cadence from the customer’s cycle, but your understanding of the customer can only come from the customer.
Instrumenting your leading indicators means being disciplined in signal selection, not in how many signals you capture. To limit the noise, you want to carefully select which signals actually commit the organization to something it cannot easily reverse and build the feedback loop around it. Things like integration changes, customer-facing behavior changes, architectural dependencies, and deprecations. The leading indicator needs to be quantitative, continuous, and available before the next development iteration begins. This doesn’t require new infrastructure, just tuning the purpose of what already exists by making deliberate choices of what signals are being watched and at what rhythm.
A tight closed loop asks one question inside every cycle, across the teams: are the decisions we are making still pulling us all in the same direction? This is not another item on the launch checklist. It is a signal path that completes before the next cycle begins, integrating instrumented leading indicators that stand in for the customer feedback tied to slower cycles.
But a signal path that nobody acts on is not a feedback loop. It is a logging system. The signal can arrive on time, complete its loop, and still produce no change. That failure mode belongs to the system the loop reports into, not the loop itself. That is what the next mechanism addresses.
Incentive realignment
The growth team didn’t just hit their target, they turned it around. The celebration was earned, by every measure the system had. The OKR was green and they exceeded the metric target. The month closed with clear upward momentum.
On the other end of the phone, your most important customer is not celebrating. She is not even angry anymore. She is past angry. She feels betrayed. The organization she staked her credibility on was optimizing for something that had nothing to do with her business, and that optimization is what broke her company’s month-end close. That feature was not a mistake. It was a success.
These two realities exist simultaneously inside that organization right now. They are in conflict because the system produced both of them, exactly as designed, not because someone did something wrong. The roadmap created the target. The OKR attached the reward. The team delivered. Your customer’s experience was never designed into that system.
This is the impedance mismatch incentive realignment has to resolve. The roadmap is not just a plan, it’s a commitment structure that underpins team evaluations, executive agreements, and public announcements. Customer feedback that contradicts the roadmap does not arrive as neutral information. It arrives as a threat to the roadmap. So it gets processed accordingly. It’s only one customer, or an edge case that will be addressed in the next quarter. The rationalization is not dishonest. It is the rational response to a system that rewards delivery and has no mechanism for rewarding the harder question of what the delivery actually cost.
The problem is that the OKR lacked the guardrails to keep it coherent. Activation is the growth team’s view of the customer journey. It’s blind to the customer’s month-end close, to the downstream dependency, to what the product commits to when it changes an integration point. The team moved the metric exactly as designed. The constraint that would have made that metric serve the whole instead of fragment it was never discussed. Without that constraint, your growth team will always push to find a way to move their number by design. The right question is whether moving their number also makes the product better for the customer who has staked her credibility on it.
The principle is to reward what compounds the whole. Each team’s incentives should be designed so local optimization strengthens global coherence rather than undermining it. That requires naming global coherence as a key dimension of performance. Shipping is not enough. What shipped also has to be coherent with everything else the organization has committed to.
Incentive realignment does something the other two mechanisms cannot do at scale. Aligned incentives are preventive at the source. They change what gets generated in the first place. Named ownership and drift detection can catch some divergence before it ships, but their primary work is in correction. An organization that generates less divergence needs fewer corrections. And the few corrections it does need are cheaper to fix. The system was designed to catch deviations before they became six weeks of compounded drift. Before what broke for the customer felt less like a broken feature and more like a broken promise.
Three mechanisms. Three principles. One system.
Looking back at the morning of the call, no single mechanism would have prevented what happened. Named ownership alone would have placed someone accountable for the customer-impact question, but the question still might not have surfaced in time. Drift detection alone would have flagged the divergence, but the signal would have been rationalized away by an organization committed to its roadmap. Incentive realignment alone would have rewarded surfacing the impact, but without an owner to act on it, the incentive would have stayed aspirational. Each mechanism on its own catches only part of the problem and lets the rest through.
Each mechanism works because it creates the conditions the others depend on. Named ownership maps the path drift detection uses to hand it off to someone already familiar with the problem space. Drift detection makes divergence visible in time for incentive realignment to reward catching it. And aligned incentives reduce the drift and therefore how often they need to be corrected. None of these mechanisms is the answer on its own. The system that thrives is the one that balances all three.
Not the last to know
The call back is still going to be hard. The customer’s team is still going to spend weeks unwinding what broke. The head of operations is still going to sit in front of her CFO in three days and explain it in her own words. You don’t have an answer she will like. You have an honest structural account of what happened, and not just good intentions, but a description of what is being built so the next one is caught earlier.
She might still churn. The relationship was never about being perfect. It was about being the kind of partner who could be trusted to correct fast and correct honestly. You are now in a position to actually be that kind of partner, because you started building the structure that makes it true.
Months from now there will be another call. Maybe from her, maybe from a different customer who has been with you through every release. The calls do not stop. The leaders who win this decade are the ones who change who places them, and when.
What is different now is that you make the call first.
Before the customer does. The detection latency that ran six weeks now runs in days. The decision latency that stalled because no one knew who owned the call now routes to a named owner inside the same cycle. The implementation latency was never the problem; it is the one thing AI actually accelerated for you. You name what happened, what you detected, when you detected it, what you decided, and what is already being worked on by the time you are on the phone.
That is the resolution. Not that mistakes stop happening. That you stop being the last one to know.
Where to start
Three diagnostics can begin this quarter. None of them require redesigning anything yet. Each one surfaces what the structural design will need to address.
Map your recurring decisions. Take the last quarter. List the decisions your organization makes repeatedly - feature scoping, customer commitments, deprecation calls, hiring decisions, integration changes. For each class, name two things: the person who owns it, and the decisions adjacent to it that they are not allowed to decide unilaterally. Publish the result. If the list is hard to make, that is the finding. If it is easy to make but no one on the teams could produce the same list independently, that is also the finding. Implicit ownership only feels stable until something breaks. Then it is the absence of the decision ownership map that costs you weeks.
Baseline your correction velocity. Pick the last three times your organization had to course-correct on something that mattered. For each one, measure three intervals. How long between the decision starting to drift and someone inside the building flagging it? How long between flagging and aligning on the correction? How long between deciding and the correction reaching the customer? Add them up. That number is your current correction cycle. Now look at where the time is concentrated. If detection latency dominates, the structural work is drift detection. If decision latency dominates, the structural work is named ownership. If implementation latency dominates, you are in a different operating reality where execution is still the binding constraint. The work is real, but it is a different conversation.
Audit what your incentive structures are actually rewarding. Take any team’s current OKRs or goals. For each objective, ask one question: if the team hit this number while making the product worse for the customer, would the system catch it? If the answer is yes, the constraint is named somewhere, so find out where. If the answer is no, the OKR is a fragmentation engine waiting for the right conditions. The audit does not require redesigning the OKRs. It requires only naming, for each one, what coherence constraint would have to hold for the metric to mean what the team thinks it means.
Each diagnostic produces a finding that is precise enough to act on. A decision rights map with gaps in it. A correction velocity baseline that names the dominant latency. An OKR audit that surfaces what the metrics leave uncovered. With those findings in hand, you have something the operating reality of your organization can be designed against. Without them, the design conversation stays theoretical.
The findings are not the design. They are the starting conditions for it. The harder work is what comes after: distinguishing the symptoms from the root cause that produced them. Inside any organization, every divergence has a local explanation that sounds plausible. Plausibility is exactly what got the system to this point. A decision that drifted can look like an ownership gap but may actually be an incentive problem about what people believe gets them promoted. A drift signal that wasn’t escalated can appear like a detection failure, but may in fact be a decision-rights ambiguity that made people deprioritize acting on what was already visible. The symptoms and the causes look surprisingly similar from inside the system. The cost of designing against the wrong one is a redesign that produces a slower bureaucracy or an alert system nobody trusts. The reset on a bad first iteration is expensive.
This is the Noma problem in your own organization. The cookbook is in your hands, but the restaurant still needs to be built. The diagnostics will tell you what your kitchen looks like. The work after that is fitting the specific shape of decision rights, the specific instrumentation of leading indicators, the specific redesign of how teams are evaluated to the operating reality that your diagnostics surface. There is no template that survives first contact with reality. That work is best done partnering with someone who has watched these systems succeed and fail at scale, who has been in the rooms where they are built, and who can tell the difference between a design that fits your operating reality and one that only sounds like it does.
The question is not whether your organization will be redesigned for this operating reality. It is whether you will design it, or whether it will be imposed upon you by the next escalation call.
This is the kind of work I help leaders through directly. If what I've described is what's happening in your organization, reach out and we can talk it through.
Work with me →