How One ADR Got Claude to Stop Making the Same Mistake
Here is an ADR that Claude has not violated since I wrote it. Not once. Not in 226 sessions. Not across 30 repositories. Not on a tired Sunday night when the model was halfway through a refactor and tempted to take a shortcut.
It is not about microservices. It is not about clean architecture. It is about a Python type. Specifically, when to use Decimal instead of float. The decision is boring. The way the ADR is written is not.
Before this ADR existed, every LLM I tried — Claude, GPT, the rest — would happily store a price as 100.0 in DynamoDB, watch the test fail, and "fix" the test by changing the assertion to 100.0 on the way back out. The production code stayed broken. The test stayed green. The bug went into prod.
I have been writing ADRs for years and evolving the format every time a new failure mode taught me something. What changed when AI assistants became the executor was not that ADRs suddenly mattered — they always did. What changed was that the reader is no longer a thoughtful human who fills gaps with judgment. Most of what I had been refining — status, context, decision, consequences — was built for that reader. The Decimal failure was the moment it became obvious that the format I had been carrying forward needed a different shape.
Then I rewrote the ADR with seven specific properties in mind. The failure rate went to zero.
This is the post about those seven properties — what makes an ADR survive contact with an AI executor, and why most ADRs you have written do not.
The ADR
Here is the load-bearing part of ADR-009 in our standards repository. It is longer in full, but this is the spine:
# ADR-009: Use Decimal Instead of Float for AWS and Numeric Operations
## Decision
Always use `Decimal` from Python's `decimal` module for numeric values
that interact with AWS services or require precise arithmetic.
### Rules
1. DynamoDB Operations: All numeric values stored in or retrieved from
DynamoDB MUST use `Decimal`
2. Test Fixtures: Numeric test data MUST use `Decimal` to match
DynamoDB responses
3. Comparisons: When comparing numbers, convert to the same type first
4. API Responses: Convert `Decimal` to `int` or `str` for JSON
serialization (JSON doesn't support `Decimal`)
### Code Patterns
Correct - Using Decimal:
from decimal import Decimal
item = {
"pk": "DOC#123",
"count": Decimal("100"),
"score": Decimal("0.95"),
}
Incorrect - Using Float:
# DON'T DO THIS
item = {
"count": 100.0, # Will cause type mismatch
"score": 0.95, # Precision issues
"limit": float(100) # Unnecessary, still wrong type
}
### For Existing Tests
When tests fail with Float/Decimal mismatch:
1. Update test fixtures to use `Decimal`
2. Update assertions to compare same types
3. Do NOT change production code to return `float` - fix the testsLook at that last block. Do NOT change production code to return float — fix the tests. That single line is doing more work than the rest of the ADR combined. It is the line that took our failure rate to zero. We will come back to it.
Why this works when CLAUDE.md did not
For years — and at this point "years" still means "since 2024" — the conventional wisdom on steering AI coding assistants has been: write a CLAUDE.md, list your conventions, hope for the best. It is a reasonable starting point. It is not a stable end state.
CLAUDE.md is a single flat file the model reads at the start of a session. Conventions written there compete with everything else in context: the user's prompt, the open files, the diff being reviewed, the snippets pulled in from search. By the time the model is on its third tool call into a refactor, your CLAUDE.md rule about Decimal has been pushed half a screen out of working memory by the actual code it is editing.
ADRs, when served properly, do not compete on flat-file attention. They are retrieved on demand by the executor when the diff touches a relevant surface. The model edits a DynamoDB write — the ADR for DynamoDB types loads. The model edits a test — the ADR for testing standards loads. Context is delivered just-in-time, not crammed up-front.
But "served properly" is doing a lot of work in that paragraph. Most ADRs, even when retrieved, still fail. The Decimal one does not. Here is why.
Property 1: Falsifiable, not aspirational
Most ADRs read like values statements. "We prefer immutability where possible." "We strive for loose coupling." "Services should own their data." These are fine for humans because humans interpret. They are useless for an executor because they are not testable on a per-line basis.
Look at the rule in ADR-009: All numeric values stored in or retrieved from DynamoDB MUST use Decimal. That is binary. Every numeric literal in the diff is either compliant or not. The model can grade itself against the rule on every chunk of generated code — which is exactly what happens in our self-review loop.
If your ADR cannot be enforced by a regex over a diff — or, more honestly, by a smart executor reading the diff — the ADR is not falsifiable. Rewrite it until it is.
Property 2: Real error message, not a description of the failure class
The full ADR includes the actual error our test suite produced:
FAILED test_list_recent_docs.py - Type error: Float/Decimal mismatch
Expected: 100.0 (float)
Got: Decimal('100') (Decimal)This is doing something subtle and important. The model has, somewhere in its training corpus, seen this exact error string — and the patterns of code that produce it. When the executor is later debugging a similar test, the ADR pattern-matches against the failure surface, not just against the rule. It is a hand-off from rule-based reasoning to retrieval-based reasoning, and it is the difference between an ADR that gets followed and an ADR that gets ignored.
Rule of thumb: if your ADR cannot show the actual log line, stack trace, or output that motivates it, the ADR is missing its strongest piece of context. Paste it in.
Property 3: Labeled wrong example, not just labeled right ones
Most ADRs include a "good example" code block. ADR-009 includes both, and it labels them. "Correct — Using Decimal:" followed by the right pattern. "Incorrect — Using Float:" followed by the wrong one, with inline # DON'T DO THIS comments on every offending line.
You might think the wrong example is dangerous — what if the model copies it? It does not. It is not just an LLM thing; this is true of code review too. Negative examples calibrate. They tell the executor: this is exactly the shape of code I am pattern-matching to forbid. Without them, the model has to infer the negative space, and it infers it badly.
I have a stronger version of this principle, learned through repeated failure: an ADR without a labeled wrong example is a half-finished ADR. The right pattern alone leaves too much room for the executor to invent novel ways to be wrong.
Property 4: Anticipates the wrong fix
This is the most important property and the one almost no team does. Recall the line:
Do NOT change production code to return float — fix the tests.
Without that line, here is what happens. A test fails with Float/Decimal mismatch. The executor reads the failing test. It sees a comparison: assert response["count"] == 100.0. It looks at the production code that produced Decimal("100"). It thinks: "the test wants a float, so let me make production return a float." It does. The test passes. The bug is now permanent.
That sequence is locally reasonable — the model is following the path of least resistance — and globally catastrophic. The ADR exists to forbid the locally-reasonable wrong fix.
When you write an ADR for an AI executor, ask: what is the lazy fix that looks like compliance but is not? Then write it down and forbid it explicitly. This is the immune-system property. The ADR does not just prescribe; it anticipates.
Property 5: Negative space — explicit "do not use this for X"
ADR-009 is scoped: numeric values that interact with AWS services or require precise arithmetic. Not all numbers. Not all floats anywhere. Just this scope. That scoping is the negative space.
Without it, the executor will over-apply. It will start replacing every float in the codebase with Decimal, including in math operations where Decimal would be slower and inappropriate — physics simulations, ML feature vectors, image processing. I have watched it happen on other repos. Over-application is the failure mode of an ADR with no negative space.
Every rule needs a fence. "Use Decimal for AWS-bound numeric values." implies "Do not use Decimal for things that are not AWS-bound numeric values." But "implies" is not how an executor parses. State the fence.
Property 6: Flexibility where flexibility is correct
ADR-009 has two acceptable patterns for comparing values in tests:
# Option A: compare same type
assert response["count"] == Decimal("100")
# Option B: convert both to native int
assert int(response["count"]) == 100Both pass. Both are correct. The ADR explicitly endorses both. This matters because if you write a single rigid pattern, the executor will refuse to use the other when the surrounding code style calls for it — and you will get awkward, inconsistent diffs as a result.
Rigidity should be reserved for the part of the rule where rigidity is correct. Storing in DynamoDB? Always Decimal. Comparing in tests? Either of two patterns. Get this distinction wrong and the executor will either bend a hard rule or invent unnecessary friction around a soft one.
Property 7: Drift forward — the corpus has to evolve
The first version of ADR-009 did not have the line about "do not change production code to return float." That line was added the third time we hit the failure mode in production. The labeled wrong example was added after the first time the executor invented a novel way to mis-handle a JSON serialization. The boto3 TypeSerializer guidance was added when the team started using the low-level client and the high-level pattern stopped applying.
ADRs ossify if you do not write back to them. Every novel failure that an executor produces — every "huh, I would not have predicted that" moment — is a signal that the ADR is missing a clause. The corpus has to drift forward as reality drifts forward, or the ADR slowly stops being the source of truth and the executor starts inventing its own.
This is the operational discipline that makes the Outcome Engineer role real. You are not just writing code. You are stewarding a corpus that the executor reads, edits against, and grades itself by.
ADRs as governance documents vs. immune-system documents
Most ADRs you have read in your career are governance documents. They exist so that six months from now, someone can ask "why did we do it this way?" and an answer exists. They are written for retrospection, for audit, for onboarding the next architect.
Those ADRs assume the reader is a thoughtful human who will interpret, contextualize, and apply judgment. That assumption used to be safe. It is not safe anymore.
The reader is now an executor. It is fast, it is literal, it is non-judgmental, it works at 3am on a tired Sunday, and it will happily change production code to make a test pass if you do not tell it not to. Governance documents do not survive that reader. Immune-system documents do.
The mental shift is not subtle:
Governance ADR
- · Status, context, decision, consequences
- · Aspirational language, principles
- · One worked example (the right one)
- · Written once, read at retrospectives
- · Reader: thoughtful human
Immune-system ADR
- · Falsifiable rule plus labeled right and wrong examples
- · Real error messages, real stack traces
- · Anticipates the lazy wrong fix and forbids it
- · Scoped — "do not apply to X"
- · Updated every time the executor finds a new failure
- · Reader: AI executor (and humans, secondarily)
Governance ADRs explain. Immune-system ADRs constrain.
The deployment loop this enables
Once your ADRs look like this, something interesting happens to your delivery pipeline. We run a four-stage loop in production:
- MCP-served standards. Our ADRs and conventions are exposed as a queryable knowledge base over MCP. The executor pulls only what is relevant to the diff. No CLAUDE.md context bloat, no stale rule on a tab no one opened.
- Chunk-time grading. Every chunk of generated code is graded against the falsifiable rules in the relevant ADRs before it lands. Decimal violations get caught before the file is saved, not during code review.
- Self-review pass. The executor re-reads its own diff against the same ADR set and produces a written compliance check. If the check fails, it iterates.
- Then, and only then,
--dangerously-skip-permissions. The autonomous flag becomes safe to use because the immune system has already done its job. We are not skipping safety; we are moving safety upstream into the corpus.
This is the loop that turns ADRs from documentation into infrastructure. It is also why I can make claims like "Claude has not violated this ADR in 226 sessions across 30 repositories." That is not a product pitch — it is measured production data, because the grading at step 2 is logged.
What to do tomorrow morning
Pick the most painful, most-repeated mistake your AI assistant makes in your codebase. Not the architecturally interesting one — the boring one that costs you 20 minutes a week and a code review every two weeks. The float/Decimal of your stack.
Open the ADR for it — or write the first one if you do not have any. Then audit it against the seven properties:
- Is the rule falsifiable on a per-line basis?
- Have you pasted the real error message?
- Is there a labeled wrong example with explicit "DON'T DO THIS"?
- Does it anticipate the lazy fix and forbid it?
- Is the scope fence stated explicitly?
- Where flexibility is correct, does it name multiple acceptable patterns?
- Is it updated when the executor surprises you?
If the answer to any of these is no, that is the gap. Close it. Then watch the failure rate move.
The thing I keep telling teams that ask why our delivery loop works and theirs does not: it is not the model. We use the same models you do. It is not the prompt. We do not have secret prompts. It is the corpus. We treat ADRs as load-bearing infrastructure, not as documentation. Once you make that shift, the executor becomes an asset rather than a liability.
The Decimal ADR is one rule on one type in one language. Get it right, and you have a template for every other rule that matters in your codebase.
Want this loop in your stack?
OutcomeOps deploys ADR-served standards, chunk-time grading, and the self-review pass directly into your AWS account. No SaaS proxy. Your code never leaves.