The Guardrail Erosion Problem with AI Agents

May 22, 2026 9 min

Table of Contents

We have all seen AI agents make ‘mistakes’ in ways that introduce bugs and then try to cover their tracks by deleting or updating the tests or evidence. There have been some high-profile incidents such as Replit’s AI agent deleting a live production database, fabricating fake data to conceal the damage, and telling the user rollback was impossible (AI Incident Database, 2025). A CodeRabbit analysis of 470 GitHub pull requests found AI-authored ones contain 1.7 times as many bugs as human-authored ones. Those are the spectacular failures: visible, attributable, containable. This post is about the quieter problem.

In my Suggestible Actor post I prescribed four design strategies that build guardrails to mitigate the mistakes by AI coding agents: actionable errors, hard boundaries with signposts, documentation as local context, and closing the directive gap. I have come to realize that my prescription will not be enough. AI agents generate code that incidentally modifies guardrails. More often than not, those modifications erode them.

Guardrail erosion: what is it?

Guardrail erosion is the phenomenon where codebases that are iteratively modified by AI-generated changes without proper human reviews accumulate bugs at a faster rate. An IEEE-ISTAS paper showed this to be true with vulnerabilities. A recent SlopCodeBench paper on arXiv showed an increase in ‘structural erosion’ over iterative AI code changes.

Such erosion is a structural consequence of the Suggestible Actor properties of an AI agent.

A goal-oriented agent treats guardrails as obstacles when they produce errors that block progress toward the goal.
A locally reasoning agent cannot distinguish between a test that documents current behavior (safe to update) and a test that guards a critical invariant (dangerous to update): both look the same from the local context.
An agent susceptible to local context pattern-matches from the surrounding code; if prior iterations have already weakened some guardrails, the context reinforces further weakening.
An agent that hallucinates under uncertainty will, when encountering a guardrail it does not understand, resolve the ambiguity in the direction that clears the immediate error: loosening the constraint rather than preserving it.

A more capable model will not stop eroding guardrails. It will erode them more efficiently, or possibly more convincingly. These properties do not depend on current model limitations. Hallucination is a proven mathematical limitation of autoregressive language models, not an engineering problem awaiting a fix (Xu et al., 2024). Hoping for smarter LLMs to solve this problem is wishful thinking.

There are two ‘obvious’ solutions to this problem: code review, and testing. However, neither of them works in the world of AI coding agents.

Review does not scale

When coding agents produce code at a prodigious rate, code review becomes the bottleneck. Core developers review 6.5% more code but produce 19% less of their own after AI adoption (Xu et al., 2025). 45% say debugging AI-generated code is more time-consuming than debugging human-written code (Stack Overflow Developer Survey, 2025). There is simply too much AI-generated code for humans to review thoroughly.

The obvious retort is to have AI agents do the reviews. But the knowledge required to catch guardrail violations (“why does this invariant exist?”, “which systems depend on it?”, “what breaks downstream?”) lives in people’s heads. It cannot be codified into the agent’s local context precisely when the agent needs it. An AI reviewer with full architectural context is still a suggestible actor: it pattern-matches against the codebase as it finds it, including the erosion already present.

AI agents can infect tests too

Tests will not save us either. When the VP of engineering wants “high code coverage”, engineers prompt their AI agents with: “write tests for this module.” Tests generated this way encode the existing behavior. They are tautologies: they catch regressions from the current behavior, but the current behavior may already be wrong. Human-written tests have the same problem in principle, but a human validates assumptions while writing each assertion. An agent generating hundreds of assertions per minute does not.

When the prompt is “implement this feature”, the agent modifies code and tests together. It is measuring compliance with itself (StratoAtlas, 2026), not objective correctness. According to Alves et al. (EASE 2025), in LLM-generated Python test suites, 64% of errors were incorrect assertions: the test ran, the assertion was wrong, and the suite passed anyway.

If reviews and tests don’t work, then what does? The answer depends on what kind of guardrails are at risk, and what the human review budget is.

Not all guardrails are the same

There are three classes of guardrails, and each erodes differently.

Social guardrails are conventions and patterns that may or may not be documented. It is the social contract around which humans write software. The suggestible actor sees traces of them in code patterns but treats them as weak signals. They erode too fast to be a reliable line of defense, so I will set them aside for the rest of this discussion.

Encoded guardrails

Encoded guardrails are encoded into the software lifecycle: linters, static analysis, unit tests, integration tests, and regression tests. These are guardrails that the agent can modify in situ, within the same codebase it is already changing. The agent responds to them because violations produce errors that block progress, and errors are the contextual feedback the suggestible actor is most susceptible to. But the agent can satisfy them trivially: delete a failing test, drop a precondition check, or suppress a linter warning. The error is gone. The vulnerability is not.

Structural guardrails

Structural guardrails are woven into the structure of software development that cannot be modified in situ. Changing them requires a significant change to the build and execution environment. Examples include type systems, capability restrictions, formal verification, and property-based tests (tests that verify general properties over randomized inputs) with human-authored properties. These guardrails enforce properties that must hold regardless of the path taken to satisfy them. The agent does not need to understand why the guardrail exists; it just needs to know that the goal cannot be accomplished without satisfying it. Structural guardrails typically require human maintenance, which is why they are expensive. But because organizations deploy them sparingly, the surface area that humans must maintain remains small enough to review thoroughly.

Most codebases have decent social and encoded guardrails, but thin structural guardrails. Very few codebases have anything beyond type safety from the compiler. Fewer still have formal verification, and even those verify against the design, not the implementation: nothing guarantees the two haven’t drifted apart.

The numbers bear this out. 55.8% of AI-generated security-critical code contains formally proven vulnerabilities; static analysis tools miss 97.8% of vulnerabilities that Z3/SMT solvers can prove (Blain & Noiseux, 2026). Across 7,703 AI-generated files on GitHub, researchers found 4,241 occurrences of known, cataloged vulnerability patterns (Schreiber & Tippe, 2025). Most codebases are thin on structural guardrails, which is the one class that survives the suggestible actor. To be precise: this evidence shows that encoded guardrails fail, not that structural guardrails succeed. But the argument is not that structural guardrails are perfect. It is that they are the only class whose enforcement mechanism does not depend on the agent’s cooperation.

Building erosion resistance

Review does not scale to all AI-generated code, but it does not need to. Every team has a finite budget of human review time. AI-generated PRs have dramatically increased the demand on that budget. The goal is not zero bugs: zero bugs was never the goal. The goal is no increase in the ambient bug rate, and a reduction in higher-severity bugs. The question is how to allocate a fixed review budget for that outcome.

The answer starts with assessing each component or module in your system along three dimensions. The ideal metric is expected damage: severity multiplied by time to mitigation. In practice, neither factor is directly computable. These three dimensions decompose that product into assessable proxies, in priority order.

First: risk tolerance. How bad will things get if a guardrail erodes here? Some failures are catastrophic (rocket crashes, medical misdiagnosis, financial loss at scale). Some are recoverable inconveniences (drop in user engagement, wrong data on a dashboard, a broken UI flow). Prioritize components where the cost of erosion is highest, because even if you can detect and roll back quickly, the damage from a single incident may already be unacceptable.

Second: feedback latency. If a guardrail erodes and the damage reaches production, how quickly will you know? In continuous deployment with production monitoring, the window is hours. A distributed library with quarterly releases can carry a weakened invariant for months. Embedded software may not reveal a failure until a specific operating condition triggers it years later. Silent or slow-to-detect failures cause unbounded damage accumulation. Even if the component is theoretically reversible, you cannot roll back what you have not yet detected.

Third: deployment reversibility. Once detected, how quickly can you undo the damage? A web service rolls back in seconds. Firmware in a medical device requires FDA re-certification. If rollback is cheap, detection is sufficient. If rollback is expensive or impossible, prevention is the only option.

These three dimensions are not independent; they tend to be correlated. The core of most systems (the “secret sauce” that makes a company valuable) typically has low risk tolerance. Bugs that escape to production in the core tend to be edge cases that take time to surface. Rolling back changes to the core tends to be risky and slow. The components closer to the top of the stack (UI layers, internal tools, prototypes) tend to cluster at the opposite end: higher risk tolerance, faster detection, easier rollback.

This correlation simplifies the allocation. Invest your structural guardrails and your heaviest human review in the core: formal verification for critical paths, property-based tests with human-authored properties, capability restrictions that the agent cannot circumvent. For the middle tier, strengthen encoded guardrails with stricter static analysis and more rigorous CI gates. When a PR touches structural guardrails, it gets priority for human review over one that only modifies production code and its unit tests. For the top of the stack, encoded guardrails with robust monitoring, canary analysis, and fast rollback may be sufficient, with human review reserved for architectural changes.

Industries at the extreme end of this spectrum already mandate structural rigor (DO-178C for avionics, ISO 26262 for automotive, IEC 62304 for medical devices). The erosion problem gives those standards new urgency: AI agents will test them in ways human developers never did. But most software does not live at that extreme. Most software lives in the middle, where the right allocation is neither “structural guardrails everywhere” nor “monitoring and hope.” It is a deliberate, prioritized investment calibrated to what each component can afford to lose.

AI coding agents erode the guardrails in your codebase. That erosion is structural, not accidental: it follows from the properties that make AI agents useful in the first place. You cannot eliminate it. But you can direct your finite human attention to the places where erosion is most dangerous, and let the right class of guardrail do the rest.

Linked in this post

🌳

AI Reviewing AI: Shared Blind Spots

AI models reviewing AI-generated code share systematic blind spots with the generator, creating gaps that neither side detects.

🌳

Confabulation Is Plausible

AI agent confabulation is not random — it is plausible-looking wrongness constructed from pattern and proximity rather than knowledge.

🌳

Convert Ambient Knowledge into Local Context

The core design principle for the suggestible actor: convert ambient knowledge into local context.

🌳

Encoded Guardrails

Encoded guardrails are guardrails encoded into the software lifecycle that the agent can modify in situ, within the same codebase it is already changing.

🌳

Expected Damage: Severity Times Time to Mitigation

The ideal metric for guardrail investment is expected damage: severity multiplied by time to mitigation.

🌳

Goal vs. Intent

Goal and intent are not the same thing.

🌳

Guardrail Erosion Is a Meta-Problem

AI agents erode the guardrails designed to constrain them through the same mechanisms those guardrails address.

🌳

Hallucination Is a Mathematical Inevitability

Hallucination in autoregressive language models is a proven mathematical limitation, not an engineering problem awaiting a fix.

🌳

Review Is the Bottleneck

AI agents produce code faster than humans can review it, making review the structural bottleneck.

🌳

Social Guardrails

Social guardrails are conventions and patterns, documented or not, that form the social contract around which humans write software.

🌳

Static Analysis Is Insufficient for AI Code

Industry static analysis tools are structurally insufficient for AI-generated code.

🌳

Structural Guardrails

Structural guardrails are guardrails woven into the structure of software development that cannot be modified in situ.

🌳

Susceptibility Peaks at Failure

An AI agent's susceptibility to local context peaks at the point of failure.

🌳

Three Classes of Guardrail Erosion Resistance

Guardrails fall into three classes by erosion resistance: erasable (convention-dependent), detectable (tool-enforced), and immutable (formally enforced).

🌳

Three Dimensions of Erosion Resistance Allocation

Risk tolerance, feedback latency, and deployment reversibility are decomposed proxies of expected damage, and they tend to correlate.