Guardrail Erosion Is a Meta-Problem
AI agents erode the guardrails designed to constrain them through the same mechanisms those guardrails address. A hard boundary encoded as a compile-time check survives. A hard boundary encoded as a convention-enforcing code pattern does not, because the agent pattern-matches from other code rather than understanding why the pattern exists. The suggestible actor does not dismantle guardrails intentionally. It simply does not notice them when they are in its way.
This creates a self-reinforcing loop: guardrails erode, the codebase drifts toward the dominant pattern, and each successive agent iteration has less guardrail signal to learn from. The mechanism is analogous to Shumailov et al.โs model collapse (Nature 2024): recursive training on generated data causes tails of the distribution to disappear, and guardrails are the tails.