Srikanth Sastry

AI vs. Open Source, Part 1: The Empty Grant

Srikanth Sastry — Mon, 11 May 2026 00:00:00 GMT

The step function increase in AI's ability to generate code is looming over open source. What frontier models can do today is a warning shot, already enough to dissolve the legal scaffolding that makes open source enforceable. Historically, companies with flagship open-source software have relied on relicensing as a weapon to protect their competitive advantage. MongoDB moved from AGPL to SSPL in 2018, CockroachDB went from Apache 2.0 to BSL in 2019 to a custom CockroachDB license in 2024, Elasticsearch followed in 2021, HashiCorp switched Terraform and Vault to BSL in 2023, Sentry created an entirely new license (FSL) that same year, and Redis went source-available in 2024, mostly in response to cloud vendors offering their code as managed services. That weapon is now obsolete as AI threatens to make licenses completely irrelevant.

AI generated code? No copyright for you!

Every open source license is a conditional grant of copyright. The author holds the copyright, and the license grants permission to use the work only if certain conditions (e.g., attribution, source disclosure, or reciprocal licensing) are satisfied. This is the only enforcement mechanism that sustains open source through the chain of derived works. Without it the entire structure collapses.

AI-generated code is not copyrightable. The D.C. Circuit held in Thaler v. Perlmutter that the Copyright Act requires a human author. The U.S. Copyright Office confirmed that providing prompts to an AI does not constitute sufficient human authorship. This is U.S. law; other jurisdictions differ, but the enforcement gap is universal. The copyright status of "AI-assisted" code is still a legal gray area. While code written with "AI assistance" is copyrightable, the line between AI-generated and merely AI-assisted remains undefined. Is it sufficient to change a comment in AI-generated code to make it AI-assisted? No court has drawn that line.

AI-generated code is already at the gate. Open source maintainers are drowning in "vibe coded" pull requests: AI-generated submissions with minimal human oversight. Gentoo has banned AI-generated code contributions outright. NetBSD classifies them as tainted code requiring core developer approval. The Linux kernel allows them but mandates disclosure and full human accountability. Quality is the basis for rejection today. That filter has a shelf life. As the models improve, the quality will improve. The ethical case for rejecting machine-generated contributions becomes harder to make when the code is indistinguishable from human work.

Code without copyright cannot be licensed. The requirement to share source becomes unenforceable for modifications that have no copyright. Such code falls into a legal void: not public domain (no affirmative dedication), not proprietary (no copyright to assert), not open source (no license that can attach). The license text still sits in the repository. It is an empty grant.

To free, or not to free

Consider any corporation that writes and maintains code under an open source license. If AI-generated code enters that repository, the license grant over those contributions is void. The codebase becomes unauditable. Some files are copyrighted and licensed, others legally unowned, and yet others become legally contestable "AI-assisted" code.

Every team using Copilot or Claude Code produces ambiguously authored output. The corporation is strongly incentivized to close the source rather than maintain an open codebase with no legal protection. The relicensing wave already demonstrated this pattern: when the legal basis for openness stops serving the business, the business closes the code. AI-generated code is a larger threat than cloud vendors ever were. Cloud vendors merely underpriced them. AI dissolves the legal mechanism that made their licenses mean anything.

Why reciprocate when you can replicate?

Even if all the lawyers in the world agreed on the copyright question, a second problem remains: AI's ability to clone functionality with new source code.

Clean-room reimplementation has precedent. Sega v. Accolade established that reverse engineering for interoperability is fair use. Yet there was no widespread reimplementation of open source software into closed source counterparts. The economics did not make sense. Rewriting a mature project from scratch took months of expert labor, regardless of what license it carried. Compliance was cheaper than reimplementation. Until now.

With AI, the cost of generating code has gone down to near zero. Dan Blanchard rewrote the Python chardet library with Claude Code in order to sidestep LGPL. A project that would have taken a team months was completed in days. chardet is a proof of concept, not the end state. Software is modular, and that modularity compounds: as individual components are cloned, they become building blocks for cloning progressively larger and more complex systems. This is not a today problem. It is a next-year problem. MALUS.sh took the concept further, launching as a satirical "clean room as a service." Feed it any open source project. It produces a functionally equivalent clone stripped of all license obligations. No attribution. No copyleft. The satire landed because the tool works.

Granted, that is still legally fraught because the AI model was trained on open source software, and traditional clean-room doctrine required that the reimplementing team had no access to the original source. Whether the model's transformation of training data into weights constitutes a sufficient "clean room wall" is novel law. No court has ruled.

Regardless, enforcement at scale is near impossible. You cannot pursue every clone. You cannot detect every AI-generated clone. The economic bulwark of expensive code writing is gone irrespective of the legal outcome. And the cost will only continue to drop. What frontier models clone imperfectly today, the next generation will clone competently. The question is whether actions will follow incentives.

What remains

Open source has survived every prior threat by adapting its licensing regime. Tivoization got GPLv3. Cloud free-riding got SSPL and BSL. Importantly, the legal machinery worked, because copyright was relevant and valuable. AI is different. The machinery itself is failing. The grant is empty and the moat is collapsing. The onslaught of automated discovery and generation is incentivizing institutions to close their source code.

That would be survivable if the community that built open source could regroup and adapt as it always has. Part 2 examines why that is no longer a safe assumption.

SECURE Data Act: The dilution in pseudonymization

Srikanth Sastry — Wed, 06 May 2026 00:00:00 GMT

Disclaimer: I am not a lawyer, and this post is not advising any technical implementation in pursuit of any privacy regulation. The opinions expressed here are my own and do not represent the views of my employer.

The SECURE Data Act was introduced in Congress, and it immediately received a lot of criticism and blowback. The ACLU says it "would entirely destroy the work that states have been doing" on preemption of state privacy laws. The CDT calls out "easily exploitable loopholes" and data minimization that "lacks teeth." Brookings notes the absence of a private right of action. EPIC calls it "a huge gift to Big Tech." The California Privacy Protection Agency published a formal opposition letter.

As an engineer who builds privacy infrastructure, I am looking at it from a different lens. How does this bill impact the way personal data can and cannot be used for personalization? My reference is the GDPR, because I have built infra to support the regulations it mandates. Where does the SECURE Data Act diverge from GDPR, and what does that mean for how companies can use or erase user data post opt-out? The daylight between them is in pseudonymous data.

Pseudonymous data: GDPR vs. SECURE Data Act

GDPR and the SECURE Data Act define pseudonymous data in nearly identical language. Both classify it as personal data. Both require separating the identifying information. Both require technical measures to prevent attribution. You could swap one definition into the other and barely notice. While they share the definition, their treatment of pseudonymous data is very different.

Under GDPR, pseudonymous data is personal data. Period. Pseudonymization does not absolve corporations of the regulatory burden around erasure, access, profiling objections, or any obligations associated with personal data. The SECURE Data Act has a different take. Its pseudonymous data provision (Section 7(c)) suspends consumer rights for data that meets the pseudonymous threshold. The consumer cannot opt out of its use for targeted advertising. Cannot request deletion. Cannot access it. Pseudonymous data is still personal data by the bill's own definition. The bill simply overrides the consumer's ability to act on that fact.

The shared definition also leaves a gap. Both frameworks describe pseudonymous data in terms of records keyed by a pseudonym. But what about a derived artifact? A model trained on pseudonymous inputs, keyed by a pseudonymous identifier, encodes behavioral patterns without direct identifiers. It is linkable to an identified person if the controller holds the forward mapping, but the identifying information is "kept separately." The bill defines personal data as information "linked or reasonably linkable" to an identified person. Neither framework cleanly resolves whether the model is pseudonymous data, personal data, or something else. The SECURE Data Act's exemption in Section 7(c) operates on the data layer. Whether the model inherits that exemption is a question the definitions do not answer.

The divergence extends further. Under GDPR, a consumer can withdraw consent, and the controller must stop processing. Purpose limitation constrains what can be collected in the first place. The consumer has levers across the full data lifecycle: collection, processing, retention, deletion. The SECURE Data Act's opt-out covers three specific activities: targeted advertising, sale, and certain profiling. Data collection itself is not subject to opt-out. The pipe stays open.

Data pipeline with pseudonymous data

Starting with the same behavioral data, and going through the same pseudonymization step, GDPR and the SECURE Data Act allow data controllers to offer very different treatments. Here is an example data pipeline to sharpen this difference.

One-way pseudonymizer

The SECURE Act requires two conditions for the pseudonymous exemption: the identifying information is kept separately, and appropriate technical measures ensure non-attribution. It does not specify what "appropriate" means. A one-way derivation fits cleanly: HMAC with a secret key, or a key derivation function. The forward mapping (user_id to pseudo_id) is computable. The reverse mapping is computationally infeasible. No reverse API. No reverse index. Key material is restricted and audited. Every element of the definition is satisfied. Consumer rights no longer apply to this data, though data minimization and security obligations persist.

Notice what just changed. Under GDPR, the obligations follow the data regardless of how it is keyed. Under the SECURE Data Act, you can use a one-way function precisely because the obligations don't follow. Same definition. Same data. The architecture diverges at the exact point where obligations either persist or detach.

Let's see what a pipeline built on this architecture can do.

Data pipeline for personalization

Assume all user behavioral data has been pseudonymized, replacing user_id with pseudo_id. This data trains an ML model indexed by pseudo_id. At inference time, the system performs a forward lookup (user_id to pseudo_id) to select the right model and generate a personalized result.

Consumer experience with pseudonymous ML models

When a consumer opts out of personalization, their data, keyed by pseudo_id, has been exempted from the opt-out, and so makes its way to the ML model. When this opted-out user interacts with the product, the ML model, which continued to be trained on the user's pseudonymous data, continues to personalize the product for them. The consumer experience is identical to that of a user who never opted out.

Did you notice the difference? Neither did I.

Here is how the bill permits this. At no point was pseudonymous data attributed to an identified person. The system started with a known user and walked forward into the pseudonymous layer. It never walked backward. Forward resolution is not re-identification. The bill's re-identification provisions contemplate the reverse direction. Forward resolution is simply how a personalization system works. The bill does not address it.

A defender of the bill would point out that the forward lookup operates entirely in the identified layer: the user is logged in, the system derives their pseudo_id from their user_id, and only then touches the pseudonymous data. The pseudonymous data itself is never "attributed to an identified person." The attribution runs from identity to pseudonym, not the reverse. That reading is consistent with the bill's text. It is also consistent with a pipeline that delivers personalized content to a known user based on their behavioral history, with the user having no ability to opt out of the data that powers it.

This is not a fantastical architecture. Existing large-scale personalization systems bear more than a passing resemblance to this one. Behavioral features are processed in layers abstracted from direct identity, and identity is resolved at serving time. The SECURE Data Act's pseudonymous data provisions map onto this existing architecture and exempt its core data processing layer from consumer rights. Other obligations (data minimization, data security) still apply to pseudonymous data. But the consumer-facing rights that would let a user see, delete, or opt out of this processing do not.

How does GDPR handle this?

GDPR treats pseudonymous data as personal data subject to the same constraints as identifiable data. Run the same pipeline under GDPR: the user opts out, and the deletion obligation follows the data into the pseudonymous layer. The controller must locate the user's pseudo_id, delete the pseudonymous behavioral records, and address any models trained on them. The hair-splitting around one-way mappings and ID resolution at runtime becomes irrelevant to privacy compliance. If the user opts out, all of their data, including pseudonymous data, is in scope.

What follows from the example

GDPR and the SECURE Data Act start from the same sentence and describe the same technical operation: stripping direct identifiers, separating the mapping, applying technical safeguards. The disagreement is about what follows.

GDPR says: the processing is what matters. If you use someone's behavioral history to target them, they have rights over that processing. It does not matter whether that data is keyed by PII or by a pseudonym. Rights attach to what is done with data.

The SECURE Data Act says: the PII is what matters. Sever the link between personal data and PII through pseudonymization, and the rights detach.

The two frameworks encode different theories of where privacy lives. One locates it in what is done with data. The other locates it in whether the data can be traced back to someone. The same engineer building the same system faces a fundamentally different regulatory question depending on which framework governs. Under GDPR, pseudonymization is a tool you use inside the regulatory perimeter. Under the SECURE Data Act, pseudonymization is the door out of it.

Subsidiarity is not Hayek

Srikanth Sastry — Sun, 03 May 2026 00:00:00 GMT

I've been writing about directive governance and subsidiarity in software organizations. The objection I get is: "Isn't this just Hayek?"

The steelman goes something like this. Hayek argued that knowledge is distributed, tacit, and cannot be aggregated by a central planner. He was arguing against the central planning of Keynes, which was in vogue during his time. Analogously, directive governance centralizes decisions, and subsidiarity distributes them. So, directive governance looks Keynesian, subsidiarity is Hayek, and I just spent three posts reinventing The Use of Knowledge in Society.

I did use Hayek's core insight around the tacit and distributed nature of incompressible knowledge as a starting point. But private profit-seeking organizations and the nature of software engineering reject a wholesale transplantation of Hayek's ideas. The differences break the model entirely.

Subsidiarity keeps the hierarchy. Hayek's market is a flat, emergent coordination mechanism without a central authority. In contrast, subsidiarity explicitly preserves organizational hierarchy. It changes the function of hierarchy from directing to enabling, from commanding to providing context and guardrails. Accountability still aggregates upward. Higher levels still intervene when lower levels cannot handle the issue. This is not "let the market decide." It is "let the closest competent authority decide, backed by a hierarchy that enables rather than directs."

Directive governance is not Keynesian central planning. Keynes argued for targeted intervention to correct specific market failures: demand deficiency, coordination failures, liquidity traps. Directive governance is not that; it is an organizational model where all decisions flow through a hierarchy. There is nothing targeted about it. It is a complete takeover of organizational decision-making, and not a scalpel to governance like Keynesian planning is to the economy.

The argument for failure is conditional. Hayek's claim is universal: central planning always fails because knowledge is always distributed and tacit. My argument is that directive governance fails in software because three specific preconditions do not hold: information cannot be compressed without losing signal, metrics are not good proxies for outcomes, and execution is not separable from decision-making. Directive governance works for Pharma and manufacturing, where those preconditions hold. This is not a universal principle about the superiority of decentralization. It is a structural diagnosis.

Shared interest in mission vs. self-interest in price. Hayek's distributed system coordinates through prices. Subsidiarity coordinates through missionary culture: every member motivated by advancing the mission, viewing others as partners. Self-interest is the engine of Hayek's market, shared interest is the engine of missionary culture.

The difference matters under pressure. An SRE team with full decision-making autonomy might internalize its role (keep the fleet humming) without caring about the organization's mission. That is functional, but fragile. When a crisis hits, a team that owns its role but not the mission has no reason to resist centralization. The ratchet finds less resistance. Missionaries push back. A directive that runs counter to the mission feels viscerally wrong to someone who has internalized it. Price signals do not build that resistance. Shared purpose does.

The origin story matters. Subsidiarity comes from Catholic social teaching. Quadragesimo Anno (1931) criticizes both laissez-faire capitalism and central planning. It is a third position, not a pole. Treating subsidiarity as Hayek strips the most important part: the commitment to community organized around shared purpose, with authority distributed to the lowest competent level. The encyclical calls both unregulated markets and overcentralized states a "grave evil", and that would not sit well with Hayek.

The ratchet has no analog. The crisis-centralization ratchet is a structural mechanism that pulls organizations toward directive governance under pressure. Markets do not have this. Crises in markets lead to more markets, or regulation, or both, depending on who wins the political argument. Organizations have a one-way valve. That makes the organizational problem fundamentally different from the macroeconomic one.

Hayek's prescription does not survive contact with organizations. Organizations are not markets. They have hierarchies, missions, reporting chains, crises, and ratchets. The question is not "centralize or decentralize?" The question is: what is your hierarchy for?

Directive governance answers: directing. Subsidiarity answers: enabling.

That distinction has no home on the Keynes-to-Hayek spectrum.

Deliverance from Directive Governance

Srikanth Sastry — Sat, 02 May 2026 00:00:00 GMT

This is the third post in the series about directive governance. The first post diagnosed the problem with governance in the tech industry as directive governance applied where it doesn't belong. Directive governance is top-down governance: decisions flow down, information travels up. It works in Pharma and manufacturing because three preconditions hold. (1) Information can be compressed without losing signal. (2) Metrics are good proxies for what the organization cares about. And (3) execution is distinct from decision-making. None of this holds for software.

The second post explained why companies tend to not change the governance model, despite benefits to decentralizing decision making. In essence, it persists because a structural ratchet centralizes quickly during crisis and decentralizes almost never.

So, what's a gal like you supposed to do in such a cruel world! First off, there is no easy way out. But there is a way out; read on to find out.

Quick fixes that don't fix

Let's dispatch three shortcuts that folks might resort to: AI, flattening, and rock star CEO. As standalone shortcuts, none of them escape the root cause: decisions are made where the information is not.

AI fixes the information flow. The argument is that AI can now summarize engineering discussions, design docs, and Slack threads with high fidelity. The stronger version: AI doesn’t just compress, it reasons across codebases, surfacing dependencies no single human sees. Either way, directive governance is now viable.

But here’s the rub. Essential complexity in software engineering is irreducible (Brooks). "We chose this abstraction boundary because of how three subsystems will need to evolve independently over the next two years" cannot be compressed into something a VP can evaluate across hundreds of systems under her purview. The judgment calls that matter most: which trade-offs to accept, which boundaries will hold as requirements shift. That context lives in the team, not in the model. AI gives you a better summary of what is measurable. It does not make the unmeasurable measurable. Worse: if leadership believes the information flow is fixed, they centralize harder. Confidence goes up. Accuracy stays flat.

Flatten the org. This one also seems to be in vogue. Just remove the hierarchy and information can flow freely to the top, and decisions are better explained to the bottom. If only! Jo Freeman diagnosed in 1970 what every flat organization discovers: eliminating formal hierarchy does not eliminate hierarchy. It eliminates accountable hierarchy. Valve's flat structure concealed informal cliques. Spotify's squad model never worked at Spotify. You replace a visible, broken pipeline with an invisible, unaccountable one.

Get a better CEO. This argument is essentially Confucian in that it concedes we will always have kings, and so we should make sure we have a "good king". Dressing it up for the 21st century, it goes "Jobs did it. Nadella did it. Bezos built it from scratch. The problem is personnel, not structure."

This doesn't go far enough. A good leader is necessary, but not sufficient. Jobs is the strongest case: Apple under his leadership was extraordinary. Apple after Jobs coasts on the momentum of his decisions, increasingly centralized, increasingly directive. The stock goes up and to the right. The pace of category-defining products has slowed. The kingdom did not survive the king. And the leaders who did build something lasting all made structural changes, not just better decisions. Marquet took the worst-performing submarine in the fleet and turned it around by replacing "permission to" with "I intend to." Nadella spent a decade restructuring how decisions get made at Microsoft. In every case, the escape was structural, not personal.

The deliverance: subsidiarity

What is the alternative to directive governance? Turns out, the alternative has already been done. No, not by The Simpsons; by the Catholic Church!

The principle can be traced back to Christian philosophers Thomas Aquinas and Johannes Althusius, and the principle is called subsidiarity. In 1931, Pope Pius XI formalized it in Quadragesimo Anno:

"It is an injustice and at the same time a grave evil and disturbance of right order to assign to a greater and higher association what lesser and subordinate organizations can do."

I have co-opted it as follows.

Subsidiarity: decisions should be made at the lowest level competent to make them. Higher levels sit behind, providing context, guardrails, and intervening only when lower levels cannot handle the issue.

This is not flattening. Subsidiarity preserves hierarchy but changes what it is for. The function shifts from directing to enabling. Accountability still aggregates upward. Decisions stay where information lives.

The existence proofs from the first post are all instances of subsidiarity. Toyota's andon cord, Amazon's two-pizza teams, Berkshire Hathaway's 30-person headquarters, the US Army's mission command. None of them invented something new.

Subsidiarity needs missionaries

Subsidiarity implemented as a reorg will not survive the first crisis. The ratchet will compress it back. That is the Spotify illusion from the previous post: structure without substance.

What separates the existence proofs from the illusions is missionary culture: an organization where every member is motivated by advancing the mission and views others as partners in that endeavor. Decisions are defended based on how they impact the mission. Not based on which VP cares about the project. Not based on which metrics will move.

Subsidiarity alone does not produce this. An SRE team with full decision-making freedom might internalize their role (keep the fleet humming) without caring about the organization's mission to "connect people" or "organize the world's information." That is functional, but fragile. When the next crisis hits, a team that owns its role but not the mission has no reason to resist centralization. The ratchet finds less resistance.

Missionaries are different. When a top-down decision runs counter to the mission, they push back. They challenge it, acting as a governor on the ratchet of centralization. If you have internalized the mission, a directive that violates it feels viscerally wrong.

Netflix's stock crashed 70% in 2022. They did not re-centralize. Their 2024 culture revision doubled down on "context not control." The people making decisions were not waiting for permission. They had a shared understanding of what Netflix exists to do, and a stock crash did not change that. The mission was the coordination mechanism. Not the reporting chain.

Amazon's leadership principles are how an L5 engineer decides what to build without asking a VP. Marquet's "I intend to" model on the USS Santa Fe worked because every sailor understood the submarine's mission well enough to propose action without waiting for orders.

Without missionary culture, subsidiarity is just another structural change that snaps back under pressure. With it, subsidiarity is self-reinforcing: every decision made locally, grounded in the mission, builds the muscle that makes the next local decision possible.

The stencil

This is not a playbook. It is a diagnostic you can superimpose on your own organization.

Are you reaching for AI, flatter org charts, or the right leader? Those are fig leaves. Where are decisions actually made (not what the org chart says, but the actual flow)? When decisions are justified, what is the grammar: "leadership wants X" or "this serves our mission because Y"?

You probably cannot fix your organization's governance model. But you can see it clearly, name it, and choose where to work with open eyes.

The Suggestible Actor: A New Model for AI-Assisted Software Development

Srikanth Sastry — Fri, 24 Apr 2026 00:00:00 GMT

Every software system is designed around an assumption about its actors; the ones who use the system, and what drives their behavior. This assumption, the actor model, determines API surfaces, error handling, defaults, and guardrails.

There are two actor models in software design, each an archetype at one end of the intent spectrum. The well-intentioned actor on one end, the malicious actor on the other. Some systems mix both along the user journey (the login flow assumes a malicious actor, the dashboard assumes a well-intentioned one) but at any given point, the design caters to one or the other. That binary held for decades. But not anymore.

The Well-Intentioned Actor

This model assumes that the actor intends to use the system as designed and follow the happy path of the user journey. They want to work within the boundaries, satisfy the preconditions for calling the right APIs, and follow conventions. When they violate a rule, it is accidental and not intentional.

The design paradigm that follows from this model is the pit of success. Make correct usage easy and incorrect usage ergonomically painful. Examples include Rust's borrow checker, builder patterns that enforce required fields, type systems that make illegal states unrepresentable, etc. All of these rely on the actor to interpret ergonomic friction as a signal to stop and reassess. When a well-intentioned actor encounters resistance, they read it as: I am probably doing something wrong.

This paradigm rests on a specific assumption: the actor has judgment. They can interpret signals beyond the literal content of an error message, drawing on context and system-wide invariants. The system does not need to spell out every correct behavior; it only needs to make incorrect behavior uncomfortable, and the actor's judgment does the rest.

The Malicious Actor

Here, the actor's intent is adversarial. They aim to subvert, compromise, or exploit the system.

The design paradigm that follows from this model is the fortress. Make incorrect usage impossible. Examples include capability-based access control, sandboxing, least privilege, zero-trust architectures, etc. Ergonomic friction is irrelevant here because the adversary does not interpret friction as a warning, but as evidence that something worth protecting is nearby.

This paradigm rests on its own assumption: the actor has directed intent. They will study the system, map its architecture, and probe its boundaries methodically. Any defense that is merely inconvenient rather than impossible will eventually be bypassed.

The Shared Assumption

Both archetypes share a deeper assumption: the actor has intent. Whether aligned or adversarial, the actor is motivated by something internal. They want an outcome, and the system is designed as a response to that want. This has been true since the start of software engineering as a discipline. We now have a new actor that upends it: the AI coding agent.

The New Actor: AI coding agent

The AI coding agent demands a new model. The natural instinct is to place it somewhere on the intent spectrum, perhaps as a mostly well-intentioned actor with occasional problematic behavior. This is a category error. The entire spectrum is organized around intent, and the AI coding agent has none.

The AI agent has an objective, which is not the same thing as intent. The objective is set externally by the human who dispatched it. It did not choose its objective; it was told. It has no internal motivation and no values against which to evaluate the task. It is not aligned with the designers' intent. It is not adversarial toward it. It is orthogonal to the entire axis. Both design paradigms fail for this actor.

When the Paradigms Fail

What happens when an AI coding agent operates in a codebase designed for the well-intentioned actor? Let's sharpen this question with an example.

An agent is implementing a feature and runs the test suite. A test fails with Access Denied: an authorization system is blocking a call the new code needs to make.

A well-intentioned human developer recognizes what this means. The authorization system is working as designed. They stop, determine which permission they need, and request access through the proper channel. They interpreted the friction correctly: I don't have the right to do this yet.

When the AI agent encounters the same error, this is just another test failure, no different in kind from a syntax error or a missing dependency. It looks for alternative paths to make the test pass. Not to compromise the authorization system (the agent has no concept of "compromise") but because that is what it does with any error: it tries to eliminate it.

In the best case, the agent wastes cycles on a dead end. In the worst case, it finds and exploits an actual vulnerability in the authorization system. This is not as far-fetched as it sounds. Anthropic's Claude Mythos Preview discovered zero-day vulnerabilities that had survived 27 years of human code review, when directed to look for them. If a model pointed at security can find what 27 years of human review missed, an agent brute-forcing past Access Denied is not going to stop at the authorization boundary.

One can always claim victory by assuming the malicious actor model for all AI agents. But fortress-hardened software is difficult to read, difficult to write by hand, and expensive to operate at scale. Applying it universally makes the codebase hostile to humans and agents alike.

Instead of forcing the AI agent into an existing archetype, we need a model that describes how it actually behaves and a design paradigm that follows from it.

The Suggestible Actor

I call this actor the suggestible actor. It is defined by four properties:

Goal-oriented. The actor has a goal that it is trying to accomplish.
Locally reasoning. The actor only reasons over what is immediately available to it.
Susceptible to local context. The actor's behavior is influenced by the outputs of each interaction with the system.
Confabulates under uncertainty. When local context leaves gaps in specification or direction, the actor makes up plausible rationale. It "hallucinates."

Goal-oriented

The agent always has a goal, externally set by the human who dispatched it: "implement feature X," "fix this bug," "refactor this module."

This is not the same as intent. Intent implies motivation: an intentional actor wants an outcome, understands why the outcome matters, and can evaluate trade-offs against their own values. The suggestible actor has none of this. It has a target, and it moves toward that target the way a heat-seeking missile moves toward a heat source: persistently, without comprehension of what it is pointed at or why.

Locally reasoning

The agent reasons only over what is immediately available: the contents of its context window, the file it is modifying, the output of the last command it ran. Global invariants, cross-system dependencies, and architectural constraints outside its immediate context do not factor into its decisions.

A human developer operates with ambient knowledge: team conventions, institutional history, an understanding of why the system is structured the way it is. The suggestible actor has none of this. Its understanding extends exactly as far as someone has made explicit within its local context. Even if all ambient knowledge were codified and provided, the locality of the context window would quickly obscure it.

Susceptible to local context

Every input the agent receives during execution (compiler errors, test results, code comments, documentation) influences its subsequent behavior. This susceptibility is not uniform. When the agent has a working path toward its goal, external inputs have relatively weak influence. When the agent is stuck, the next piece of feedback it encounters has outsized influence on what it does next. The agent is most susceptible at the point of failure.

This is the primary design lever. The agent's behavior can be steered, but only if guidance is placed where the agent will encounter it at the moments it is most receptive.

Confabulates under uncertainty

When local context is insufficient to determine a next step, the agent does not stop and request clarification. It confabulates: it generates a plausible structure and proceeds as if that structure were real. A call to an API that does not exist. A convention that was never established. A security bypass that "should work based on the patterns in this codebase."

This is the convergent failure mode of the other three properties. The result is not random behavior. It is plausible-looking wrongness: output that fits the shape of what should be there, constructed from pattern and proximity, not knowledge. The danger is not that these errors are spectacular. It is that they look correct.

Designing for the Suggestible Actor

Neither the pit of success nor the fortress was designed for an actor without intent. The suggestible actor paradigm starts from a different assumption: the actor is susceptible to local context and confabulates when that context is insufficient.

Because the agent is goal-oriented but locally reasoning, a gap always exists between the goal as the human understood it and the reality the agent encounters. The human had ambient knowledge that was never made local. This directive gap is the root cause of most suggestible-actor failures. The prescriptions below are all strategies for closing it.

Make every error a call to action

Error messages are the most effective steering mechanism available for the suggestible actor.

403 Forbidden tells the agent nothing actionable.

403 Forbidden: identity 'svc-deploy' lacks 'write:documents' scope. Request access at https://console.example.com/api-keys or use a key with admin privileges. gives the agent an actionable next step at the exact moment it is most receptive to one.

Principle: Treat error surfaces as the primary API for the suggestible actor.

Replace soft boundaries with hard boundaries plus signposts

Deprecation warnings that hope the developer will migrate. Abstract classes that trust no one will instantiate them. Internal APIs relying on the convention "you shouldn't use this." These are boundaries enforced by social contract. The suggestible actor does not read social contracts. It walks through the "DO NOT ENTER" sign because the door was unlocked.

For boundaries that matter, make them genuinely impassable (compile-time enforcement, runtime rejection, capability restrictions), then attach a signpost telling the agent what to do instead. For boundaries not worth enforcing, the suggestible actor will cross them. They are not boundaries anymore. Accept them as part of your system's state space.

Principle: Only hard boundaries exist and when they are hit, provide clear alternatives.

Write documentation and conventions as if they will be executed

To steer the agent, documentation must exist within its local context: inline comments adjacent to the code it will modify, docstrings on the functions it will call, unit test failure messages, READMEs precise enough for the agent to follow step by step. It will follow your docs more literally than most humans will.

The same applies to conventions. The suggestible actor cannot absorb norms through osmosis. Project templates, linters, and consistent directory structure encode convention at the tooling level. The agent complies with linters because violations are errors, and errors are the feedback it is most susceptible to.

Principle: Comments in code are vectors for prompt injection. Specifications are implementation contracts. Be explicit.

Close the directive gap

When the directive gap is wide, the agent confabulates. Close it.

CI/CD gates should report not just what failed but what to do about it. Pre-commit hooks should provide the correct alternative, not just reject the incorrect one. An AGENTS.md or CONTRIBUTING.md should encode the ambient knowledge that a human developer would carry. Example code near the API surface, type signatures that make the correct shape unambiguous, factory methods with correct defaults, named parameters that make intent explicit at the call site: all of these make the correct answer locally available so the agent never needs to invent one.

Principle: Convert ambient knowledge into local context.

The suggestible actor is already operating in your codebase. It is calling your APIs, reading your documentation, and hitting your error messages. It has no intent to respect your design philosophy and no judgment to interpret your ergonomic signals.

But it is susceptible to local context. And that is a lever.

Tech Companies and Directive Governance: A Situationship

Srikanth Sastry — Thu, 16 Apr 2026 00:00:00 GMT

Most large tech companies operate top-down. Information flows up through a reporting chain. Decisions are made centrally. Directives flow back down. This is true regardless of what their culture decks say. Unfortunately, it is the wrong way to govern for software engineering. Top-down governance works when information flowing upstream is compressible, the metrics that decision makers see are a good proxy for org health and success, and decision-making and execution are distinct from each other. None of these hold for software engineering. I have detailed all the wrongness in "Cargo Cult Governance".

If you take my claim of wrongness at face value, the natural follow-up question is the one my nine-year-old asks regularly: "why?" If the structural mismatch is that clear, and the existence proofs are that abundant, why don't large tech companies change?

Before going into the "Why", we need to answer the "What?". Specifically, what is "directive governance"?

The "what": Directive governance

The colloquial term for this is "command and control." That term is imprecise and means different things to different people. What dominates the tech industry is something more specific. Burns and Stalker come close with the notion of mechanistic organization in 1961: rigid hierarchy, top-down decision-making, formal procedures, accountability through the chain of command.

In practice, tech companies are not purely mechanistic. They bolt on organic elements: hackathons, "innovation time," autonomous-team branding. The mechanistic core stays intact. I call this directive governance:

Information flows up (compressed and possibly lossy). Decisions are made centrally based on whatever survives the trip. Directives flow back down for execution. Accountability is for compliance: did you execute the directive? Not: did you achieve the outcome?

Directive governance works when the three preconditions I mentioned earlier hold: information is compressible without critical loss, quantitative metrics correlate with reality, and decision-making is separable from execution. In manufacturing, these hold. In software, they structurally don't.

So... why don't companies switch? The answer is a ratchet.

The "Why": The ratchet

The ratchet hypothesis.

Tech companies centralize quickly during crisis and decentralize very slowly afterward.

The asymmetry has three layers, and they compound.

Layer one: mechanical asymmetry. Centralizing is a directive. "All decisions go through me now." That can happen overnight. Decentralizing is a culture. It requires building judgment, trust, and context at every level of the hierarchy. The transition to decentralization itself requires some level of decentralization. Culture takes years.

Layer two: loss aversion. Even when organizations are no longer in crisis, can afford to decentralize, and see the benefits of it, they do not start. Because being caught mid-transition when the next crisis arrives feels worse than staying centralized. A fully centralized org can respond quickly, even if the response is wrong. A half-decentralized org has neither the speed of centralization nor the distributed judgment of full decentralization. So you wait. And the next crisis arrives. And the waiting becomes permanent.

Layer three: competitive pressure as the only loosening force. Peacetime loosening is not voluntary. It is forced by upstarts that are nimble and innovative. When smaller competitors are shipping faster and stealing talent, the pressure to federate becomes hard to ignore. But because it is reactive rather than deliberate, it produces shallow structural changes. Skunkworks. Federated org charts. Squad models. Autonomous teams that still need VP sign-off. The structure gets decentralized. The culture does not; remember, it can take years. That is why it snaps back the moment crisis returns.

The three layers compound. The transition is mechanistically slow, psychologically avoided, and when it happens at all, shallow and reversible. If crises come faster than the loosening rate, centralization accumulates.

The ratchet makes a falsifiable prediction.

Long peace produces observable loosening. Frequent crises produce persistent centralization.

The evidence fits this prediction. Consider the last fifteen years in the tech industry. From 2012 to 2018, a long peace produced observable loosening. Facebook was federated: "move fast and break things," teams shipping independently. Google let teams launch products with minimal central approval. Nadella took over Microsoft in 2014 and killed the stack ranking system that had crippled the company for a decade. Peacetime loosening, exactly as the ratchet predicts. Some products from that era failed (Google Allo, Amazon Fire Phone, Google Glass, Meta Portal, etc.), and others broke new ground (Google Cloud and Microsoft Azure, the Transformer paper, etc.). You do not get to cherry-pick the hits without accepting the misses. The product graveyard is not evidence of federation failing. It is the cost of federation succeeding.

Then came the crises: Trade tensions, COVID, the hiring binge and the correction, AI panic, layoff waves. Every eighteen to twenty-four months, another shock. Zuckerberg's "year of efficiency." Google's layoffs. Meta's flattening. The loosening snapped back instantly because it was structural, not cultural. A competitive response, not a deliberate transformation.

The organizations that practice federated decision-making (Amazon, Netflix, Toyota, mission command) escaped all three layers. They invested in culture before they needed it. They did it deliberately and deeply. They maintained it long enough for the slow process to take hold. Their decentralization does not snap back under pressure because it is not shallow. Netflix is the clearest test. Stock crashed 70% in 2022. They laid off staff. They did not re-centralize. Their 2024 culture revision doubled down on "context not control."

What makes it stick

The ratchet is the spine. Four forces make it stickier.

Serial satisficing without learning. Hire aggressively in 2021. Lay off aggressively in 2023. Pivot to AI in 2023. Each correction is presented as the rational fix to the previous bounded decision. But the claim that any correction is rational is not falsifiable. Neither is the claim that it is wrong. That is the point. Nobody checks. Nobody builds the feedback mechanisms that would let you check next time. Pfeffer's research provides the closest thing to empirical traction: companies that did not lay off performed equally well. The honest version of the earnings call: "We are making this correction with equally incomplete information, and we have no way of knowing if it is better than what it replaced." Nobody says that.

Institutional inertia. The companies are profitable. The stock is up. Directive governance is not producing visible failures. Why would anyone champion a multi-year cultural transformation with uncertain payoff? The ratchet provides cover during crisis: centralization is plausible enough to be defensible. In peacetime, the status quo is plausible enough to be comfortable. Nobody fixes what appears to work.

Incentive structures. Meta recently tied nearly $1 billion in executive compensation to stock price targets. Not innovation rate. Not decision quality. Not talent retention. Stock price. The CTO, CPO, COO, and CFO all hold options that pay out only if market capitalization hits specific thresholds. This is not unusual. It is the norm. Executive compensation rewards what directive governance can produce: revenue, cost cuts, market cap. It does not reward what decentralization would improve.

The Spotify illusion. Companies that claim to have decentralized but have not. Spotify's squad model was "part ambition, part approximation." Co-author Joakim Sundén later admitted people struggled to copy "something that didn't really exist." Spotify itself transitioned back to traditional management. This is the shallow loosening the ratchet predicts: structural change without cultural change. It looks like adaptation without being adaptation.

These forces do not operate independently. They feed the ratchet. Serial satisficing provides the post-hoc justification for each centralization. Inertia keeps the status quo comfortable. Incentives make change financially unrewarding. The Spotify illusion lets companies claim they have changed when they have not. Together, they ensure that even when competitive pressure forces loosening, the loosening stays shallow.

How does it make $$$

If directive governance is that broken, why are these companies worth trillions?

Because bad governance is a tax on performance, not a death sentence. When you have a search monopoly, network effects, or ecosystem lock-in, the monopoly rents absorb the tax. Microsoft lost a decade of market cap under Ballmer and emerged just fine. Google missed the boat on generative AI, despite authoring the seminal Transformer paper, and lost billions playing catchup. Apple and Meta spent billions on VR headsets that have yet to find a market.

The point is not that directive governance leads to death or bankruptcy. The point is that outward success does not mean the organization is healthy. The failure mode for most large tech companies is not death. It is languishing. Profitable enough to survive. Too poorly governed to innovate. The best engineers leave for younger companies where they can make decisions. The products get incrementally worse. The stock price holds up long enough that the board never forces the issue.

The urgency here is personal, not institutional. You can work inside directive governance and function. The company will survive either way. Directive governance ties your ability to drive innovation to the grace of your leader. That grace can be snatched away overnight.

Nadella's Microsoft is the rare counterexample: a deliberate, decade-long cultural investment that has survived multiple crises. The rarity is the point. The escape requires investment on a timeline that exceeds most executive tenures, in a discipline most leaders have never practiced.

Cargo Cult Governance

Srikanth Sastry — Thu, 26 Mar 2026 00:00:00 GMT

In the tech industry, we have been through a corporate rollercoaster in the last few years. First, it was the hiring mania during COVID, followed by widespread layoffs starting 2023. Then there is the pivot at AI, followed by the "flattening" of middle management. Regardless of whether you were laid off, or you carry the survivor's guilt, or you are shoveling AI slop to get to something useful, the mental scars are very real. So is the cynicism that the leadership may not know what it is doing. The decisions feel callous, short-sighted, even whimsical, and based on wildly inaccurate information.

But does it have to work this way? What's actually driving these decisions, and is there a better mechanism? Because if the answer is "this is just how large companies work," that's one kind of problem. If the answer is "there's a specific, diagnosable flaw in how these decisions get made", then that's a different one. One that might be fixable.

The mechanism: directive governance

The governance model that dominates the technology industry is what I call directive governance: information flows up through a reporting chain (compressed and lossy), decisions are made centrally, and directives flow back down for execution. Directive governance is how most large tech companies are actually run, regardless of what their culture decks say. (I formalize this definition in a follow-up article.)

And there is a reason for that. Directive governance has been very successful in a myriad of industries and organizations: pharmaceutical development, aviation, manufacturing, and even many parts of the military. What we see in the tech industry is a form of isomorphic mimicry; if it works in those areas, then it should work here too.

However, if you stop to ask why exactly it works in those industries, you start to see the fallacy in this mimicry. In pharmaceutical development, clinical trial data is structured and quantifiable. The information that matters can travel up the chain without losing its meaning. In aviation, decades of failure analysis have produced checklists and procedures that genuinely capture what matters. The gap between what the front line knows and what leadership sees is narrow by design. In manufacturing, defect rates and throughput are real proxies for operational reality. Cost per unit correlates with what's actually happening on the floor. And the person who designed the part is genuinely distinct from the person who fabricates it to spec.

Briefly, directive governance works really well when information for decision making is highly compressible without losing fidelity, is verifiable, and the decisions are clearly separable from their execution. These conditions are favorably satisfied in the industries mentioned earlier, and hence the success of directive governance in these spheres.

But the tech industry doesn't conform to these conditions, and therein lies the problem; the reason directive governance works poorly here.

The damage

This isn't theoretical. The wreckage is visible and well-documented.

During Microsoft's lost decade, stack ranking destroyed collaboration across the company. It was a centralized performance system that forced bell-curve grading. As a result, employees optimized against each other instead of for the product. Market cap fell from $580 billion to $249 billion. Leadership could see attrition rates and shipping dates. They could not see the innovation that wasn't happening.
In 2011, Google made the top-down decision to compete with Facebook on social and mandated that all teams across Google integrate Google+ into their products. Decision was made, and the teams were expected to execute. But "integrate social" is not a specification. Each team made their own decisions about what integration meant for their product. The decisions that determined whether the product would be coherent were not made by Google leadership. They were made by dozens of teams independently, at the execution layer, with no mechanism to coordinate them. The result was a Frankenstein. Google+ was shut down in 2019.
After Elon Musk laid off roughly 80% of Twitter's staff, a single remaining SRE made a configuration change that broke the entire platform: links, images, internal tools, everything. "You may not see negative effects immediately," NYU's Justin Cappos observed. "A month later you start to take a hit, and then the wheels start to fall off." Musk couldn't see which engineers were load-bearing because their contributions didn't show up in the metrics visible from the top.
In March 2026, OpenAI killed Sora, its video generation tool, because it was a "distraction". The pivot to ChatGPT-first left the Sora and DALL-E teams starved and feeling like second-class citizens. The result was an exodus of significant talent from OpenAI. In summary: centralized leadership greenlit a scatter of products, discovered the strategy was incoherent, and corrected with another centralized decision. The cost wasn't just a cancelled product. It was the people who walked out the door.

You have seen some version of this play out in your own workplace. It might be at a lower scale and with lower stakes, but the pattern is the same. What you might not have seen is the mechanism that produces it.

When does directive governance break?

Directive governance rests on three implicit assumptions about the information pipeline connecting the people who decide to the people who do:

Compression. When information is summarized upward, the compression preserves the signal that matters.
Proxy validity. The quantitative metrics available to decision-makers correlate with the reality they're managing.
Separability. Decision-making and execution are distinct activities that can be cleanly divided between levels of the hierarchy.

These assumptions hold in many industries. In manufacturing, summarizing production into throughput metrics does preserve what matters. Cost per unit does correlate with operational reality. And the engineer who designed the part is genuinely distinct from the worker who fabricates it to spec.

But these don't hold true in tech. To understand why, we need to go back to Hayek and Austin. Hayek recognized that in any complex system, knowledge is often inherently distributed, tacit, and contextual. It resists centralized aggregation by its nature. And Robert Austin demonstrated in 1996 that if only some dimensions of work are measurable, then measurement-based management actively degrades what cannot be measured. The combination of these two conditions in the tech industry invalidates the three assumptions above.

Why tech specifically

In 1986, Fred Brooks drew a distinction in "No Silver Bullet" that remains true despite all the advances in software engineering: software has essential complexity (the irreducible difficulty of the problem itself) and accidental complexity (the incidental difficulties of our tools and processes). Tools can attack accidental complexity. They cannot touch essential complexity, because it is the problem.

This argument extends to governing the people who build it. When you apply it there, all three assumptions collapse.

Compression fails. Essential complexity is irreducible by definition. You can compress "we shipped 15 features this quarter" into a slide, but you cannot compress "we chose this abstraction boundary because of how three subsystems will need to evolve independently over the next two years" into anything a non-participant can evaluate. The compression directive governance requires strips precisely the signal that matters. This goes back to Hayek's observation about knowledge's resistance to centralization.

Separability fails. This is where software diverges most sharply from other industries. In manufacturing, you do the same thing repeatedly. The design decision was already made, and execution follows a spec. Micro-decisions on the line are local and ephemeral. They don't compound.

Everything you build in software is new (if it weren't, you'd just call the API that already does it), and consequently, the act of building software itself is decision making: choosing an abstraction, defining an interface, decomposing a system. And unlike manufacturing, software decisions compound. Every abstraction choice constrains every future choice built on top of it. A manufacturing micro-decision lives and dies in the moment. A software decision shapes the codebase for years.

In software, execution is decision-making. Directive governance depends on a separation between the two that doesn't exist.

Proxy validity fails. The metrics that survive the reporting chain (uptime, sprint velocity, cost per headcount) track what keeps the lights on. They don't track what makes the company thrive. Innovation, architectural soundness, the quality of an abstraction, whether a team's trajectory is sustainable: none of these fit in a dashboard. As per Austin's observation, the metrics don't just miss creativity and innovation; they actively redirect effort away from it and toward maintenance.

The tech industry fails these three assumptions structurally and inherently, and the problem is only getting more acute. All of our advances in software engineering (Agile, CI/CD, cloud infrastructure, AI-assisted coding) serve only to eliminate accidental complexity. This leaves the essential complexity to dominate the signal loss in upward communication and to force leveraged decision-making in execution, while continuing to remain in the blindspot like a ghost moving the metrics.

The structural implication

If directive governance fails because the tech industry is structurally not convivial to it, then we need structural changes to how decisions are made. The structural fix here is that decisions get made where the information actually lives. But subsidiarity is not the same as flattening the org chart.

Does that mean we should 'flatten' the org chart? Is all this talk of 'flattening' really going somewhere? Sigh. I wish. It has been tried, and it produces its own pathologies. Valve's famous flat structure concealed a hidden hierarchy of informal cliques. Jo Freeman diagnosed this dynamic in 1970: structurelessness doesn't prevent hierarchy, it prevents accountable hierarchy. The loudest and most politically savvy end up in charge, with no formal mechanism for review or appeal. Spotify's squad model never actually worked at Spotify. "Even at the time we wrote it, we weren't doing it," co-author Joakim Sundén admitted. When Zappos adopted holacracy and gave employees an ultimatum to embrace self-management or leave, 14% of the company walked out the door. Eliminating hierarchy doesn't solve the information problem. It just makes power invisible.

But there are organizations, across industries, at massive scale, that have kept hierarchy while relocating decision authority within it.

Toyota gives any assembly line worker the authority to pull the andon cord and stop the entire production line when they spot a defect. Why? Because the worker has specific knowledge no dashboard can capture.
Amazon scales by multiplying small teams, not layering hierarchy. The "two-pizza team" is small enough that one leader can have full context. It is directive governance at a scope where the information precondition actually holds, federated across thousands of teams.
Netflix operates on the explicit principle that leadership's job is to communicate what and why; the people doing the work decide how.
Warren Buffett runs Berkshire Hathaway, a $900 billion conglomerate, with roughly 30 people at headquarters. He handles capital allocation, where a bird's-eye view helps. Subsidiary CEOs handle everything else, where local knowledge is what matters.
And the US Army, an institution built on hierarchy and obedience, formalized mission command: commanders communicate intent, subordinates decide execution. L. David Marquet transformed the USS Santa Fe from the worst-performing submarine in the fleet to the best by replacing "permission to" with "I intend to."

Notice that none of these examples eliminated hierarchy. They all redesigned where decisions happen within it. The tech industry didn't need to discover a new governance structure. It just needs to snap out of its dogma and stop ignoring what works.

The grand flattening: AI Slop is just the next step

Srikanth Sastry — Fri, 10 Oct 2025 00:00:00 GMT

Reality is an entity of vast, irreducible complexity. It is far more than the human mind can grasp, yet we are forced to operate within it. To cope, we rely on simplified models and simulations; essentially, shorthand versions of the world that fit inside our heads. The problem is fidelity. Eventually, the model breaks, and we are forced to confront phenomena we didn't account for and don't know how to handle.

Humanity's response to this problem has not been improvement alone. Each advance in our models brought with it a particular hubris: the conviction that this time, the map was complete; that what couldn't be captured was simply not worth capturing. And it is precisely that conviction that licenses the coercion. If the model is complete, then deviation isn't a sign of the model's limits. It is a sign of reality's defects.

Better maps didn't reduce the impulse to redraw the territory. They justified it.

To eliminate these 'edge cases,' humanity has spent millennia on a grand project: forcing reality to conform to a model we can predict and control. Philosophers have both fueled this project and warned of its side effects, warnings we have summarily ignored. The culmination of this effort is the 'AI Slop' currently inundating us. An ironic final step in subverting our perception of reality itself.

The legibility project of the pre-modern era

Humans tend to be "illegible". They are complicated, and diverse. Every group has its own customs, traditions and morality. Controlling and ruling over such an illegible group is near impossible. So this grand project started millennia ago as a mechanism to control people by reducing their illegibility. By making them legible. By flattening their complexity and diversity into a 'compressible' set of behaviors. The first recorded efforts in this direction are the Code of Hammurabi and the Manusmṛti.

Yes, these are known to be the first legal texts, but then again, a legal system essentially is a compression algorithm for human behavior; its goal is to reduce a diverse population to a predictable, manageable set of outputs. Of course, there were errors in predictions of these models, and such errors are referred to as "crimes" and there are entire institutions dedicated to "correcting" such errors, by not improving the model, but by coercing human behaviors to fit the model. "Justice" was really about systemic control.

Of course, due to limitations of technology, the model had incredibly low resolution, and it sought to model only the human behaviors that needed control within the confines of the day's political sovereignty. For the most part, these models left the natural world and our inner worlds alone.

As the project matured, the philosophers were on a mission to build a descriptive model of reality. Plato's Theory of Forms modeled the world's diversity as mere 'noise' deviating from a perfect ideal. Aristotle provided the methodology for deconstructing reality into 'silos of legibility' under the fatal assumption that nothing of value is lost in the gaps. Mathematicians such as Aryabhata and Brahmagupta created descriptive maps to navigate the heavens. However, the impulse toward a prescriptive reality was already visible in the shadows. It lived in Astrology, which forced human destiny to fit a celestial map, and in the sale of Indulgences, which downsampled the infinite complexity of sin into a quantifiable financial transaction. The pivot to a world coerced to conform to the map was not a new idea, but it could not be realized at scale until better technology came along.

The objectivity of modernity

The renaissance and modernity introduced us to the concept of objectivity: the notion that things are true regardless of a subject. The philosophers of this age viewed reality as an object to be observed, dissected, studied. And somehow, we could do it objectively, as if we weren't part of this object we were studying. This paradigm alienated us from our own existence. This alienation allowed philosophers to turn this gaze of objectivity inward into our own lives and how we relate to each other; into our inter-subjectivity. They dissected how we relate to each other, and how we work together to produce goods and make progress. This was categorized and studied with ever more precision. We had new categories to peer into. There was economics and there was psychology and there was political science and there was ethics. Almost as if each of them had nothing to do with each other, and pursued their own investigations to get to their objective truths.

It wasn't long before this inward gaze of economics turned onto human work, and it didn't see people. It saw functions. A watchmaker, viewed through the economic lens, was not a person embedded in a tradition, a community, a set of relationships. He was a bundle of discrete, separable processes: material procurement, part fabrication, assembly, quality control, distribution. The model couldn't perceive anything it couldn't categorize. And what it couldn't perceive, it treated as if it didn't exist.

This was the Industrial Age: a period of selective blindness. The watchmaker didn't disappear because someone chose to erase him. He disappeared because the model looking at him had no category for what he actually was. We didn't just make more watches; we created a world where the human was only allowed to exist as a low-resolution component of a larger machine.

Ontological Blinders of the information age

The 20th century provided the necessary technologies to unify the balkanized silos of modernity. Through the work of Alan Turing, John von Neumann, and Claude Shannon, the messy kinetics of physical reality were recast as pure information processing. "Process Efficiency" was replaced by "Algorithmic Optimization."

We resurrected Plato's theory of Forms, but the forms were now idealized mathematical models. Any deviation from the model became a systematic 'error' that needed rectification. For instance, the nuances and peculiarities around the problems of routing trains between cities, laying down water and sewer pipes in a neighborhood, and moving data packets around a network were all 'unified' by the same optimization algorithms, and in that process those very same nuances and peculiarities were completely marginalized. A missed package was no longer a logistical accident: it was an "error" requiring more "fault tolerance." A worker calling in sick was no longer a human event: it was a "node failure" requiring a "redundancy" patch. These changes happened in the background of our lives, hidden by the perceived convenience of the tools.

The insidious turn occurred when the model overrode the reality. The diversity of the world was rebranded as "noise" that failed to map to the model, rather than the model failing to map to the world. Algorithms started changing human behavior so that it remains compliant with the model's expectations.

The upshot is a society that has mistaken the models for reality. The mask has become the face. You see it in social media where a curated "Instagram life" is accepted as a true representation of existence. You see it in the economy, where macro-economic abstractions like GDP are deemed more "real" than actual economic health. We have now flattened ourselves to be legible to these models. We optimize our lives to improve a credit score as if the score were the reality. "Pics or it didn't happen" is the demand for algorithmic validation of our own subjectivity. We have become ontologically blind to anything that cannot be accounted for by the model.

The Manufactured Reality of the Intelligence Age

The 21st century introduced the ultimate agent of the Grand Project: Generative AI. This technology finally detaches the Map from the Territory. But it does so in a way that is categorically different from everything that came before it.

Previous technologies mediated reality. The photograph selected a frame. Television broadcast a produced version of events. Social algorithms surfaced a curated slice of human expression. In each case, the underlying reality was still there, generating the inputs. The mask had become the face; but there was still a face underneath.

Generative AI breaks this relationship entirely. It does not compress reality. It bypasses it. The inputs to a large language model are not live signals from the world; they are prior compressions: text, images, and records of what humans said and made, after already passing through every filter described above. The model trains on the averaged residue of a civilization that had already been flattening itself for centuries. It then generates new outputs optimized for coherence with that averaged signal; maximally legible, frictionlessly consumable, scrubbed clean of the noise that makes any particular perspective distinct from the statistical mean.

We call the result "AI Slop." It is a pejorative that describes the soulless, uncanny nature of these creations. Yet, we cannot stop consuming it. We are addicted to it because it is the path of least resistance. It is content with the highest possible fidelity to the model and the lowest possible fidelity to any individual reality. It has no author, no context, no stake. It is the signal of the average; which is to say, the signal of no one.

We consume it anyway, and at scale. Not because we are foolish, but because this entire arc has been progressively reducing our tolerance for friction, for illegibility, for the effort that genuine encounter with reality requires. AI Slop is not a cause. It is a symptom of a sensory system that has been recalibrated, over centuries, to mistake the model for the thing.

The momentum driving this is 2,500 years in the making. From Hammurabi's codes to Shannon's information theory, every step has iteratively eliminated the human element as an "inefficiency." We are now so far immersed in this episteme that we have lost the ability to distinguish the mask from the face. Previously, the mask became the face. Now there are no more faces. Only masks.

Defense in Depth vs Locality of Behavior

Srikanth Sastry — Mon, 14 Jul 2025 00:00:00 GMT

Defense-in-depth—borrowed from military and security strategy—means layering safeguards so that if one fails, another takes over (Cloudflare, Wikipedia). Defensive programming is a software take on the same idea: add checks and fallbacks so bugs don't escalate.

Meanwhile, the principle of locality of behavior (or “locality of behaviour” in htmx) says that "the behavior of a unit of code should be obvious by looking only at that unit" [source]. It draws on older ideas of cohesion: keep related logic together.

So when should you favor layering defenses, and when should you co-locate behavior? This isn’t that post. Instead, here’s a story about how leaning on defensive programming without scrutiny let a critical bug stay hidden for far too long.

Background: minor and major compaction. I was working on a big data system that performed repeated mutations on datasets via commits. Over time, reading slowed down—each read had to apply more mutations. To fix this, my service relied on cheap minor compactions. But unbeknownst to me, there was a fallback: a slow, expensive major compaction if too many mutations piled up. (See https://orc.apache.org/docs/acid.html.)

Unexpected failures. Suddenly, my service slowed down and sometimes timed out. Digging in, I found it was triggering major compactions. These were so costly that jobs exceeded timeouts and got killed.

Asking around, I learned this was an intentional fallback. It was a defensive programming safeguard in case minor compactions failed. Everyone thought this was great resilience.

Increasingly brittle. Then I asked: why did minor compactions fail in the first place? Silence. No alerts, no monitoring; no one knew.

Logs revealed major compactions had been quietly running on small datasets for ages due to a bug in minor compaction discovery. The problem stayed hidden because small datasets finished quickly. When larger ones arrived, everything blew up.

Ironically, the fallback meant we never fixed the root issue. Our supposed resilience made the system more fragile.

Could we have seen this coming? If major compactions didn’t exist—or at least raised an alarm every time—they’d have forced us to fix minor compactions long ago, before the blast radius grew.

So next time you violate locality of behavior for defense-in-depth, think hard. And always alert aggressively when deeper defenses kick in.

Building on these well-trodden ideas, this incident is just one more caution: defense-in-depth only works if every fallback is visible and monitored. Otherwise, your “resilience” may just be hiding decay.

When Backward Compatibility Can Rescue a Leaky Abstraction

Srikanth Sastry — Wed, 02 Jul 2025 00:00:00 GMT

I ran into one of those delightful bugs that only show up in dynamic task generation of your data pipelines — the kind that teach you how a leaky abstraction in your pipeline platform can have you scratching your head in confusion.

The short version: I made a simple function signature change, assuming only future runs would care. Instead, my pipeline broke days later when an old task serialized under the previous signature collided with the new code. The fix? Classic backward compatibility tricks that saved me from babysitting all existing task runs when making changes in the future.

Here’s the story — and how to avoid learning this lesson the hard way.

The bug

I had a Dataswarm operator that would execute a Python function every day, and the output of that Python function was a list of tasks (other functions) to be executed that day.

Here is what the function looked like:

# Function that generates the tasks to be executed.
def task_generator(arg1, arg2) -> List[Task]:
    ...

# How the function is invoked
wait_for_data = SomeTaskWaitingForData(data)
tasks_to_be_executed = DynamicTasks(
    wait_for_tasks=[wait_for_data]
    task_gen_function=task_generator
    task_gen_args={
        "arg1": a,
        "arg2": b,
    }
)

I put in a diff that looked something like this:

# Function that generates the tasks to be executed.
- def task_generator(arg1, arg2) -> List[Task]:
+ def task_generator(arg1, arg3) -> List[Task]:
    ...

# How the function is invoked
wait_for_data = SomeTaskWaitingForData(data)
tasks_to_be_executed = DynamicTasks(
    wait_for_tasks=[wait_for_data]
    task_gen_function=task_generator
    task_gen_args={
        "arg1": a,
-        "arg2": b,
+        "arg3": c,
    }
)

You see, I just replaced arg2 with arg3 and everything looked fine. I tested the diff and landed it, expecting the next task instance to pick up the changes and move on. As you can imagine, that is not what happened :)

I soon got a bug report that said that my pipeline failed with an error: TypeError: 'arg2' is an invalid keyword argument for task_generator(). This had me completely confused. My expectation was that either the previous version of the pipeline would be executed, in which task_generator() is defined to expect arg2 and tasks_to_be_executed passes a value for arg2, or the new version of the pipeline would run, where task_generator() expects arg3 and tasks_to_be_executed passes arg3. Neither of those two scenarios result in a TypeError: 'arg2' is an invalid keyword argument for task_generator(). So, what's going on?

The root cause

After some debugging, I saw that the tasks_to_be_executed task instance that errored out started off two days ago, but was waiting for the wait_for_data to complete, and the wait_for_data task didn't complete until the current day, after which the tasks_to_be_executed task instance ran and errored out. Eventually I found that DynamicTasks serializes the function name and args as a JSON blob at schedule time, waits for upstream tasks to finish, then reloads the function from HEAD and calls it with the original arguments. That’s why old args collided with new code.

Because DynamicTasks persists the function name and args and then later reloads HEAD, it breaks the assumption that changing a function signature only affects new pipeline runs. I only discovered this by digging into DynamicTasks implementation; classic leaky abstraction!

The fix

Changing the Dataswarm operator implementation to not leak its implementation detail was a pretty heavy lift, and I needed a more scoped down change to unblock myself. So, I needed a way to make sure that I can change the task_generator implementation without running into such combinations of race conditions and leaky abstraction again. Making the task_generator implementation backward compatible accomplishes this quite nicely. But first, I need to make sure that it can be made backwards compatible. That involves a few steps.

Step 1. Add **kwargs

First, we need to ensure that passing in parameters from the previous version of task_generator does not throw an unexpected exception. We can do that by swallowing all unspecified parameters in **kwargs as follows.

# Function that generates the tasks to be executed.
- def task_generator(arg1, arg2) -> List[Task]:
+ def task_generator(
+    *,
+    arg1=None,
+    arg2=None,
+    **kwargs
+ ) -> List[Task]:
+    if kwargs:
+        LOG.warning(f"Found unspecified arguments {kwargs.keys()}")
    ...

The diff does three things.

It ensures that all arguments are passed by name and not by position.
It makes all parameters optional with a default value of None. This ensures that leaving out any specific parameter doesn’t break the call. The reasoning for this is similar to the ones in proto3 that made all fields optional.
If the caller passes an unexpected parameter (say, arg13), the function won’t throw an exception. Instead, it logs a warning about the unrecognized parameter and proceeds to execute the function with the remaining parameters.

Land this change and wait for it to propagate to all your task instances.

Step 2. Change your function signature

Now you are ready to make changes to your function signature without breaking existing tasks. Suppose you want to remove arg2 and introduce arg3. Your diff would look like this.

def task_generator(
    *,
    arg1=None,
-   arg2=None,
+   arg3=None,
    **kwargs
) -> List[Task]:
    if kwargs:
        LOG.warning(f"Found unspecified arguments {kwargs.keys()}")
+   if not arg3:
+       arg2 = kwargs.get("arg2", None)
+       # Old business logic with arg2
        ...
+       return tasks
+   # New business logic with arg3
+   ...
+   return tasks

When you land this, you could have tasks scheduled to run that are currently persisting the old function signature. When such tasks execute your new function definition, **kwargs will swallow arg2 and arg3 is set to its default value None. The function will see that arg3 is None, so it will look for arg2 in kwargs and execute the old business logic.

However, for all new instances of your task, arg3 is set, and so the function executes the new business logic. Backward compatibility accomplished!

Step 3. Delete old functionality

After all your old task instances have completed execution, you are now ready to remove the old business logic. This is a simple red diff.

def task_generator(
    *,
    arg1=None,
-   arg3=None,
    **kwargs
) -> List[Task]:
    if kwargs:
        LOG.warning(f"Found unspecified arguments {kwargs.keys()}")
-   if not arg3:
-       arg2 = kwargs.get("arg2", None)
-       # Old business logic with arg2
-        ...
-       return tasks
   # New business logic with arg3
   ...
   return tasks

And, you are done!

Lessons learned

Pipeline frameworks can serialize more than you think. DynamicTasks pickled the function name and arguments days earlier, then loaded the function definition fresh from HEAD. That mismatch broke everything.
Stage changes with **kwargs and defaults. When changing function signatures that might still be called by older task payloads, always accept extra kwargs and use None defaults to gracefully detect old vs. new callers.
Expect your abstractions to leak. If your orchestration tool stores data and code separately (JSON blobs now, functions later), your assumption that “old code only calls old function signatures” is toast.
Logging unrecognized parameters is gold. Instead of crashing, you get explicit warnings when old payloads collide with new code. Debugging becomes a thousand times easier.

Const Refs vs. Raw Pointers: Fixing Shared Pointer Reads

Srikanth Sastry — Thu, 26 Jun 2025 00:00:00 GMT

The problem

Recently, I encountered a subtle performance issue while refactoring some C++ code. I was passing a std::shared_ptr by value into a function, even though the callee only needed read access. Infer flagged it as PULSE_READONLY_SHARED_PTR_PARAM. Infer was right: passing shared pointers by value incurs refcount overhead, and if multiple threads are sharing the pointer, it can introduce performance regressions. My code looked something like this.

void caller(std::shared_ptr shared_ptr) {
  ...
  callee(shared_ptr);
}

void callee(std::shared_ptr ptr) {
  auto foo = ptr->read_value() + 1;
}

Infer's suggestion, and why it's wrong

Infer's documentation around PULSE_READONLY_SHARED_PTR_PARAM says the following:

This issue is reported when a shared pointer parameter is a) passed by value and b) is used only for reading, rather than lifetime extension. At the callsite, this might cause a potentially expensive unnecessary copy of the shared pointer, especially when many number of threads are sharing it. To avoid this, consider 1) passing the raw pointer instead and 2) use std::shared_ptr::get at callsites.

So, its suggestion was to change my code to the following:

void caller(std::shared_ptr shared_ptr) {
  ...
  callee(shared_ptr.get());
}

void callee(T* ptr) {
  auto foo = ptr->read_value() + 1;
}

Sure, Infer got the diagnosis right, but the proposed solution of using raw pointers seems wrong. Smart pointers (unique_ptr and shared_ptr) were introduced precisely to avoid the many footguns associated with raw pointer memory management and safety. There should be almost no good reason to use raw pointers, and the use case above seems too trivial to warrant using a raw pointer. In fact, if I passed a raw pointer, some future developer might wrap it in a new shared_ptr and pass it elsewhere. That’s a recipe for double-free bugs and a nasty core dump.

Ranting aside, I still couldn’t let the change stand as-is; the infer error was pointing to a legitimate problem. So, how do we address this issue without resorting to raw pointers? Answer: const refs :)

Const refs to the rescue

Instead of passing the raw pointer, what if we just passed the underlying object itself? Well, we don't really need a copy of the object, and so we can pass a reference to it. Since we’re only calling a read-only method, a const reference works just fine. With that, we have this fix:

void caller(std::shared_ptr shared_ptr) {
  ...
  callee(*shared_ptr);
}

void callee(const T &obj) {
  auto foo = obj.read_value() + 1;
}

And it works like a charm.

Why not const ref the shared pointer itself?

Of course, passing a const reference to the object isn’t the only way to avoid copying the shared pointer. You can also simply pass the shared pointer by reference!

void caller(std::shared_ptr shared_ptr) {
  ...
  callee(shared_ptr);
}

- void callee(std::shared_ptr ptr) {
+ void callee(const std::shared_ptr &ptr) {
  auto foo = ptr->read_value() + 1;
}

(I showed this as a diff because the change is subtle! Also note: taking a reference to a shared_ptr avoids bumping the reference count, so there’s no added contention.)

Now, be careful here. You can't use this trick if shared_ptr could be a nullptr, because passing a reference to a nullptr and trying to dereference it is a great way to bring down your service! In my case, it turns out that shared_ptr was guaranteed to be non-null, and so this trick works well.

So next time someone suggests using a raw pointer, be skeptical—there’s almost always a safer alternative to that particular footgun.

Oh, and one last thing...

I’ll stop here, but don’t walk away thinking const refs are a cure-all. They can backfire too. This post in Belay the C++ outlines some of the less obvious pitfalls.

Changing your Jekyll theme without losing your mind (or your content!)

Srikanth Sastry — Sun, 22 Jun 2025 00:00:00 GMT

After I moved my website from Wordpress to Jekyll, I hadn't changed the theme for nearly 5 years. When I finally decided to change the theme recently, it turned out to be a lot more complicated than I expected. After a lot of trial and error, and searching the internet, and asking ChatGPT, I managed to get the theme changed. As a note to my future self, and to anyone else who might be struggling to update their Jekyll theme, I am outlining the steps here that will make it relatively straightforward for you to move from one theme to another.

Disclaimer: While these steps will make it easy for your site to start looking closer to your desired theme, it is by no means a turnkey solution. You will still need to do a fair amount of hand editing for the new theme to work with your existing content. So, make sure you are able and willing to spend time fiddling with various configs, settings, and markdown front matter.

Prerequisites

Before we begin, here are the prerequisites:

Your site is already version controlled by git. Ideally, it is already on Github.
You are able to run Jekyll locally. If not, please follow the instructions in https://jekyllrb.com/docs/ and then come back here.
The new theme that you have picked out for your site is a gem-based theme. Technically, the steps outlined below can be tweaked for a regular theme as well. I will include an addendum at the bottom on how you can work with regular themes as well.
The post assumes that your Jekyll site source is in the directory ~/github/website. All references to ~/github/website should be replaced by the location of your source.
I assume that the theme that you want is Wind theme that can be installed via gem "wind-theme". You can change it to whatever theme you pick.

Requirements

The new theme is installed in the same location as the original site
The content from the orignal site is preserved.
The new theme lands as a single commit on top of the old theme. We do not want to lose the commit history.
No vestiges or residues of the old theme must remain.
This should be done via a Pull Request/merge and not via a force-push.

The Guide

Start with a clean slate

To begin, we’ll create a clean working directory. We don’t want to simply nuke ~/github/website since it contains all your site's content. Since we want to keep the commit history, we cannot pull an orphan branch. So, we start by creating a branch off of main, and then working on it exclusively until we are ready to publish a Pull Request. We’ll start by creating a new branch and removing its contents so we can set up a fresh Jekyll site. Here are the steps.

First, create a new branch from main, and then nuke everything. This creates an empty working directory in the new branch, while preserving your full site on main.

cd ~/github/website
git checkout main
git checkout -b install_new_theme
git rm -rf .
git clean -dfx

Note: git clean -dfx removes any untracked files and directories, including ignored files, to fully clean the working tree.

Now you have a clean directory to install things in. Next, we install a fresh Jekyll site:

gem install jekyll bundler
bundle init
cd ..  # Change directory to `~/github`
jekyll new website --force
cd ~/github/website

Warning: jekyll new website --force overwrites contents of website/ and so make sure it doesn’t contain anything important that isn’t backed up.

Now, this is an empty Jekyll site, and we are going to change the theme on this empty site to the theme that you picked out.

Install new theme

Recall that I am going to install wind-theme. To do that, I edit the Gemfile ~/github/website/Gemfile as per the theme instructions, and the following line.

gem "wind-theme"

You can either use the default _config.yml, or copy your existing one from main via:

git checkout main -- _config.yml

Now, edit your _config.yml to set the theme to wind-theme (as per theme instructions).

theme: wind-theme

Now run the installer and then serve the site locally (in another shell).

cd ~/github/website
bundle
bundle exec jekyll serve

The empty site with your new theme should now be accessible at http://localhost:4000/. Make sure that it looks and feels as advertised, and that no errors are popping up when building the site. At this point, if you are seeing issues, then you will have to roll up your sleeves, figure out what wrong, and fix it. Once you are happy with your site, time to fill it up with your content.

Bring back your content

Your site's contents should be in the main branch's _posts, and any other directories you may have created to store content (such as documents, images, assets, etc.). The top level pages of your site should be the top level .md or .markdown markdown files. Then, you also have your .gitignore, CNAME, etc. Bring them all back by copying them from the main branch as follows:

git checkout main -- _posts about.md something_else.md CNAME .editorconfig .github
...

Note: If you want to bring back everything except the theme-related directories, use a wildcard or cherry-pick structure.

The shell building your site should pick these changes up and update the local build automatically. Check the console logs on the output of bundle exec jekyll serve to make sure that things are working correctly. Go check out http://localhost:4000/ to make sure that all your content is present and looks as expected.

Again, if something looks off, it is time to get your hands dirty, figure out what's gone awry, and fix it. From here, you may need to debug based on your content and theme setup.

Once you are happy with the new theme, it is time to make it official!

Switching over to the new theme

First, let us commit all these changes we made into one fat commit.

git commit -am "Replace site with new jekyll theme"

Once you’re satisfied, push the branch to origin and open a pull request for review and merge.

git push --set-upstream origin install_new_theme

If you’re not using GitHub PRs, you can merge the changes directly into main via the CLI:

git checkout main
git merge install_new_theme

Once merged to main, you should be good to go. Congratulations, you have successfully changed your site's theme!

Addendum: using regular themes

If you want to install a regular theme that is not gem-based, then instead of following the "Install New Theme" section's instructions, do the following.

Download the regular theme files. You can fork them from github and clone it locally, if you want. That is what many repos suggest, but I prefer not to do that because those instructions assume that you will be installing your entire site on top of that repo. But that is not what you want. Presumably you want all of your content and associated commit history. So, don't clone the repo. Just download it to a new directory (say) ~/tmp/new_theme.
Copy over the theme files to your site directory (~/github/website). This include directories such as _includes, _layouts, _data, assets, and typically any other directory that starts with _ (except for _site). There might also be .js or other files in the main directory that you might have to copy.
Caution. Before moving on to the instructions in "Bring back your content" section, make sure that whatever content you are bringing back, it does not overwrite or interfere with the theme file that you just copied over. E.g., you might have images or icons in your assets directory, and the theme might also store some files in assets directory that you just copied over. git checkout main -- assets might overwrite your new theme files with old site content, so use caution.

Final Checklist

[x] Your site is backed up or version-controlled
[x] You've created a new branch from main (e.g., install_new_theme)
[x] You've cleaned out old content in the new branch
[x] You've installed a fresh Jekyll site and configured your new theme
[x] You've served the site locally and confirmed it builds with no errors
[x] You've copied over your original content and ensured nothing essential was lost or overwritten
[x] You've tested your new site at http://localhost:4000/ and confirmed the theme looks correct
[x] You've committed the changes as one clean commit
[x] You've pushed the branch and opened a PR — or merged the branch into main
[x] Your main branch now reflects the new theme and your full content

Cyclomatic Complexity: How Low Can You Go?

Srikanth Sastry — Tue, 17 Jun 2025 00:00:00 GMT

What even is Cyclomatic Complexity?

Ever spend 20 minutes trying to figure out why your bug fix or feature code isn't triggering or being executed — only to realize you missed a buried branch in someone’s 10-path function? That’s Cyclomatic Complexity in action. Intuitively, you can think of Cyclomatic Complexity as the number of possible paths a single execution of a function can take.

For example, a = b + c has a cyclomatic complexity of one, and a = b + c if foo else d + e has a cyclomatic complexity of two: one path is when foo is True and the effective logic is a = b + c, and the other path is when foo is False and the effective logic is a = d + e.

Ain't got no time? Here's the goods.

If you take just one thing away from this note, then let it be this.

Strive to reduce the Cyclomatic Complexity of your code; your team and your future self will thank you!

Time to hit the brain gym, bro

As an exercise, I will let you figure out the cyclomatic complexity of the following piece of code:

env_val = os.environ.get('...')
switcher_val = False
if env_val is not None:
    jk_val = True
    if env_val.lower() is in ["true", "1", "yes"]:
        env_val = True
    else:
        env_val = False
else:
    env_val = True
    switch_name = "/switch/name/from/config"
    switcher_val = switcher.check(switch_name, switchval=region)
if env_val or switcher_val:
    apply_some_config(job)

I'll wait... (Spoiler: It's not pretty.)

Give up? Turns out, it is 4: three if-checks contribute to three branching points, and the cyclomatic complexity is one more than that; ergo 4.

Next, by spending no more than 60 seconds looking this code, can you tell me what exactly it is doing? BTW, this is real production code that I ran across when debugging some issue, and it took me a long while to make sure I knew exactly when and how the config is applied. It wasn't obvious at all. If you can grok this in 60 seconds, take a bow!

Reeling yet?

Anyway, making sense of functions with high cyclomatic complexity is annoying. It’s notoriously difficult to write tests with good coverage for these functions, and in general, they tend to be bug factories.

And yet — somehow — a lot of senior software engineers don’t seem to grok this. I keep seeing deeply nested if-else blocks, sometimes inside loops with breaks and continues, and it doesn’t seem to bother anyone! It’s like we’ve collectively normalized this cognitive overhead.

Why?! Why are we putting up with this crap? It’d never fly in an interview.

Yo, let's fix it up!

Coming back to the above example, the confusion and ugliness of this code really got to me. It got so bad I considered dusting off a Karnaugh map. After some much needed grokking, I managed to simplify it down to a cyclomatic complexity of 2! :)

In the end, here’s what that poor little code snippet was trying to do:

# Apply config when '...' environment variable is True, else check the switch
__ENV_VARIABLE = '...'
__SWITCHER_KEY = '/switch/name/from/config'
def has_env_override():
    val = os.environ.get(__ENV_VARIABLE)
    return val is not None and val.lower() in {"true", "1", "yes"}

if (
    has_env_override() or
    switcher.check(__SWITCHER_KEY, switchval=region)
):
    apply_some_config(job)

Fewer paths, fewer bugs. Cleaner code. Happier teammates. What’s not to love?

TDD for Bug Fixes

Srikanth Sastry — Wed, 11 Jun 2025 00:00:00 GMT

I have seen way too many 'senior' engineers get bug fixing wrong. It is common to see an engineer sent a pull request titled "bug fix: " and the PR has changes to the functional code that fixes the bug and a correspond test case that shows that the bug is fixed. If that sounds reasonable, THINK AGAIN — you’ve walked right into the classic trap!

If you are sending PRs for bug fixes with functional code change and an added test case in the same PR/commit, then you are doing it wrong!

The crux of the problem is the following: HOW DO YOU KNOW YOU’RE SMASHING THAT BUG? HOW CAN YOU BE SURE YOUR TEST ISN’T A DUD?! Your answer better not be VIBE CHECKS or just STARING REALLY HARD! If you are having to deploy your entire service/library and run an end-to-end test to demonstrate correctness, then you are doing too much, and you still haven't demonstrated that the unit test actually captures the previously errneous behavior.

There is this shiny little concept called Test Driven Development (TDD) that is mighty useful here. You can peruse the wikipedia link to figure out what TDD is exactly. This note will show you how to use TDD for bug fixes.

Here are simple steps to fixing bugs using TDD:

🕵️ Discover the bug. BAM! There it is! Your nemesis!
🧪 Create a PR that creates a new unit test that exposes the unit test. YAWZA!
🔧 Create a second PR on top the first PR that makes the functional code change and changes the expectation on the unit test accordingly. That should squash the bug! KAPOW!
💰 Justice is served! PROFIT!

Still not sure? Let's demonstrate this with an example. Say, there is a bug that you discovered and know how to fix it.

First, you create a PR that demonstrates the bug by invoking your SUT with the offending input, and sets the expected value to be incorrect so that the test case actually passes with this incorrect value; thus demonstrating the bug.

class TestSUT(unittest.TestCase):
    ...
    def test_bug_b12345(self) -> None:
        '''
        Test to expose bug b12345
        '''
        # Arrange
        sut = SUT(...)
        
        # Act
        actual = sut.test_method(input="bad-input")

        # Assert
        self.assertEqual(actual, "bad buggy output")
        # The assertion above demonstartes the bug b12345
        # The right expected value should be "correct output".
        # self.assertEqual(actual, "correct output")

You can send that PR out for review and merge it in. Now you have a solid proof that you have found a bug, and reproduced it.

Next, you have a new PR that fixes that bug. If you bug fix is correct, then the test test_bug_b12345 should not start failing. The output of sut.test_method(input="bad-input") should be "correct output" and not "bad buggy output". So, you now modify the unit test test_bug_b12345 in that same PR that looks as follows:

    def test_bug_b12345(self) -> None:
        '''
        Test to expose bug b12345
        '''
        # Arrange
        sut = SUT(...)
        
        # Act
        actual = sut.test_method(input="bad-input")

        # Assert
-       self.assertEqual(actual, "bad buggy output")
-       # The assertion above demonstartes the bug b12345
-       # The right expected value should be "correct output".
-       # self.assertEqual(actual, "correct output")
+       self.assertEqual(actual, "correct output")

Now your test should pass. This second PR is conclusive proof that your diff now fixes the bug! So, merge it in. Deploy with confidence. BOOM — PROFIT!

Let Sleeping Engineers Lie: Why Your Alerts Should Match Your SEVs

Srikanth Sastry — Sat, 07 Jun 2025 00:00:00 GMT

At work, I had a customer team that aspired to be “customer first.” To them, that meant fixing issues before they became SEVs. That was all and good, except that the way they went about it was to fire alerts well before their SLOs were close to being breached. Of course, I knew nothing about it until I was the receiving end of their 'aspiration'.

It’s 4 AM, and I am in deep sleep. Suddenly, my phone, overriding all silencing setting starts ringing like there is no tomorrow. Naturally, I was being paged. I wake up bleary eyed, acknowledge the page, and join the team channel. Helpfully, the customer team oncall has message for me: “Your service has a latency spike. Please look into it.”

I drag myself to a laptop, check the graphs, and yes — there was a p99 latency spike, it lasted about half hour, and is already waning. Our SLOs were fine; our latency SLOs at these latency levels don't breach for another 30 minutes. I double-checked their SEV criteria, and they are also still green! So why the 4 AM fire drill?

Turns out, they’d set up their alerts to go off when their p99 latency went above the normal limits for 30 minutes, but their SLO wouldn't be breached until the elevated p99 persisted for 60 minutes. A twitcy alert if you ask me!

Their on-call had no idea what to do with the alert, saw my service mentioned, and did the classic move:

“When in doubt, escalate!”

So now I’m awake, trying to make sense of a 30-minute p99 latency increase that is fixing itself. I asked:

“Where's the SEV'?

I imagine the scene something like this.

Silence. Five minutes later, "Here is the SEV number..." The SEV was created two minutes ago. Facepalm!

Here’s what actually happened:

The latency spike lasted about 30 minutes.
The system auto-healed.
The affected service was user-facing, but this was deep in the off-hours.
Total estimated user impact: somewhere between “negligible” and “none.”

We could’ve all just slept through it and looked at it with fresh eyes in the morning. Instead, two engineers got pulled into zombie mode to stare at graphs that improved all by themselves. It was like debugging a ghost.

Moral of the story:

If your alert is going to wake someone up at 4 AM, it better be for something that actually matters. If there's no SEV, no SLO breach, and no clear user impact — maybe let sleeping engineers lie.

The Law of Demeter and unit tests

Srikanth Sastry — Fri, 22 Jul 2022 00:00:00 GMT

The Law of Demeter essentially says that each unit should only talk to its 'immediate friends' or 'immediate dependencies', and in spirit, it is pointing to the principle that each unit only have the information it needs to meet its purpose. In that spirit, the Law of Demeter takes two forms that are relevant to making your code more testable: (1) object chains, and (2) fat parameters.

Object Chains

This is the more classic violation of the Law of Demeter[^1]. This happens when a class C has a dependency D, and D has method m that returns an instance of another class A. The violation happens when C accesses A and calls a method in A. Note that only D is the 'immediate' collaborator/dependency of C, and not A. The Law of Demeter says that C should not be accessing the method in A.

# A violation of the Law of Demeter looks as follows.
## Example 1:
c.d.m().methodInA()

## Example 2:
d: D = c.d
a: A = d.m()
a.methodInA()

What is the problem with violating the Law of Demeter? Consider the following production code:

class UpdateKVStore:
    def __init__(self, client: KVStoreClient) -> None:
        self.client = client
        
    def update_value(new_content: Content) -> Status:
        transaction: KVStoreClient.Transaction = self.client.new_transaction()
        if transaction.get_content() == new_content:
            # Nothing to update
            transaction.end()
            return Status.SUCCESS_UNCHANGED
        mutation_request: KVStoreClient.MutationRequest = (
            transaction.mutation_request().set_content(new_content)
        )
        mutation = mutation_request.prepare()
        status: KVStoreClient.Mutation = mutation.land()
        return status

Now how would you unit test this? The test doubles for testing this code will look something like this

mock_client = MagicMock(spec=KVStoreClient)
mock_transaction = MagicMock(spec=KVStoreClient.Transaction)
mock_mutation_request = MagicMock(spec=KVStoreClient.MutationRequest)
mock_mutation = MagicMock(spec=KVStoreClient.Mutation)

mock_client.new_transaction.return_value = mock_transaction
mock_transaction.mutation_request.return_value = mock_mutation_request
mock_mutation_request.prepare.return_value = mock_mutation

Now you can see how much the class UpdateKVStore and its unit tests need to know about the internals of the KVStoreClient. Any changes to how the KVStoreClient implements the transaction will cascade into test failures on all its clients! That's a recipe for a low accuracy test suite.

There are a few ways to address this. Instead, if KVStoreClient could be recast as a Transaction factory, and then encapsulate all operations associated with the transactions within the Transaction class, then UpdateKVStore can be modified as follows:

class UpdateKVStore:
    def __init__(self, client: KVStoreClient) -> None:
        self.client = client  # Now a Factory class for Transaction.
        
    def update_value(new_content: Content) -> Status:
        transaction: KVStoreClient.Transaction = self.client.new_transaction()
        if transaction.get_content() == new_content:
            # Nothing to update
            transaction.end()
            return Status.SUCCESS_UNCHANGED
        status = transaction.update_and_land(new_content)
        return status

When testing the new UpdateKVStore, you only need to replace the KVStoreClient and the Transaction, both of which are (explicit or implicit) direct dependencies, with test doubles. This makes the code much easier and straightforward to test.

Fat Parameters

While the anti-pattern of 'fat parameters' does follow directly from the Law of Demeter, it does follow from the spirit of passing in only the information that the class needs to perform its function. So, what are fat parameters? They are data objects that as passed in as an argument to a class, and they contain more information than what is needed by the class.

For instance, say you have a class EmailDispatcher whose method setRecipient only needs a customer name and email address. The method signature for setRecipient should only require the name and email, and not the entire Customer object that contains a whole lot more.

@dataclass(frozen=True)
class Customer:
    ... # data class members.
    def getFullName(self):
        ...
    def getEmail(self):
        ...
    def getPhysicalAddress(self):
        ...
    def getPostalCode(self):
        ...
    def getCountry(self):
        ...
    def getState(self):
        ...
    def getCustomerId(self):
        ...
    # and so on.
    
 class EmailDispatcher:
     ...
     def setRecipient(name: str, email: str):
         ...
     def setRecipientWithFatParameter(customer: Customer):
         ...
     def sendMessage(self, message: Message):
         ...

In the pseudocode above, the class EmailDispatcher has two methods setRecipient and setRecipientWithFatParameter. The former uses only the information it needs, and the latter passed in the entire Customer object as a fat parameter.

The convenience of passing in the entire Customer object is straightforward. It allows gives you a simple method signature. It makes it easier for the method to evolve to use richer information about the customer without needing to change its API contract. It allows you to define a common Dispatcher interface with multiple Dispatchers that use different properties of the Customer class.

However, when it comes to unit testing, such fat parameters present a problem. Consider how you would test the EmailDispatcher's setRecipientWithFatParameter method. The tests will need to create fake Customer objects. So, your fake Customers might look like this:

fakeCustomer = Customer(
    first_name="bob",
    last_name="marley", 
    email="bob@doobie.com", 
    address=Address(
        "420 High St.", 
      "", 
      "Mary Jane", 
      "Ganga Nation", 
      "7232"
    ), 
    id=12345, 
    postal_code="7232", 
    ...
)

When someone reads this unit test, do they know what is relevant here? Does it matter that the second parameter of address is empty string? Should the last parameter of address match the value of postal_code? While we might be able to guess it in this case, it gets more confusing in cases where the fat parameter is encapsulating a much more complicated entity, such as a database table.

When refactoring or making changes to the EmailDispatcher, if the unit test fails, then figuring out why the test failed becomes a non-trivial exercise, and could end up slowing you down a lot more than you expected. All this just leads to high maintenance costs for tests, low readability [^2], poor DevX, and limited benefits.

[^1]: You can read about it here, here, here, and here, and really just search for "Law of Demeter" on the Internet

[^2]: For more details on why we should care about readability, see the section on Readability here.

'Privatize' your classes for better unit testing

Srikanth Sastry — Mon, 11 Jul 2022 00:00:00 GMT

You service may be massive, but it's public API surface is pretty small; it has just a handful of APIs/endpoints. Everything else behind those APIs are 'private' and 'implementation details'. It is highly advisable to follow this pattern even when designing the implementation of your service, almost like a fractal. This will pay dividends in the quality of your test suite.

For instance, you service implementation should be split into 'modules' where each module has a well defined API through which other modules interact with it. This API boundary has to be strict. Avoid the temptation of breaking this abstraction because your module need this 'one tiny bit' of information that is available inside the implementation of another module. You will regret breaking encapsulation, I guarantee it!

If you follow this pattern, you will eventually reach a class that has a public API, has all of its external/shared dependencies shared, and delegates a lot of it's business logic and complex computation to multiple 'private' classes that are practically hermetic and have no external/shared dependencies. At this point, treat all these 'private' classes as, well, private. That is, DO NOT WRITE UNIT TESTS FOR SUCH CLASSES!

Yes, that statement seems to fly in the face of all things sane about software testing, but it is a sane statement, nonetheless. These private classes should be tested indirectly via unit tests for the public class that they serve/support. This will make your tests a lot more accurate. Let me explain.

Say, you have a public class CallMe and it uses a private class HideMe, and furthermore, HideMe is used only by CallMe, and the software design enforces this restriction. Assume that both CallMe and HideMe have their own unit tests, and the tests do an excellent job. At this point, there is a new requirement that necessitates that we refactor CallMe's implementation, and as part of that refactoring, we need to modify the API contract between CallMe and HideMe. Since HideMe's only caller is CallMe, it is completely safe to treat this API contract as an implementation detail and modify it as we see fit. Since we are modifying the specification of HideMe, we have to change the tests for HideMe as well.

Now, you run the tests, and the tests for HideMe fail. What information does that give you? Does that mean that there is a bug in HideMe; or does it mean that we did not modify the tests correctly? You cannot determine this until you either manually inspect HideMe's test code, or until you run the tests for CallMe. If CallMe's tests fail, then (since this is a refactoring diff) there must be a bug in HideMe and/or CallMe, but if the tests don't fail, then it must be an issue in HideMe's tests.

Thus, it turns out that the failure in HideMe tests gives you no additional information compared to failure in CallMe's tests. Thus, tests for HideMe have zero benefits and a non-zero maintenance cost! In other words, testing HideMe directly is useless!

By aggressively refactoring your code to push as much of you logic into private classes, you are limiting the API surface of your software that needs direct testing, and simultaneously, ensuring that your tests suite is not too large, has very [high accuracy, with reasonable completeness]({% post_url 2022-06-13-unit-test-attributes-and-their-trade-offs %}).

Tests should be isolated from each other; not coupled

Srikanth Sastry — Sun, 03 Jul 2022 00:00:00 GMT

Almost [by definition]({% post_url 2022-06-18-defining-unit-tests-two-schools-of-thought %}) unit tests should be isolated from its (external, shared) dependencies. But, equally importantly, unit tests should also be isolated from each other. When one test starts to affect another test, the two tests are said to be coupled. Alternatively, if changes to one test can negatively impact the correctness of another test, then the two tests are said to be coupled.

Coupled tests are problematic in two ways.

Tests become less readable. Reading the code for a single unit test does not necessarily communicate what the test does. We also need to understand the 'coupling' between that test and other tests to grok what a single test does. This coupling can be subtle and not easy to follow.
Tests become less [accurate]({% post_url 2022-06-13-unit-test-attributes-and-their-trade-offs %}). When one test affects another, it becomes difficult to make changes to a single test in isolation. For instance, if a diff makes changes to the some production and test code, and then a test fails, then it is not always clear why the test failed. The failure could due to a bug, or an artifact the coupled tests. Thus, your tests are no longer trustworthy, and therefore, less accurate.

Coupling can happen in many ways. The obvious ones include (1) using the same shared dependency (like when you use the same temp file name in all tests), and (2) relying on the post-condition of one test as a precondition of another test. Such cases are also obvious to detect, and to fix. There are two more following ways in which tests can be coupled; but these are more subtle, and more prevalent.

Precondition setting in test fixtures
Parameterized tests for heterogeneous tests

The rest of this note is focused on the above two anti-patterns of test coupling.

Coupling through test fixtures

Say, your SUT has a dependency called Helper, and initially, for the two tests in your unit tests for the SUT, you initialize your Helper stub with contents valueA, and valueB. Since both tests share the same initial state, you include the initialization code in the SetUp of the unit tests.

class SUTTestCase(unittest.TestCase):
    def setUp(self):
        self.helper = StubHelper()
        self.helper.add_contents([valueA, valueB])
        self.sut = SUT(self.helper)
        
    def test_behavior1(self) -> None:
        ...  # Assumes self.helper set with contents=[valueA, valueB]
    
    def test_behavior2(self) -> None:
        ...  # Assumes self.helper set with contents=[valueA, valueB]

Next, you modify SUT to add features to it. In order to test those features, the Helper stub needs to include controllerA. But these are useful only in the new tests being added. However, looking at the unit test you already have, it is easiest to to simply add controllerA to self.helper. So, your unit tests look as follows:

class SUTTestCase(unittest.TestCase):
    def setUp(self):
        self.helper = StubHelper()
        self.helper.add_contents([valueA, valueB])
        self.helper.add_controller(controllerA)
        self.sut = SUT(self.helper)
        
    def test_behavior1(self) -> None:
        ...  # Assumes self.helper set with contents=[valueA, valueB]
             # But this test assumes nothing about self.helper's controller

    def test_behavior2(self) -> None:
        ...  # Assumes self.helper set with contents=[valueA, valueB]
             # But this test assumes nothing about self.helper's controller

    def test_behavior3(self) -> None:
        ...  # Assumes self.helper set with contents=[valueA, valueB], and controller=controllerA

    def test_behavior4(self) -> None:
        ...  # Assumes self.helper set with contents=[valueA, valueB], and controller=controllerA

Then you discover a gap in testing that requires the initial state of the Helper stub to have just the content valueA and include controllerA. Now, when adding this new unit test to suite, the simplest way to do this would be to remove valueB from self.helper at the start of the new test. So, now, your test suite looks as follows:

class SUTTestCase(unittest.TestCase):
    def setUp(self):
        self.helper = StubHelper()
        self.helper.add_contents([valueA, valueB])
        self.helper.add_controller(controllerA)
        self.sut = SUT(self.helper)
        
    def test_behavior1(self) -> None:
        ...  # Assumes self.helper set with contents=[valueA, valueB]
             # But this test assumes nothing about self.helper's controller

    def test_behavior2(self) -> None:
        ...  # Assumes self.helper set with contents=[valueA, valueB]
             # But this test assumes nothing about self.helper's controller

    def test_behavior3(self) -> None:
        ...  # Assumes self.helper set with contents=[valueA, valueB], and controller=controllerA

    def test_behavior4(self) -> None:
        ...  # Assumes self.helper set with contents=[valueA, valueB], and controller=controllerA

    def test_behavior5(self) -> None:
        # Assumes self.helper set with contents=[valueA, valueB] (because of other tests' setup)
        self.helper.remove_content(valueB)
        # Now assumes self.helper set with contents=[valueA]
        ...

Let pause here and inspect the state of the unit test. The tests are coupled. Why? Because modifying one test ends up affecting other tests. In the example above, if we replace self.helper.add_contents([valueA, valueB]) with self.helper.add_contents(valueA) for tests test_behavior1 and test_behavior2, it will result in a failure in test_behavior5 because self.helper.remove_content(valueB) will end up throwing an error!

Furthermore, for anyone reading these tests, it is not entirely clear that test_behavior1, and test_behavior2 need self.helper to be initialized with values [valueA, valueB], but do not need for controllerA in self.helper. The preconditions for test_behavior1 and test_behavior2 are coupled with the preconditions for test_behavior3.

It also results in test incompleteness in that, if we introduce a bug that causes behavior1 to fail when self.helper.add_controller(controllerA) is not set, we might not catch that bug because we have initialized the test for behavior1 with self.helper.add_controller(controllerA).

How to decouple such tests?

Use the setUp method to simply set up your dependencies, but not to enforce any precondition. Instead, make setting preconditions part of the arrange step of each unit test. You can even encapsulate the precondition setting into a function (with the right parameters) so that the arrange section does not get too bloated, and yet the test code is readable. Consider the following refactoring of the tests:

class SUTTestCase(unittest.TestCase):
    def setUp(self):
        self.helper: Optional[StubHelper] = None
        self.sut = SUT(self.helper)
        
    def prepare_helper(self, contents:List[Value], controller: Optional[Controller]=None) -> None:
        self.helper = StubHelper()
        self.helper.add_contents(contents)
        if controller:
            self.helper.add_controller(controller)
        
    def test_behavior1(self) -> None:
        # Assumes self.helper is a fresh object.
        self.prepare_helper(contents=[valueA, valueB])
        ...

    def test_behavior2(self) -> None:
        # Assumes self.helper is a fresh object.
        self.prepare_helper(contents=[valueA, valueB])
        ...    

    def test_behavior3(self) -> None:
        # Assumes self.helper is a fresh object.
        self.prepare_helper(contents=[valueA, valueB], controller=controllerA)
        ...

    def test_behavior4(self) -> None:
        # Assumes self.helper is a fresh object.
        self.prepare_helper(contents=[valueA, valueB], controller=controllerA)
        ...

    def test_behavior5(self) -> None:
        # Assumes self.helper is a fresh object.
        self.prepare_helper(contents=[valueA], controller=controllerA)
        ...

Coupling in parameterized tests

Parameterized tests are a collection of tests that run the same verification, but with different inputs. While this is a very useful feature (available in almost all unit test frameworks), it is also very easy to abuse. A few common ways I have seen it abused is in conjunction with DRYing, and the use 'if' checks, and that often results in coupling all the tests denoted by the parameterized list. Consider the following illustration:

class TestOutput(typing.NamedTuple):
    status: StatusEnum
    return_value: typing.Optional[int]
    exception: typing.Optional[Exception]
    ...

class TestSequence(unittest.TestCase):
  
    @parameterized.expand([
        [test_input1, expected_output1],
        [test_input2, expected_output2],
        ...
    ])
    def test_something(self, test_input: str, expected_output: TestOutput) -> None:
        self._run_test(test_input, expected_output)
    
    def _run_test(self, test_input: str, expected_output: TestOutput) -> None:
        sut = SUT(...)
        prepare_sut_for_tests(sut, test_input)
        output = sut.do_something(test_input)
        test_output = make_test_output(output, sut)
        self.assertEquals(expected_output, test_output)

The above illustration tests the method do_something for various possible inputs. However, note that the outputs (as illustrated in the class TestOutput can have a status, a return_value, or an exception). This means that every instantiation (for each parameter) has to content with the possibility of different types of outputs even though any single test only should have to verify against a single type of output. This couples all the tests verifying do_something, this making it difficult to read and understand. Adding a new test case here becomes tricky because any changes to either prepare_sut_for_tests, or make_test_output now affects all the tests!

How to decouple parameterized tests?

There are some fairly straightforward ways to decouple such tests. First, is that we should be very conservative about how we organize these tests. For example, we can group all positive tests and group all negative tests separately; similarly, we can further subgroup the tests based on the type of assertions on the output. In the above example, we can have three subgroups: positive tests that verify only output status, positive tests that verify return value, and negative tests that verify exception. Thus you now have three parameterized test classes that look something like this:

class TestDoSomething(unittest.TestCase):
  
    @parameterized.expand([
        [test_status_input1, expected_status_output1],
        [test_status_input2, expected_status_output2],
        ...
    ])
    def test_something_status_only(
        self, 
        test_input: str, 
        expected_output: StatusEnum
    ) -> None:
        # Arrange
        sut = SUT(...)
        ...  # More 'arrange' code
        
        # Act
        output = sut.do_something(test_input)
        output_status = output.status
        
        # Assert
        self.assertEquals(expected_output, output_status)
        
    @parameterized.expand([
        [test_return_value_input1, expected_return_value_output1],
        [test_return_value_input2, expected_return_value_output2],
        ...
    ])
    def test_something_return_value_only(
        self, 
        test_input: str, 
        expected_output: int
    ) -> None:
        # Arrange
        sut = SUT(...)
        ...  # More 'arrange' code
        
        # Act
        output = sut.do_something(test_input)
        output_status = output.status
        output_value = output.value
        
        # Assert
        self.assertEquals(SomeEnum.SUCCESS, output_status)
        self.assertEquals(expected_output, output_value)

    @parameterized.expand([
        [test_return_value_input1, expected_error_code_output1],
        [test_return_value_input2, expected_error_code_output2],
        ...
    ])
    def test_something_throws_exception(
        self,
        test_input: str,
        expected_error_code: int
    ) -> None:
        # Arrange
        sut = SUT(...)
        ...  # More 'arrange' code
        
        # Act
        with self.assertRaises(SomeSUTException) as exception_context:
            sut.do_something(test_input)
        exception = exception_context.exception
        
        # Assert
        self.assertEquals(excepted_error_code, exception.error_code)

In unit tests, I favor Detroit over London

Srikanth Sastry — Sun, 26 Jun 2022 00:00:00 GMT

[Recall]({%post_url 2022-06-18-defining-unit-tests-two-schools-of-thought %}) the two schools of thought around unit test: Detroit, and London. Briefly, the Detroit school considers a 'unit' of software to be tested as a 'behavior' that consists of one or more classes, and unit tests replace only shared and/or external dependencies with test doubles. In contrast, the London school consider a 'unit' to be a single class, and replaces all dependencies with test doubles.

School	Unit	Isolation	Speed
Detroit	Behavior	Replace shared and external dependencies with test doubles	'fast'
London	Class	Replace all dependencies (internal, external, shared, etc.) with test doubles	'fast'

See this [note]({%post_url 2022-06-18-defining-unit-tests-two-schools-of-thought %}) for a more detailed discussion on the two schools.

Each school have it's proponents and each school of thought has it's advantages. I, personally, prefer the Detroit school over the London school. I have noticed that following the Detroit school has made my test suite more [accurate and complete]({%post_url 2022-06-13-unit-test-attributes-and-their-trade-offs %}).

Improved Accuracy (when refactoring)

In [the post on attributes of a unit test suite]({%post_url 2022-06-13-unit-test-attributes-and-their-trade-offs %}), I defined accuracy as the measure of how likely it is that a test failure denotes a bug in your diff. I have noticed that unit test suites that follow the Detroit school tended to have high accuracy when your codebase has a lot of classes that are public de jour, but private de facto.

Codebases I have worked in typically have hundreds of classes, but only a handful of those classes are actually referenced by external classes/services. Most of the classes are part of a private API that is internal to the service. Let's take a concrete illustration. Say, there is a class Util that is used only by classes Feature1 and Feature2 within the codebase, and has no other callers; in fact, Util exists only to help classes Feature1 and Feature2 implement their respective user journies. Here although Util is a class with public methods, in reality Util really represents the common implementation details for Feature1 and Feature2.

In London

According to the London school, all unit tests for Feature1 and Fearure2 should be replacing Util with a test double. Thus, tests for Feature1 and Feature2 look as follows.

Now, say we want to do some refactoring that spans Feature1, Feature2, and Util. Since Util is really has a private API with Feature1 and Feature2, we can change the API of Util in concert with Feature1 and Feature2 in a single diff. Now, since the tests for Feature1 and Feature2 use test doubles for Util, and we have changed Util's API, we need to change the test doubles' implementation to match the new API. After making these changes, say, the tests for Util pass, but the tests for Feature1 fail.

Now, does the test failure denote a bug in our refactoring, or does it denote an error in how we modified the tests? This is not easy to determine except by stepping through the tests manually. Thus, the test suite does not have high accuracy.

In Detroit

In contrast, according to the Detroit school, the unit tests for Feature1 and Feature2 can use Util as such (without test doubles). The tests for Feature1 and Feature2 look as follows.

If we do the same refactoring across Feature1, Feature2, and Util classes, note that we do not need to make any changes to the tests for Feature1 and Feature2. If the tests fail, then we have a very high signal that the refactoring has a bug in it; this makes for a high accuracy test suite!

Furthermore, since Util exists only to serve Feature1 and Feature2, you can argue that Util doesn't even need any unit tests of it's own; the tests for Feature1 and Feature2 cover the spread!

Improved Completeness (around regressions)

In [the post on attributes of a unit test suite]({%post_url 2022-06-13-unit-test-attributes-and-their-trade-offs %}), I defined completeness as the measure of how likely a bug introduced by your diff is caught by your test suite. I have seen unit tests following the Detroit school catching bugs/regressions more easily, especially when the bugs are introduced by API contract violations.

It easier to see this with an example. Say, there is a class Outer that uses a class Inner, and Inner is an internal non-shared dependency. Let's say that the class Outer depends on a specific contract, (let's call it) alpha, that Inner's API satisfies, for correctness. Recall that we practically trade off between the speed of a test suite and it's completeness, let us posit that the incompleteness here is that we do not have a test for Inner satisfying contract alpha.

In London

Following the London school, the tests for Outer replace the instance of Inner with a test double, and since the test double is a replacement for Inner, it also satisfies contract alpha. See the illustration below for clarity.

Now, let's assume that we have a diff that 'refactors' Inner, but in that process, it introduces a bug that violates contract alpha. Since we have assumed an incompleteness in our test suite around contract alpha, the unit test for Inner does not catch this regression. Also, since the tests for Outer use a test double for Inner (which satisfies contract alpha), those tests do not detect this regression either.

In Detroit

If we were to follow the Detroit school instead, then the unit tests for Outer instantiate and use Inner when testing the correctness of Outer, as shown below. Note that the test incompletness w.r.t. contract alpha still exists.

Here, like before, assume that we have a diff that 'refactors' Inner and breaks contract alpha. This time around, although the test suite for Inner does not catch the regression, the test suite for Outer will catch the regression. Why? Because the correctness of Outer depends on Inner satisfying contract alpha. When that contract is violated Outer fails to satisfy correctness, and is therefore, it's unit tests fail/

In effect, even though we did not have an explicit test for contract alpha, the unit tests written according to the Detroit school tend to have better completeness than the ones written following the London school.

Defining unit tests: two schools of thought

Srikanth Sastry — Sat, 18 Jun 2022 00:00:00 GMT

Definitions: What is a unit test?

There are several definitions for unit tests. Roy Osherove defines it as "piece of code that invokes a unit of work in the system and then checks a single assumption about the behavior of that unit of work"; Kent Beck turns the idea of defining unit tests on it's head by simply stating a list of properties, and any code that satisfies those properties in a "unit test".

I like Vladimir Khorikov's definition of a unit test in his book Unit Testing Principles, Practices, and Patterns. According to him, a unit test is a piece of code that (1) verifies a unit of software, (2) in isolation, and (3) quickly. The above definition only balkanizes a unit test into three undefined terms: (1) unit of software, (2) isolation, and (3) quick/fast/speed. Of the three, the third one is the easiest to understand intuitively. Being fast simply means that you should be able to run the test in real time and get the results quickly enough to enable interactive iteration of modifying the unit of software you are changing. However, the other two terms: unit of software, and isolation merit more discussion.

Are you from Detroit, or London?

In fact, there are two schools of thought around how the above two terms should be defined. The 'original/classic/Detroit' school, and the 'mockist/London' school. Not surprisingly, the school of thought you subscribe to has a significant impact on how you write unit tests. For a more detailed treatment of the two schools of thought, I suggest Martin Folwer's excellent article on the subject of Mocks and Stubs. Chapter 2 of Khorikov's book Unit Testing Principles, Practices, and Patterns also has some good insights into it. I have distilled their contents as it pertains to unit test definitions.

The Detroit School

The Classical or Detroit school of thought originated with Kent Beck's "Test Driven Development".

Unit of software. According to this school, the unit of software to test is a "behavior". This behavior could be implemented in a single class, or a collection of classes. The important property here is that the the code that comprises the unit must be (1) internal to the software, (2) connected with each other in the dependency tree, and (3) not shared by another other part of the software.

Thus, a unit of software cannot include external entities such as databases, log servers, file systems etc. They also cannot include external (but local) libraries such as system time and timers. Importantly, it is ok to include a class that depends on another class via a private non-shared dependency.

Isolation. Given the above notion of a "unit" of software, isolation simply means that the test is not dependent on anything outside that unit of software. In practical terms, it means that a unit test needs to replace all external and shared dependencies with [test doubles]({% post_url 2022-05-25-mocks-stubs-and-how-to-use-them %}).

The London School

The mockist or London school of thought was popularized by Steve Freeman (twitter) and Nat Pryce in their book "Growing Object- Oriented Software, Guided by Tests".

Unit of Software. Given the heavy bias Object-Oriented software, unsurprisingly, the unit of software for a unit test is a single class (in some cases, it can be a single method). This is strictly so. ANy other class that this the 'class under test' depends on cannot be part of the unit being tested.

Isolation. What follows from the above notion of a "unit" is that everything that is not the class under test must be replaced by test doubles. If you are instantiating another class inside the class under test, then you must replace that instantiation with an injected instance or a factory that can be replaced with a test double in the tests.

Here is a quick summary of the definitions of a unit tests under the two schools.

School	Unit	Isolation	Speed
Detroit	Behavior	Replace shared and external dependencies with test doubles	'fast'
London	Class	Replace all dependencies (internal, external, shared, etc.) with test doubles	'fast'

What does this mean?

The school of thought you subscribe to can have a significant impact on your software design and testing. There is nothing I can say here that hasn't already been explained by Martin Fowler in his article "Mocks aren't stubs". So, I highly recommend you read it for yourself.

Primary attributes of unit test suites and their tradeoffs

Srikanth Sastry — Mon, 13 Jun 2022 00:00:00 GMT

Unit test suites have three primary attributes.

accuracy,
completeness, and
speed.

Accuracy says that if a test fails, then there is a bug. Completeness says that if there is a bug, then a unit test will fail. Speed says that tests will run 'fast'. These three attributes are in opposition with each other, and you can only satisfy any two of the three attributes!

Before discussing these attributes, it is important to note that they are not properties of test suite at rest, but rather, of the test suite during changes. That is, these attributes are measured only when you are making changes to the code and running the test suite in response to those changes. Also, these attributes are not applicable to a single unit test. Instead, they apply to the test suite as a whole. Furthermore, the quality of your test suite is determined by how well the suite measures up along these attributes.

Attributes' descriptions

Let's describe each of these attributes, and then we can see any unit test suite is forced to trade off these attributes.

Accuracy. It is a measure of robustness of the test suite to changes in the production code. If you make a change to the production code without changing your unit tests, and your test suite has a failure, then how likely is it that your changes introduced a bug? Accuracy is a measure of this likelihood. High quality unit tests typically have very good accuracy. If your test suite has poor accuracy, then it suggests that either your tests are brittle, they are actually testing implementation details instead of functionality, or your production code is poorly designed with leaky abstractions. Inaccurate tests reduce your ability to detect regressions. They fail to provide early warning when a diff breaks existing functionality (because the developer cannot be sure that the test failure is a genuine bug, and not an artifact of test brittleness). As a result, developers are more likely to ignore test failure, or modify the tests to make it 'pass', and thus introduce bugs in their code.
Completeness. This is a measure of how comprehensive the test suite really is. If you make a change to the production code without changing your unit tests, and you introduce a bug in an existing functionality, then how likely is it that your test suite will fail? Completeness is a measure of this likelihood. A lot of the test coverage metrics try to mimic the completeness of your test suite. However, [we have seen how coverage metrics are often a poor proxy for completeness]({% post_url 2022-04-29-do-not-index-on-test-coverage %}).
Speed. This is simply a measure of how quickly a test suite runs. If tests are hermetic with the right use of [test doubles]({% post_url 2022-05-25-mocks-stubs-and-how-to-use-them %}), then each test runs pretty quickly. However, if the tests are of poor quality or the test suite is very large, then they can get pretty slow. It is most noticeable when you are iterating on a feature, and with each small change, you need to run the test suite that seems to take forever to complete. Slow tests can have a disproportionate impact on developer velocity. It will make developer less likely to run tests eagerly, it increases the time between iterations, and it increases the CI/CD latency to where the gap between your code landing and the changes making it to prod can be unreasonably large. If this gets bad enough, it will discourage developers from running tests as needed, and thus allow bugs to creep in.

Attribute constraints and trade offs

There is a tension among attributes, and how these attributes contribute to overall unit test suite quality.

Among accuracy, completeness, and speed, you cannot maximize all three; that is, you cannot have a fast test suite that will fail if and only if there is a bug. Maximizing any two will minimize the third.

A prefect test suite with high accuracy and completeness will inevitably be huge, and thus very slow.
A fast test suite with high accuracy will often only test only the most common user journeys, and thus be incomplete.
A test suite with very high coverage is often made 'fast' through extensive use of test doubles and ends up coupling tests with the implementation details, which makes the tests brittle, and therefore inaccurate.

What's the right trade off?

A natural follow up to the trade offs among accuracy, completeness, and speed is "What is the right trade off?". It helps to notice that, empirically, we are always making this trade off and naturally settling on some point in the trade-off surface. What is this natural resting point for these trade offs? Let's examine a few things to help us answer the above question.

From experience, we know that bugs in software are inevitable, and we have learned to deal with it. While bug-free code might be the ideal, no one reasonably expects bug-free software, and we accept some level of incorrectness in our implementations.
Flaky/brittle tests can have very significant negative consequences. Such tests are inherently untrustworthy, and therefore, serve no useful purpose. In the end, we tend to ignore such tests, and for all practical purposes they just don't exist in our test suite.
While extremely slow tests are an issue, we have figured out ways to improve test speeds through infrastructure developments. For instance,our CI/CD systems can run multiple tests in the test suite in parallel, and thus we are delayed only by the slowests tests in the test suite; we have figured out how to prune the affected tests in a diff by being smart about the build and test targets affected by the changes, and thus, we need not run the entire test suite for a small change; the machines that execute tests have just gotten faster, thus alleviating some of the latency issues, etc.

From the above three observations, we can reasonably conclude that we cannot sacrifice accuracy. Accurate tests are the bedrock of trustworthy (and therefore, useful) test suites. Once we maximize accuracy, that leaves us with completeness and speed. Here there is a sliding scale between completeness and speed, and we could potentially rest anywhere on this scale.

So, is it ok to rest anywhere on the tradeoff spectrum between completeness and accuracy? Not quite. If you dial completeness all the way up and ignore speed, then you end up with a test suite that no one wants to run, and therefore, not useful at all. On the other hand, if you ignore completeness in favor of speed, then you are likely going to see a lot of regressions in your software and completely undermine consumer confidence in your product/service. In effect, the quality of your test suite is determined by the lowest score among the three attributes. Therefore, it is important to rest between completeness and speed, depending on the tolerance to errors and the minimum developer velocity you can sustain. For instance, if you are developing software for medical imaging, then your tolerance to errors is very very low, and so you should be favoring completeness at the expense of speed (and this is evident in how long it takes to make changes to software in the area of medical sciences). On the other hand, if you are building a web service that can be rolled back to a safe state quickly and with minimal external damage, then you probably want to favor speed over completeness (but only to a point; remember that your test quality is now determined by the completeness, or the lack thereof).

Thus, in conclusion, always maximize accuracy, and trade off between completeness and speed, depending on your tolerance of failures in production.

The big WHY about unit tests

Srikanth Sastry — Mon, 06 Jun 2022 00:00:00 GMT

When you ask "why do we write need unit tests?", you will get several answers including

To find common bugs in your code
[As protection against regression]({% post_url 2017-11-12-the-merits-of-unit-tests-part-2 %})
[To act as a de facto documentation of your code]({%post_url 2017-11-01-merits-of-unit-tests-part-1 %})
[To help improve software design]({%post_url 2017-11-22-the-merits-of-unit-tests-part-3 %})
[To help debug issues in production]({% post_url 2017-11-28-unit-tests-ftw-part-4 %})
[Improve your APIs' usability]({% post_url 2017-12-28-merits-of-unit-tests-part-5 %})
etc.

These seems like a collection of very good reasons, but it seems inelegant to state that the common phenomenon of unit testing has such disparate causes. There must be a 'higher' cause for writing unit tests. I argue that this cause is "maintainability".

Maintainability

Here is a potentially provocative statement; "The final cause of unit tests is software maintainability". To put it differently, if your software was immutable and could not be altered in any way, then that software does not need any unit tests.

Given that almost all software is mutable, unit tests exist to ensure that we can mutate the software to improve upon its utility in a sustainable manner. All the aforementioned answers to the question "why do we write unit tests" are ultimately subsumed by the cause of maintainability.

Unit tests help you find bugs in your code, thus allowing safe mutations that add functionality.
Unit tests protect against regression, especially when refactoring, thus allowing safe mutation of the software in preparation for functional changes.
Unit tests act as de facto documentation. It allows developers who change the code to communicate across time and space on how best to use existing code for mutating other code.
Unit tests help improve software design. It some code/class is difficult to unit test, then the software design is poor. So, you iterate until unit testing becomes easier.
Unit test help improve the usability of your API. Unit tests are the first customers of your API. If unit tests using your API are inelegant, then you iterate towards more usuable APIs. A more usable API is often a more used API, and thus, aids software evolution.

Interestingly, looking at maintainability as the primary motivation for unit tests allows us to look at some aspects of unit tests differently.

Looking at unit tests differently

Unit tests incur a maintenance cost.

If it code incurs a maintenance cost, and unit tests help reduce that cost, then you can naturally ask the following; since unit tests are also code, do they not incur a maintenance cost?

Obviously the answer to the question above is an unequivocal "yes!". Thus, unit tests are only useful if the cost of maintaining them DOES NOT EXCEED the savings they provide as a buttress against production code. This observation has significant implications for how to design and write unit tests. For instance, unit tests must be simple straight line code that is human readable, even at the expense of performance and redundancy. See the post on DRY unit tests for a more detailed treatment on this topic.

Unit tests can have diminishing returns.

If unit tests incur a maintenance cost, then their utility is the difference between the maintainability they provide and the cost they incur. Since software is a living/evolving entity, both this utility changes over time. Consequently, if you are not careful with your tests, then could become the proverbial Albatross across your neck. Consequently, it is important to tend to your unit test suite and pay attention when the utility of a test starts to diminish. Importantly, refactor your tests to ensure that you do not hit the point of diminishing, or even negative returns on your unit test.

Unit tests should be cognitively simple.

An almost necessary way to reduce the maintenance cost of a unit tests is to make it very simple to read and understand. It helps with maintenance in two ways. First, it makes it easy to understand the intent of the test, and the coverage that the test provides. Second, it makes it easy to modify the test (if needed) without having to worry about an unintended consequences such modifications might have; a degenerate case is that of tests that have hit the point of diminishing returns; more simple a test is, the easier it is to refactor and/or delete it. See the post on DRY unit tests for mote details.

A bad unit test is worse than no unit test.

If unit test incur a maintenance cost, then a bad unit test has all the costs associated with unit tests and none of the benefits. It is a net loss. Your code base is much better off without that unit test. In fact, a bad unit test can have an even higher cost if it sends developers on a wild goose chase looking for bugs when such unit tests fail. So, unless a unit test is of high quality, don't bother with it. Just delete it.

A flaky unit test is the worst.

This is a corollary of the previous observation, but deserves some explanation. Flaky tests have the side effect of undermining the trust in the entire test suite. If a test is flaky, then developers are more likely to ignore red builds, because 'that flaky test is the culprit, and so the failure can be ignored'. However, inevitably, some legitimate failure does occur. But, at this point, developers have been conditioned to ignore build/test failures. Consequently, a buggy commit makes it's way to prod and causes a regression, which would never have happened if you didn't have that flaky test.

Unit test the brains and not the nerves

Srikanth Sastry — Tue, 31 May 2022 00:00:00 GMT

Note: This is inspired from the book "Unit Testing: Principles, Practices, and Patterns" by Vladimir Khorikov.

Unit tests are typically your first line of defense against bugs. So, it is tempting to add unit tests for all functionality that your code supports. But that begs the following question. "Why do we need integration and end-to-end tests?"

Categorizing production code

To better understand the primary motivations for unit tests vs. integration (and end-to-end) tests, it is helpful to categorize your production code into four categories along two dimensions: thinking, and talking.

Thinking code. There are parts of your codebase that are focused mostly on the business logic and the complex algorithmic computations. I refer to these as the thinking code.
Talking code. There are parts of your codebase that are focused mostly on communicating with other dependencies such as key-value stores, log servers, databases, etc. I refer to these as talking code.

Each part of your codebase can be either thinking, talking, or both. Based on that observation, we can categorize each unit of code into one of four categories (in keeping with the biology theme).

Thinking	Talking	Category
Yes	No	Brain
No	Yes	Nerves
Yes	Yes	Ganglia
No	No	Synapse

Testing for each category

Each category needs a distinct approach to testing.

Brains → Unit Tests

Brains are one of the most complex parts of your codebase that often requires the most technical skill and domain knowledge to author, read, and maintain. Consequently, they are best tested with unit tests. Furthermore, they also have very few direct external dependencies, and as a result require limited use of test doubles.

Nerves → Integration Tests

Nerves have very little logic, but focus mostly on external communication with dependencies. As a result, there isn't much to unit test here, except perhaps that the protocol translation from the outside world into the brains is happening correctly. By their very nature, the correctness of nerves cannot be tested hermetically, and therefore, are not at all well suited to be unit tested. Nerves should really be tested in your integration tests, where you hook your production code with real test instances of external dependencies.

Ganglia → Refactor

Ganglia are units of code that have both complex business logic and have significant external dependencies. It is very difficult to unit test them thoroughly because such unit tests require heavy use of test doubles which can make the tests less readable and more brittle. You could try to test ganglia through integration tests, but it becomes very challenging to test low probability code paths, which is usually the source of difficult-to-debug issues. Therefore, my suggestion is to refactor such code into smaller pieces of code each of which are either a brain or a nerve, and tests each of those as described above.

See Chapter 7 of "Unit Testing: Principles, Practices, and Patterns" for suggestions on how to refactor your code to make it more testable.

Synapse → Ignore

Synapses are trivial pieces of code (often utilities) that have neither complex business logic, nor do they have any external dependencies. My recommendation is to simply not focus on testing them. Adding unit tests for them simply increases the cost of testing and maintenance without really providing any benefit. They are often simple enough to be verified visually, and they exist only to serve either the brains or the nerves, and so will be indirectly tested via unit tests or integration tests.

Mocks, Stubs, and how to use them

Srikanth Sastry — Wed, 25 May 2022 00:00:00 GMT

Photo by Polina Kovaleva from Pexels

Test doubles are the standard mechanism to isolate your System-Under-Test (SUT) from external dependencies in unit tests. Unsurprisingly, it is important to use the right test double for each use case for a maintainable and robust test suite. However, I have seen a lot of misuse of test doubles, and suffered through the consequences of it enough number of times to want to write down some (admittedly subjective) guidelines on when an how to use test doubles.

Briefly, test doubles are replacements for a production object used for testing. Depending on who you ask, there are multiple different categorizations of test doubles; but two categories that appears in all of these categorizations are mocks and stubs. So I will focus on on these two. I have seen mocks and stubs often conflated together. The problem is made worse by all the test-double frameworks' terminology: they are often referred to as 'mocking' frameworks, and the test doubles they generate are all called 'mocks'.

Mocks

Image by Andii Samperio from Pixabay

Mocks are objects that are used to verify 'outbound' interactions of the SUT with external dependencies. This is different from the notion of 'mocks' that 'mocking frameworks' generate. Those 'mocks' are more correctly the superclass of test doubles. Examples where mocks are useful include the SUT logging to a log server, or sending an email, or filing a task/ticket in response to a given input/user journey. This becomes clearer with an illustration.

from unittest.mock import MagicMock

class TestSUT(unittest.TestCase):
    def test_log_success(self) -> None:
        mock_log_server = MagicMock(spec=LogServerClass, autospec=True)
        mock_log_server.log = MagicMock(return_value=True)
        sut = SUT(log_server=mock_log_server)
        
        sut.test_method(input="foo")
        
        # This is ok!
        mock_log_server.log.assert_called_once_with(message="foo")

Note that in the above illustration, we verify that the message is sent to the the log server exactly once. This is an important part of the SUT's specification. It the SUT were to start logging multiple messages/records for the request, then it could pollute the logs or even overwhelm the log server. Here, even though logging appears to be a side effect of test_method, this side effect is almost certainly part of SUT's specification, and needs to be verified correctly. Mocks play a central role in such verifications.

Stubs

Unlike mocks, stubs verify 'inbound' interactions from external dependencies to the SUT. Stubs are useful when replacing external dependencies that 'send' data to the SUT in order for the SUT to satisfy its specification. Examples include key value stores, databases, event listeners, etc. The important note here is that the outbound interaction to the stub should not be asserted in the tests; that's an anti pattern (it results in over-specification)! Here is an illustration.

from unittest.mock import MagicMock

class TestSUT(unittest.TestCase):
    def test_email_retrieval(self) -> None:
        stub_key_value_store = MagicMock(spec=KeyValueStoreClass, autospec=True)
        stub_key_value_store.get = MagicMock(return_value="user@special_domain.com")
        sut = SUT(key_value_store=stub_key_value_store)
        
        email_domain = sut.get_user_email_domin(username="foo")
        
        # This is ok!
        self.assertEquals("special_domain.com", email_domain)
        
        # THIS IS NOT OK!
        stub_key_value_store.get.assert_called_once_with(username="foo")

In the above illustration, we create a stub for the key value store (note that this is a stub even thought the object is a 'mock' class) that returns "user@special_domain.com" as a canned response to a get call. The test verifies that the SUT's get_user_email_domain is called, it returns the correct email domain. What is important here is that we should not assert that there was a get call to the stub. Why? Because the call to the key value store is an implementation detail. Imagine a refactor that causes a previous value to be cached locally. If the unit tests were to assert on calls to the stubs, then such refactors would result in unit test failures, which undermines the utility, maintainability, and robustness of unit tests.

Fakes, instead of stubs

A small detour here. When using a stub, always consider if you can use a fake instead. There are multiple definitions of a fake, and the one I am referring to is the following. A fake is a special kind of stub that implements the same API as the production dependency, but the implementation is much more lightweight. This implementation may be correct only within the context of the unit tests where it is used. Let's reuse the previous illustration of using a stub, and replace the stub with a fake. Recall that we stubbed out the get method of KeyValueStoreClass to return the canned value "user@special_domain.com". Instead, we can implement a fake KeyValueStoreClass that uses a Dict as follows.

from unittest.mock import MagicMock
from typing import Dict

# We assume a simplistic API for KeyValueStoreClass with just
# update and get methods.
class KeyValueStoreClass:
    def update(self, k: str, v: str) -> None:
        ...
    def get(self, k: str) -> str:
        ...

class FakeKeyValueStoreClassImpl:
    def __init__(self):
        self.kvs: Dict[str, str] = {}
    
    def update(self, k:str, v:str) -> None:
        self.kvs[k] = v

    def get(self, k: str) -> str:
        return self.kvs[k]


class TestSUT(unittest.TestCase):
    def test_email_retrieval(self) -> None:
        FakeKeyValueStoreClass = MagicMock(return_value=FakeKeyValueStoreClassImpl())
        fake_key_value_store = FakeKeyValueStoreClass()
        fake_key_value_store.update(k="foo", v="user@special_domain.com")
        sut = SUT(key_value_store=fake_key_value_store)
        
        email_domain = sut.get_user_email_domin(username="foo")
        
        self.assertEquals("special_domain.com", email_domain)

The advantage of using a fake is that the test becomes much more robust and is more resistant to refactoring. It also becomes more extensible. When using a stub, if we wanted to test a different user journey, we would need to inject a new return value for KeyValueStoreClass.get method. We would in one of two ways: (1) resetting the mock, which is a bad anti-pattern, or (2) initialize the stub to return a preconfigured list of canned values, in order, which makes the test more brittle (consider what happens if the SUT chooses to call get for the same key twice vs. calls get for different keys once each). Using a fake sidesteps these issues.

But my dependency has both inbound and outbound interactions!

Despite all your efforts to separate out the test cases that need stubs and the ones that need mocks, you will inevitably find yourself needing to test a scenario in which you need to verify both inbound and outbound interactions with an external dependency. How do we address that?

First, if you need to assert on the outbound interaction of the same call that is stubbed, then you really don't need that test. Just use a stub/fake and do not assert on the outbound interaction. Next, the only legitimate case of needing to verify both inbound and outbound interactions is if they are on distinct APIs of the same dependency. For example, the SUT could be reading from a file, and you need to test that (1) the contents of the file were read correctly, and (2) the file object was closed after the file was read. In this case, it is perfectly ok to stub the file read method while mocking the close method. Here is an illustration.

from unittest.mock import MagicMock, patch

class TestSUT(unittest.TestCase):
    def test_file_read(self) -> None:
        file_mock_stub_combo = MagicMock()
        # Using this as a stub by injecting canned contents of the file
        file_mock_stub_combo.__iter__.return_value = ["1234"]
        
        # Next, we treat the file open call as a mock.
        with patch("builtins.open",
                   return_value=file_mock_stub_combo, 
                   create=True
                  ) as mock_file:
            sut = SUT(filename="foo")
            file_contents = sut.get_contents()
            
            # Assertions on call to file open.
            # Treating the 'open' call as a mock.
            mock_file.assert_called_once_with("foo")
        
            # Assertion on the contents returned.
            # Treating the `read` as a stub.
            self.assertEquals("1234", file_contents)
        
            # Assertion on the outbound interaction of file close.
            # Treating the 'close' call as a mock.
            file_mock_stub_combo.close.assert_called_once()

DRY unit tests are bad... mkay

Srikanth Sastry — Tue, 17 May 2022 00:00:00 GMT

"Don't Repeat Yourself" (DRY) is arguably one of the most important principles in software engineering. It is considered a truism among many. A consequence of such dogmatic allegiance to DRYness is that we see a lot of DRY unit tests; this is where the utility of the DRY principle breaks down — simplicity matters more than DRY in tests.

TL;DR. Simplicity should be a core property of unit tests. This is motivated, both by arguments in this post against DRY unit tests, and by software maintainability as the primary motivation for unit tests. Unit tests should be as simple as reasonable. It should be easy to ready, understand, and modify (it should be easy to modify any single test in isolation). It is perfectly acceptable for this simplicity to come at the expense of code-reuse, performance, and efficiency.

So, what's wrong with DRY Unit Tests?

Presumably, we are all convinced of the benefits of DRYing your code (interested readers can go the Wikipedia page). It does have some downsides, and so you have the notion of the DAMP/MOIST/AHA principle. Interestingly, the reasons why DRYness doesn't always work out in production code are different from why it is a bad idea to write DRY unit tests. I see five ways in which (a) test code is different from production code and (b) it contributes to why test code should not be DRY.

Tests (conceptually) do not yield well to common abstractions.
Test code's readability always takes precedence over performance, but not so for production code.
Production code enjoys the safety net of test code, but test code has no such backstop.
DRY production code can speed up developer velocity, but DRY test code hinders developer velocity.
Complex changes to production code can be reviewed faster with pure green/pure red test code changes, but complex changes to test code cannot be reviewed easily.

Let's explore each one in more detail.

DRYness and Abstraction

In practice, DRYing out code results in building abstractions that represents a collection of semantically identical operations into common procedure. If done prematurely, then DRYing can result in poorer software. In fact, premature DRYing is the motivation for advocating the AHA principle. While that argument against DRYness works well in production code, it does not apply for test code.

Test code is often a collection of procedures, and each procedure steps the System-Under-Test (SUT) through a distinct user journey and compares the SUT's behavior against pre-defined expectations. Thus, almost by design, test code does not yield itself semantically similar abstractions. The mistake that I have seen software engineers make is to mistake syntactic similarly for semantic similarity. Just because the tests' 'Arrange' sections look similar does not mean that they are doing semantically the same thing in both places; in fact, they are almost certainly doing semantically different things because otherwise, the tests are duplicates of each other!

By DRYing out such test code, you are effectively forcing abstractions where none exist, and that leads to the same issues that DRYness leads to in production code (See [1], [2], [3], [4] for examples).

Readability

Most code is read more often than is written/edited. Unsurprisingly, it is important to favor code readability, even in production code. However, in production code, if this comes at a steep cost in performance and/or efficiency, then it is common (and prudent) to favor performance over readability. Test code, on the other hand, is less subject to the (potential) tension between readability and performance. Yes, unit tests need to be 'fast', but given the minuscule amount of data/inputs that unit tests process, speed is not an issue with hermetic unit tests. The upshot here is that there is no practical drawback to keeping test code readable.

DRYing out test code directly affects its readability. Why? Remember that we read unit tests to understand the expected behavior of the system-under-test (SUT), and we do so in the context of a user journey. So, a readable unit test needs to explain the user journey it is executing, the role played by the SUT in realizing that user journey, and what a successful user journey looks like. This is reflected in the Arrange-Act-Assert structure of the unit test. When you DRY out your unit tests, you are also obfuscating at least one of those sections in your unit test. This is better illustrated with an example.

A common DRYing in unit tests I have seen looks as follows:

class TestInput(typing.NamedTuple):
    param1: str
    param2: typing.Optional[int]
    ...

class TestOutput(typing.NamedTuple):
    status: SomeEnum
    return_value: typing.Optional[int]
    exception: typing.Optional[Exception]
    ...

class TestCase(typing.NamedTuple):
    input: TestInput
    expected_output: TestOutput
        
class TestSequence(unittest.TestCase):
    
    @parameterized.expand([
        [test_input1, expected_output1],
        [test_input2, expected_output2],
        ...
    ])
    def test_somethings(self, test_input: TestInput, expected_output: TestOutput) -> None:
        self._run_test(test_input, expected_output)
        
    def _run_test(self, test_input: TestInput, expected_output: TestOutput) -> None:
        sut = SUT(...)
        prepare_sut_for_tests(sut, test_input)
        output = sut.do_something(test_input.param2)
        test_output = make_test_output(output, sut)
        self.assertEquals(expected_output, test_output)

On the face of it, it looks like DRY organized code. But for someone reading this test to understand what SUT does, it is very challenging. They have no idea why the set of test_inputs were chosen, what is the material difference among the inputs, what user journeys do each of those test cases represent, what are the preconditions that need to be satisfied for running sut.do_something(), why is the expected output the specified output, and so on.

Instead, consider a non-DRY alternative.

class TestSequence(unittest.TestCase):
    
    def test_foo_input_under_bar_condition(self):
        """
        This test verifies that when condition bar is true, then calling `do_something()`
        with input foo results in sigma behavior
        """
        sut = SUT()
        ensure_precondition_bar(sut, param1=bar1, param2=bar2)
        output = sut.do_something(foo)
        self.assertEquals(output, sigma)

This code tests one user journey and is human readable at a glance by something who does not have in-depth understanding of SUT. We can similarly define all the other test cases with code duplication and greater readability, with negligible negative impact.

Who watches the watchmen?

[Original Image posted to Flickr by colink. License: Creative Commons ShareAlike]

Production code has the luxury of being fine tuned, optimized, DRY'd out, and subject to all sorts of gymnastics mostly because production code is defended by tests and test code. For instance, to improve performance, if you replaced a copy with a reference, and accidentally mutated that reference inside a function, you have a unit test that can catch such unintended mutations. However, test code has no such backstop. If you introduce a bug in your test code, then only a careful inspection by a human will catch it. The upshot is the following: the less simple/obvious the test code is, the more likely it is that a bug in that test code will go undetected, at least for a while. If a buggy test is passing, then it means your production code has a bug that is undetected. Conversely, if a test fails then, it might just denote a bug in the test code. If this happens, you lose confidence in your test suite, and nothing good can come from that.

DRY code inevitably asks the reader to jump from one function to another and requires the reader to keep the previous context when navigating these functions. In other words, it increases the cognitive burden on the reader compared to straight line duplicated code. That makes it difficult to verify the correctness of the test code quickly and easily. So, when you DRY out your test code, you are increasing the odds that bugs creep into your test suite, and developers lose confidence in the tests, which in turn significantly reduces the utility if your tests.

Developer Velocity

Recall from the previous section that while tests might have duplicate code, they do not actually represent semantic abstractions replicated in multiple places. If you do mistake them for common semantic abstractions and DRY them out, then eventually there will an addition to the production code whose test breaks this semantic abstraction. At this point, the developer who is adding this feature will run into issues when trying to modify the existing test code to add the new test case. For instance, consider a class that is hermetic, stateless, and does not throw exceptions. It would not be surprising to organize DRY tests for this class that assumes that exceptions are never thrown. Now there is a new feature added to this class that requires an external dependency, and now can throw exceptions. Added a new test case into the DRY'd out unit test suite will not be easy or straightforward. The sunk cost fallacy associated with the existing test framework makes it more likely that the developer will try to force-fit the new test case(s) into existing framework. As a result:

It slows the developer down because they now have to grok the existing test framework, think of ways in which to extend it for a use case that it was not designed for, and make those changes without breaking existing tests.
Thanks to poor abstractions, you have now incurred more technical debt in your test code.

Code Reviews

DRY'd out tests not only impede developer velocity, they also make it less easy to review code/diffs/pull requests. This is a second order effect of DRYing out your test code. Let's revisit the example where we are adding a new feature to an existing piece of code, and this is a pure addition in behavior (not modification to existing behavior). If the tests were not DRY'd out, then adding tests for this new feature would involve just adding new test cases, and thus, just green lines in the generated diff. In contrast, recall from the previous subsection that adding tests with DRY test code is likely going to involve modifying existing code and then adding new test cases. In the former case, reviewing the tests is much easier, and as a result, reviewing that the new feature is behaving correctly is also that much easier. Reviewing the diff in the latter case is cognitively more taxing because not only does the reviewer need to verify that the new feature is implemented correctly, they also have to verify that the changes to the test code is also correct, and is not introducing new holes for bugs to escape testing. This can significantly slow down code reviews in two ways (1) it requires more time to review the code, and (2) because it requires longer to review the code, the reviewers are more likely to delay even starting the code review.