Srikanth Sastry A Techie in Boston

Are you building for Survival or Excellence?

Source: https://www.youtube.com/attribution?v=oW2i6QpnmyY

In my experience, the approach to building a software artifact often falls into one of two types: building for survival, or building for success.

When building for survival, your only goal to get the product working for the specific usecase(s) that will save your skin. In contrast, when building for success, you are building to solve more than just the immediate problem; you are building to set up building blocks that is incidentally used to solve the immediate problem, but can also be adapted to solve a larger class of problems within the same context.

This post is not about when to choose what approach. Instead, it is about what each of the two approaches look like, and what purposes they serve. A subsequent post will talk about when I think each approach is appropriate.

In theory, specific circumstances should determine which of these two approaches ought to be used. Unfortunately, all too often, the developer's temperament determines the approach, and this IMHO is a mistake. I have seen consequences of such mistakes last through multiple years and impact the morale of multiple teams and engineers.

Building for survival

Source: https://www.flickr.com/photos/bandrews/6028945055

Building for survival often translates to 'being fast', taking shortcuts, and solving for the immediate use case. However, remember that when you do this, your software incurs a debt that will have to be paid eventually. Every incremental functional change you make on top of it incurs interest on the existing debt. Refusal to address it makes it incredibly difficult for your software to evolve and improve. This has a direct impact in your team’s morale. Ask any team that is left supporting ‘legacy’ code, or has some ‘black box’, ‘sacred cow’ modules that no understands, but is business critical.

What does building for survival look like?

How do you know you are now in the regime of building for survival? There are many clues to look for. I'll list three.

  • Typically, when building for survival, your deadlines are sacred. Think about everything that you or your company has had to do to meet GDPR deadlines. The odds are that all of that was done in the mode of building for survival.
  • The second clue is that you deem it more important that some specific usecase work end-to-end, than that things are done the 'right' way (the discussion of what is a 'right' way is a topic for a whole other discussion). You see this often in early stage startups where they have an alpha customer who has promised to your product/service for some specific purpose, and your next round of funding is contingent upon demonstrating the utility of your product/service with that (isolated) context.
  • The third, and perhaps the strongest, clue is that to (the collective) you, the end product is more important than the software development process. If your engineering culture is to 'get things done' by 'hook or crook', then you are most definitely building for survival.

What does building for survival get you?

You survive, period. It gets you to where you want to be, and within a reasonable amount of time, with potentially steep discounting of the future. There really isn't much beyond that to show for.

What it doesn't give you

It is important to realize the trade-off you are making when building for survival, and not be under illusions.

  • For starts, do not mistake hitting your milestones under this approach to success. Sure, you may have succeeded in getting where you want to be, but that's not the end of the story.
  • Presumably, the software you just delivered is not going to be abandoned imminently. So, what you need is a path forward, and that is exactly what this approach will not provide. Building for survival does not necessarily tell you how and where to go next. It shines no light on the landscape of possibilities that could have been unlocked.
  • It doesn't tell you what else can your artifact be used for, or it fits into the larger ecosystem. In the pursuit of 'moving fast', the odds are they you have built in so many assumptions into your code that even cognitively extricating the underlying technological innovation from the business logic and the business logic from the use cases becomes challenging.

Building for Success

source: https://medium.com/deliberate-data-science/deliberate-data-science-intro-eac1b1a06568

Building for success is a much more deliberate process that includes grokking the true context of the problem you are solving, and being critical of everything you choose to build. But it is important to be sure that you actually have such a luxury; otherwise, your software will likely become part of the vast graveyard of failed projects, products, and companies.

What does building for success look like?

There are lots of ways building for success is different from building for survival.

  • You deliberate before execution. You ask questions such as:
    • What is the problem we are solving?
    • Are we solving the right problem?
    • Is our proposal the right way to solve the problem?
  • You deconstruct the problem to understand the larger context and nature of the sub-problems you are solving. You tease out the incidental nature of how these sub-problems combine versus the essential nature of the overall problem to be solved.
  • Your execution is heavily informed by the aforementioned analysis. You apply the deconstruction and analysis to each sub-problem recursively until the actual act of writing the code becomes a rote exercise. The 'magic' and 'innovation' in your execution is really in how you compose such 'simple pieces of code' to solve your non-trivial problem across layers of abstractions (which are translated directly from your deconstructions).
  • The code paths within your subsystems and modules are constrained to the supported use cases, but that is the result of intentional plumbing across the data flow. Addition of new use cases and flows is often a matter of easily understandable and incremental changes.
  • Your control flow mimics your data flow within the system. (Unless there is a very good reason for it not to be the case.)

What does building for success give you

Despite it not being the 'fast' way to build software. There is a lot of be said for Building for Success.

  • The deliberation process before you build should result in a decent understanding of the context within which you are solving your problem. This often means, your team and the software is now in a much better position to solve more problems faster. Effectively you have expanded your 'pie'.
  • Almost always, problems do not occur/manifest in isolation. They are part of a larger landscape of issues, utilities, and benefits. Deconstructing the problem through this lens will help you build a solution that is more likely to have reusable and sustainable components and modules that will lower the incremental effort associated with the evolution of your systems and their adaptation to solve proximate and associated problem.
  • A well thought out design allows you to shard your development across multiple developers. This will help in three ways:
    1. You can 'move fast' with concurrent execution.
    2. Each developer can work on multiple workstreams, and is less likely to be completely stuck.
    3. Your software's bus factor is much improved with more engineers on the code.
  • You can pivot better and faster because a lot of what you wrote is reusable and reconfigurable. You can migrate from one upstream dependency to another much more smoothly. A good composition-based design allows you to make disruptive changes without actually disrupting :)

What it doesn't give you

You will not have a quick start. You will be a little slower starting from square one. It will take time to start putting together code that actually does something real and concrete.

You are vulnerable to analysis paralysis. The bar for action is much higher when building for success. It takes a certain type of decisiveness, and ability to disagree and commit, to be able to flourish under this approach.

Object Composition for Service Migration

Object Composition is a very powerful and pervasive software design technique. Yet, paradoxically, it is an underutilized design pattern whose lack of usage is the root of many anti-patterns in software development. One that I continue to come across regularly has to do with not using composition to test and migrate a piece of software from one service to another.

Briefly, Object Composition is combining two or more objects to create more complex objects while keeping definitions of the constituent objects unchanged (unlike inheritance, which extends these definitions)

Set Up

Say you have an existing backend service that your code currently uses. It has evolved over time to become a chimera that needs replacing, and you have a brand new implementation of that service that can replace your old service.

Say, the two client implementations looks something like the following:

class OldAndBusted implements ServiceClient {
@override
Response process(Request request) {
// Hacky code.
if (request.type == A) {
// Ugly code.
} else if (request.type == B) {
// Even uglier code.
} else {
// A monstrosity that needs to be killed with fire
}
return response;
}
}

class NewHotness implements ServiceClient {
@override
Response process(Request request) {
// Best code ever written.
return response;
}
}

The goal is to migrate your code from using OldAndBusted to NewHotness. There are several ways to do this wrong. So it is easier if I demonstrate a right way to do this using Object Composition.

A right way

There are really four steps to such a migration.

  1. Verify equivalence: Shadow a percentage of your calls to the new service, log mismatches in the response, and fix all such mismatches.
  2. Configure migration: Setup service migration to proceed in phases.
  3. Migrate and clean up: Complete migration and delete the old service.

Step 1. Verify equivalence

The goal here is to ensure that before we start migration, the new service is functionally identical to the old service. We accomplish this through composition of old and new service as sketched out next.

class ClientWithShadow implements ServiceClient {
ClientWithShadow(ServiceClient oldAndBusted,
ServiceClient newHotness) {
this.oldAndBusted = oldAndBusted;
this.newHotness = newHotness;
}
@override
Response process(Request request) {
oldResponse = oldAndBusted.process(request);
if (shouldShadow(request)) {
newResponse = newHotness.process(request);
if (!oldResponse.equals(newResponse)) {
logMismatch(oldResponse, newResponse);
}
}
return newResponse;
}

The pseudocode code above simply delegates calls to the old service, and if shadowing is requires, it additionally delegates to the new service as well and compares the two outputs. It logs any mismatches it sees so that the developer can then take a look at it to ensure that it is addressed.

You simply replace all calls to OldAndBusted with calls to ClientWithShadow.

Step 2. Configure migration

After you have determined that the two services are indeed functionally alike, we can then prep for migration. Again, object composition helps us set this up cleanly.

Pseudocode for setting up such a migration follows next. Here, I assume that there is a Config object that contains the migration related config.

class MigrationClient implements ServiceClient {
MigrationClient(ServiceClient oldAndBusted,
ServiceClient newHotness,
Config migrationConfig) {
this.oldAndBusted = oldAndBusted;
this.newHotness = newHotness;
this.migrationConfig = migrationConfig;
}
@override
Response process(Request request) {
if (migrationConfig.useNewService(request)) {
return newHotness.process(request);
}
return oldAndBusted.process(request);
}

You simply replace all instances of ClientWithShadow with MigrationClient. Yes, it really is that simple! The migration config has all the info it needs to figure out whether a given request should use the new service or the old service.

Step 3. Migrate and clean up

Here, we do the actual migration. We set up the config to slowly start shifting some of the load from the old service to the new one, while monitoring to make sure everything is going well. We can always roll back the migration by editing the config without actually modifying the code, which is a big deal here.

After migration to the new service a 100%, you can simply replace MigrationClient instances with NewHotness instances, and delete all the old code (OldAndBusted, ClientWithShadow, and MigrationClient). And you are all cleaned up. Profit!

So many wrong ways

Unfortunately, I have seen this done in way too many wrong ways.

  • I have seen use of inheritance to extend OldAndBusted to NewHotness, and some hacky switch inside the NewHotness implementation to do shadowing and migration.
  • I have seen hacky if-else modification of OldAndBusted that the new if-block implementing NewHotness functionality.
  • I have seen developers skip shadowing entirely only to cause major service incidents.
  • Many more ways that are not that interesting, except for disaster tourism.

So, object composition is useful, it is powerful, and please use it more!

folly:Future, onTimeout(), and a race condition

TL;DR. The inability to cancel threads in C++ can result in bizarre semantics even in seemingly straightforward (and almost) declarative code. folly::Future is an interesting case in point.

Folly Futures is an Async C++ framework from Facebook. It has an interesting function onTimeout(), which essentially allows to stop waiting on a Future forever. So you would typically use it as follows.

provider_.getOperationFuture(Request r)
.then([&](Response response) {
doFoo(); // Accesses variables in the surrounding scope
})
.onTimeout(milliseconds(500), [&]{
doBar(); // Accesses variables in the surrounding scope
})
.get();

The semantics that I expected from this piece of code was the following:

if there is no response within 500 milliseconds, then
the future throws a timeout, thus executing doBar()
else
the future executes the then() block, thus executing doFoo(

Importantly, I was expecting exactly one of the two function doFoo() or doBar() to be executed. And it turns out not be true!

Race Condition

It turns out that the Future has a background thread running waiting for the response, and this thread is not cancelled upon timeout because:

  1. This thread is spawned first, and that in-turn waits on the timeout, and
  2. C++ does not support canceling threads.

So, we now have a race condition between the Future's response and timeout, thus potentially causing memory overruns and segfaults. How do you get around this? How do you use folly::Future with the semantics I outlined above?

Remedies

I found two possible ways for this.

Swap onTimeout() and then()

provider_.getOperationFuture(Request r)
.onTimeout(milliseconds(500), [&]{
doBar(); // Accesses variables in the surrounding scope
return Response::onTimeout();
})
.then([&](Response response) {
if (response == Response::onTimeout()) {
return;
}
doFoo(); // Accesses variables in the surrounding scope
})
.get();

Essentially, you force the onTimeout block to return a special instance of the Reponse object (called Response::onTimeout() here), this then becomes the input to the then block, and within the then block you can check if the response is valid and proceed accordingly. Yes, I know it's ugly. Worse, what if the Response object is complex enough that you cannot simply build a special instance of it? Or what if every possible instance of the Response object is potentially valid? Then you can go for the next remedy.

Open up onTimeout()

It is useful to remember that onTimeout() is just syntactic sugar for the following.

provider_.getOperationFuture(Request r)
.within(milliseconds(500))
.onError([](const TimedOut& e){
doBar();
return Response::onTimeout();
})
.then(...);

So, you can use this to refactor your code to look something like this:

provider_.getOperationFuture(Request r)
.within(milliseconds(500))
.then([&](Response response) {
doFoo(); // Accesses variables in the surrounding scope
})
.onError([&](const folly::Timeout& ){
doBar(); // Accesses variables in the surrounding scope
})
.get();

This essentially, raises an exception after 500 milliseconds of no response, and that exception ensures that the then block is never executed! So, yeah, folly::Future can be tricky.

Scripts and their undo

TL;DR. Scripts are a great way to automate the mundane. But be sure you give yourself a way out --- an undo -- when running them.

Some time ago, I had to carry out a long sequence of manual changes in the deployment of my ‘cloud’ service, and so like a good software engineer, I automated large chunks of these changes with shell scripts. Here I learned the importance of building and ‘undo’ in all your shell scripts that mutate the state of world.

A bit of background first. I discovered that one of the services in a collection of co-located services was over-provisioned by a lot. But due to interdependence among services and second order effects, I wasn’t sure by how much. A quick way to do this was to shrink the size of this service while monitoring the resource utilization for the service. For multiple reasons, I had to go through a very specific sequence of replica turn downs, and this sequence was accounted for the automation as well.
After writing and testing the script, I unleashed it on the deployment, and things seem to be going well.
Midway through, an engineer from a partner team pinged me to say that they were having service incident, and they my changes was introduces a lot of noise in their monitoring dashboard making it difficult to debug their issue. So, they asked me to undo my changes, and resume it after they had fixed their issue.

Well, as it turns out, I did not have an undo script, and worse, I hadn't even thought of an undo short of resetting the entire service (which was scheduled to happen at the end of the day anyway). So, I halted my existing script, and made some quick changes that I thought would effectively undo, and given how short I was on time, I just let it run.

You can guess what happened. Instead of undoing the changes, a bug in the script caused to be more aggressive about shutting down replicas, and I now had a new service incident on my hands! :)

If only I had spent enough time figuring out the undo operation, and had a handy command that executed that, this could have been completely avoided. So, my advice to you is this. When writing and launching a script that mutates the state of the world, please ensure that the script logs (either to stdout, stderr, or a log file) the exact command that can be pasted into your shell prompt to undo all the mutations. You may not have to use it often (or at all(; but when you do, it will definitely be worth the effort.

Merits of unit tests — part 5

Cross posted on LinkedIn.

This is the fifth, and final, post in my series of notes on unit tests. So far, we've talked about how unit tests help us in documenting our code, reliably refactor software, build better code, and even help debugging in prod. In this post, we'll discuss how unit tests (more precisely, the act of writing unit tests) help us improve the usability of our code.

Usability

It is fairly accurate to state that the simpler and more usable your API is, the less likely it is to be misunderstood, misused, and abused. Also, simpler API constrains your possible code paths making it more testable and less bug-prone. I claim that the very act of making unit tests will help you write more usable code/API. (Of course, this assumes an earnest effort in writing good quality unit tests, which can be a topic of discussion in its own right).

The reason for my claim is simple, by writing extensive unit tests that account for all your use cases, you effectively become your own first customer. This forces you you to wear your customer's hat and really probe the user experience of your API. In fact, it is not uncommon for me to iterate on my APIs multiple times simply because I am not happy with how difficult it is to set up and execute my unit tests.

by writing extensive unit tests that account for all your use cases, you effectively become your own first customer

Let's take a fictional example of a class that does the following: It retrieves either a URL, or the content of the URL for a given handle that could potentially need to be authenticated as a specific user, and it can do so periodically. Here is a first crack at the API and usage for it.

class UrlRetriever {
// Unauthenticated, one-time
UrlRetriever(Handle handle);
// Authenticated, one-time
UrlRetriever(String user, Handle handle);
// Authenticated, periodic
UrlRetriever(String user, Handle handler, int periodInSeconds);
// Unauthenticated, periodic
UrlRetriever(Handle handler, int periodInSeconds);

String getUrl();
Blob getContents();
void getContentsPeriodically(Callback cb);
void getUrlPeriodically(Callback cb)
}

When you start writing unit tests for this, you start seeing issues with usability. For example, you have to consider all possible constructions of UrlRetriever with getUrl() or getContents() call. Worse, what happens if the UrlRetriever is constructed without a periodInSeconds argument, and someone invokes getContentsPeriodically() on it? Sure, it is nonsensical, but you still need a test case for it, right? Which means, the clients could potentially misuse the class in this fashion, in part because the usability of this API is poor.

Making an honest attempt at writing unit tests can actually help you detect such usability issues! Consider the next iteration for the same use case, informed (or constrained) by the unit tests.

class UrlRetriever {
UrlRetriever(Handle handle);
AuthenticatedUrlRetriever withAuth(String user);
String getUrl();
Blob getContents();
void getPeriodically(ContentCallback cb, int periodInSeconds);
void getPeriodically(UrlCallback cb, int periodInSeconds);
}
class AuthenticatedUrlRetriever inherits UrlRetriever {
// Does not make sense to authenticate with another user.
AuthenticatedUrlRetriever withAuth(String user) throws exception;
}

You will see that writing unit tests for it is much easier, and furthermore, there is less chance of misusing this API. Both of this is true because the API is more usable. The clients can use it in multiple, but limited/tractable. Here is how it ends up looking in the unit tests.

// one-time unauthenticated 
url = UrlRetriever(handle).getUrl();
// periodic unauthenticated (1)
// -- MyCallback inherits UrlCallback.
urlCallback = new MyCallback();
UrlRetriever(handle).getPeriodically(urlCallback, 30);
// periodic unauthenticated (2)
// -- MyCallback inherits ContentCallback.
contentCallback = new MyCallback(); UrlRetriever(handle).getPeriodically(contentCallback, 30);
// Authenticated
foo = UrlRetriever(handle).withAuth(user);
// -- one-time
blob = foo.getContents();
// -- periodic
foo.getPeriodically(contentCallback, 60);

Much better eh? :)