Srikanth Sastry A Techie in Boston

Git may not be the best for SaaS companies

Yes, I realize that this is going to be pretty controversial :) Let’s dive in, shall we?

In the past half decade or so, Git has sky rocketed in popularity and is becoming the defacto choice for version control. While I understand its popularity, I have found it to be a poor fit for one specific, but popular, environment: SaaS development in a medium to large size enterprise.

Let’s take a quick look at what developing software in a SaaS enterprise is like. For the most part, it involves a significant number of developers concurrently committing code to a single master branch with frequent releases to prod and no long lived branches or versions. In a services environment, it also includes coordinated deployment of multiple services that have complementary server and client API spec.

While Git has been used in such environments with varying degrees of success, I have seen teams really trying to work around Git rather than Git just work for them. I have seen Git get in the way when teams get large, when teams get siloed into their own branches, when teams starts working with junior developers, and when having to develop across multiple services. While there are mechanisms in Git workflows and tools to mitigate this, it only adds additional complexity to developing software instead of taking it away.

Scaling Git with team size

Git branches for large teams

Git serializes all commits. So it does not scale well for large orgs natively. The common way to get around this is with more branching. Each team and developer ends up creating dependent branches from master, and either does a best effort merge into master, or designates one engineer to deal with the mess of merging multiple feature branches into master. Note that all this overhead make sense if you are having named releases and have to maintain multiple versions of the software. But in the SaaS world, you are continuously releasing to prod. There is not need to keep v2.1 around that is 8 weeks old! So all this overhead becomes an artifact of using Git.

Instead, consider Mercurial or Perforce where you can have concurrent commits as long as the commits don’t touch the same files or modules. Here is it much easier to support concurrent updates to master across different engineers who are not working on the same set of files. Granted, this can potentially break build on master despite each of the individual commits not breaking the build, but with a good CI set up, it should be quick to catch and easy to fix. And as a long as the master is in good health, you cut another release and move on.
The use of branches in Git brings me to the next issue that I have seen with Git in SaaS software development environments.

Git encourages branching despite conceptually not needing it

Source:https://flic.kr/p/8BkA3f
License: https://creativecommons.org/licenses/by-nc-sa/2.0/

Anyone who has used Git has heard "never commit directly to master". The common pattern is to create a feature branch for development. Let's consider this for a minute. In a SaaS deployment with continuous release, you want every diff to go into master asap, and have it deployed so that you can iterate faster. With Git, you end up doing this with creating a new branch, committing changes to it, getting it reviewed, and then (if you are smart) squash merging it to master. Now, if a team is developing a feature together, the pattern becomes that of creating a feature branch that everyone commits to, and after the feature is complete, you merge into master. Conceptually, you really just want to "commit directly to master", but in Git bad things can happen with less than well-seasoned developers if you allow it. So you create branches. Effectively, Git is forcing you to create branches.

This pattern leads to two problems: (1) Large features do not hit prod iteratively in small merges; instead they hit in one large merge to master. This has its own issues with stability and bug discovery earlier during the SDLC. (2) Unless an org is extremely vigilant against branching, you are likely to see long running ‘feature branches’ that compete with the main branch for attention. In practice, this means all fixes end up needing porting across multiple branches and the longer this goes on, the more feature branches diverge from master, which makes merging that much more unpleasant, and so it tends to get postponed making problem worse. Are there workarounds to prevent this from happening? Sure! You can write any sort of framework and tooling around Git. Have I see it often enough to think it’s standard? Not even close!

Really, all you need is to be able to release from master, and if there is a bug, just fix it in master, and keep adding features to master. There is no need for anything more sophisticated.

The above issue with branches is actually the symptom of a bigger attribute of Git: complexity.

Git is complicated; a little too complicated.

Git is complicated, and there is no way around it. Git (much like C++) makes it way too easy to shoot yourself in the foot. It gives way too much power to all developers including ones who shouldn’t have it, and don’t want it. As a junior developer, I cannot count the number of times I ended up with a detached HEAD on Git despite doing very simple operations. In general, git makes it very hard to follow simple workflows. It is way too easy to get into a bad state from which recovery is very difficult, if not impossible. But that’s the benign case. Git allows you to do all kinds of crazy stuff including rewrite history. This can be very dangerous in the hands of an inexperienced developer. A bad merge to master can put your report in a state from which rollback is near impossible.

It is easy to dismiss this argument with “With great power comes great responsibility”, but the fact of the matter is that (1) having inexperience developers is to be expected and your VCS should be robust to their shenanigans, and (2) most developers don’t want to be an Git expert. They just want to be able to write good software without the tools getting in their way, which Git steadfastly refuses to do.

Even with experienced developers, Git can get contentious/tricky. Just ask a bunch of developers if you should merge or rebase, and watch the fireworks that ensues.

Git encourages multiple repos, and that’s not always a good thing

Git does not like large repos; the recommended way around the complexity of a large repo is to split your codebase into multiple repos. This multi-repo approach to services creates a problem in the (micro)services world. Where do you define the source of truth for a service endpoint API spec? This problem is accentuated when using frameworks such as gRPC, or Thrift. Where do you store the Thrift/protobuf message definitions that need to be accessed by both the client code and the server code? Servers and clients are often distinct services that are implemented in different repos. If you define them in the server repo, then how do you ensure that when you change the API spec on the server side, the client will pick that change up as well? (Remember that this is enterprise environment where we control both client and server). Sure you can use git submodules, but that is not without its hassels. Developers could pinky swear to keep the definitions in sync, but we all know how that often goes.

More generally, Git has encouraged us to move away from mono-repos, and I think that’s a mistake. Note that mono-repo vs multi-repo and monolith vs micro services are completely orthogonal issues. You can have a mono-repo with all micro services implemented in that single repo. Alternatively, you can have a single service implemented across multiple repos. Git’s constraints on scaling for larger code bases has created new set of problems that need working around, but if you were to use a different solution like mercurial or even a centralized version control, then the problem wouldn’t exist in the first place.

When should you build for survival?

source: http://beaconbusinessmarketing.com/success-vs-survival/

Previously, I wrote about building for survival vs. success. Briefly, when building for survival, your only goal to get the product working for the specific usecase, and in contrast, when building for success, you are building to solve a bigger class of problems within the broader context of your solution space. In this post, I will talk about when you should be build for survival, and when for success.

A Straw Man

source: https://prepareforchange.net/wp-content/uploads/2016/06/Meet-You-Strawman.jpg

On the face of it, it seems like an easy answer: "build for survival, when survival is at stake; otherwise, build for success." But unfortunately, that answer hides a multitude of assumptions, and oversimplifies the real-world within which the software development process operates. So, let's first breakdown the assumptions, and then address the oversimplification.

The assumptions

The first assumption here is the notion that we have a common understanding of what it means to say that a project/product's "survival is at stake". And the second assumption is that building for survival in all such cases will actually help.

Is your survival at stake?

Let's examine the first assumption: we can agree on what it means for survival to be at stake. Sure, in the extreme cases, we can all agree on this notion (e.g., a startup has a runway of 6 months, and additional funding depends on delivering an alpha in 3 months), but moving past that, things become a lot more subjective.

Consider a new project/product incubating within a well established company such as Facebook or Google. Is it's survival ever at stake? How about a project that involves building/dismantling infrastructure with a fixed, slightly aggressive, deadline; would it's survival be at stake? The answers to the above questions are not always obvious, and they can be different depending on who you ask. They can differ depending on where you are in the organizational hierarchy; it can even differ among developers within the team.

Ok, so your survival is at stake. So should you build for survival?

Now on to the second assumption: when survival is at stake, building for survival is actually the right thing to do. Again, there are some obvious cases where this is the right call. What about a case where the survival of your medical diagnostic software is at stake; would building for survival actually be the right thing (given that 'break things' is a corollary of 'move fast')? How about when you realize that you were a little too optimistic about what the product could accomplish with the limited resources you have; is building for survival still the right thing (ask Microsoft about Windows Vista)?

The oversimplification made here is that we only ever have two choices: survival, or success. This is almost never the case. You can almost always negotiate. You can negotiate on deadlines, on scope, on resources, on expectations, and on outcomes. Without taking all of the above into account, the discussion of survival vs success is meaningless.

So, when should you build for survival vs success?

While you have to evaluate every situation independently and holistically to determine which approach is the right one to take, here are some rule-of-thumb symptoms that suggest that you should be building for survival.

You are resource constrained, and failure is an option

source: https://www.triskellsoftware.com/wp-content/uploads/2016/01/product-resource-graphic01.png

When you are resource constrained, then there is a good chance that you cannot afford the time/effort/resources that a principled approach to software development demands. Recall, that in such a case, I talked about renegotiating the original parameters and expectations. However, they are not always negotiable. (E.g., you might have only a few months' of runway, and your investor might not be willing to fund you in case of milestone slippage.) In such cases, failure becomes a better option than renegotiation (almost vacuously). Here, building of survival makes sense.

Your environment is highly uncertain

source: https://blogs.cfainstitute.org/investor/2015/09/08/uncertainty-reigns-supreme-for-fixed-income-investors-in-2015/

High uncertainty is often a good trigger to build for survival. High uncertainty often requires you to 'fail fast'. If you are working on experimental technology, or on nascent problem spaces, there really isn't much to grok without actually building something and test things out. In other cases, it might not be possible to know if you are solving the right problem; this happens often when your customers tend to "know it when they see it".

Your survival is more important than stakeholders' risks

source: https://corporatefinanceinstitute.com/resources/knowledge/finance/stakeholder/

This one is less obvious. It could well be the case that your survival is at stake, you are resource constrained, and there is no negotiating. However, there still are situations when you should not build for survival. One big situation is when the stakeholders' risks trump your survival.

The most egregious example I can think of is in medical technology. If you software you are building is for (say) medical diagnosis, and a wrong diagnosis can mean the difference between life and death of a patient, then you should never, ever, ever build for survival. From here, you can extrapolate to all other situations where your stakeholders' risk outweighs yours.

You are solving a one-time problem

source: https://prototypeinfo.com/evolutionary-prototyping-and-throw-away-prototyping/

This one is tricky, because solutions to one-time problems have a nasty tendency of sticking around a lot longer than they should. However, in principle, if you are writing software that is going to be used just once, and then discarded, then you should consider building for survival. However, please ensure that that software will NOT persist past it's primary use. Incidentally, if your work involves building prototypes and proof-of-concept works, then you are almost definitely building for survival.

There are multiple ways to enforce this: (1) do not put it into version control at all, (2) put the code in a new repo that is nuked on a timer, (3) prevent importing modules from this codebase to anywhere else, etc.

Can you think of any other situations where building for survival is warranted? Let me know in the comments.

Are you building for survival or Survival?

Source: https://www.youtube.com/attribution?v=oW2i6QpnmyY

In my experience, the approach to building a software artifact often falls into one of two types: building for survival, or building for success.

When building for survival, your only goal to get the product working for the specific usecase(s) that will save your skin. In contrast, when building for success, you are building to solve more than just the immediate problem; you are building to set up building blocks that is incidentally used to solve the immediate problem, but can also be adapted to solve a larger class of problems within the same context.

This post is not about when to choose what approach. Instead, it is about what each of the two approaches look like, and what purposes they serve. A subsequent post will talk about when I think each approach is appropriate.

In theory, specific circumstances should determine which of these two approaches ought to be used. Unfortunately, all too often, the developer's temperament determines the approach, and this IMHO is a mistake. I have seen consequences of such mistakes last through multiple years and impact the morale of multiple teams and engineers.

Building for survival

Source: https://www.flickr.com/photos/bandrews/6028945055

Building for survival often translates to 'being fast', taking shortcuts, and solving for the immediate use case. However, remember that when you do this, your software incurs a debt that will have to be paid eventually. Every incremental functional change you make on top of it incurs interest on the existing debt. Refusal to address it makes it incredibly difficult for your software to evolve and improve. This has a direct impact in your team’s morale. Ask any team that is left supporting ‘legacy’ code, or has some ‘black box’, ‘sacred cow’ modules that no understands, but is business critical.

What does building for survival look like?

How do you know you are now in the regime of building for survival? There are many clues to look for. I'll list three.

  • Typically, when building for survival, your deadlines are sacred. Think about everything that you or your company has had to do to meet GDPR deadlines. The odds are that all of that was done in the mode of building for survival.
  • The second clue is that you deem it more important that some specific usecase work end-to-end, than that things are done the 'right' way (the discussion of what is a 'right' way is a topic for a whole other discussion). You see this often in early stage startups where they have an alpha customer who has promised to your product/service for some specific purpose, and your next round of funding is contingent upon demonstrating the utility of your product/service with that (isolated) context.
  • The third, and perhaps the strongest, clue is that to (the collective) you, the end product is more important than the software development process. If your engineering culture is to 'get things done' by 'hook or crook', then you are most definitely building for survival.

What does building for survival get you?

You survive, period. It gets you to where you want to be, and within a reasonable amount of time, with potentially steep discounting of the future. There really isn't much beyond that to show for.

What it doesn't give you

It is important to realize the trade-off you are making when building for survival, and not be under illusions.

  • For starts, do not mistake hitting your milestones under this approach to success. Sure, you may have succeeded in getting where you want to be, but that's not the end of the story.
  • Presumably, the software you just delivered is not going to be abandoned imminently. So, what you need is a path forward, and that is exactly what this approach will not provide. Building for survival does not necessarily tell you how and where to go next. It shines no light on the landscape of possibilities that could have been unlocked.
  • It doesn't tell you what else can your artifact be used for, or it fits into the larger ecosystem. In the pursuit of 'moving fast', the odds are they you have built in so many assumptions into your code that even cognitively extricating the underlying technological innovation from the business logic and the business logic from the use cases becomes challenging.

Building for Success

source: https://medium.com/deliberate-data-science/deliberate-data-science-intro-eac1b1a06568

Building for success is a much more deliberate process that includes grokking the true context of the problem you are solving, and being critical of everything you choose to build. But it is important to be sure that you actually have such a luxury; otherwise, your software will likely become part of the vast graveyard of failed projects, products, and companies.

What does building for success look like?

There are lots of ways building for success is different from building for survival.

  • You deliberate before execution. You ask questions such as:
    • What is the problem we are solving?
    • Are we solving the right problem?
    • Is our proposal the right way to solve the problem?
  • You deconstruct the problem to understand the larger context and nature of the sub-problems you are solving. You tease out the incidental nature of how these sub-problems combine versus the essential nature of the overall problem to be solved.
  • Your execution is heavily informed by the aforementioned analysis. You apply the deconstruction and analysis to each sub-problem recursively until the actual act of writing the code becomes a rote exercise. The 'magic' and 'innovation' in your execution is really in how you compose such 'simple pieces of code' to solve your non-trivial problem across layers of abstractions (which are translated directly from your deconstructions).
  • The code paths within your subsystems and modules are constrained to the supported use cases, but that is the result of intentional plumbing across the data flow. Addition of new use cases and flows is often a matter of easily understandable and incremental changes.
  • Your control flow mimics your data flow within the system. (Unless there is a very good reason for it not to be the case.)

What does building for success give you

Despite it not being the 'fast' way to build software. There is a lot of be said for Building for Success.

  • The deliberation process before you build should result in a decent understanding of the context within which you are solving your problem. This often means, your team and the software is now in a much better position to solve more problems faster. Effectively you have expanded your 'pie'.
  • Almost always, problems do not occur/manifest in isolation. They are part of a larger landscape of issues, utilities, and benefits. Deconstructing the problem through this lens will help you build a solution that is more likely to have reusable and sustainable components and modules that will lower the incremental effort associated with the evolution of your systems and their adaptation to solve proximate and associated problem.
  • A well thought out design allows you to shard your development across multiple developers. This will help in three ways:
    1. You can 'move fast' with concurrent execution.
    2. Each developer can work on multiple workstreams, and is less likely to be completely stuck.
    3. Your software's bus factor is much improved with more engineers on the code.
  • You can pivot better and faster because a lot of what you wrote is reusable and reconfigurable. You can migrate from one upstream dependency to another much more smoothly. A good composition-based design allows you to make disruptive changes without actually disrupting :)

What it doesn't give you

You will not have a quick start. You will be a little slower starting from square one. It will take time to start putting together code that actually does something real and concrete.

You are vulnerable to analysis paralysis. The bar for action is much higher when building for success. It takes a certain type of decisiveness, and ability to disagree and commit, to be able to flourish under this approach.

Object Composition for Service Migration

Object Composition is a very powerful and pervasive software design technique. Yet, paradoxically, it is an underutilized design pattern whose lack of usage is the root of many anti-patterns in software development. One that I continue to come across regularly has to do with not using composition to test and migrate a piece of software from one service to another.

Briefly, Object Composition is combining two or more objects to create more complex objects while keeping definitions of the constituent objects unchanged (unlike inheritance, which extends these definitions)

Set Up

Say you have an existing backend service that your code currently uses. It has evolved over time to become a chimera that needs replacing, and you have a brand new implementation of that service that can replace your old service.

Say, the two client implementations looks something like the following:

class OldAndBusted implements ServiceClient {
@override
Response process(Request request) {
// Hacky code.
if (request.type == A) {
// Ugly code.
} else if (request.type == B) {
// Even uglier code.
} else {
// A monstrosity that needs to be killed with fire
}
return response;
}
}

class NewHotness implements ServiceClient {
@override
Response process(Request request) {
// Best code ever written.
return response;
}
}

The goal is to migrate your code from using OldAndBusted to NewHotness. There are several ways to do this wrong. So it is easier if I demonstrate a right way to do this using Object Composition.

A right way

There are really four steps to such a migration.

  1. Verify equivalence: Shadow a percentage of your calls to the new service, log mismatches in the response, and fix all such mismatches.
  2. Configure migration: Setup service migration to proceed in phases.
  3. Migrate and clean up: Complete migration and delete the old service.

Step 1. Verify equivalence

The goal here is to ensure that before we start migration, the new service is functionally identical to the old service. We accomplish this through composition of old and new service as sketched out next.

class ClientWithShadow implements ServiceClient {
ClientWithShadow(ServiceClient oldAndBusted,
ServiceClient newHotness) {
this.oldAndBusted = oldAndBusted;
this.newHotness = newHotness;
}
@override
Response process(Request request) {
oldResponse = oldAndBusted.process(request);
if (shouldShadow(request)) {
newResponse = newHotness.process(request);
if (!oldResponse.equals(newResponse)) {
logMismatch(oldResponse, newResponse);
}
}
return newResponse;
}

The pseudocode code above simply delegates calls to the old service, and if shadowing is requires, it additionally delegates to the new service as well and compares the two outputs. It logs any mismatches it sees so that the developer can then take a look at it to ensure that it is addressed.

You simply replace all calls to OldAndBusted with calls to ClientWithShadow.

Step 2. Configure migration

After you have determined that the two services are indeed functionally alike, we can then prep for migration. Again, object composition helps us set this up cleanly.

Pseudocode for setting up such a migration follows next. Here, I assume that there is a Config object that contains the migration related config.

class MigrationClient implements ServiceClient {
MigrationClient(ServiceClient oldAndBusted,
ServiceClient newHotness,
Config migrationConfig) {
this.oldAndBusted = oldAndBusted;
this.newHotness = newHotness;
this.migrationConfig = migrationConfig;
}
@override
Response process(Request request) {
if (migrationConfig.useNewService(request)) {
return newHotness.process(request);
}
return oldAndBusted.process(request);
}

You simply replace all instances of ClientWithShadow with MigrationClient. Yes, it really is that simple! The migration config has all the info it needs to figure out whether a given request should use the new service or the old service.

Step 3. Migrate and clean up

Here, we do the actual migration. We set up the config to slowly start shifting some of the load from the old service to the new one, while monitoring to make sure everything is going well. We can always roll back the migration by editing the config without actually modifying the code, which is a big deal here.

After migration to the new service a 100%, you can simply replace MigrationClient instances with NewHotness instances, and delete all the old code (OldAndBusted, ClientWithShadow, and MigrationClient). And you are all cleaned up. Profit!

So many wrong ways

Unfortunately, I have seen this done in way too many wrong ways.

  • I have seen use of inheritance to extend OldAndBusted to NewHotness, and some hacky switch inside the NewHotness implementation to do shadowing and migration.
  • I have seen hacky if-else modification of OldAndBusted that the new if-block implementing NewHotness functionality.
  • I have seen developers skip shadowing entirely only to cause major service incidents.
  • Many more ways that are not that interesting, except for disaster tourism.

So, object composition is useful, it is powerful, and please use it more!

folly:Future, onTimeout(), and a race condition

TL;DR. The inability to cancel threads in C++ can result in bizarre semantics even in seemingly straightforward (and almost) declarative code. folly::Future is an interesting case in point.

Folly Futures is an Async C++ framework from Facebook. It has an interesting function onTimeout(), which essentially allows to stop waiting on a Future forever. So you would typically use it as follows.

provider_.getOperationFuture(Request r)
.then([&](Response response) {
doFoo(); // Accesses variables in the surrounding scope
})
.onTimeout(milliseconds(500), [&]{
doBar(); // Accesses variables in the surrounding scope
})
.get();

The semantics that I expected from this piece of code was the following:

if there is no response within 500 milliseconds, then
the future throws a timeout, thus executing doBar()
else
the future executes the then() block, thus executing doFoo(

Importantly, I was expecting exactly one of the two function doFoo() or doBar() to be executed. And it turns out not be true!

Race Condition

It turns out that the Future has a background thread running waiting for the response, and this thread is not cancelled upon timeout because:

  1. This thread is spawned first, and that in-turn waits on the timeout, and
  2. C++ does not support canceling threads.

So, we now have a race condition between the Future's response and timeout, thus potentially causing memory overruns and segfaults. How do you get around this? How do you use folly::Future with the semantics I outlined above?

Remedies

I found two possible ways for this.

Swap onTimeout() and then()

provider_.getOperationFuture(Request r)
.onTimeout(milliseconds(500), [&]{
doBar(); // Accesses variables in the surrounding scope
return Response::onTimeout();
})
.then([&](Response response) {
if (response == Response::onTimeout()) {
return;
}
doFoo(); // Accesses variables in the surrounding scope
})
.get();

Essentially, you force the onTimeout block to return a special instance of the Reponse object (called Response::onTimeout() here), this then becomes the input to the then block, and within the then block you can check if the response is valid and proceed accordingly. Yes, I know it's ugly. Worse, what if the Response object is complex enough that you cannot simply build a special instance of it? Or what if every possible instance of the Response object is potentially valid? Then you can go for the next remedy.

Open up onTimeout()

It is useful to remember that onTimeout() is just syntactic sugar for the following.

provider_.getOperationFuture(Request r)
.within(milliseconds(500))
.onError([](const TimedOut& e){
doBar();
return Response::onTimeout();
})
.then(...);

So, you can use this to refactor your code to look something like this:

provider_.getOperationFuture(Request r)
.within(milliseconds(500))
.then([&](Response response) {
doFoo(); // Accesses variables in the surrounding scope
})
.onError([&](const folly::Timeout& ){
doBar(); // Accesses variables in the surrounding scope
})
.get();

This essentially, raises an exception after 500 milliseconds of no response, and that exception ensures that the then block is never executed! So, yeah, folly::Future can be tricky.