Site Antifragility Engineering

This is part 2 of a 3-part series on things I’ve learned working in an incubator together with a high-performing, truly agile software development team.

When I was working at an incubator a while ago, the distributed software teams released new features to production a few times a week. New functionality to the product was released continuously whenever the team assessed that it was ready for the public. It was the responsibility of the teams themselves to coordinate releases inbetween them, and also their collective responsibility to ensure the stability of the product. In other words and as explained in Part 1, the team was empowered by ownership of the release management pipeline.

After about a year, when the incubator was “graduated” into the sponsoring corporation, a top-down decision was made to show complete trust in the team’s own abilities to manage themselves and nothing whatsoever about the release pipeline was changed by the management.

The end.

Just kidding

Shortly after the takeover engineers were called into a meeting with a new Career Manager™. The manager explained that since the FinTech product was now mature enough to be a part of the larger corporation, they had their “brand risk” to consider. They could not afford the risky Wild West style of the current release process and the teams now had to conform to the one dictated by the IT management.

The team was then presented with a slide. Without boring you with too many details, just imagine a large circular diagram that starts with the words “new feature”, continues on with concepts such as stakeholder sign-off, risk assessments, (manual) user tests, regression tests, promotion to staging environments, coordination with infrastructure teams, freeze periods, deployment windows, go/no-go meetings, steering committees and so forth. At the completion of the circle, six months later, the diagram ended with a figure saying “working release to production”.

At least in theory.

Having multiple teams all releasing things whenever they felt like it was an extreme risk which had to be mitigated as soon as possible. One cannot trust the complicated process of enterprise-grade release management to meagre footmen. Proper enterprise release management of complicated systems demanded the introduction of The Coordinated Release Cycle, which is one of the biggest misconceptions about proper release management still flourishing in the corporate software industry.

The Career Manager™ of the Month ended the introduction by issuing a statement which I have never forgotten:

This will not mean much to your daily work. I do not believe in micro-management. Unless production errors increase – then you’ll start seeing more of me.

In one single sentence the window into the mind of the management was revealed: The solution to errors is control, and more errors equals more control.

About a year later release of new features had almost come to a compete halt. It probably didn’t help that another top-down decision was made to limit releases to once every sixth months. Despite this, the average time from a feature was finished until it was in production was often a lot longer than that. Even more, many production releases still ended in rollbacks due to unforeseen or unchecked bugs and production errors still happened regularly. So what happened?

The Coordinated Release Cycle

Imagine your have made a back-end feature A and a related front-end feature B. In a coordinated release process, one cannot release feature A to production without also releasing B, since they are interconnected. Before a release to production can be done, feature A and B must go through the various bureaucratic steps described above, which is created to act as the ultimate assessor and mitigator of the risks of your release causing something to break in production.

If the risk of releasing a feature is deemed too high by the risk assessment controls in place and the time it takes to mitigate these risks exceeds the time until the next freeze period, neither feature A nor B will be released until the next cycle six months later.

If, however, fortune favoured the bold who developed the features, and the release request was successfully transitioned through all the controls, feature A and B will be bundled together along with every other feature that has been developed by other teams and accepted for release since the last freeze period. Thus resulting in a massive bundle of code and features which will all be released at the exact same moment sometime later: A Big Bang Release.

The Big Bang Release

A big bang release can be massive and not just include code. Often, a new feature will also require changes not directly related to the software of the product. For instance, updates to documentation or changes to databases or other kind of infrastructure surrounding the product. In this case, responsibility of infrastructure changes was taken away from the software teams and put in the hands of a completely detached infrastructure team, who came with a substantially different view on site reliability methodologies and sustainable infrastructure design related to their on-premise cloud environment.

This also meant that rapid reaction to change was severely abated, since changes to databases, new servers and the like, as well as insights into logs and metrics of infrastructure already in place had to be requested through the already established support systems of the corporation.

Therefore, if just a minor thing went wrong during a deployment, everything had to be rolled back and aborted until the next available time slot. Per policy, releases had to be done very early in the morning hours and often on the weekends, further inhibiting the emergency response capabilities.

Should the big bang release happen without problems, it was usually a warning sign, as production errors would then happen a few hours later, when traffic to the site gradually increased to peak hours. It is a classic scenario that code developed locally on a local version of the platform works fine, since the developer is the sole consumer, but crashes in a flaming inferno when exposed to the traffic of the masses.

Since a profusion of features were deployed simultaneously, pinpointing which one of them was the culprit could be time consuming and quite complicated, and a full rollback was usually needed to buy time.

Rolling back schema-changes to a database that has already been running in production can be a nerve-racking ordeal, which can potentially lead to loss of data for the end user. It is often easier to keep the production burning, while the people on watch scrambles like F16s and attempts to wake up sleeping developers and submit support tickets of panic to the infrastructure teams.

Courage under fire

As The Career Manager™ had foretold, when crisis happened the teams would start seeing more of the management. The blame game started and the events were interesting to observe: If a production environment went down like the Hindenburg, rationality dictated that oversight of the release process must have been too lax. It must be investigated how to better keep developer-daredevils and erratic release managers on a tighter leash, since they obviously are not mature enough to handle the demanding challenges of corporate-grade enterprise systems.

It is not too much to expect talented developers to write bug-free code which does not cause production blow-ups. As promised, this meant more control: Prolonged freeze periods, increased amounts of risk assessments, more meetings about release coordination, more risk-prediction forms to fill and more stakeholders to do sign-offs.

This caused developer demotivation and a feeling of punishment for following a task from inception to production, instead of a sense of reward for a job well done. Furthermore, it dissuaded developers from making smaller, yet very important continuous fixes to the code, such as refactorings, solving of technical debt or simple issues such as documentation typos or the like, since it all involved risks of scrutiny, blame and reduced freedom.

Take this example: Say a production bug is introduced by a developer forgetting to update a database schema in production. It happens. I know many companies who do not want database schema changes automatically propagated upwards from development environments, or maybe have not gotten around to being purely infrastructure-as-code compatible. This causes management to introduce a tightened risk mitigation control, where developers are no longer allowed to make database changes themselves. Instead they have to fill out a change-request document which will be sent to a newly appointed responsible of infrastructure changes, as well as a newly-appointed responsible of database infrastructure: A person dedicated to coordinate between developers and database operators.

This also creates a tunnel-vision of impending threat from schema-changes, causing everyone to focus entirely on highly specific types of risk, when in reality the next bug is probably something completely different. Where the situation previously was one developer who had one problem, there are now three people who have three problems. Not only must everyone remember to fill out new formulae, they must also ensure that the other two are doing their jobs before the third can do theirs. Else the chain snaps and a lack of attention will cause yet another production breakdown.

If that happens, a new decree will be enforced with even tighter control, perhaps introducing a new manager in charge of coordinating between the managers of other parts of the release coordination effort. This creates an infinite loop of ever tightening control and micro-management of the release process, in turn causing an even more error-prone and fragile production system, which in turn causes more management oversight. Do you see where this is going?

The myth of predictability

The Coordinated Release Cycle takes root in a fallacy that permeates many levels of IT management in enterprise-level corporations. The idea that risks can be predicted or even controlled. The entire purpose of the tightened control of the release process is above all put into place to predict future (specific) disasters and design the system so it can withstand these specifically.

The only problem is that no matter how hard you try, nobody can predict the next disaster. Many claim to, such as investment bankers, weather forecasters, insurance- and real-estate agents, and so forth. But that still hasn’t kept some of the most cataclysmic events of the world to go entirely un-predicted by any and all of the experts of disaster prediction. Notable examples include the 2009 global financial crisisthe Fukushima Daiichi nuclear disasterHurricane Katrina or World War I. Quite surely all disasters that, if predicted, would have been in everyone’s interest to avoid.
One thing all these calamities have in common, is the fact that after they had occured, they were all used by people such as the aforementioned risk prediction experts as examples of worst-case scenarios, which act as benchmarks long and wide for prediction of future risks. Such as when Real Money’s Jim Cramer offers his bit on The Worst-Case Scenario for Dow Jones, making his predictions of the future of the financial markets based on the previously, historically worst-case scenarios.

The blatantly obvious and yet mostly ignored contradiction of these types of apocryphal risk assessments is the fact that in every single case, without exception, the event that is presently thought to be the worst-case scenario, was when it happened, an event that exceeded the previous worst-case scenario at the time.

The opposite is also true: In 2014 the Bitcoin cryptocurrency reached unprecedented heights of $1.000 per coin, which was followed shortly thereafter by the bankruptcy scandal of the Mt. Gox exchange, causing the value to plummet to around $300. Financial futurologists all over the world jumped at the chance to declare Bitcoin dead and announce the golden days were over, as $1.000 was now the de-facto best case scenario which investors would never see again. All that was left to do for the hopeful day-trader investors was to try and become rich inside the $300 to $1.000 window.
In December 2017, the price of 1 Bitcoin was about $20.000, before it crashed to about half that a short while later, causing yet another batch of futurologists to go down with collective amnesia and predict the now all-time best-case scenario had passed. All that was left to do for the hopeful day-trader investors was to try and become rich inside the $1.000 to $20.000 window.

Antifragitility

That finally brings me to the point: You can’t predict disasters, so stop trying. Instead try and assess how well your system responds to disasters when they occur.

The term antifragility is one that was coined by author, scholar, ex-trader, ex-risk analyst and all-around amazing guy Nassim Nicholas Taleb in one my favourite philosophical writings of the 21st century: Antifragile: Things that gain from disorder. In short, the term is coined as meaning the opposite of fragile. Where fragile means something that is harmed by damage, antifragile is something that becomes stronger from it. Not to be confused with resilience or robustness, though undoubtedly preferable to fragility, does not gain from adversity or damage. Antifragility does.

Taleb highlights the curious case of humanity’s ceaseless attempts to mitigate and protect their lives from risks and volatility through attempts at predicting them, despite the fact that it goes directly against the survival strategy of all the longest-living systems that surrounds us. For example:

And so on.

Antifragility in Software Engineering

The point is that small stressors and volatility provide key insights to the overall stability of your system, where large-scale, unpredictable, cataclysmic events do not. Life on Earth does not improve from a direct meteor hit, but small and random changes to its DNA through mutations will. Your muscles do not get stronger by trying to lift an elephant, but perhaps an 80 pound barbell will. The complete incineration of The Amazon does not help nature, but small fires do.

This is also true for your IT infrastructure. When you submit a PR for review, many companies have automated test-suites that tries to break your code and detect weaknesses. If they find any, your software becomes better. Developers who are more experienced in the system than you are will review your code, perhaps asking you to do things differently. This will probably cause stressors to your sense of self-confidence, but the overall system will benefit from this in terms of your increased competence.

Unfortunately, automated tests can only assess those risks which the developer who wrote them imagined could happen. As history has shown us, even the biggest players in the IT world have not found a solution to 100% disaster-free software. In 2013, a series of unfortunate events caused Google to go offline, dropping global internet traffic 40%. In early 2016, an alleged typo caused Amazon’s S3 storage servers to go down, breaking half the Internet. And finally, in one of my favourite examples unpredictability, an unlucky JavaScript developer accidentally brought down even more of the Internet by submitting a broken package to NPM.

Despite all this, people responsible for their corporation’s uptime still obsess about predicting and assessing the risk to their infrastructure’s stability by exercising tight and rigorous control of the release process.

Chaos and control: The false dilemma

For The Career Manager™, the primary motivator is the ascension of the corporate ladder while avoiding blame and responsibility. The best way to do that is to assess the immediate gains of a given action while disregarding long term consequences. In the case of corporate infrastructure, this means tightening and centralisation of control. In the view of a Career Manager™, the situation is binary: Either a situation is chaos and anarchy (self-managed teams) or its order by numbers (top-down control).

As previously stated, this paradox is easily disproved in our society, where we have plethoras of examples where the exercise of strict control resulted in massive disasters. Just looking at previous and existing nation states, it is easy to see that the nations which exercises the highest amounts of autocratic dictatorships are the ones most prone to destruction through revolutions after going through never-ending loops of ever-tightening of control (e.g. Czarist RussiaShah’s IranFrench HaitiFrench Indochina, etc). On the contrary, countries which are governed by more distributed, bottom-up manners usually tends to be relatively stable (e.g. SwitzerlandDenmarkFinland). In the first category, the occurrence of small stressors and volatility (e.g. public dissidence, demonstrations, opposing party activity, critical media, etc.) were almost non-existent, but giant, unpredictable diasters (e.g. violent revolution, civil war, coup d’état, etc) was. In the latter category, we observe the opposite.

What most Career Managers™ fail to see is the important fact that volatility equals information, not weakness. If you deploy a small and isolated feature to production, which causes a bug to be revealed, this will provide you with key information about the overall ability of your system to handle failure. If you submit 20 features to production all at once, you are gambling with unfavourable odds.

Suppressing volatility causes the system to increase in fragility while hiding risks. The sole purpose of The Career Manager™ is to stabilise the system by inhibiting fluctuations (i.e., small errors) until he is promoted and moved somewhere else. The result is the exact opposite. Such environments ends in massive blowups, catching everyone off guard and can potentially undo years of work (think broken databases running for prolonged time in production), and always leaves the system in a worse state than it was before.

When such collapses happen, the responsibility is rarely blamed on fragility, but on poor forecasting. It cannot be the control system that was broken, is must be the individual members of said system that were incompetent when it came to predicting the disaster that happened. How many times have you heard the words uttered after a failure has occurred:

How did we not see this coming?

What they should have been asking was:

How could our system have been this fragile?

The good news

This may all sound very draconic. The good news is that creating a system that can grow from mistakes is really not that difficult to create. Take your own system and think about these three points:

1) It is a lot easier to asses how fragile your system is than predicting the next disaster

You cannot state what event is most likely to happen, but you can state that one system is more fragile than another, should a certain event happen. How does my system react to a failure of a code? How does my system bring it self back up after a failure? How does my system roll back after failures? Does it do so without human intervention? How does my system handle partial failures? Can some parts of the system still be up while other parts are down? How can my system improve overall after a failure has happened? Do we respond to incidents with tighter control or greater insight? Can we do a decoupled release of front- and back-end features without actually changing what the user sees? Do we have something like feature-flippers or something that works like it?

2) Don’t try and change the world. Make the world more robust towards defects or even exploit the errors.

What did previous errors tell me about the fragility of my system? Did the system display a single point of failure at any point in time? Furthermore: If I have not detected an error in my system for a given amount of time, how well does my system report errors back to me? How well is my system continuously supervised? Do I have sensible alerts and metrics? Are my firefighters drowning in false positives or is all detected incidents actually incidents? Could we use a gradually rolling-deployment of new features with blue-green deployments, feature flippers and automatic rollback policies? Could we create automated testing of infrastructure changes the same way we do automated tests of code?

3) Antifragility is evolution

Move your focus away from:

Why did this major disaster happen and who’s to blame?

To:

Why did we build a system so fragile to unforeseen events?

Are the people responsible for your database administration completely detached from the people responsible for your system’s backend? Is the person responsible for making the deployment another person than the one who made the feature? Are they communicating on a regular basis? Are they aligned on how to do deployments? Are you required to shut down your production environment in order to deploy a new version? Do I lead by fear and control instead of trust and empowerment? Am I a Career Manager™?

Attempts are being made to make this more mainstream, such as Chaos EngineeringAntifragile Manifestos and Antifragile Cloud Infrastructure and more. But the world of corporate IT is for the most part still not convinced. Their Career Managers™ still spend most of their time trying to understand the ordinary, when the real world always causes events of massive and unpredictable scales. And even after such events occur, they usually feel that they almost predicted them because they are retrospectively explainable.

Human beings have tried to make these predictions for as long as there has been IT, and yet, Amazon and Google still manages to crash their systems every now and then. What is unmeasurable and unpredictable will remain so, no matter how many Career Managers™ you hire to change it.

Up next

This leaves us with the third and final piece in the trinity of technological takeways: Transparency. In Part 1, I theorised upon the challenge of pulling classical corporations into the fold of modern software development and out of the comfort of the known. In this piece, I exacerbate this issue by highlighting the limiting belief that exercising tight control will render your overall system more resilient, when the truth is actually the exact opposite. If the companies doing sprint-based waterfall development are truly serious about staying in the game, I am arguing that they can do so with relative ease by implementing the points expressed here and in Part 1. And a shortcut to doing so efficiently is to radically change their internal mindset and communication through transparent decision-making. That’s coming up in Part 3.

Attributions

read original article here