While we become more productive as people and companies, the administration of our lives is being transferred to computers. It is simply more convenient, scalable and competitive.

We talk about this a great deal, facilitated appropriately enough by computers on social media. People comment, signal their virtue, actually show their bias and attack the audacity of others on discussion platforms that will never forget.

But far more important things are also stored digitally, aspects of our lives that really matter: property rights, money and identity to name just three. If these are disrupted or their data hacked, it can throw our modern existence into utter disarray.

Multiple industries have arisen to protect our online world: anti-virus software, cyber threat hunting, and all types of computer security. These all try to detect, test and protect. Other approaches, inside application software itself, allow a limited amount of damage to take place but remain functional, like the internet and blockchain.

But what if the bad guys succeed?

What if an attack is so well planned they get through? Or they have sufficient resources and technology that they can cripple the major infrastructures on which we depend by using multiple coordinated attacks.

It has happened already and will happen again.

Extortionists, cyber terrorists and aggressive, slighted governments are among the bad actors that want to see the failure of anything that is established. This includes Nation-states, enterprise infrastructures and other critical facilities.

Big Infra

Our focus turns to protect big digital infrastructures. The scope of big infra is the immense concern that is a data centre, all the equipment in it and the software on it. It represents a considerable investment that needs to be fully utilised from day one and then kept that way.

But it is also fragile.

It is all too easy to break by misconfiguring, hardware failure or environmental damage.

Agile Maybe?

For Big Infrastructure to work, it is planned within an inch of its life.

Upgrades, downgrades, contingency plans, ‘refreshes’; all aspects of the lifecycle are covered by spreadsheets, planning committees and business plans.

It is a considerable task to keep a description of an Infrastructure up to date and accurate. There can be many participants and many changes, traditionally requiring a permanent central team for maintenance, large enough to cope with the workload.

But is it possible that we can use agile techniques to increase accuracy and improve feedback? Can we adapt some of the lessons learnt from software engineering?

The good news is that It is entirely possible to simplify the status quo. To produce a system that increases the direct participation of stakeholders and can feedback status automatically. To be able to increase complexity while limiting the size of the core team.

Something Borrowed

Two aspects are borrowed from the world of agile DevOps, modified to take on the requirements of infrastructure and adapted to take on a more significant systems integration role.

Firstly, a dependency database that relates the major components together. This is similar to the top-level build systems in software, such as a source repository (like Github) and dependency mapping held in code.

Secondly, a method of testing, so that everything new and everything existing is checked as working automatically and frequently. It mimics the idea of Continual Testing and should spot issues quickly on every build event.

Using these techniques, we can maintain the relationships between services and the ability to test that these connections are correct and working. It can also be used to start services in sequence, useful for establishing basic infrastructure and services.

In addition to these, log analytics should be used to gather management information and allow the system to scale. Appropriate levels of detail can be delivered to all participants, immediately reflecting changes against a state of readiness.

Mutual Dependency

Looking at the dependency in more detail reveals a graph database of every technology component. Each component needs to explain how to construct (or reconstruct) itself which, in turn, supports continuous testing and integration.

It needs to list the services (running autonomous instances) and environmental aspects on which it relies. It also needs to show how to test the resulting cluster adequately.

The service owners should directly maintain the dependency database to improve accuracy. It also helps with scale and reduces the need for a large administration team.

Two corporate functions can make use of this resource: the more traditional Disaster Recovery and the newer concept of establishing the organisation from scratch.

The latter technique is essential when the bad guys won, and we are not able to run our business. The ability to ‘reboot’ the company (or the parts that were damaged) turns defeat in war into just a battle. Damage is taken, but the enterprise lives to fight on.

Test Early, Test Often

The other crucial aspect is testing.

With software, testing is critical but straightforward. It is easy to wire-up code checks on source check-in or create test instances to evaluate your code. Teams of developers can work in parallel, and with cloud, additional resources can be generated dynamically for deep or lengthy checks.

The ubiquity of testing allows an entire suite of checks to be carried out every time, even if success is expected. Unconditionally running all the tests every time is simple and allows us to catch unexpected failures.

The Trouble with Big Infrastructure

With infrastructure, the resources are much harder to come by, so testing in the same way as software development is difficult or sometimes impossible.

There are three major approaches we could take.

Firstly, infrastructure for all, where there needs to be a complete set of test infrastructure that mirrors the whole of production and any other facility that needs checking. It is required if you wish to perform end-to-end testing in real-time. In essence, you are creating a complete clone of the organisation.

This option is expensive and difficult to justify, as it does not offer the same freedom as its software-only cousin. With only one set of hardware, It is not possible to run multiple instances of tests simultaneously.

The second approach is to provide a virtual image to test for those services that have physical hardware. It could be created with a ‘p-to-v’ conversion if you are not already virtual and would be enough to test logic but perhaps not all the volume. However, if this is possible, it makes a much more flexible environment by having the option of running tests in parallel. It is also the most economical solution and would scale with the testing profile; when you are not testing, you don’t get billed for equipment.

The third and final approach is to divide the enterprise into a set of test groups and buy hardware that can run every group, one at a time. Each test run would involve many services and would check out resources from the pool of equipment. Other tests requiring the same hardware would need to be held in a queue until the resources became ready.

Reducing the amount of hardware needed decreases the cost, but still being able to execute all the tests eventually is an advantage in many situations. Creating a composite from all the groups would yield an answer for many enterprises. The disadvantage is that a complete enterprise end-to-end test is not possible and thus, would be much slower.

The remaining question is whether a composite of multiple test groups is a valid test. Work would need to be carried out to make sure this approach would provide a viable recovery strategy.

Real Time Feedback

While potentially slower than software testing, it’s still important to show results quickly. Therefore, an additional proposal is continuous monitoring of service status logs, and it’s real-time analysis. The conclusion of which can be pushed as an alarm to participants and stakeholders. More senior management would be interested in a report presented as dashboards, showing how ready the organisation is to react to adverse events.

Confidence in a recovery system stems from being able to interpret test status and track quickly in real-time.

How it Works

Following the Agile and DevOps ethos, each service team should take control of their app and profile it directly in the graph database. They should work out not just the recovery model but also the testing model. Ideally, they would be able to make use of virtual resources to test, which would allow scaling.

Third-party and non-application components such as networking equipment, cluster hosts and storage systems should need to be considered in most recovery plans. Each of the owning operation teams needs to take ownership of how they record the information and how they recover.

Whenever there is a new service release, a recovery and integration test should be scheduled to take place. However, a synchronous test process may take too long, as recovery-style tests are often more involved and work at a higher level. Additionally, there may multiple service changes to consider in the test process.

Consequently, it is assumed that many will opt for an asynchronous approach or even a hybrid, depending on criticality. The primary functionality is not affected by the nature of asynchronous testing.

An execution system is needed to carry out the recovery testing, which will traverse the dependency graph set up by the service owners. When one or more releases are ready to be evaluated, all precursors are extracted and established into a running state. Then the candidate services can be staged, and the high-level recovery tests carried out. Finally, an orderly shutdown of the resources taken by the test can be carried out.

The results from the test can be made available using log management tools, with their consoles and proactive push events. If there is an adverse result, the environment, configuration, code and other aspects can be updated until there is a clean test. In most situations, this can take place independently of the successful running of an existing service, although it will need to be sympathetically implemented.

Recovery is a Public Good

In closing the detailed aspects of this article, I have discussed how we can borrow the continuous integration and agile techniques from modern, mainstream software development. They are introduced into traditional IT recovery techniques, adding to and even replacing some aspects.

There are challenges, of course, and the precise dogma may have to be changed. But the principles involved show that it is possible to:-

Increase the accuracy of the recovery process by self-administration
Reduce the number of people required to be employed by a central recovery function given the same mission
Improve confidence of successful recovery by ‘near’ continual testing
Implement recovery dashboards for senior management, showing recovery readiness.
The overall reduction in recovery time, due to practice and automation

There are costs to these techniques, chiefly the expense of sufficient recovery infrastructure, impacting capital and P+L budgets. Also, service teams will have to take on the maintenance and responsibility of their part of the recovery puzzle. However, the final advantage raised from the list, the reduction of overall recovery time, comes with a massive benefit. The boon of restarting the enterprise quickly and reactivating the franchise before people turn away cannot be overstated.

As security threats become greater, and the number of state actors with cyberattack capabilities become more numerous, we have to take the view that — very occasionally — the bad guys will win. Add to this the threats from insiders or even the defeat of certain key mathematical reliances or algorithmic processes. Then there is no security situation that can be 100% risk-free.

It is clear that this approach is not only profitable and enterprise-saving, but in the days of digital reliance, it is also a public good.

About the Author

Architecture, Strategy, Innovation, PM. Follow me on Twitter @nigelstuckey

System Garden

Agile Infrastructure for Enterprise DevOps Design from diagrams, document and deploy to your cloud.
systemgarden.com, Twitter @systemgarden

Applying Agile Techniques to Big Infrastructure Recovery