The recent outage that occurred for Microsoft Azure prompted me to collect some thoughts on the challenges of upgrading cloud infrastructure. 

The inherent challenge of cloud infrastructures is also their biggest advantage – a consistent landscape of compute, storage and networking resources that are elastic for its subscribers. The more common the infrastructure is, the better the economies of scale for the subscribers and the provider. The challenge is when pieces of the infrastructure are supposed to be or need to (think the recent security risk in VMs) upgraded. So it makes sense to differ between planned upgrades and emergency upgrades in the further considerations.

  • Testing – Needless to say, upgrades should be tested before exposed at all, to a limited or to the overall subscriber community. But anyone who has ever build software knows, that you can never test for the real thing, the go-live to 100%. That is not a cop out though and every serious cloud provider needs to test infrastructure upgrades before rolling them out. Labs, test instances and test data centers are a good vehicle to do so. The next phase needs to be a pilot phase – with the awareness of customers. Works great when it is a new piece being rolled out that is optional. Things change when the upgrade is related to a core piece of the architecture and de-facto must be used by all subscribers (e.g. security, networking etc.). Of course all these testing questions pretty much fly out of the window in emergency situations, triggered by threats – or bad upgrades. Rollbacks on infrastructure are a must, but have limitations in the real world. 
  • Redundancy – Having enough redundancy in the infrastructure is key, e.g. being able to operate a cloud data center on two sets of switches / switch software releases. At least for the test data centers. But that all costs money and drives up the subscription prices, something cloud providers don’t want. But it can’t be done for all upgrades, note the noise in the cloud space when e.g. Amazon AWS was not able to provide enough ‘safely patched’ VMs to transfer the load from the ones that had to be patched. The same must have happened (so observers were less vocal) at all other cloud providers, as there was simply not enough compute around to have enough freshly patched servers available. Finally redundancy considerations are irrelevant in emergency situations. 
  • Isolation – A good public cloud infrastructure should work like the human organism. When bad things happen the rest of the infrastructure steps up to carry the load while isolating and insulating the part (aka servers, network infrastructure, data centers, backup sites etc.). Whenever isolation fails – see this week’s Azure problem – challenges to service levels get out of hand. But then to be fair, there are upgrades where you cannot afford isolation, e.g. in the Day 1 exploit remedy. 
  • Recovery vs Upgrade Speed – At the end of the day it all comes back to 2 speeds in which IT resources can be transformed. With upgrade speed I refer to the speed a provider can rollout out an upgrade across its infrastructure – with recovery speed I refer to the speed a provider can undo the latest change in case there was an issue. If recovery speed exceeds upgrade speeds by magnitudes, the provider is usually on the good side. If the two speeds become similar, it gets increasingly risky as e.g. a 2 hour planned downtime requires another 2 hour unplanned downtime to recover. When recovery speed is slower than upgrade speed (as in Azure’s case this week, though the details are not fully public – maybe even understood) then a provider is in trouble. When recovery time takes much, much longer than upgrade time, then it’s a pretty risky upgrade. 

The cloud GAU

I am lacking internet access while typing this – so using the German term from nuclear engineering – Groesster Anzunehmender Unfall – the largest thinkable accident. The good news is, that for compute and network upgrades, downtime solves the issue. Restart the server, give it the right software and all should be fine. Take the network down, roll it back and all should be fine. Ugly, downtimes, bad press, SLAs out of the window – but a remedy that is well understood. Same works for basic storage issues.

But for late discovered storage issues at high cost – it maybe game over for a cloud infrastructure. Anyone in traditional on premise enterprise software knows the scenario. A customer upgrades, incl. data conversion with non back ward compatible schema changes, and a major issue is only uncovered months in the usage of the upgraded system. So all and a lot of storage needs to be rolled back, while keeping the lights on and while fixing the issue. And because it’s data that clients cannot afford to loose, the provider needs to make a lot of copies of it. If you simply don’t have the storage capacity for it – it’s game over. I doubt any cloud provider has the same amount of strorage available just for that case. The above isn't purely theoretical, I have seen this twice in my 25 year enterprise software career, and if it can happen on premise, it can happen in the cloud – at least theoretically – too.

Questions cloud customers should ask

So here are a few questions customers should ask their cloud providers:
  • How do you test infrastructure upgrades?
  • How do you isolate data centers (and us as a customer) from upgrades
  • What is the level of redundancy on a compute, storage and network level that you have to test and isolate upgrades
  • What is your approach and philosophy in regards of upgrade vs recovery speed
  • How fast can I get my vital data out of your infrastructure?
  • ...

    MyPOV

    It’s the early days of the cloud and both customers and vendors need to learn and grow up. That includes the industry observers, as I was appalled to find nothing on the topic of cloud upgrades. Not sure if they vendors will tell the public, but it’s overdue that influencers on the media and analyst side start asking the questions. I will. 

     
    ----------
     
    More recent posts on cloud:
     
    • Event Report - AWS re:Invent - AWS becomes more PaaS on own IP - read here
    • Musings - Are we wittnessing the birth of the enterprise cloud? Read here
    • News Analysis - SAP and IBM partner for cloud success - a good move - read here
    • Event Report - Oracle OpenWorld - Oracle's vision and remaining work become clear - and both are big - read here
    • Market Move - Cisco wants to acquire MetaCloud - getting more in the cloud game - read here
    • News Analysis - HP acquires Eucalypus - Genius or panic at Page Mill Road? Read here
    • news Analysis - IBM and Intel partner for cloud security - Read here
    • Event Report - VMWare makes a lot of progress - but the holy grail is still missing - read here
    • News Analysis - SAP committs to OpenStack and CloudFoundry - Key Steps - but what's the direction? Read here
    • Event Report - Microsoft TechEd - Day 1 Keynote takeaways - read here