Sorry you are down, wait we are down, too! (Or: The sad state of HA)
This week I learned about an outage that happened at a provider of Microsoft Office applications. Which reminded me about the sad state of high availability across the industry.
More on the provider
The provider is a medium sized infrastructure vendor that is successful in providing hosted Microsoft Officer applications, basically running the servers for Outlook for their clients. They are not small with 5 corporate locations on both sides of the Atlantic, and 10 datacenters in the US and Europe. The provider is professional and has e.g. achieved SOC2 and SOC3 compliance.
The vendor offers a a 99.999% up-time guarantee - but that was definitively broken by being down for most of a workday from 7 AM till 3:30 PM.
What happened
Clients noticed in the morning, that they were not able to get their emails, send emails and work with their calendars. When calling the provider, calls went dead, the provider's website and support applications were not available. The first provider to client communication happened then over... Twitter. And Twitter remained the lifeline between provider and customers till - you may have guessed it - the Twitter account went into Twitter jail for hitting the daily limit of 1000 tweets. And while that seems generous - it's not much if Twitter is your only ways of communication with multiple hundreds customers.
What went right
The provider got the system back, tried all they can do to get customer informed (so they were obviously in the dark), offered the usual letter form the CEO in the next 24 hours and had that followed up by the COO. The vendor communicated pro-actively that they had broken the service levels, and that that they would waive the requirement to ask for re-imbursement, and re-imburse customers diretly based on their SLAs.
What went wrong
In the CEO letter the provider already offered an issue with their routers as the root cause of the outage. And while it's fine to not have the ultimate reasons 24 hours post an outage event - you need to do better than the following from the letter of the COO- 48 hours later:
The sorry state of HA
MyPOV