[With comments from Holger Mueller]
Atlassian hired a new CTO - Rajeev Rajan from Meta (ex-Microsoft). While this itself is not the complete answer, this is a solid first step by getting someone with a strong enterprise engineering team background to address some of their issues. As I stated in my conclusion of this blog earlier, "Atlassian cloud has growing pains." Hope he is their answer, and hope they will continue to address the issue at hand to instill confidence in their customers.
The latest Atlassian outage goes to show that every cloud provider is prone to unplanned downtime sooner or later. While every company strives to achieve that unicorn status of zero downtime, it is almost impossible to achieve that in the face of “Unknown Unknowns.” Especially with the need and demand for “always-on,” there are more opportunities than ever for things to break, and incidents do not wait for a convenient time.
What actually happened?
On April 4th, a small portion of Atlassian customers (400 of the 226,000 ish customers) experienced an outage on a number of Atlassian Cloud sites for Jira Software, Jira Work Management, Jira Service Management, Confluence, Opsgenie, Statuspage, and Atlassian Access. While the number of customers affected was low, those customers lost complete access to all of their Atlassian services. And the actual number of affected could be in the 100s of thousands of actual users. If the enterprise depends on the Atlassian cloud suite for their DevOps, Enterprise Service, or Incident Management, they were at a standstill until the issues were resolved. (The issues were finally resolved on 4/18/22, after 2 weeks, per Atlassian).
This outage that originally started on April 4th took almost 2 weeks now for some customers. The timing couldn’t have been any worse, with their executive team pitching how great their cloud services are and how they are putting an enterprise sales/service model for large enterprise customer customers at their Atlassian Team ’22 conference in Vegas during April 5-7. All this was happening while the Atlassian executives were on stage talking about how they are building a resilient cloud bar none.
Why is it bad?
This outage from Atlassian is pretty bad for a few reasons.
They were putting on a big show in Vegas, where they were talking about their strategy, direction, vision, and mission for their cloud services when their cloud services were down. I was attending the conference in person, and some of my tweets and many others’ tweets were met with hostile customer replies with angry comments. Bad publicity.
Atlassian as a company has a primary solution set that is mainly focused on helping customers prepare for such unplanned outages. Agile development, DevOps cycle, issue/bug tracking, incident management, ITSM, incident response, Statuspage, etc.
The issue is self-inflicted. The damage was not due to misconfiguration, hacking, or affected by other provider-dependent services. As Atlassian CTO stated, two critical errors were committed. First, instead of deactivating a specific app, the entire cloud site for certain customers with all apps was deactivated due to a communication gap. Second, the scripts deleted the entire apps “permanently” vs “temporarily recoverable” for compliance reasons. The combination of these two errors led to the colossal mishap.
They took a long time to respond with something meaningful. Granted, they were all busy with the big show in Vegas. Atlassian claims they figured out the issue and root cause within hours, but the cryptic messaging to the customers and their status pages were not very clear on the situation. Until then, only the cryptic message of “your service is down” with no ETA was relayed to the affected customers. Only about a week later, Atlassian CTO wrote a detailed post on what went wrong. Until then, customers were scrambling to figure out what went wrong and were doing patchwork with spreadsheets, Word docs, and other collaboration tools like Slack to manage the gap.
Atlassian announced that they will no longer sell new licenses for server on-prem installations (but will still continue their Data Center offerings) and will discontinue support for on-prem server in 2024 for all existing customers effectively forcing the server version of the customers to move to the cloud in 2 years. Which makes sense on their part only to maintain only SaaS and Data Center versions instead of having a fragmented solution set.
Incidentally, Atlassian CTO stated in his last blog (before this fiasco) "At our engineering Town Hall meeting, I announced that we were in a "Five-Alarm Fire" due to poor reliability and cloud operations. Our customers needed to trust that we could provide the next level of reliability, security, and operational maturity to support our business transition to the cloud in the coming years." and that he is raising a “Five alarm fire” to fix that. That particularly called for reliability, security, and operational maturity to support the transition to the cloud. With this incident, they failed in two of those three categories, unfortunately.
Finally, more importantly, Atlassian claims an SLA of 99.99% for Premium and 99.95% for enterprise cloud products. They also claim a 6-hour RTO (recovery time objective) for tier 1 customers. Unfortunately, neither held up this time.
Why it happened?
Rather than me trying to paraphrase, you can read Atlassian CTO’s blog that explains what happened in detail here.
Accept that no cloud service is invincible. High-profile outages are becoming more and more common. Even the mighty AWS had their US-East region down for many hours recently. Unplanned downtimes are expected and will happen at the most unfortunate time – holidays, nights, weekends, or during flagship events. The following steps, while not a complete solution, can help mitigate the situation somewhat:
SLAs: Most cloud-based SaaS SLAs are written with either 4 or 5 9s (such as 99.999). While those contracts won’t stop incidents from happening, they will at least give some financial recourse when such events happen. While it might be preferred to write large SaaS contracts with business outage costs rather than technical outage and data loss costs, most SaaS vendors may not accept such language in contracts. The higher the penalty for such incidents, or higher penalties for long resolution times, the faster it gets attended to. In events like this, where vendors restore a few customers per batch, you want to be the first in line and your contract needs to reflect that.
Have a backup option. Ideally, it might be better if you have a backup solution that is either by a different provider or on a different cloud for such occasions. But that can be expensive. Multi-cloud solutions are easier said than done. If your business is that critical, those options must be considered.
Have a plan for such extended downtime. When such long outages happen, part of the issue is about productivity to your employees, partners, and services. Your business can not be at a standstill because your service provider is down. Whether it is a backup service document-based notes, there has to be a plan in place ahead of time to act.
A lot of Atlassian competitors were using this opportunity to pitch their solution on Twitter, LinkedIn, and other social media on how this would NEVER happen to their product lines. Don’t jump from the frying pan to the fire in the hour of immediate need just because of this incident. However, it is time to consider the other worthy offerings to evaluate if they might fit your business model better.
As discussed in my Incident Management report, “Break Things Regularly” and see how your organization responds. Most digital enterprises make a lot of assumptions about their services and breaking things regularly is a great exercise for validating those assumptions. A couple of options discussed in my report involved either breaking things and seeing how long it will take for support/SRE/resiliency teams to fix it (Chaos monkey theory-based), or creating game-day exercises (from AWS well-architected principles) to make teams react to a controlled exercise to create a “muscle memory” to react fast in such situations. Assumption is a dangerous thing in the digital economy. You are one major incident away from disaster, which can happen anytime.
Measure what matters. If you are just checking “health” of your services and your provider services, your customers will unearth a lot of incidents before your SRE team can. I discuss a lot of instrumentation, observability, and customer real situation monitoring ideas in my Incident Management report.
Review SaaS vendor’s resiliency, backup, failover, restoration, architecture, data protection, and security measures in detail. Not just a claim of x hours of restoration time is good enough. If you have architected a reliable solution on-prem or on another cloud, make sure that the SaaS vendor’s plan and design at the very least match or exceed your capabilities.
When deleting "permanently" make sure it is a staged delete even if it is for compliance reasons. A gestation period of 24 or 48 hours to make sure the deletion didn't do more damage than intended.
Before you execute a script that does mass operations, test many times first to make sure the intended results. Triple-check the mass scripts and delete operations or any major modifications.
Finally, as discussed in my report, take ownership and communicate well. Customers do appreciate that. While such incidents do happen occasionally, how they communicate with the customers, how soon they fix the incident, how detailed is their postmortem, and, most importantly, what they do so such incidents don’t occur in the future is more important.
Atlassian cloud has growing pains. It may be a tough pill to swallow, but they need to go back to the drawing board and reassess the situation. Not only do they need to take a hard look at their cloud architecture, their processes, their operations, and more importantly their mandate to convert all customers to the cloud or Data Center by 2024. It is such a shame as I like their suite of products. A solid line of products that have performed well for large enterprise customers with their On-prem Data Center version for many years until now.
They also need to automate a lot of their cloud operations such as restoring deleted customer sites in one batch rather than painful small batches. They should have fully automated rollbacks for any changes whether it is configuration, functions, features, or code changes. If something didn’t work, they should be able to roll back to a previous version in an automated fashion quickly in a matter of hours – not days or weeks.
This happens when companies grow too soon, too fast. An added complexity in the case of Atlassian is the list of acquisitions they did which they are trying to bring together in one cloud platform.
This too shall pass. How they will respond to this event by putting newer processes, controls, approvals, automation, and more importantly automated rollbacks can tell whether they are trustable going forward. Once the picture is clear, enterprises can decide whether it is worthy of continuing with Atlassian or look at some worthy alternatives.
It is too soon to tell at this point.
PS: I had a call with their head of engineering (Mike Tria) on 4/19/2022, who addressed some of these concerns and explained in detail some of the measures they are doing to fix this issue so it won’t happen again. It included items like staged permanent deletes, operational quality, mass auto rollbacks, customer restoration across product lines, etc. He also discussed at length about what they are doing to the customers who cleverly instantiated instances in parallel while waiting for the issues to be resolved and how they can be merged into their main service.
I was assured by Atlassian that this incident will be reviewed in detail and the measures they are taking going forward will be addressed in detail in a PIR (Post-mortem Incident Report) that is scheduled to be released soon (before the end of the month per Atlassian).