Lessons to IT teams from Atlassian outage incident

Atlassian IT teams took whopping 14 days to finish recovery from their outage that started on April 4th 2022. I recommend to go through Atlassian outage update page before you proceed to read here.

Atlassian team communicated in a clear way that the outage caused due to failure in executing one of their legacy application sunset and migration. The interesting root cause analysis shows how communication between teams failed, and a faulty script further aggravated the whole situation. Even with severe incident of permanent data deletion by accident, Atlassian team quickly came together and took the complete control. Must appreciate those teams for avoiding further damage and restoring the customer data with all the constraints they had.

The whole incident teaches us few things that we perhaps know already but procrastinate on keeping them in order.

Regular data backup with automated testing

Atlassian team explained in their status report on how they recovered using immutable backup data. Regular testing of backup data is one essential aspect which most of the IT team give least importance. The KPIs of disaster discovery like RPO(recovery point objective) and RTO(recovery time objective) must be tested and measured periodically.

Segregated Development, Test and Staging environment

With ever decreasing computing cost, budget wont be a serious constraint for IT teams to have multiple lower environments. Teams must have stable lower environments and deployment process defined. Production deployment without validation certification in lower environments must be discouraged. Staging environment must have the data close to production data with proper masking of sensitive and confidential information. This way, test in staging environments would bring up potential data issue that can happen in production. It is not clear from the Atlassian report whether the development team tested it atleast in one lower environment before deleting the legacy application in production environment. That would have saved them all these hassles.

IT applications uniqueness

Distinguish applications using unique IDs in enterprise library. Define the dependencies between applications, and maintain the detailed documentation. Remove ambiguity in applications naming. This helps in communicating better.

Automation

We all know how important automation is for business. This incident is a solid proof that it not just saves cost but reputation as well. Atlassian IT team missed on automating backup data recovery process. This lack of automation along with non segregated customer data cost them 14 days to recover fully.

Auditing

We often ignore to follow established governance process. Production deployment review and approval process from a team that has broader business vision finds stiff resistance calling it counter productive, bureaucracy etc. Regular audit on security, and non functional requirements like high availability and disaster recovery is critical to business continuity. It must not be deprioritised against regular process.

Application sunset process

Application sunset is not a trivial task. Just because it lost focus of customers and business, it doesn't mean that legacy application do not carry data and dependencies. This task must be given due diligence. Dependency checking, data archiving, compliance guidelines, stakeholder communication are some of the sub tasks under this activity which must be executed before deleting anything.

Clear Communication

There is nothing else as important as this in any aspects of our life. Lot being said and taught about communication. Still we fail at this.

Mistakes are inevitable. To encounter fear of failure, we must anticipate and must be prepared for facing it. Any system can fail. Designing the system for failure is one of the best solution that industries are focussing on. There are various techniques and you must adopt according to your system design. This will ensure quick recovery and minimal downtime.

Tech Work

Sunday, May 29, 2022

Lessons to IT teams from Atlassian outage incident

What is IP Geolocation that caused Zerodha's outage