Friday, January 20, 2023

Should chaos engineering practice be made part of Devops CI/CD pipeline!!!


Answering this question with just Yes or No would be tough.  At first glance it looks like straight forward and possible. There are tools like Gremlin, Litmus or Chaos Monkey to simulate different types of failures, and then integrate these tools into your pipeline so that they are automatically run as part of your testing and deployment process. However, Its not as simple. Let us deep dive to see what are the challenges for including these fault injection experimentations into CD pipeline.
We need to start with basics of Chaos Engineering and the purpose of Devops to arrive at the detailed answer. We must start with the official definition:  
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Next we need to look at the principles. Principles of chaos engineering is well defined to help users understand the practice in detail. it includes:

  • Good understanding of the system
  • Including stakeholders
  • Formulating hypothesis
  • Preparing experiments
  • Planning for game day
  • Running experiments
  • Monitoring
  • Collecting metrics
  • Increasing blast radius 

Automating the fault injections using Devops pipeline would miss certain aspects of principles stated above. Monitoring the system behaviour on a scheduled day to run these experiments together with all the stakeholders is critical to the success of this practice. The learning that comes from live monitoring and analysis along with other stakeholders provides better perspective on the issues that are uncovered. It also helps in detecting the weak points in the system. 


Simple fault injections whose results are easily guessable without the need of monitoring may be considered as better candidate for CD pipeline. But there are not many such simple faults. Even simplest faults injected can manifest beyond imagination to become complex issue within the ecosystem. Monitoring system at every layer with every possible means is important while flats are being injected. Today’s distributed systems are complex with multiple layering and interdependent modules. Same yard scale cannot be used in every stage. 


Game day planning includes activities that cannot be automated easily. The fire drill concept, team gathering, and war room setup are some of the activities that must be experienced by every chaos engineering practitioner. These rehearsals build confidence to deal with the real life issues. Loosely based name sake automation could fail the preparedness objective. The real life chaotic surprises requires greater people collaboration skills to spring back at the earliest proving resiliency. Mere dependency on the rigid CD process without the possibility of eternal evolving would not yield better results. 


After every iteration of the fault injection exercise, teams must plan for increasing blast radius and learn more. Automation with same input criteria/parameters for every run would fail the objective of continuous improvement. Adding variable time in increasing manner is not tough automating. However, that requires complex logic inclusion and becomes unsustainable sooner than you think. Devops CD pipelines are not designed for handling such highly variable, time consuming and human intervention demanding process. 


Time is crucial parameter and with chaos engineering practice, every minute variation could increase possibilities of uncovering the new issue. The main motto of automation is to save on time. However, chaos engineering requires uses time as varying parameter to alter the characteristics of the fault injection and subject the system to fail using time as catalyst. Also, running time consuming process in Devops CD process is not recommended. This slows down the deployment process and fails the modularity aspect with increasing dependency. 


Conclusion


The above details with definition and principle establishes that this practice is nothing but detailed experimentation and not just mere testing that can be simply automated. Chaos engineering is about uncovering the inherent chaos in the system using unique approaches in every trial. Without the variable parameters, human collaboration, and real time monitoring, the process may not serve the real purpose of chaos engineering. 

No comments:

Post a Comment