Sunday, November 6, 2022

Chaos Engineering Clarity


My recent conversations with technical community made me feel that Chaos Engineering(CE) concepts are being misunderstood. This being emerging practice in the industry, very soon deeper understanding would eventually clear all the confusions. There are comprehensive documentation written about this practice already defining its objectives, principles, and implementation. These documentation are widely available through various open source and commercial CE tools available in market. 

Technical folks new to this practice are having doubts and confusions on this practice. I believe further clarities need to be provided on the CE practice widely. I have gathered many questions through my interaction with technical teams. I'm trying to address some of them in this blog post.

Before I begin, you should read through the chaos definition. As per wikipedia:

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production


Is chaos engineering practice against compliance and regulation? 

No! In fact, the CE experiments help in meeting security and compliance standards by uncovering the hidden issues. Compliances are setup generally for ensuring data privacy, data protection, security, country/regions specific laws and following standard processes. 
The objective of chaos engineering is to have deeper understanding of IT systems and eliminate or minimize the application/service outage. The tests by injecting faults are supposed to be done in a controlled manner where rollback plan is also part of it. 
As far as the tools usage, just like any third party tools that are used in the applications, CE tools also required to be complaint with security and licensing standards of an organization. The scans that enterprises mandate usually checks for all the CVEs in third party tools.


Is chaos engineering replacement for security related tests like static and dynamic/penetration test practice?

The general CE practice is not meant to replace the security tests like static, dynamic or interactive application security tests (SAST, DAST or IAST). Both security testing and CE practice includes injecting fault into the system. The method of analysing the system by trying to break is same in both types of practices but objectives differ. 

I see attempt being made to extend CE practice to cover certain types of security testing. There is even a new name Security Chaos Engineering coined for this purpose.  They use the same fault injection approaches to inject security flaws into the system. However, one should not confuse general CE practice with security testing. Their objectives are different even if the same tool is used for both types of practices. 


Is chaos engineering suitable only for cloud infrastructure?

Any type of infrastructure that hosts your applications can be considered. Whether it is cloud, containerized, virtual machines or physical servers infrastructure. The widely available examples in the documentation of CE provides more examples on  cloud infrastructure. That doesn't mean it is only applicable for cloud infrastructure. 


Can it be made part of automation through CI pipelines?

Automation of injecting faults at various layers like application, network, and computing resources is quite possible. However, CE process require observability through instrumentation which is better if carried out manually together with all stakeholders. Automating the systems breaking process could fail the objective of deeper understanding on the system and its behaviour when faults are deliberately induced. 

The crucial phase of CE practice is to have a game plan day (mock drill) where faults are injected and systems behavior is captured. This would help the teams to be prepared when such real time situation arises in future. 


How is chaos engineering different from load testing or performance testing?

Load testing objective is to determine the system performance under load and benchmark it. This helps teams in understanding the system capacity and design the system accordingly. The scalability requirement of the system is identified and fulfilled with the help of performance testing. 

This is entirely different from the objective and approaches of CE practice. However, CE practice can include certain load testing tools to further test the behavior of the system under the load while fault is injected. This apart, there is no other connection between CE practice and performance testing. They remain different in its objectives and approaches. 


Is it necessary to use chaos engineering tool?

Tools help you to get started quickly on achieving the CE objectives. There are various open source and commercial tools available in the market. These tools provide make readymade fault injections available to you with easy setup. Most of the tools comes packaged with resource exhaustion tests (RAM, Disk, I/O), network test(Latency), Infrastructural(pods, containers, VM, server) or application level tests. 

Writing these tests from scratch requires lot of time and effort. I would recommend to use one of such tools if you are new to CE practice. Once you are proficient, you can start building your own tool. 


Hope this post helps you. Thanks for reading and look forward to your feedback.


References

http://principlesofchaos.org/

https://arxiv.org/abs/2006.04444

https://www.ibm.com/cloud/architecture/architecture/practices/chaos-engineering-principles/