Chaos Engineering Clarity
My recent conversations with technical community made me feel that Chaos Engineering(CE) concepts are being misunderstood. This being emerging practice in the industry, very soon deeper understanding would eventually clear all the confusions. There are comprehensive documentation written about this practice already defining its objectives, principles, and implementation. These documentation are widely available through various open source and commercial CE tools available in market.
Technical folks new to this practice are having doubts and confusions on this practice. I believe further clarities need to be provided on the CE practice widely. I have gathered many questions through my interaction with technical teams. I'm trying to address some of them in this blog post.
Before I begin, you should read through the chaos definition. As per wikipedia:
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production
Is chaos engineering practice against compliance and regulation?
Is chaos engineering replacement for security related tests like static and dynamic/penetration test practice?
The general CE practice is not meant to replace the security tests like static, dynamic or interactive application security tests (SAST, DAST or IAST). Both security testing and CE practice includes injecting fault into the system. The method of analysing the system by trying to break is same in both types of practices but objectives differ.
I see attempt being made to extend CE practice to cover certain types of security testing. There is even a new name Security Chaos Engineering coined for this purpose. They use the same fault injection approaches to inject security flaws into the system. However, one should not confuse general CE practice with security testing. Their objectives are different even if the same tool is used for both types of practices.
Is chaos engineering suitable only for cloud infrastructure?
Any type of infrastructure that hosts your applications can be considered. Whether it is cloud, containerized, virtual machines or physical servers infrastructure. The widely available examples in the documentation of CE provides more examples on cloud infrastructure. That doesn't mean it is only applicable for cloud infrastructure.
Can it be made part of automation through CI pipelines?
Automation of injecting faults at various layers like application, network, and computing resources is quite possible. However, CE process require observability through instrumentation which is better if carried out manually together with all stakeholders. Automating the systems breaking process could fail the objective of deeper understanding on the system and its behaviour when faults are deliberately induced.
The crucial phase of CE practice is to have a game plan day (mock drill) where faults are injected and systems behavior is captured. This would help the teams to be prepared when such real time situation arises in future.
How is chaos engineering different from load testing or performance testing?
Is it necessary to use chaos engineering tool?
Tools help you to get started quickly on achieving the CE objectives. There are various open source and commercial tools available in the market. These tools provide make readymade fault injections available to you with easy setup. Most of the tools comes packaged with resource exhaustion tests (RAM, Disk, I/O), network test(Latency), Infrastructural(pods, containers, VM, server) or application level tests.
Writing these tests from scratch requires lot of time and effort. I would recommend to use one of such tools if you are new to CE practice. Once you are proficient, you can start building your own tool.
Hope this post helps you. Thanks for reading and look forward to your feedback.
References
https://arxiv.org/abs/2006.04444
https://www.ibm.com/cloud/architecture/architecture/practices/chaos-engineering-principles/