Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.
We want to safely allow operators to run downtime scenarios in pre-prod & prod environments randomly. Starting with the min unit (pod) all the way to the max unit (cluster). Once operators have built/configured a good fail-over plan, allow then to run downtime scenarios in production environments. Provide relevant metrics alongside incidents.
Interested in joining the conversation for this category? Please join us in our public epic where we discuss this topic and can answer any questions you may have. Your contributions are more than welcome.
We want to provide our Kubernetes users an easy way to get started with Chaos Engineering. Litmus is a toolset to do cloud-native chaos engineering. Litmus provides tools to orchestrate chaos on Kubernetes to help SREs find weaknesses in their deployments. SREs use Litmus to run chaos experiments initially in the staging environment and eventually in production to find bugs, vulnerabilities. Fixing the weaknesses leads to increased resilience of the system.
After users have the ability to easily install Litmus, we plan to incorporate these capabilites into Auto DevOps.
Gremlin provides a framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks.
Chaos Toolkit is a project whose mission is to provide a free, open and community-driven toolkit and API to all the various forms of chaos engineering tools that the community needs.