Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.
We want to safely allow operators to run downtime scenarios in pre-prod & prod environments randomly. Starting with the min unit (pod) all the way to the max unit (cluster). Once operators have built/configured a good fail-over plan, allow then to run downtime scenarios in production environments. Provide relevant metrics alongside incidents.
Interested in joining the conversation for this category? Please join us in our public epic where we discuss this topic and can answer any questions you may have. Your contributions are more than welcome.
We want to provide our Kubernetes users an easy way to get started with Chaos Engineering. GitLab has previously used Kube Monkey internally as part of testing our helm charts with good results. kube-monkey is an implementation of Netflix's Chaos Monkey for Kubernetes clusters. It randomly deletes Kubernetes pods in the cluster, encouraging and validating the development of failure-resilient services.
After users have the ability to easily install kube-monkey, we plan to incorporate these capabilites into Auto DevOps.
Gremlin provides a framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks.
Chaos Toolkit is a project whose mission is to provide a free, open and community-driven toolkit and API to all the various forms of chaos engineering tools that the community needs.