Chaos Monkey, the first member of Netflix's "Simian Army", has been released as open source code. The software runs on the Amazon Web Services (AWS) cloud platform and can be used for stress testing cloud deployments. Chaos Monkey will randomly disable virtual machines in Auto Scaling Groups (ASG) and give the support engineers a chance to test contingency plans for outages under realistic circumstances. This gives administrators a chance to learn from the problems encountered. Chaos Monkey's schedule can be configured, but by default it runs during work hours to give the engineers a chance to be notified and react quickly.
According to Netflix, the idea behind Chaos Monkey is to make sure engineers and administrators are prepared for problems and can solve them efficiently when they occur. In its announcement, the company summarises the reasons for deploying the software as follows: "Failures happen and they inevitably happen when least desired or expected. If your application can't tolerate an instance failure would you rather find out by being paged at 3am or when you're in the office and have had your morning coffee?"
Netflix has been using this kind of approach for a while and says that over the last year, Chaos Monkey has disabled over 65,000 virtual machine instances in its network. In many cases, AWS handles these outages seamlessly and nobody notices a problem but the approach has also led to bugs and problems being discovered which could then be eliminated.
To make the approach less painful, Chaos Monkey has several configuration options. It can be used in both opt-in and opt-out mode; this is configured for each application the service is run on. This allows organisations to test the software without putting their whole infrastructure at risk. Administrators can also tune the probability with which Chaos Monkey terminates instances. Probability settings range from 100% (one instance a day) to 20% (one instance per week on average).
The source code for Chaos Monkey is available from GitHub under the Apache 2.0 Licence. More information on the software is available from its documentation wiki, which is also hosted on GitHub.
Netflix is planning to release more members of its Simian Army in the future. The next candidate will most likely be Janitor Monkey, a program that helps clean up unneeded assets from AWS environments and thus save running costs.