Release the Chaos Monkey
Yesterday Netflix finally released the Chaos Monkey to the community, I had expected them to do it as some point and was overjoyed to see it happened. For those of you who don’t know the Chaos Monkey I’ll elaborate: Chaos Monkey is a service which runs in the production environment on AWS and seeks out servers and terminates instances (virtual machines). You might want to do a double take.
You read it right Chaos Monkey randomly kills running servers. Chaos Monkey simulates causes random outages, which mean that the code must be robust and fault tolerant so it can handle a random outages. It also means that services must be able to self repair for the most part, with the number of VMs Netflix is running it would be impossible for sysadmins to fix systems manually.
Failures happen and they inevitably happen when least desired or expected. Which means that rather than wait for random events to happen, Chaos Monkey nudges the system in the wrong direction to make the exceptional part of the routine.
Chaos Monkey released into the wild