Designing for resilience: A case for a network Chaos Monkey

funny-monkey-2 Netflix runs a massive data center. And would you believe they do it with a team of monkeys?

In 2011, Netflix quite famously talked about their Simian Army. The Simian Army is essentially a bunch of tools that randomly wreak havoc in their production environment. The name refers to unleashing an army of armed monkeys on your data center to smash devices and chew through cables. By driving random failures into the middle of highly monitored systems in the middle of the day, they identify weaknesses in their production environment at a time when their staff is on-hand and alert to respond.

The Simian Army started with Chaos Monkey. Chaos Monkey basically randomly shuts down production instances in the Netflix cloud. But after the success of Chaos Monkey, they added a number of other simian servants: Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, and the Chaos Gorilla. For a description of what these all do, see this blog post from Netflix.

The whole point of employing an army of monkeys to assault the data center is to embrace the fact that your environment simply cannot be defect-free. Rather than trying to test your way out of the existence of bugs, a different approach is to agree to live alongside those bugs but demand that the system be resilient enough such that their presence is not debilitating.

Modern networking equipment consists of thousands of features supported by millions of lines of code. Even within a single device, it is prohibitively expensive to test every combination imaginable. The best that vendors can do is test representative use cases deployed across reference architectures. And even these tests typically yield a large number of defects that are deemed cosmetic, tolerable, or unlikely to happen in the real world. Put simply, vendors ship things that they know do not work in all conditions.

For customers, the best you can hope for is that your environment is somehow covered in the tests that do exist. But the other largely unspoken truth is that each deployment is subtly but importantly different. If you use 100 features and someone else uses 100 features, the odds that you use them together in exactly the same way are slim. That is to say that what might be a valid test for one environment is unlikely to translate directly to even very similar environments. And unless you are one of the largest accounts (typically measured by spend with the vendor), there is little chance that your exact environment has been replicated in your vendor QA labs.

So what should the industry do?

The first inclination is to broaden the tests. But you cannot test your way out of a quality problem. It is quite expensive to find all the bugs that exist, and even if you could find them all, for every three that you fix, you introduce a new one. Reaching Defect Zero is a moving target, and in large systems it is not even a worthwhile goal. But what is the alternative?

There are basically two things we can do: converge on a small set of very specific reference architectures, or design with more resiliency in mind.

The proliferation of features in our industry is by our own design. Customers are uniquely susceptible to point features as gateway drugs. Once you get a knob or doohickey that you specifically ask for, getting off that drug is nigh impossible. And even if you do not get your own feature, the fact that you can design your network to use any of the 40,000 features available means that your environment is a snowflake – a unique network instance suitable for exactly what you need.

Snowflake networks are challenging though. Vendors don't test them all. They look at 50 snowflakes and then create a reference architecture that represents those snowflakes. The problem here is that any testing done on that reference architecture is really only valid for the specific deployment being tested. But because you are told that your deployment is included in the vendor's testing processes, you get a false sense of security. Note that even small deviations in architecture, configuration, traffic, or other network conditions can render useless even the best of reference tests.

To combat that, our industry could consolidate on a small subset of total features that are actually used in production environments. This is why I love Cumulus so much. By building a decoupled software offering essentially from scratch, they simply cannot replicate the history of network features. They will naturally have to select a subset of features. And as they gain market traction, that will help identify the set of features that really matter.

Ultimately though, I think this is unlikely to yield true industry-wide convergence on a small set of architectures. There is an awful lot of inertia behind the snowflake model our industry currently uses, and I am not confident a startup pushing a new paradigm is enough to buck the trend. So while I like having Cumulus in our industry, I think they will be unable (by themselves) to solve this particular problem.

And if we cannot converge on a small number of reference architectures, then the only alternative is to design with resiliency in mind.

This is where I think the Chaos Monkey is interesting. We ought to embrace the fact that failures are going to happen. The best we can do is not create an environment free of error but rather produce an environment that continues to be highly functional even in the presence of defects. We ought to be breaking random links, flapping routes, dropping nodes, simulating power outages, and otherwise wreaking our own havoc on our networks.

It probably starts with vendors. It is fairly innocuous to drop a few of these Simian Warriors into a QA lab. But the reality is that production environments are rarely homogeneous, and they virtually never resemble a sanitized QA lab. At some point, our Chaos Monkey will need to be let out into the wild. It likely begins on weekends in parts of the network that are less critical and always with staff on board to support. But if we can design meaningful resiliency into our systems, the presence of this kind of disruption ought not impact broad service.

And by adopting a philosophy that embraces chaos rather than fears it, we might be able to move a bit faster collectively. New deployments are a lot less scary when you are not worried that a bug will take down the entire system. It could be that agility is the end product of such an approach. For an industry that has been plagued by inertia and inactivity for so long, a little chaos might be just what the doctor ordered.

[As a related aside, I got into a Twitter conversation with Bromium CEO Simon Crosby. He was a bit more bearish on the idea of a networking chaos monkey, suggesting that using random failures as a crutch for QA was a poor substitute for good design. I don't want to suggest that we reduce QA in favor of painting over bugs. I view this as an AND proposition, not an OR.]

[Today's fun fact: Canadians eat more Kraft macaroni and cheese packaged dinners than any nationality in the world. This explains so much.]

To read more on this topic, check out:

The post Designing for resilience: A case for a network Chaos Monkey appeared first on Plexxi.