Test Driven Infrastructure: Why configuration management needs testing

Posted by ProgrammingAce on Wed 04 March 2015

A friend and I had a long discussion today about Test Driven Infrastructure, specifically what type of tests are important. We were specifically talking about configuration management with ansible, and testing with serverspec, but the concepts are more important that the tools. I ended up with a vastly different outlook on infrastructure testing from when I started the conversation, so I thought it would be good to write down my thoughts.

It’s important to know that both Ansible and serverspec use declarative languages; meaning, you tell the system what to do, and it works out how to do it. So with that, I seriously doubted the value in using one declarative system to check another. If I wanted to be truly lazy, I could just write a parser to convert from ansible code into serverspec configs. There should be a 1:1 relationship. Except there’s not; and that’s important, and that disparity is what ultimately changed my mind.

The main argument was around how important it is to test how you accomplished a task, verses whether the result is what you want it to be. To me, the ‘what’ is far more important to test than the ‘why’. Meaning, I want to make sure my webserver is serving up the website I expect and I don’t care if it’s using port 80 or port 443, nginx or apache. As long as the users are getting the website they expect, it really doesn’t matter how it gets there.

But it turns out Ansible isn’t purely declarative. I mean, it can be, and in most cases it even should be, but it isn’t. Configuration management, and Ansible specifically, allow for branching code; and that’s where we start to run into problems. If I tell Ansible to open the firewall on port 80, I can reasonably expect it to follow through without bothering to test it. But if I tell Ansible to open port 80 if the hostname=’webserver’, then I’d better verify the name of the server… because DNS might not be routing to this subnet. If you put your configuration management system in the control of outside stimuli, you’re not guaranteed to get the same results every time. Here’s an example:

I want to use my configuration management system to run a shell script. This shell script creates a PID file that keeps it from running multiple instances at once, so this script is safe to run repeatedly without problems. So I build the server and the service is running happily, and I walk away. Later, the service crashes unexpectedly. Configuration management comes by and happily runs the shell script again and it thinks it’s done a good job. Unfortunately the shell script doesn’t actually start because the PID file wasn’t deleted from the last run. So now your configuration management system is silently failing, and you have no way to know.

In this example, you could use your configuration management system to delete the PID file before each run, and hey, self solving problem. But that’s an awful idea; you need a system that can tell you when your server has changed in an unexpected way. Ansible could be manipulated into testing for this condition, but it’s clearly not the right tool for the job. Enter serverspec! You define the condition you want the server to be in, and check that it matches. In this example there had better not be a PID file if the shell script isn’t running.

But this example is a bit contrived, lets use something a little more real world:

You have a webserver that’s hosting multiple sites, each of these sites has a config file and all of sites are setup through your configuration management system. Now someone with no knowledge of your systems comes in and installs a new config for a new website without going through config management. You’re left with unknown production data running on a production system. This is something that serverspec (or Aide or Tripwire) could alert you to before you accidentally lose the data when you migrate the server.

If configuration management worked in a vacuum, I don’t think declarative systems would need testing. Since they have to operate in the real world where things change and break unexpectedly, you had better verify the state of your systems.