For the last 6 months I’ve been consulting on a project to build a monitoring metrics storage service to store several hundred thousand metrics that are updated every ten seconds. We decided to build the service in a way that could be continuously deployed and use as many existing Open Source tools as possible.

There is a growing body of evidence to show that continuous deployment of applications lowers defect rates and improves software quality. However, the significant corpus of literature and talks on continuous delivery and deployment is primarily focused on applications - there is scant information available on applying these CD principals to the work that infrastructure engineers do every day.

Through the process of building a monitoring service with a continous deployment mindset, we’ve learnt quite a bit about how to structure infrastructure services so they can be delivered and deployed continuously. In this article we’ll look at some of the principals you can apply to your infrastructure to start delivering it continuously.

How to CD your infrastructure successfully

There are two key principals for doing CD with infrastructure services successfully:

  1. Optimise for fast feedback. This is essential for quickly validating your changes match the business requirements, and eliminating technical debt and sunk cost before it spirals out of control.
  2. Chunk your changes. A CD mindset forces you to think about creating the shortest and smoothest path to production for changes to go live. Anyone who has worked on public facing systems knows that many big changes made at once rarely result in happy times for anyone involved. Delivering infrastructure services continuously doesn’t absolve you from good operational practice - it’s an opportunity to create a structure that re-inforces such practices.

Definitions

  • Continous Delivery is different from Continuous Deployment in that in Continuous Delivery there is some sort of human intevention required to promote a change from one stage of the pipeline to the next. In Continuous Deployment no such breakpoint exists - changes are promoted automatically. The speed of Continuous Deployment comes at the cost of potentially pushing a breaking change live. Most discussion of “CD” rarely qualifies the terms.
  • An infrastructure service is a configuration of software and data that is consumed by other software - not by end users themselves. Think of them as “the gears of the internet”. Examples of infrastructure services include DNS, databases, Continuous Integration systems, or monitoring.

What the pipeline looks like

  1. Push. An engineer makes a change to the service configuration and pushes it to a repository. There may be ceremony around how the changes are reviewed, or they could be pushed directly into master.
  2. Detect and trigger. The CI system detects the change and triggers a build. This can be through polling the repository regularly, or a hosted version control system (like GitHub) may call out via a webhook.
  3. Build artifacts. The build sets up dependencies and builds any required software artifacts that will be deployed later.
  4. Build infrastructure. The build talks to an IaaS service to build the necessary network, storage, compute, and load balancing infrastructure. The IaaS service may be run by another team within the business, or an external provider like AWS.
  5. Orchestrate infrastructure. The build uses some sort of configuration management tool to string the provisioned infrastructure together to provide the service.

There is a testing step between almost all of these steps. Automated verification of the changes about to be deployed and the state of the running service after the deployment is crucial to doing CD effectively. Without it, CD is just a framework for continuously shooting yourself in the foot faster and not learning to stop. You will fail if you don’t build feedback into every step of your CD pipeline.

Defining the service for quality feedback

  • Decide what guarantees you are providing to your users. A good starting point for thinking about about what those guarantees should be is the CAP theorem. Decide if the service you’re building is an AP or CP system. Infrastructure services generally tend towards AP, but there are cases where CP is preferred (e.g. databases).
  • Define your SLAs. This is where you quantify the guarantees you’ve just made to your users. These SLAs will relate to service throughput, availability, and data consistency (note the overlap with CAP theorem). 95e response time for monitoring metric queries in a one hour window is < 1 second, and a single storage node failure does not result in graph unavailability are examples of SLAs.
  • Codify your SLAs as tests and checks. Once you’ve quantified your guarantees SLAs, this is how you get automated feedback throughout your pipeline. These tests must be executed while you’re making changes. Use your discretion as to if you run all of the tests after every change, or a subset.
  • Define clear interfaces. It’s extremely rare you have a service that is one monolithic component that does everything. Infrastructure services are made of multiple moving parts that work together to provide the service, e.g. multiple PowerDNS instances fronting a MySQL cluster. Having clear, well defined interfaces are important for verifying expected interactions between parts before and after changes, as well as during the normal operation of the service.
  • Know your data. Understanding where the data lives in your service is vital to understanding how failures will cascade throughout your service when one part fails. Relentlessly eliminate state within your service by pushing it to one place and front access with horizontally scalable immutable parts. Your immutable infrastructure is then just a stateless application.

Making it fast

Getting iteration times down is the most important goal for achieving fast feedback. From pushing a change to version control to having the change live should take less than 5 minutes (excluding cases where you’ve gotta build compute resources). Track execution time on individual stages in your pipeline with time(1), logged out to your CI job’s output. Analyse this data to determine the min, max, median and 95e execution time for each stage. Identify what steps are taking the longest and optimise them.

Get your CI system close to the action. One nasty aspect of working with infrastructure services is the latency between where you are making changes from, and the where the service you’re making changes to is hosted. By moving your CI system into the same point of presence as the service, you minimise latency between the systems.

This is especially important when you’re interacting with an IaaS API to inventory compute or storage resources at the beginning of a build. Before you can act on any compute resources to install packages or change configuration files you need to ensure those compute resources exist, either by building up an inventory of them or creating them and adding them to said inventory.

Every time your CD runs it has to talk to your IaaS provider to do these three steps:

  1. Does the thing exist?
  2. Maybe make a change to create the thing
  3. Get info about the thing

Each of these steps requires sending and recieving often non-trivial amounts of data that will be affected by network and processing latency.

By moving your CI close to the IaaS API, you get a significant boost in run time performance. By doing this on the monitoring metrics storage project we reduced the CD pipeline build time from 20 minutes to 5 minutes.

Push all your changes through CI. It’s tempting when starting out your CD efforts to push some changes through the pipeline, but still make ad-hoc changes outside the pipeline, say from your local machine.

This results in several problems:

  • You don’t receive the latency reducing benefits of having your CI system close to the infrastructure.
  • You limit visibility to other people in your team as to what changes have actually been made to the service. That quick fix you pushed from your local machine might contribute to a future failure that your colleagues will have no idea about. The team as a whole benefits from having an authoriative log of all changes made.
  • You end up with divergent processes - one for ad-hoc changes and another for Real Changes™. Now you’re optimising two processes, and those optimisations will likely clobber one another. Have fun.
  • You reduce your confidence that changes made in one environment will apply cleanly to another. If you’re pushing changes through multiple environments before they are applied to your production environment, you reduce the certainty that one off changes in one environment won’t cause changes to pass there but fail elsewhere.

There’s no point in lying: pushing all changes through CI is hard but worth it. It requires thinking about changes differently and embracing a different way of working.

The biggest initial pushback you’ll probably get is having to context switch between your terminal where you’re making changes and the web browser where you’re tracking the CI system output. This context switch sounds trivial but I dare you to try it for a few hours and not feel like you’re working more slowly.

Netflix Skunkworks’ jenkins-cli is an absolutely godsend here - it allows you to start, stop, and tail jobs from your command line. Your workflow for making changes now looks something like this:

git push && jenkins start $job && jenkins tail $job

The tail is the real killer feature here - you get the console output from Jenkins on your command line without the need to switch away to your browser.

Chunking your changes

Change one, test one is a really important way of thinking about how to apply changes so they are more verifiable. When starting out CD the easiest path is to make all your changes and then test them straight away, e.g.

  • Change app
  • Change database
  • Change proxy
  • Test app
  • Test database
  • Test proxy

What happens when your changes cause multiple tests to fail? You’re faced with having to debug multiple moving parts without solid information on what is contributing to the failure.

There’s a very simple solution to this problem - and test immediately after you make changes:

  • Change app
  • Test app
  • Change database
  • Test database
  • Change proxy
  • Test proxy

When you make changes to the app that fail the tests, you’ll get fast feedback and automatically abort all the other changes until you debug and fix the problem in the app layer.

If you were applying changes by hand you would likely be doing something like this anyway, so encode that good practice into your CD pipeline.

Tests must finish quickly. If you’ve worked on a code base with good test coverage you’ll know that slow tests are a huge productivity killer. Exactly the same here - the tests should be a help not a hinderance. Aim to keep each test executing in under 10 seconds, preferably under 5 seconds.

This means you must make compromises in what you test. Test for really obvious things like “Is the service running?”, “Can I do a simple query?”, “Are there any obviously bad log messages?”. You’ll likely see the crossover here with “traditional” monitoring checks. You know, those ones railed against as being bad practice because they don’t sufficiently exercise the entire stack.

In this case, they are a pretty good indication your change has broken something. Aim for “good enough” fast coverage in your CD pipeline which complements your longer running monitoring checks to verify things like end-to-end behaviour.

Serverspec is your friend for quickly writing tests for your infrastructure.

Make the feedback visual. The raw data is cool, but graphs are better. If you’re doing a simple threshold check and you’re using something like Librato or Datadog, link to a dashboard.

If you want to take your visualisation to the next level, use gnuplot’s dumb terminal output to graph metrics on the command line:



  1480 ++---------------+----------------+----------------+---------------**
       +                +                +                + ************** +
  1460 ++                                            *******              ##
       |                                      *******                 #### |
  1440 ++                    *****************                 #######    ++
       |                  ***                                ##            |
  1420 *******************                                  #             ++
       |                                                   #               |
  1400 ++                                                ##               ++
       |                                             ####                  |
       |                                          ###                      |
  1380 ++                                      ###                        ++
       |                                     ##                            |
  1360 ++                               #####                             ++
       |                            ####                                   |
  1340 ++                    #######                                      ++
       |                  ###                                              |
  1320 ++          #######                                                ++
       ############     +                +                +                +
  1300 ++---------------+----------------+----------------+---------------++
       0                5                10               15               20


CRITICAL: Deviation (116.55) is greater than maximum allowed (100.00)

Conclusion

CD of infrastructure services is possible provided you stick to the two guiding principals:

  1. Optimise for fast feedback.
  2. Chunk your changes.

Focus on constantly identifying and eliminating bottlenecks in your CD pipeline to get your iteration time down.