The two key take home messages from Monitorama PDX are this:
- We are mistakenly developing monitoring tools for ops people, not the developers who need them most.
- Our over-reliance on strip charts as a method for visualising numerical data is hurting ops as a craft.
Death to strip charts
Two years ago when I received my hard copy of William S. Cleveland's The Elements of Graphing Data, I eagerly opened it and scoured its pages for content on how to better visualise time series data. There were a few interesting methods to improve the visual perception of data in strip charts (banking to 45˚, limiting the colour palette), but to my disappointment there were no more than ~30 pages in the 297 page tome that addressed visualising time series data.
Flapjack assumes a constant stream of events from upstream event producers, and this is fundamental to Flapjack's design.
Flapjack asks a fundamentally different question to other notification and alerting systems: "How long has a check been failing for?". Flapjack cares about the elapsed time, not the number of observed failures.
Alerting systems that depend on counting the number of observed failures to decide whether to send an alert suffer problems when the observation interval is variable.
Take this scenario with a LAMP stack running in a large cluster:
On Monday I gave a talk at Puppet Camp Sydney 2014 about managing Flapjack data (specifically: contacts, notification rules) with Puppet + Hiera.
There was a live demo of some new Puppet types I've written to manage the data within Flapjack. This is incredibly useful if you want to configure how your on-call are notified from within Puppet.
The code is a little rough around the edges, but you can try it out at in the
puppet-type branch on vagrant-flapjack.
The NSW Roads and Maritime Services' driver and vehicle registration service suffered a full-day outage on Wednesday due to human error during a routine exercise, an initial review has determined.
Insiders told ITnews that the outage, which affected services for most of Wednesday, was triggered by an error made by a database administrator employed by outsourced IT supplier, Fujitsu.
The technician had made changes to what was assumed to be the test environment for RMS' Driver and Vehicle system (DRIVES), which processes some 25 million transactions a year, only to discover the changes were being made to a production system, iTnews was told.
There is a lot to digest here, so let's start our analysis with two simple and innocuous words in the opening paragraph: "routine exercise".
In October @rodjek asked on Twitter:
"I've got a working Nagios (and maybe Pagerduty) setup at the moment. Why and how should I go about integrating Flapjack?"
Flapjack will be immediately useful to you if:
- You want to identify failures faster by rolling up your alerts across multiple monitoring systems.
- You monitor infrastructures that have multiple teams responsible for keeping them up.
- Your monitoring infrastructure is multitenant, and each customer has a bespoke alerting strategy.
- You want to dip your toe in the water and try alternative check execution engines like Sensu, Icinga, or cron in parallel to Nagios.
At Bulletproof, we are increasingly finding home brew systems tools are critical to delivering services to customers.
These tools are generally wrapping a collection of libraries and other general Open Source tools to solve specific business problems, like automating a service delivery pipeline.
Traditionally these systems tools tend to lack good tests (or simply any tests) for a number of reasons:
- The tools are quick and dirty
- The tools model business processes that are often in flux
- The tools are written by systems administrators
Sysadmins don't necessarily have a strong background in software development. They are likely proficient in Bash, and have hacked a little Python or Ruby. If they've really gotten into the infrastructure as code thing they might have delved into the innards of Chef and Puppet and been exposed to those projects respective testing frameworks.
In a lot of cases, testing is seen as "something I'll get to when I become a real developer".
Earlier this week I gave a talk at Monitorama EU on psychological factors that should be considered when designing alerts.
Dave Zwieback pointed me to a great blog post of his on managing the human side of post mortems, which bookends nicely with my talk:
Imagine you had to write a postmortem containing statements like these:
We were unable to resolve the outage as quickly as we would have hoped because our decision making was impacted by extreme stress.
We spent two hours repeatedly applying the fix that worked during the previous outage, only to find out that it made no difference in this one.
We did not communicate openly about an escalating outage that was caused by our botched deployment because we thought we were about to lose our jobs.
While the above scenarios are entirely realistic, it's hard to find many postmortem write-ups that even hint at these "human factors." Their absence is, in part, due to the social stigma associated with publicly acknowledging their contribution to outages.
Dave's third example dovetails well with some of the examples in Dekker's Just Culture.
Monitoring dependency graphs are fine for small environments, but they are not a good fit for nested complex environments, like those that make up modern web infrastructures.
DAGs are a very alluring data structure to represent monitoring relationships, but they fall down once you start using them to represent relationships at scale:
There is an assumption there is a direct causal link between edges of the graph. It's very tempting to believe that you can trace failure from one edge of the graph to another. Failures in one part of a complex systems all too often have weird effects and induce failure on other components of the same system that are quite removed from one another.
Complex systems are almost impossible to model. With time and an endless stream of money you can sufficiently model the failure modes within complex systems in isolation, but fully understanding and predicting how complex systems interact and relate with one another is almost impossible. The only way to model this effectively is to have a closed system with very few external dependencies, which is the opposite of the situation every web operations team is in.
The cost of maintaining the graph is non trivial. You could employ a team of extremely skilled engineers to understand and model the relationships between each component in your infrastructure, but their work would never be done. On top of that, given the sustained growth most organisations experience, whatever you model will likely change within 12-18 months. Fundamentally it would not provide a good return on investment.
At my day job, I run a distributed team of infrastructure coders spread across Australia + one in Vietnam. Our team is called the Software team, but we're more analogous to a product focused Research & Development team.
Other teams at Bulletproof are a mix of office and remote workers, but our team is a little unique in that we're fully distributed. We do daily standups using Google Hangouts, and try to do face to face meetups every few months at Bulletproof's offices in Sydney.
Intra-team communication is something we're good at, but I've been putting a lot of effort lately into improving how our team communicates with others in the business.
This is a post I wrote on our internal company blog explaining how we schedule work, and why we work this way.
What on earth is this?
This is a Kanban board.
Back in 2009 when I was backpacking around Europe I remember waking up on the morning of June 1 and reading about how an Air France flight had disappeared somewhere over the Atlantic.
The lack of information on what happened to the flight intrigued me, and given the traveling I was doing, I was left wondering "what if I was on that plane?"
Keeping an ear out for updates, in December 2011 I stumbled upon the Popular Mechanics article describing the final moments of the flight. I was left fascinated by how a technical system so advanced could fail so horribly, apparently because of the faulty meatware operating it.