If alarms are more often false than true, a culture emerges on the unit in that staff may delay response to alarms, especially when staff are engaged in other patient care activities, and more important critical alarms may be missed.
One of the most difficult challenges we face in the operations field right now is "alert fatigue". Alert fatigue is a term the tech industry has borrowed from a similar term used in the medical industry, "alarm fatigue" - a phenomenon of people being so desensitised to the alarm noise from monitors that they fail to notice or react in time.
In an on-call scenario, I posit two main factors contribute to alert fatigue:
- The accuracy of the alert.
- The volume of alerts received by the operator.
Alert fatigue can manifest itself in many ways:
- Operators delaying a response to an alert they've seen before because "it'll clear itself".
- Impaired reasoning and creeping bias, due to physical or mental fatigue.
- Poor decision making during incidents, due to an overload of alerts.
Earlier this year a story popped up about a Boston hospital that silenced alarms to improve the standard of care. It sounded counter-intuitive, but in the context of the alert fatigue problems we're facing, I wanted to get a better understanding of what they actually did, and how we could potentially apply it to our domain.
The two key take home messages from Monitorama PDX are this:
- We are mistakenly developing monitoring tools for ops people, not the developers who need them most.
- Our over-reliance on strip charts as a method for visualising numerical data is hurting ops as a craft.
Death to strip charts
Two years ago when I received my hard copy of William S. Cleveland's The Elements of Graphing Data, I eagerly opened it and scoured its pages for content on how to better visualise time series data. There were a few interesting methods to improve the visual perception of data in strip charts (banking to 45˚, limiting the colour palette), but to my disappointment there were no more than ~30 pages in the 297 page tome that addressed visualising time series data.
Flapjack assumes a constant stream of events from upstream event producers, and this is fundamental to Flapjack's design.
Flapjack asks a fundamentally different question to other notification and alerting systems: "How long has a check been failing for?". Flapjack cares about the elapsed time, not the number of observed failures.
Alerting systems that depend on counting the number of observed failures to decide whether to send an alert suffer problems when the observation interval is variable.
Take this scenario with a LAMP stack running in a large cluster:
On Monday I gave a talk at Puppet Camp Sydney 2014 about managing Flapjack data (specifically: contacts, notification rules) with Puppet + Hiera.
There was a live demo of some new Puppet types I've written to manage the data within Flapjack. This is incredibly useful if you want to configure how your on-call are notified from within Puppet.
The code is a little rough around the edges, but you can try it out at in the
puppet-type branch on vagrant-flapjack.
The NSW Roads and Maritime Services' driver and vehicle registration service suffered a full-day outage on Wednesday due to human error during a routine exercise, an initial review has determined.
Insiders told ITnews that the outage, which affected services for most of Wednesday, was triggered by an error made by a database administrator employed by outsourced IT supplier, Fujitsu.
The technician had made changes to what was assumed to be the test environment for RMS' Driver and Vehicle system (DRIVES), which processes some 25 million transactions a year, only to discover the changes were being made to a production system, iTnews was told.
There is a lot to digest here, so let's start our analysis with two simple and innocuous words in the opening paragraph: "routine exercise".
In October @rodjek asked on Twitter:
"I've got a working Nagios (and maybe Pagerduty) setup at the moment. Why and how should I go about integrating Flapjack?"
Flapjack will be immediately useful to you if:
- You want to identify failures faster by rolling up your alerts across multiple monitoring systems.
- You monitor infrastructures that have multiple teams responsible for keeping them up.
- Your monitoring infrastructure is multitenant, and each customer has a bespoke alerting strategy.
- You want to dip your toe in the water and try alternative check execution engines like Sensu, Icinga, or cron in parallel to Nagios.
At Bulletproof, we are increasingly finding home brew systems tools are critical to delivering services to customers.
These tools are generally wrapping a collection of libraries and other general Open Source tools to solve specific business problems, like automating a service delivery pipeline.
Traditionally these systems tools tend to lack good tests (or simply any tests) for a number of reasons:
- The tools are quick and dirty
- The tools model business processes that are often in flux
- The tools are written by systems administrators
Sysadmins don't necessarily have a strong background in software development. They are likely proficient in Bash, and have hacked a little Python or Ruby. If they've really gotten into the infrastructure as code thing they might have delved into the innards of Chef and Puppet and been exposed to those projects respective testing frameworks.
In a lot of cases, testing is seen as "something I'll get to when I become a real developer".
Earlier this week I gave a talk at Monitorama EU on psychological factors that should be considered when designing alerts.
Dave Zwieback pointed me to a great blog post of his on managing the human side of post mortems, which bookends nicely with my talk:
Imagine you had to write a postmortem containing statements like these:
We were unable to resolve the outage as quickly as we would have hoped because our decision making was impacted by extreme stress.
We spent two hours repeatedly applying the fix that worked during the previous outage, only to find out that it made no difference in this one.
We did not communicate openly about an escalating outage that was caused by our botched deployment because we thought we were about to lose our jobs.
While the above scenarios are entirely realistic, it's hard to find many postmortem write-ups that even hint at these "human factors." Their absence is, in part, due to the social stigma associated with publicly acknowledging their contribution to outages.
Dave's third example dovetails well with some of the examples in Dekker's Just Culture.
Monitoring dependency graphs are fine for small environments, but they are not a good fit for nested complex environments, like those that make up modern web infrastructures.
DAGs are a very alluring data structure to represent monitoring relationships, but they fall down once you start using them to represent relationships at scale:
There is an assumption there is a direct causal link between edges of the graph. It's very tempting to believe that you can trace failure from one edge of the graph to another. Failures in one part of a complex systems all too often have weird effects and induce failure on other components of the same system that are quite removed from one another.
Complex systems are almost impossible to model. With time and an endless stream of money you can sufficiently model the failure modes within complex systems in isolation, but fully understanding and predicting how complex systems interact and relate with one another is almost impossible. The only way to model this effectively is to have a closed system with very few external dependencies, which is the opposite of the situation every web operations team is in.
The cost of maintaining the graph is non trivial. You could employ a team of extremely skilled engineers to understand and model the relationships between each component in your infrastructure, but their work would never be done. On top of that, given the sustained growth most organisations experience, whatever you model will likely change within 12-18 months. Fundamentally it would not provide a good return on investment.
At my day job, I run a distributed team of infrastructure coders spread across Australia + one in Vietnam. Our team is called the Software team, but we're more analogous to a product focused Research & Development team.
Other teams at Bulletproof are a mix of office and remote workers, but our team is a little unique in that we're fully distributed. We do daily standups using Google Hangouts, and try to do face to face meetups every few months at Bulletproof's offices in Sydney.
Intra-team communication is something we're good at, but I've been putting a lot of effort lately into improving how our team communicates with others in the business.
This is a post I wrote on our internal company blog explaining how we schedule work, and why we work this way.
What on earth is this?
This is a Kanban board.