Just post mortems

Earlier this week I gave a talk at Monitorama EU on psychological factors that should be considered when designing alerts.

Dave Zwieback pointed me to a great blog post of his on managing the human side of post mortems, which bookends nicely with my talk:

Imagine you had to write a postmortem containing statements like these:

We were unable to resolve the outage as quickly as we would have hoped because our decision making was impacted by extreme stress.

We spent two hours repeatedly applying the fix that worked during the previous outage, only to find out that it made no difference in this one.

We did not communicate openly about an escalating outage that was caused by our botched deployment because we thought we were about to lose our jobs.

While the above scenarios are entirely realistic, it’s hard to find many postmortem write-ups that even hint at these “human factors.” Their absence is, in part, due to the social stigma associated with publicly acknowledging their contribution to outages.

Dave’s third example dovetails well with some of the examples in Dekker’s Just Culture.

Dekker posits that people fear the consequences of reporting mistakes because:

They don’t know what the consequences will be
The consequences of reporting can be really bad

The last point can be especially important when you consider how things like hindsight bias elevate the importance of proximity.

Simply put: when looking at the consequences of an accident, we tend to blame people who were closest to the thing that went wrong.

In the middle of an incident, unless you know your organisation has your back if you volunteer mistakes you have made or witnessed, you are more likely to withhold situationally helpful but professionally damaging information.

This limits the team’s operational effectiveness and perpetuates a culture of secrecy, thwarting any organisational learning.

I think for Dave’s first example to work effectively (“our decision making was impacted by extreme stress”), you would need to quantify what the causes and consequences of that stress are.

At Bulletproof we are very open to customers in our problem analyses about the technical details of what fails, because our customers are deeply technical themselves, appreciate the detail, and would cotton on quickly if we were pulling the wool over their eyes.

This works well for all parties because all parties have comparable levels of technical knowledge.

There is risk when talk about stress in general terms because psychological knowledge is not evenly distributed.

Because every man and his dog has experienced stress, every man and his dog feel qualified to talk about and comment on other people’s reactions to stress. Furthermore, it’s a natural reaction to distance yourself from bad qualities you recognise in yourself by attacking and ridiculing those qualities in others.

I’d wager that outsiders would be more reserved in passing judgement when unfamiliar concepts or terminology is used (e.g. talking about confirmation bias, the Semmelweis reflex, etc).

You could reasonably argue that by using those concepts or terminology you are deliberately using jargon to obfuscate information to those outsiders and Cover Your Arse, however I would counter that it’s a good opportunity to open a dialog with those outsiders on building just cultures, eschewing the use of labels like human error, and how cognitive biases are amplified in stressful situations.

Lindsay Holmwood

Recent Stories

Using a first gen iPad mini as a grafana dashboard in 2024

Using MikroTik Netinstall on Linux

My philosophy on work

A simple proxy service for scrapers running on Morph

AWS in government: risks, myths, and misconceptions

Help! I’ve just been made a manager