photo of me

Lindsay Holmwood is an engineering manager living in the Australian Blue Mountains. He is the creator of Visage, Flapjack & cucumber-nagios, and organises the Sydney DevOps Meetup.

Why do you want to lead people?

Understanding your motivations for a career change into management is vitally important to understanding what kind of manager you want to be.

When I made the transition into management, I didn't have a clear idea of what my motivations were. I had vague feelings of wanting to explore the challenges of managing people. I also wanted to test myself and see if I could do as good a job as role models throughout my career.

But all of this was vague, unquantifiable feelings that took a while to get a handle on. Understanding, questioning, and clarifying my motivations was something I put a lot of thought into in the first year of my career change.

People within your teams will spend much more time than you realise looking at and analysing what you are doing, and they will pick up on what your motivations are, and where your priorities lie.

They will mimic these behaviours and motivations, both positive and negative. You are a signalling mechanism to the team about what's important and what's not.

This is a huge challenge for people making the career change! You're still working all this shit out, and you've got the ever gazing eye of your team examining and dissecting all of your actions.

These are some of the motivations I've picked up on in myself and others when trying to understand what drew me to the management career change.

Read more...

It's not a promotion - it's a career change

The biggest misconception engineers have when thinking about moving into management is they think it's a promotion.

Management is not a promotion. It is a career change.

If you want to do your leadership job effectively, you will be exercising a vastly different set of skills on a daily basis to what you are exercising as an engineer. Skills you likely haven't developed and are unaware of.

Your job is not to be an engineer. Your job is not to be a manager. Your job is to be a multiplier.

You exist to remove roadblocks and eliminate interruptions for the people you work with.

You exist to listen to people (not just hear them!), to build relationships and trust, to deliver bad news, to resolve conflict in a just way.

You exist to think about the bigger picture, ask provoking and sometimes difficult questions, and relate the big picture back to something meaningful, tangible, and actionable to the team.

You exist to advocate for the team, to promote the group and individual achievements, to gaze into unconstructive criticism and see underlying motivations, and sometimes even give up control and make sacrifices you are uncomfortable or disagree with.

You exist to make systemic improvements with the help of the people you work with.

Does this sound like engineering work?

The truth of the matter is this: you are woefully unprepared for a career in management, and you are unaware of how badly unprepared you are.

There are two main contributing factors that have put you in this position:

  • The Dunning-Kruger effect
  • Systemic undervaluation of non-technical skills in tech

Read more...

Applying cardiac alarm management techniques to your on-call

If alarms are more often false than true, a culture emerges on the unit in that staff may delay response to alarms, especially when staff are engaged in other patient care activities, and more important critical alarms may be missed.

One of the most difficult challenges we face in the operations field right now is "alert fatigue". Alert fatigue is a term the tech industry has borrowed from a similar term used in the medical industry, "alarm fatigue" - a phenomenon of people being so desensitised to the alarm noise from monitors that they fail to notice or react in time.

In an on-call scenario, I posit two main factors contribute to alert fatigue:

  • The accuracy of the alert.
  • The volume of alerts received by the operator.

Alert fatigue can manifest itself in many ways:

  • Operators delaying a response to an alert they've seen before because "it'll clear itself".
  • Impaired reasoning and creeping bias, due to physical or mental fatigue.
  • Poor decision making during incidents, due to an overload of alerts.

Earlier this year a story popped up about a Boston hospital that silenced alarms to improve the standard of care. It sounded counter-intuitive, but in the context of the alert fatigue problems we're facing, I wanted to get a better understanding of what they actually did, and how we could potentially apply it to our domain.

Read more...

Rethinking monitoring post-Monitorama PDX

The two key take home messages from Monitorama PDX are this:

  • We are mistakenly developing monitoring tools for ops people, not the developers who need them most.
  • Our over-reliance on strip charts as a method for visualising numerical data is hurting ops as a craft.

Death to strip charts

Two years ago when I received my hard copy of William S. Cleveland's The Elements of Graphing Data, I eagerly opened it and scoured its pages for content on how to better visualise time series data. There were a few interesting methods to improve the visual perception of data in strip charts (banking to 45˚, limiting the colour palette), but to my disappointment there were no more than ~30 pages in the 297 page tome that addressed visualising time series data.

Read more...

Flapjack, heartbeating, and one off events

Flapjack assumes a constant stream of events from upstream event producers, and this is fundamental to Flapjack's design.

a beating heart

Flapjack asks a fundamentally different question to other notification and alerting systems: "How long has a check been failing for?". Flapjack cares about the elapsed time, not the number of observed failures.

Alerting systems that depend on counting the number of observed failures to decide whether to send an alert suffer problems when the observation interval is variable.

Take this scenario with a LAMP stack running in a large cluster:

Read more...

Data driven alerting with Flapjack + Puppet + Hiera

On Monday I gave a talk at Puppet Camp Sydney 2014 about managing Flapjack data (specifically: contacts, notification rules) with Puppet + Hiera.

There was a live demo of some new Puppet types I've written to manage the data within Flapjack. This is incredibly useful if you want to configure how your on-call are notified from within Puppet.

Video:

Slides:

The code is a little rough around the edges, but you can try it out at in the puppet-type branch on vagrant-flapjack.

Read more...

The questions that should have been asked after the RMS outage

Routine error caused NSW Roads and Maritime outage

The NSW Roads and Maritime Services' driver and vehicle registration service suffered a full-day outage on Wednesday due to human error during a routine exercise, an initial review has determined.

Insiders told ITnews that the outage, which affected services for most of Wednesday, was triggered by an error made by a database administrator employed by outsourced IT supplier, Fujitsu.

The technician had made changes to what was assumed to be the test environment for RMS' Driver and Vehicle system (DRIVES), which processes some 25 million transactions a year, only to discover the changes were being made to a production system, iTnews was told.

There is a lot to digest here, so let's start our analysis with two simple and innocuous words in the opening paragraph: "routine exercise".

Read more...

The How and Why of Flapjack

In October @rodjek asked on Twitter:

"I've got a working Nagios (and maybe Pagerduty) setup at the moment. Why and how should I go about integrating Flapjack?"

Flapjack will be immediately useful to you if:

  • You want to identify failures faster by rolling up your alerts across multiple monitoring systems.
  • You monitor infrastructures that have multiple teams responsible for keeping them up.
  • Your monitoring infrastructure is multitenant, and each customer has a bespoke alerting strategy.
  • You want to dip your toe in the water and try alternative check execution engines like Sensu, Icinga, or cron in parallel to Nagios.

Read more...

CLI testing with RSpec and Cucumber-less Aruba

At Bulletproof, we are increasingly finding home brew systems tools are critical to delivering services to customers.

These tools are generally wrapping a collection of libraries and other general Open Source tools to solve specific business problems, like automating a service delivery pipeline.

Traditionally these systems tools tend to lack good tests (or simply any tests) for a number of reasons:

  • The tools are quick and dirty
  • The tools model business processes that are often in flux
  • The tools are written by systems administrators

Sysadmins don't necessarily have a strong background in software development. They are likely proficient in Bash, and have hacked a little Python or Ruby. If they've really gotten into the infrastructure as code thing they might have delved into the innards of Chef and Puppet and been exposed to those projects respective testing frameworks.

In a lot of cases, testing is seen as "something I'll get to when I become a real developer".

Read more...

Just post mortems

Earlier this week I gave a talk at Monitorama EU on psychological factors that should be considered when designing alerts.

Dave Zwieback pointed me to a great blog post of his on managing the human side of post mortems, which bookends nicely with my talk:

Imagine you had to write a postmortem containing statements like these:

We were unable to resolve the outage as quickly as we would have hoped because our decision making was impacted by extreme stress.

We spent two hours repeatedly applying the fix that worked during the previous outage, only to find out that it made no difference in this one.

We did not communicate openly about an escalating outage that was caused by our botched deployment because we thought we were about to lose our jobs.

While the above scenarios are entirely realistic, it's hard to find many postmortem write-ups that even hint at these "human factors." Their absence is, in part, due to the social stigma associated with publicly acknowledging their contribution to outages.

Dave's third example dovetails well with some of the examples in Dekker's Just Culture.

Read more...