Using a first gen iPad mini as a grafana dashboard in 2024

2024-01-14T00:00:00+00:00

This project had a very simple goal:

Display local weather measurements on old tablet in our kitchen.

I have:

A first generation iPad mini that has been gathering dust since 2019.
An outdoor weather station:
A Raspberry Pi with a DVB receiver:
And Grafana & Prometheus on a Vultr VPS that scrapes the Raspberry Pi.

There are obviously some resource constraints here — device and operating system age, memory limits — so it was interesting to solve this problem within these constraints.

Trial and error

First step after wiping the device was to try visiting the existing Grafana dashboards, to see if it would load. This failed immediately because the certificates had expired.

This uncovered the first problem: the device hasn’t received software updates since August 2016, and certificates issued by Let’s Encrypt no longer work.

Fortunately other people have dealt with this issue, back when the certs in the iOS trust store expired in 2021. This very helpful post on the Let’s Encrypt Community Support forum explains how to manually install the current Let’s Encrypt certs from https://letsencrypt.org/certs/isrgrootx1.pem.

With the certificates fixed, I tried visiting the existing graphs dashboard again, but now it showed a new error:

If you're seeing this Grafana has failed to load its application files

Not particularly helpful, especially when you don’t have access to Safari’s dev console on the iPad.

But! I was able to use ios-webkit-debug-proxy to show the errors on a desktop browser. This surfaced a bunch of JavaScript errors for unsupported browser features.

Based on this, I stumbled across this GitHub Issue that showed Grafana 7.0.6 was the last version that supported Safari shipped with iOS 9.3.

This left me with two choices.

Crazy or annoying: pick one

Keep using the latest Grafana, but use wrp (as suggested in this Reddit post) to render the page as an image.

This is a wild approach, and it was crazy enough that I had to try it.

While I kinda got it working, I found there were too many moving parts, and I quickly ran into memory limits on the VPS (it’s running a headless Chrome).
Set up a standalone instance of Grafana running 7.0.6, the last version that worked with iOS 9.3.5.

This had a few downsides:
- Grafana won’t be patched when there are bugs. To mitigate, I could use Nginx + basic auth to protect it from the public web.
- It’s a separate set of dashboards to maintain. I couldn’t export the dashboard from the newer Grafana and import into 7.0.6, because the dashboard schema had changed. So I would have to recreate dashboards manually, and some of the panel types (like the stat panel) have fewer features.

At the end of the day, option 2 was the least terrible, and the end users (my family) don’t need to know or care about how inelegant the setup is behind the scenes.

The last software thing to set up was Kiosk mode for iPad, to show the dashboard in full screen. Fortunately Kiosk mode still works on older iOS — thanks to the maintainers!

Finally, I had to safely mount the iPad to the wall. I used the Dockem Koala Mount 2.0, mounted directly into the stud:

In conclusion, it’s still possible in 2024 to use older iPads with a bit of work. My recommendation is to stick to things in the browser, or use some of the few apps that still work on the first gen iPads.

Using MikroTik Netinstall on Linux

2022-01-30T00:00:00+00:00

If you’ve used MikroTik network gear long enough, you’ve likely run into devices bricking themselves after RouterOS software upgrades. Maybe you’ve set some configuration that has inadvertently made your device unusable. Or maybe you’ve inherited a device and want to start with a clean slate.

How do you re-install RouterOS, and maybe reset the device’s configuration too?

MikroTik provide the Netinstall tool to do network-based RouterOS installs. Until recently you could only run Netinstall on Windows, but Mikrotik recently released a Linux CLI version.

As of writing, it’s only been available for a few months, and it has quite a few rough edges, which I have attempted to document here.

The Linux version of Netinstall is janky in several ways:

It only works on a single network interface, that you cannot control the selection of.
It generally fails if you have multiple active network interfaces.
It fails with obscure messages if there is no default route on the interface it selects.
It often doesn’t serve up the images correctly the first time.

You need to set Linux networking up in a very particular way to make Netinstall work.

But before we start, a little background on Netinstall, and another less well-known RouterBOARD subsystem called Etherboot.

Netinstall is only one half of the solution. The other is Etherboot.

Netinstall is a binary that rolls a BOOTP/TFTP server into a single executable. The other half of the equation is Etherboot, which is a low-level system built into MikroTik devices for installing RouterOS onto the device’s flash memory.

Check the documentation for how to trigger Etherboot for your specific device, but it generally boils down to:

Power off the device
Hold the reset button
Power on the device

Then watch the output of Netinstall to see the device fetch an image and reboot.

I highly recommend running a packet sniffer like Wireshark or tcpdump when you’re doing this, to identify any configuration errors.

If everything is working correctly, you’ll see Netinstall and Etherboot do a standard BOOTP/TFTP dance.

In my particular case, I was doing this on a cAP ac that had bricked itself after an automated upgrade, and was stuck in an Etherboot reboot loop. I have also used this process to reinstall RouterOS on a hAP ac lite with corrupted configuration.

How to run netinstall on Linux

Before you start:

Fetch the latest netinstall binary for Linux. At time of writing, I was using netinstall 7.1.1.
Fetch the appropriate RouterOS image for your device. You can find the latest image linked from the product page of your device. Pay attention to whether the image is MIPS or ARM.

Once you’ve downloaded these, you need to set up a wired network, with a default route. The netinstall Linux binary will not work if you do not have a default route set, and will output FAILED TO REPLY which is an awesomely unhelpful error message.

To set up the network on Ubuntu, configure /etc/netplan/50-cloud-init.yaml:

network:
  version: 2
  ethernets:
    eno1:
      addresses:
        - 192.168.88.100/24
      routes:
        - to: 0.0.0.0/0
          via: 192.168.88.1

Then apply with:

sudo netplan generate
sudo netplan apply

If you have other interfaces (like wifi) shut them down with something like:

sudo ip link set dev wlp0s20f3 down

Then start the netinstall server:

sudo ./netinstall -a 192.168.88.1 routeros-mipsbe-7.1.1.npk

Replace routeros-mipsbe-7.1.1.npk with your image name.

The -a flag says what IP address should be assigned to Etherboot clients when doing the Netinstall dance.

You should see output that looks something like this:

Using server IP: 192.168.88.100
Starting PXE server
Waiting for RouterBOARD...
PXE client: 01:23:45:67:89:10
Sending image: mips
Discovered RouterBOARD...
Formatting...
Sending package routeros-mipsbe-7.1.1.npk ...
Ready for reboot...
Sent reboot command

Remember that depending on what your device is, when the device comes back up after the RouterOS install, the default configuration may have a firewall on the ethernet interface, so you won’t be able to connect to it.

The default behaviour is to serve up a RouterOS image, but keep the existing configuration on the device.

If you have uploaded broken configuration to the device, or the configuration has become corrupted, a RouterOS install via Netinstall/Etherboot won’t be enough. You will need to wipe all config on the target device, by running the previous command with -r.

More detail about the Linux version of netinstall can be found on the MikroTik help site.

You can’t use non-MikroTik tools (like dnsmasq) to serve up the RouterOS images

You might be thinking “why use a proprietary tool like Netinstall when I can use open source tools like dnsmasq to serve up the RouterOS images?”

The short answer is: I’ve tried this and it doesn’t work.

The longer answer is: Netinstall isn’t serving up just the RouterOS image, it’s also repackaging it in a way that the RouterBOARD on the other end can use.

The hint the magic it’s doing is in these two lines of netinstall output:

Formatting...
Sending package routeros-mipsbe-6.48.1.npk ...

This suggests it’s not just sending an image. At the very least, it’s also packaging up configuration to run on first boot (the -s flag), or signalling to wipe existing configuration (the -r flag).

If you set up a dnsmasq instance and try serving up RouterOS images via TFTP, you’ll find that the device will not install that RouterOS image.

I have no interest in working out exactly what it’s doing, nor maintaining a working open source-based alternative.

If you don’t value your time and you want to investigate how to go full open source, I saw one creative solution to this problem on a forum that boiled down to:

Set up a legit Netinstall server,
Packet capture a valid Netinstall/Etherboot session
Extract the binary served by Netinstall from the pcap
Then serve it up from dnsmasq

My philosophy on work

2018-07-30T00:00:00+00:00

👋, I’m Lindsay.

I wrote this so you understand my philosophy on work.

You can use this as a quick debugging guide in case you see something in the wild that surprises you.

This document is a set of promises I intend to keep. If I don’t, I expect you to call me out.

I’m here to route information, remove roadblocks, and shield the team

At its core, I do three things for the team:

Route information to the right places at the right time
Remove roadblocks stopping us getting things done
Shield the team from interruptions and distractions

I have the responsibility for:

People. This means your health (mental and physical) and wellbeing at work, your relationship with work (including the dark side of this – keeping burnout at bay), and creating opportunities for you to grow
Systems. I am the single point of accountability for the upkeep, operations, and cost effectiveness of our socio-technical systems. I am the one on the hook for those systems. I will take responsibility when things go bad. I will ensure we work hard to make sure the likelihood of those bad things happening again is reduced.
Delivery. Ensuring we have a good pipeline of work to get on with. Ensuring that work is well defined and well sized. This is the part of work I find really fun!

I am here to leave this world a little better than I found it. If I’m doing my job well, when I step away from the team for long periods of time, things will continue to function well, and adapt and improve.

I value fairness, context, and pride in work

Fairness

To be blunt – the main motivation for me doing a career change into leadership was because I experienced a real mixed bag of bosses. I thought “I can do a better job”, and here we are.

What motivates to gets me up every morning is creating a fair and just environment for the people I am responsible for.

I will call out things I think are unfair both in the workplace and for our customers, and I won’t hesitate taking a stand on principles.

Context

Local rationality rules everything around me. People make what they consider to be the best decisions, given the information they have at the time.

Good judgement comes from experience. And experience comes from bad judgement. In my experience, disagreements often come from seeing the same thing from multiple, sometimes conflicting, perspectives.

My job is to facilitate building context for the team, so we can make more right decisions, and only make new mistakes.

Pride in work

I don’t have high standards, I have extreme standards.

I expect great work from the people around me, and I will push you to do the best work you have done in your career.

If I make you feel like I’m constantly disappointed in your work, that means I’m doing a bad job of setting expectations about what those high standards are.

My expectations are few but firm

I don’t have all the answers, and I don’t expect you to either. We’ll work together to build context that is a better approximation of reality.

Family first, work second. I am a firm believer in working-to-live, not living-to-work. If things are not on an even keel at home, your ability to do work is compromised.

Work must never trump your home responsibilities. If your partner asks you to do something important for them during work hours, I expect you to take time to do it. If you have kids, I expect you to take time off for special events.

I lead by example – I take time off during the school holidays so my wife can continue to study.

You know how to manage your time. I blanket approve leave requests. You know what is best for you, and I trust you to make the right call for the team and yourself. I view people going on leave as a chaos monkey that tests the anti-fragility of the team.

Feedback will be direct, prompt, and humane

If there is a problem, you will hear about it directly, promptly, and humanely from me. I don’t hold on to feedback. Delays in feedback create anxiety. My priority is to feed it back to you as quickly as possible.

I’ll provide feedback throughout the day, mostly through Slack. If it’s something particularly sensitive, we will do a video call.

When you have negative feedback for me, I expect it directly, promptly, and privately. Making mistakes is part of being human, and I am no exception to this.

I take a very dim view of hearing criticisms of me second hand. When I do hear second hand criticism, you’ll be hearing from me pretty quickly. This goes back to one of my values – fairness.

I go out of my way to take public responsibility for my mistakes. They are often teachable moments that are applicable to an audience bigger than me.

When you have positive feedback, please deliver it publicly. I like my work to speak for itself, and I appreciate when you say nice things about my work in public.

My office hours are 10.00 to 17.30

You won’t have my full attention before 10.00.

I have a young family, and my priority in the morning is getting them up and going for the day. If I’m in meetings before 10.00, you won’t be getting the best of me. Deal with that as you will. You’ll have a <50% hit rate if you schedule meetings with me before 10.00.

1:1s are the most important conversations I have

1:1s take priority in my calendar. This is where you have the opportunity to ask me anything. I will help you build context about what’s happening more broadly across the organisation.

We will use the full time. Sometimes there will be things we need to talk about, sometimes there won’t be. Even if you don’t think we have things to talk about, we will use the time.

I will hold you accountable for actions that come out of our 1:1s, and I expect you to do the same for me.

Scheduling wise, we can do 1:1s once a week, or once a fortnight.

With direct reports who have people management responsibilities, I want to meet once a week. With direct reports who are individual contributors, once a fortnight is fine – but if you find it valuable, I will do them weekly.

When I assume responsibility for you, I tend to do 1:1s weekly for two-to-three months, before we mutually decide to adjust the frequency.

Slack is the best way to contact me

Given I work remotely, Slack is my lifeline to the team. I am super responsive during the day.

Calendar: best to hit me up on Slack before you book anything. I will blanket reject meeting invites that don’t have agendas.

Email: This is where information goes to die.

I have some quirks. I’m working on them.

I focus on eliminating the negatives way more than I focus on accentuating the positives. I see problems pretty much everywhere I look. I work relentlessly to eliminate those problems. This sometimes means I don’t pay attention to the good things that are happening.

It’s something I’m working on and getting better at. Pull me up if you think I’m being too negative on something.

I spot dysfunction way sooner than most people. I’m like a hound dog when it comes to sniffing out dysfunction. I have found myself the canary in a coal mine more than once in my career. This has taken a personal toll more than once, and it’s something I’m very mindful of limiting the impact of in the future. If you see me withdrawing, it’s sometimes me pattern matching against past experiences that had Very Bad Outcomes.

I think holistically, which sometimes means I hold contradictory views on the same topic. To me, this is a strength when navigating complex systems. I am very good at navigating complex systems. To people around me, this can be frustrating! I can argue three contradictory points in the same number of minutes. I can appear to be hard to pin down on a position.

I actively seek contrarian views, sometimes to a fault. Healthy debate and conflict is the lifeblood of the team. I actively create opportunities for dissent. This can be uncomfortable for people who have never worked in environments like this! Apologies in advance – I will do everything I can to make your introduction to this as gentle as possible.

I am very mindful that I am often wrong, and I want to hear what I am wrong about, and why I’m wrong about it, as soon as possible. I don’t have all the answers, but I will relentlessly question until we get a closer approximation of reality.

I have a preference for working on product and delivery problems over technical ones. Sometimes this means the team ends up focusing on shipping and going faster, and work to manage tech debt and tech growth gets de-prioritised.

Sometimes this leaks into 1:1s (particularly if you are motivated by the same thing). I’m aware of it, but sometimes I still get blindsided by it. When you see this happen, let me know.

I am an integrator in a segmenter’s clothing. I have a natural tendency to blend the boundaries of home and work, and seamlessly transition between the two. Given I work from home, this can result in overwork and burnout.

I have lots of strategies to create boundaries between the two, like:

Separate physical workspace
Separate devices (my work phone goes on top of the coffee machine at the end of every day, so if you slack me after hours, you won’t get a response until 8am the next day)
Wearing different clothes for work and home

Call me out if I’m being naughty and working while sick.

I won’t add you on Facebook. There is a power dynamic in our relationship, whether we choose to acknowledge it or not. Me adding you on Facebook puts you in an awkward position if you don’t really want to share than information with me.

The choice is up to you. If you friend me on Facebook, I will accept.

Finally, I think to talk. I don’t often talk to think.

This document, like me, is a work in progress

I try to update it frequently and appreciate your feedback.

A simple proxy service for scrapers running on Morph

2016-12-11T00:00:00+00:00

When I originally launched Got Gastro back in 2008, New South Wales was the only Australian jurisdiction that published data about food safety problems.

Since then, several other Australian jurisdictions have started publishing their food safety data: Victoria, WA, ACT, and SA.

As part of building the new Got Gastro, I have been slowly adding scrapers for new data sets from across Australia and the UK.

The low-hanging fruit has been picked – NSW and Victoria publish a page/URL per notice, WA publishes a PDF per notice – now it’s time for the harder data sources.

South Australia Health publishes a register of food prosecutions. The data is not structured. Every single entry is formatted differently. Business names, addresses, dates, and even field names are different for each entry.

The clue to what’s going on is in the class name on the content:

<div class="wysiwyg">...</div>

This is a pretty common thing about public data: often the only reason it has been published is because legislation requires it.

If the scale of the data being published is small, folk-systems spring up to handle the demand (in SA Health’s case, a WYSIWYG field on a CMS to handle 6 notices). When the scale of reporting is big, you get a more structured, consistent approach (like the NSW Food Authority’s ASP.NET application that handles ~1500 notices).

The challenge becomes: how do you build a scraper that handles the variations in data from artisanal, hand-crafted data sources?

The scraper

But it turns out that’s not even the most challenging problem with writing a scraper for this dataset. Sure, there are some annoying inconsistencies that require handling a few special cases, but nothing impossible.

The problem lies with how the scraper runs.

The sa_health_food_prosecutions_register scraper runs on Morph, a truly excellent scraping infrastructure run by the Open Australia Foundation.

The scraper scrapes the food prosecutions register from the SA Health website. The SA health website sits behind some sort of Web Application Firewall. It’s assumed this WAF is meant to block nasty requests to the website.

Unfortunately, the WAF blocks legitimate requests from Morph, which means the scraper fails to run. The WAF sometimes returns a HTTP status code of 200 but with an error message in the body. Sometimes it just silently drops the TCP connection altogether. This behaviour only exhibits on Morph, not when running from within Australia.

Bugs that only show up in production? The best.

To make the scraper work on Morph, we can build a simple Tinyproxy-based proxy service running in AWS to proxy requests from Morph to SA Health’s website. The proxy is locked down to only accept requests originating from Morph.

Designed to be cheap, resilient, and open

The proxy service must be:

low cost
resilient to failure
open source and reproducible

The last point is key.

When I originally tested this proxying approach, I did it with a Digital Ocean droplet in Singapore. I forgot about it for a couple of weeks, then accidentally killed the droplet when I was cleaning up something else in my DO account. Aside from the fact that the proxy’s existence and behaviour was opaque to anyone but me, I wanted other people to be able to use this proxying approach. More selfishly, I didn’t want future Lindsay to have to remember how this house of cards was stacked.

To keep costs low and the service resilient, the proxy service uses the AWS free tier, and autoscaling groups.

There is Terraform config in the scraper’s repo to build a proxy instance and supporting environment. The Terraform config:

Sets up a single VPC, with a single public subnet, routing tables, and a single internet gateway.
Sets up an ELB to publicly terminate requests, locked down with a security group to only accept requests from Morph (don’t want to be running an open proxy).
Sets up an autoscaling group of a single t2.micro (free tier) instance, with a launch config that boots the latest Ubuntu Xenial AMI, and links the ELB to the ASG.

When the scraper runs on Morph with the MORPH_PROXY environment variable set, it connects through the ELB to the Tinyproxy instance, which then proxies the request on to SA Health’s website.

Drive changes with `make` and environment variables

Once you clone the repo and set some environment variables, you can start planning your changes:

make plan

To apply the plan:

make apply

To destroy the environment:

make destroy

This Makefile approach was borrowed from hectcastro/terraform-aws-vpc, from which this Terraform config was forked.

Wrap it with a Continuous Deployment pipeline

To keep Terraform changes consistent, all changes to the proxy service are run through a Continuous Deployment pipeline on Travis. This means no changes to the “production” service are done locally. This is important for creating visibility for new contributors of how the service runs and changes.

Terraform relies on .tfstate files to track state and changes between Terraform runs. Because Travis starts with a clean git clone every build (and thus no .tfstate), terraform config is used to push/pull persistent state across builds.

The pipeline is very simple – it just runs proxy/cibuild.sh and proxy/cideploy.sh.

These environment variables must be exported for proxy/cibuild.sh and proxy/cideploy.sh to work:

BUCKET, the name of the S3 bucket the config will be sync’d with by terraform config
AWS_ACCESS_KEY_ID, access key for the IAM user, used by terraform config
AWS_SECRET_ACCESS_KEY, access key secret for the IAM user, used by terraform config
TF_VAR_aws_access_key, access key for the IAM user, used by terraform plan and terraform apply
TF_VAR_aws_secret_key, access key secret for the IAM user, used by terraform plan and terraform apply

In the sa_health_food_prosecutions_register proxy service case, these environment variables are exported as encrypted environment variables in .travis.yml. This keeps the config and most of the data open, and easily reproducible.

Civic hacking for government shortfalls

This was a huge amount of work for a very small data set (6 notices!), but I believe it was worth it.

The approach allows the scraper to reliably run on Morph, and behave in a way that’s consistent with other scrapers. The costs are minimal, which is important if I’m picking up the tab for poor government IT.

(Side note: if you were a member of the public with an urgent enquiry for SA Health, but you were being silently dropped by their WAF, how would you contact them to let them know? Their contact numbers are on their website, after all)

Most importantly, the service is open source and reproducible. When I asked on the Open Australia Slack about other cases of Morph scrapers failing because of active blocking of requests, nobody could think of any.

I hope nobody ever has to do anything like this to make their scrapers run, but if they do, there’s now a Terraform project to set up a proxy that costs less than $5/month.

Happy civic hacking!

AWS in government: risks, myths, and misconceptions

2016-10-12T00:00:00+00:00

The opinions in this post are my own, and do not represent my employer.

When undertaking digital transformation initiatives in your organisation, to effectively meet user needs, we need infrastructure technology that can adapt to user needs as fast as we do.

Because this type of technology is fairly new to government, there is a lot of uncertainty about how it can be used, as well as a lot of optimism about the opportunity it brings.

Let’s discuss some of the the risks, myths, and misconceptions we’ve encountered on our journey to fully cloud based infrastructure.

Myth: We can’t store data securely!

AWS is on Australian Signals Directorate’s Certified Cloud Services List alongside several other Infrastructure as a Service (IaaS) providers like Azure. AWS has four services IRAP accredited by ASD up to Unclassified DLM: EBS, EC2, S3, and VPC. If you’ve used AWS in the private sector you might find this catalogue limited, but there are a heap of workloads you can run on AWS with just these four services.

It’s worth noting that ASD acknowledges the risks that come with existing in house systems compared to cloud services. From ASD’s Cloud Computing Security for Tenants guide:

Organisations need to perform a risk assessment and implement associated mitigations before using cloud services. Risks vary depending on factors such as the sensitivity and criticality of data to be stored or processed, how the cloud service is implemented and managed, how the organisation intends to use the cloud service, and challenges associated with the organisation performing timely incident detection and response. Organisations need to compare these risks against an objective risk assessment of using in house computer systems which might: be poorly secured; have inadequate availability; or, be unable to meet modern business requirements.

While these AWS services are accredited up to Unclassified DLM, if you have protected data, there are some strategies you can use to make parts of this data available on AWS.

Most data is classified at the row level in databases. While you can’t put a protected row on AWS, you’ll often find that individual columns in that protected row have a lower classification. This means you can put unclassified columns from protected rows on AWS, and work out a way to match up data between your public systems on AWS, and your private systems on protected networks.

Misconception: We’ll run it like physical infrastructure!

Once you’ve procured AWS, often you’ll want to go for the biggest cost savings immediately. Reserved Instances are a great way to achieve these cost savings, especially if you buy them for a three year period.

But the value of AWS to government is not low-cost compute, it’s on-tap capacity. We can’t extract this value unless we build and run services like AWS recommends. To do this, we have to think differently about our software architectures.

The risk with buying RIs up front for three years is you don’t know what your workloads are going to be three years from now, let alone what architecture you’ll build to deal with them.

You might optimise your code to run in parallel across many cheaper instances. You might shift your workloads to spot instances for ad-hoc calculations.

To achieve a sustainable, controlled spend, you have to start with On-Demand instances, track your spend over several months, and identify instance types that are constantly used.

Then buy RIs for a year.

This works well for both lifting-and-shifting traditional workloads, or for greenfields projects using cloud native architectures. If you’re really keen, purchase RIs for three years, but beware the risk of premature investment in an architecture that may not match your workloads.

If you do find that you’re not using the RIs you’ve purchased, you can sell them on the RI marketplace.

Risk: Our spend is getting out of control!

If you don’t spend time monitoring and analysing your usage of IaaS services, you can very quickly find yourself spending more than you planned.

Use multiple accounts to segment and control your spend.

Consolidated Billing allows you to logically separate services you’re delivering across multiple accounts, but see costs in one place: on the parent account’s billing page. Even better, you can grant your finance teams the ability to view billing information in the AWS console, so they can get straight to the information they need to make informed, financially prudent decisions.

You can take this even further by using blended rates across your AWS accounts, where On-Demand and Reserved Instance are averaged across linked accounts using consolidated billing. This allows you to make reservations once, but automatically get the cost savings across all linked accounts.

Separate accounts are also useful if your service ever gets mogged – just unlink the account and re-link it to the new parent account.

One technical approach to controlling spend is to automatically shut down non-production environments every night, and rebuild automatically in the morning. Because of On-Demand instance pricing is calculated hourly, you’ll halve your cost by only running instances for half the day.

The side effect of this is incentivising a culture of technical resilience. When you’re creating and destroying whole replicas of production systems every day, you become really good at creating and destroying environments, and more importantly automatically handling failure.

There’s also the benefit of better security posture through short-lived environments. By having ephemeral environments, we reduce the impact of one of the three resources required for effective cyber attacks: time. When we rebuild environments daily from fully patched base images, we time limit the window of opportunity for attacks to take hold.

Risk: Our stuff is getting hacked!

As we mentioned before, we can’t extract the value from IaaS providers like AWS unless we build and run services like they recommend. One of the fastest ways to do this is to give your developers full AWS access, to experiment with different ways of building services. By using IAM users, groups, and roles, you can selectively grant your developers the ability to create, update, and destroy environments, fostering that culture of technical resilience.

One side effect of this delegation of responsibility is services and data can be accidentally exposed to the world. You just need to take a look at the AISI daily observations per open service family graphs to see how prevalent this problem is across the Australian address space:

One solution is to audit publicly exposed services on your AWS accounts hourly, and automatically notify your people when the automated security monitoring detects something amiss. This is surprisingly easy to do with IaaS: just query the APIs to get a list of IP addresses in use, run a scan against these addresses, and notify owners immediately, all based on tag information on vulnerable hosts, also exposed through the API.

Misconception: We aren’t getting the reliability benefits!

As we mentioned before, we can’t extract the value from IaaS providers like AWS unless we build and run services like they recommend. On IaaS like AWS, we do this by building highly reliable systems from unreliable components.

One very simple way of achieving this is with Auto Scaling Groups. You can use Auto Scaling Groups to ensure a minimum number of like-instances are running, and to scale the number of instances in the group up and down based on demand. This ensures that if any of the underlying instances that make up the Auto Scaling Group fail, they are automatically recreated.

To get the full benefit of Auto Scaling Groups, you need to pre-bake your applications into your instances. Think of it as a frozen pizza – you automate the hard work up-front to get your application and environment ready to go, then the Auto Scaling Group warms them up at the last minute for consumption.

The caveat for this to work is this: you need a strong continuous delivery capability that is highly automated – everything must go to production through the pipeline.

The effect of this is that changes and releases become non-events. You can very quickly reach a point where you’re deploying tens, if not hundreds of changes a day – all with minimal human intervention.

Whenever you make a change to the application or the underlying environment, your systems automatically build new images that can be used in your Auto Scaling Groups. This requires new tools and ways of addressing your infrastructure, programatically.

Having a fully automated change process can make satisfying regulatory requirements easier, because all changes are highly controlled and logged, and each part of the automation has limited access, courtesy of IAM. When you combine this with CloudTrail, you get a pretty powerful combination of access control and auditing.

Now you’re getting multiple benefits: reliability, auditability, scalability, and pertinently, cost reduction – you don’t pay for what you’re not using, calculated by the hour.

Conclusion

The opportunity IaaS provides is immense.

IaaS providers like AWS help make doing the right thing easy. IaaS eliminates classes of problems, freeing up your teams to focus on the bigger picture. Most importantly, it frees people up to help your organisation learn about modern technology practices for building highly reliable government services.

Help! I’ve just been made a manager

2016-01-25T00:00:00+00:00

Your boss calls you into her office.

“Congratulations - I’m promoting you to team lead!”

Your mouth goes dry.

“You’ve been doing such great job on the last few projects, the leadership team thought you could help other people in your team perform just as good as you.”

Your stomach turns to stone.

“Your new role starts now. We’ll see how it works out, and come back in a few weeks to review.”

This experience may feel too familiar – and perhaps painful – to you. You get thrown into the deep end with a life jacket/anchor labeled “team lead”/”supervisor”/”acting manager”.

As we’ve read before, moving into a management position is not a promotion, it’s a career change. But fate (or your boss) may not agree.

How do you survive your first few weeks in your management role?

Get a job description

Getting a job description written down helps clarify your boss’s expectations about what functions your role is meant to perform, and what is expected of you. They’re ground rules that help you understand the parameters of your work.

Make sure you have a conversation with your boss about each part of the job description, to clarify your boss’s interpretation. Note down anything that was different or required clarification, then send a updated version to your boss, for both your records.

If you can’t get a job description, write your own.

You’re probably thinking right now “Oh no, it’s a trap!” or “Isn’t this what my boss should be doing?”, and you’re right, it is your boss’s responsibility. There’s a very real chance your boss doesn’t have time - that’s part of the reason why you’re getting your “promotion”. That’s not a justification, it’s just a fact. Your boss may not have been in this position before either, and is just making it up as they go along. A lot of companies don’t have good processes for how this sort of role change is meant to work, and thus don’t have any pre-canned job descriptions they can hand to new managers. Don’t even ask about training.

Writing your own job description is fantastic opportunity to define what exactly you’re going to be doing, in your own words, while demonstrating your communication and goal setting abilities.

Get 1-2 peers to review what you’ve written, and consider getting your new team to review the job description. This can help build rapport with them, and getting their buy in to what the new team is going to look like. But beware: if you’ve been given this new role over someone else that now reports to you, there may be tension – modify your technique to your audience.

Once the job description is written, send it to your boss with a “I know you’re busy, so if you don’t see any problems with it, no need to reply”.

Managing your workload

“I’m already overloaded” you’re thinking, “How am I supposed to look after all these people while doing my existing work?” – this is the biggest fear, and biggest challenge, when moving into a management role.

You have three options:

Keep trying to do your own work while managing others. This almost certainly will end in you doing a mediocre job of both. You’ll spend 30% of your time on engineering, 10% on people management, and the remaining 50% on context switching and self loathing.
Aggressively cut the scope of your personal technical workload, and manage the stakeholder expectations for those cuts. This frees you up to spend some of your time on people work.
Make your old engineering workload your team’s workload. Still cut the scope of the work, and manage your stakeholder expectations. Help the team become better at doing some of the work you were doing previously. Don’t forget to manage the stakeholder relationships for their work too.

Remember why this role change is happening: as a leader, you provide more value to the team as a multiplier than as an individual contributor. If you free up each person in your team to focus more clearly on their work, and complete that work more efficiently – that’s greater than any engineering contribution you can make as an individual.

The performance of the team will drop when the team is in this transition phase. The team is reconfiguring itself, working out what’s important, what’s not, how work gets done, who has what responsibilities.

If you can successfully manage the transition, the team will be more productive than it is as a collection of individual contributors. That is your goal.

Create feedback loops

Part of being a good at people work is creating strong feedback loops from the people in the team to you. Have a reliable, predictable avenue for them to report problems and suggest changes, then act on their information, building trust.

Organise regular one-on-ones with your team. Once a week, half an hour, away from the office if you can.

Be honest and upfront with them that you’re new to this, and are trying to work it out.

Be vulnerable about your limits and abilities. Ask them to raise problems with how you’re managing them immediately, and show you’re listening to them by acting on their feedback promptly. Build trust by listening and changing behaviour.

Find out their biggest fears and anxieties about the new work situation. Ask simply:

“Is there anything I can do to help make things better?”

Create a feedback loop from team to the rest of the organisation by publicly praising good work from individuals. Raise the profile of their work to the broader audience in your org by publicly calling out good work or congratulating them on the successful delivery of features, projects, or quick bug fixes.

Make sure you pass all the credit down to the team.

Hard truths

In my first year as a manager I spent most of the time trying to make sense of the new context I found myself operating in. Expectations were re-calibrated (sometimes brutally), people were disappointed (some even left), deadlines were missed (occasionally by wide margins).

These were realisations I came to that helped me cope in my first year:

Demand will always exceed capacity. Doesn’t matter how good you are at managing workload – team or personal – there will always be more to do. There will always be someone disappointed you’re not doing the exact work they want, when they want it. Fuck the haters.
Competence is rewarded with more work. If your boss or your boss’s boss sees you doing a good job, they will want to see how much further you can go. Put more cynically: no good deed goes unpunished.
There will always be a tension between doing technical work and doing people work. When you’re not in a pure-management role (i.e. team lead, supervisor) you still have engineering work to do, and finding the balance will be messy – especially as you’re new to this whole people management thing. You think you’ve got a handle on it, and something will knock you for six. Keep going, reflect, try out different things, and you’ll get there.
Technical output is no longer the sole measurement of your job success. Your own technical output is a false measurement for your responsibilities to the team. You will always be disappointed if you measure your current self against your past self, that past self that only had technical delivery responsibilities. Your disappointment will lead you to prioritising technical work over people work, consequentially screwing over the people who are looking to you for help and guidance. Prioritise people work.

Finally: everyone else is just making it up. Nobody comes into a management role with all the answers. The leaders you look up to have made heaps of mistakes that shaped their leadership and management style.

Get to it.

PreAccident Investigation Podcast Highlights, Sep-Oct 2015

2015-11-11T00:00:00+00:00

These are notes I’ve taken while binge listening to the last two months of the PreAccident Investigation Podcast, which you should subscribe to.

Kent Whipple – The power of the story

When we investigate an accident, we don’t tell the story of what happened, we tell a story about what didn’t happen.
Identifying what didn’t happen doesn’t help you fix what did happened.

Listen to the episode.

Dr. Alan Frankfurt - High Reliability, Safety, and Delivering Babies

Highly reliable teams don’t realise they’re highly reliable, they don’t set off to become highly reliable, they set off to become more stable, safer, more effective, or learn.
Destroy vertical silos, create horizontal integrations. The silos stop us from working together and becoming as good as we can get. Horizontal integrations help people take ownership.
Prepare for events with a pre-brief: use a template, identify what the threats are, verbalise and share concerns.
Hold a post-action review soon after the operation, schedule around the surgeon because of time demands.
There are rarely technical issues, but there are always communication issues that come up.
Everyone in the team needs a role, and understand how that role fits into the goal of the team. “I’m gonna do my doctor thing, you’re gonna do your nurse thing, but I’m not any more important than you are.”

Listen to the episode.

Dr Jim Joy - Critical Controls

Risk registers end up being a list problems on paper that are useless as a management tool.
Critical controls are a more effective management tool for dealing with risks and events.
Controls are anything that prevents or mitigates an unwanted event, that we can use to improve our resilience when things go wrong.
Controls can be acts, objects, and systems.
Acts are behaviours we mandate or encourage.
Objects are tools that work by themselves.
Systems are combinations of acts and objects.
Training is not a control, supervision is not a control. We can’t measure it, we can’t validate it, we can’t audit it.
Once we have controls, we can define performance requirements (pressure valve is released at x pressure, the operator understands how to perform x task in context), measure, then validate those performance requirements are being met.
Once we have requirements, we can set targets to assess the reliability of the controls, which is more of an objective discussion around metrics.
We can then feed these metrics into the design of the controls based on what happens in our organisations.
We need to move beyond thinking about risks as likelihood x consequence.
Risk is the degree to which your controls aren’t working.
Health and Safety Critical Control Management Good Practice Guide from ICMM publications

Listen to the episode.

Dr Jim Barker - Complexity

We don’t manage complexity, we move with it.
Think about complexity as fluidity instead of non-linearity, because there are linear aspects to our complex systems (like time).

Listen to the episode.

Martha Acosta - The 4 Things Leaders Control

When leaders say “come to me with solutions, not problems” it seems like a great empowerment move, but they’re creating a distance between workers and management.
If people come up with solutions, wouldn’t it be more empowering to just let them go and implement them, and only come to you when they have problems they can’t solve?
The value leaders provide to their organisations is helping the people at the pointy end ask the right questions, and helping them create a solution at the pointy end.
“When significant change comes up against significant culture, culture always wins” - Edgar Schein
Culture is something that arises from behaviour. That behaviour tells us what matters, how we do things around here, what works and what doesn’t. The internalisation of that behaviour is what becomes culture.
Outsiders see culture. Insiders have difficulty seeing it.
Once we see culture externally, we think we can change culture externally.
The four things leaders control are Roles (what people do), Processes (how we do work), Norms (how we interact with one another), Metrics (what we measure and incentivise).
Anxiety in the leadership structure is contagious, and can turn into fear lower down in the org structure.
Social anthropologists see culture as a bunch of intertwining narratives. If you add an anxiety narrative to your culture’s story, that changes the story.
When you get people talking about their narrative in your culture’s story, that reflection produces surprises and uncovers how things work.

Listen to the episode.

Dr. Eric Young – Patient Safety, Surgery

Checklists have become overused to the point where they’re causing more harm than good (Dr Young has seen 84 items on one list), need to be kept down to a page per Checklist Manifesto.
The best way to reduce error rates is to ensure a consistent team is working together to perform the surgeries.
This isn’t always a possibility, so ensuring consistent skills across all team members is the next best thing.
Dr Young is surprised more patients don’t get actively involved in their medical care by asking their doctors questions (e.g. why this brand of joint replacement over another?) and finding out more about their treatments.

Listen to the episode.

Blame. Language. Sharing.

2015-10-30T00:00:00+00:00

Failure can lead to blame or inquiry in your organisation.

When failure leads to blame, organisations subscribe to the old view of human error. They construct a narrative that’s far worse than the reality, a narrative that focuses on a single root cause, which is inevitably human error. This reductionist and deconstructive process has us go down-and-in, treating people and systems as separate entities, with people at the root of the cause.

When failure leads to inquiry, organisations subscribe to the new view of human error. People are part of the systems, inquiry is angled up-and-out, focused on understanding the relationships and bigger picture ideas at play. This is difficult, because it involves acknowledging and embracing complexity.

When failure leads to inquiry, we embrace different perspectives, different stories, different interests - and often these contradict one another. By embracing these differences, we create an opportunity for learning for people inside the organisation, navigating the delta between how we imagine work is completed in our organisation, and how it is actually done.

Learning organisations have three distinct advantages:

They have feedback loops that deliver high quality feedback from the front lines,
Which allows people performing the work to focus on quality and delivery,
Which reduces the amount of defending of decisions by practitioners.

These three advantages minimise the likelihood of a Cover Your Arse culture emerging, where people focus more on implementing insulation against potential blowback from performing work, than actually performing the work itself.

I posit there are three contributing factors that inhibit learning in organisations:

Language we use when talking about and contextualising failure
Blame and the tainted narrative we construct via cognitive biases
Sharing of experiences in our organisations to uncover understanding

Language

The words we use when talking about events are really important.

Words are framing devices that can both expand and limit the scope of inquiry. These words are used during your investigations, retrospectives, learning reviews, brainstorming sessions, and post-mortems. But most importantly they’re used when having daily conversations with your colleagues.

Why

Why is used to force people to justify actions, to attribute and apportion blame. Why goes down-and-in, focuses the inquiry on people, and is often used to phrase counterfactuals that focus attention on a past that didn’t happen – “why didn’t you answer the page?”, “why didn’t you check the backups?”.

Why plays right into the hands of the Fundamental Attribution Error, where we explain other people’s actions by their personality, not the context they find themselves in, but we explain our own actions by our context, not our personality.

How

How is about articulating the mechanics of a situation, which is helpful for distancing people from the actions they took. How clarifies technical details - “how did the site go down?”, “how did the team react?” – but it can also limit the scope of the inquiry, as we focus on the mechanics, not the relationships at play in the larger system.

What

What uncovers reasoning, which is important for building empathy with people in complex systems – “what did you think was happening?”, “what did you do next?”. What makes it easier to point our investigations up-and-out, on the bigger picture contributing factors to an outcome. What encourages explaining in terms of foresight, and helps us take into account local rationality:

“people make what they consider to be the best decision given the information available to them at the time”

Dekker describes explaining an incident in terms of foresight as understanding what people inside the tunnel saw, as they journeyed through it during an incident. What helps us uncover what the inside of the tunnel looked like.

Blame

Blame assigns responsibility for an outcome to a person. Often we use blame to say that people were neglectful, inattentive, or derelict of duty. It plays into this idea of bad apples, amoral actors in our midst who are working against the sanctity of pristine system the dirty humans keep fucking up.

But assigning responsibility for an outcome to a person ignores a truth – sometimes bad things happen, and nobody is to blame. Furthermore, things go right more often than they go wrong.

There are two cognitive biases at play when assigning blame to people: confirmation bias, and hindsight bias.

But what is a cognitive bias? Simply, a cognitive bias is a mental shortcut your brain unconsciously takes when processing information. Your brain optimises for timeliness over accuracy when processing information, and applies heuristics to make decisions and form judgements. If those heuristics produce an incorrect result, we say that’s an example of a cognitive bias.

Confirmation bias

With the confirmation bias, we seek information that reinforces existing positions, and ignore alternative explanations. Worse still, we interpret ambiguous information in favour of our existing assumptions.

Simply put: if you are looking for a human to blame, you’re going to find one, regardless of contrary information.

We can counter the confirmation bias by appointing people to play the devils advocate and take contrarian viewpoints during conversations and investigations.

Hindsight bias

The hindsight bias alters our recollection of memories to fit a narrative of how we perceived and reacted to events. It’s a type of memory distortion where we recall events to form a judgement, and talk about and contextualise events with knowledge of the outcome – often making ourselves look better in the process.

The hindsight bias is dangerous because it can taint all your interactions with your team. It is your culture killer, altering our how we recall your perception of events and actions in stressful situations, driving a self-defensive wedge between you and you colleagues.

It’s important to eliminate hindsight bias when conducting post-mortems and investigations if we want a just outcome. The simplest way to achieve this is to explain events in terms of foresight, and this is made easier by using questions that start with “how” and “what”. Start the review at a point before the incident, and work your way forward. Resist the urge to jump ahead to the outcome and work your way back from that.

Doing this is hard and requires a lot of self-restraint and practice. You’ll make a lot of mistakes, and it takes time to get good at it. Even when you’re good at it, you’ll still occasionally find yourself slipping into old habits. It’s the responsibility of the whole team to call each other out when they see each other fall into the hindsight bias trap, using words like why and who.

We can also harness hindsight bias to give us insights into how things might break in the future.

Before you take a new service live, gather the team together and ask them to brainstorm on a whiteboard or post-it notes what they think will break when they go live. Then clear away any notes you’ve collectively taken, and ask them to imagine themselves 5 minutes after the feature has gone live. Now ask “what has just broken?”.

You’ll find the answers you get can be quite different.

Sharing our experiences after an incident happens is vital for the organisation to learn from individual and shared experiences. By sharing our experiences we have the opportunity to embrace different and often contradictory perspectives, stories, and interests.

From these we can better understand what our organisations capabilities and weaknesses are, both when things go wrong but when things go right. This creates an opportunity to understand the delta between Work-as-Imagined and Work-as-Done in our organisations.

We do this by holding retrospectives, investigations, post-mortems, or learning reviews – but the label we apply to the event is irrelevant.

These events must be environments where people in your organisation feel they can speak their truth and experiences free of persecution or backlash. If you’re in a leadership or management position, and people in your team are participating in these sharing experiences, be the shit umbrella you want to see in the world.

Other people in your organisation will likely be skeptical of the findings (especially if there is a blameful culture of finding and singling out bad apples), so it’s your responsibility to your people to shield them from the repercussions of being honest. Again, we are all locally rational:

“people make what they consider to be the best decision given the information available to them at the time”

You have a limited window of opportunity to create an expectation that if you share there won’t be blow back - if you fuck it up early on, people will be reluctant to share anything vaguely compromising about their experiences and actions in the future, and thus the organisation as a whole suffers from the missed opportunity to learn.

Know the audience of the report you produce after you’ve shared experiences. Sometimes this means you have to construct multiple reports, one for each audience. The story you tell across these reports should be the same, but alter the level of detail for the audience who is reading it. You may also need to omit different findings for different audiences so details don’t get misconstrued.

Beware of weasel words that show up in the report:

“the team should have…” (counterfactual describing a past that never happened)
“the root cause of the outage was…” (there is never one cause, there are many contributing factors)
“human error lead to…” (our world is humans and systems, not humans or systems)

Creating opportunities for sharing our experiences of accidents, incidents, and outages is mandatory if we want to learn about what our organisations capabilities and weaknesses are when things go wrong.

To do this we have hold retrospectives or learning reviews or post-mortems, start at the beginning, and relentlessly eliminate our own and collective cognitive biases when talking about events, by using what and how, not why and who.

Things go right more often than they go wrong, and we owe it to ourselves and our colleagues to understand what made our course of action the right one at the time, in spite of the outcome.

This piece is a writeup of the talk I gave at Velocity Amsterdam 2015.

Management skills for new leaders

2015-07-30T00:00:00+00:00

At DevOpsDays Melbourne 2015 I facilitated an Open Space session on management and leadership.

The purpose of the session was for experienced leaders to share their stories and discuss what skills are required for effective leadership, so new managers and people who are thinking about making the switch could learn from people who have been before them.

You can find the original audio on SoundCloud.

This is a writeup of what was discussed in the session.

Reviews

At some stage you’re going to have to do a performance review.

Being honest in the review can be hard when you start, but the people in your team really appreciate feedback because it’s personal.

Reviews should contain no surprises. If you get to a review and your team member is surprised by something said, it’s too late. When you give feedback, be specific and call out specific good work – it shows that you as a leader are noticing and appreciating the work that’s being done. Generic platitudes of “good job” aren’t effective.

But don’t just wait for a review - force yourself to give feedback whenever you can. Showing that you notice what people in your team are doing is really powerful, and is a big influencer of behaviour. You might feel like you’re giving too much feedback to your team, but every person on your team is only seeing a slice of the feedback.

1:1s

Sometimes 1:1s are the last thing you want to do if you’re an introvert.

It shows up in your calendar, and you think “I wonder if I just don’t show up today?”. But when your boss doesn’t show up, you both stop talking about the 1:1, and mutually implicitly decide not to do more 1:1s. Force yourself to do them even when you don’t want to – you’ll often find within a minute that you’re enjoying it, because you’re feeding off the response you’re getting: that it’s important to them.

People might do 1:1s because they feel they have to.

Why should I do it? Why should I talk about family, weekend, work? We’re engineers! We know what each other do!

But people are more engaged if you know about their family, and they know about yours. It’s surprising to know these things work, but also relieving – it’s not black magic, it’s just how humans work. We attach to each other more when we’re socially connected.

You should be meeting people a minimum every fortnight, so there are no nasty surprises.

But separate 1:1s-as-a-status-update from 1:1s-as-a-personal-update. It’s important to do both, but separate the concerns.

Mentoring

Get team leads together in your organisation, and get them sharing experiences and lessons. Make sure new managers talk to one another, and bounce ideas around.

It’s also important to pair up new managers with an experienced manager to bounce ideas around.

These pairing doesn’t have to be in the same department, because it’s dealing with people problems, not technical ones.

Also think about finding someone to bounce ideas off outside of the company – talk to ex-bosses, find mentors, individuals. They give you a perspective because they’re not trapped in your day-to-day.

Best bit of advice ever given to David Spriggs, CEO of Infoxchange: “Treat all your staff as if they’re volunteers”:

Listen to people, look after them. Don’t try and do too much work as a manager – you’re there to be a multiplier.

Role models

Everyone was likely managed by a good manager at some point.

We’ve all seen and experienced good things when working for good management. We often forget what those good things are when we make the change to management. You’ve got to re-learn it all. You’re not receiving it, you’re giving it. This is super hard.

Conversely, a lot of people have never had good technical management, so they may imitate what they’ve seen, perpetuating bad practices because it’s all they’ve ever known. For example: “I haven’t seen my manager in 3 months”.

Promotion vs career change

There’s a preconception that you have to do a lot of tasks as a manager.

You’re going up to get more responsibility, be paid more, and have greater input into the company direction.

Often people will get to a point in their technical career where they are unable to advance any further, and there are few companies that help people go further on the technical career development track. Rackspace have the technical career track, where engineers can be paid more and have more input in the company than some execs.

As managers we need to create the opportunity for our people to grow in the direction they want.

Traditionally, people in tech have been promoted due to technical ability and intelligence – less on emotional intelligence, and their ability to communicate with people. The higher up in the org structure you go, the more your ability to socially interact with other people becomes important.

You have to critically assess how good you are at that. Make sure you’re getting feedback, as well as giving feedback.

Celebrating successes

Linda Rising’s “Do Food” pattern from Fearless Change talks about introducing food into celebration and retros, as a way of bringing the team together.

Just a coffee goes a long way. Removing people from the office environment allows them to open up more about problems and successes.

How do you make food or drink celebrations work with remote teams? Everyone buys their own food and drink, gets reimbursed, shares on the video conference.

If you’re going to cater celebrations, be aware of dietary requirements. If you’re doing regular 1:1s, you’ll know people’s dietary requirements ahead of time.

Distributed teams, remote workers, co-located offices

What works: dedicated chat rooms, per-product streams, that people can opt into as they please. Non-technical folk should hang out in those rooms too. Also have a general themed rooms too, like “devops”. Having management in those rooms is a great way to identify problems in the organisation before they spiral out of control.

There’s a certain etiquette when working remotely. People don’t realise that they need to digitally note things that are said face-to-face, so other people who aren’t in the same room are up to speed. To prime co-located teams for working remotely, spread the team across the office so they’re not physically located together.

Sometimes you need to pull people together into the same place regularly (e.g. every 3 months). The frequency and size of events will vary from team to team.

If you don’t have regular face to face communication, the co-located offices and people can fall into old patterns. Even if you send someone to the other office permanently, they adopt the culture of that office, become “one of them”. Exchanges need to be two-way.

There are cultural concerns around remote work in different countries. In some cultures there is a certain level of prestige around having your manager come and visit you. If your boss doesn’t come to visit, and other teams have their bosses visit, that’s a negative mark against you.

Teams that are partially distributed are harder to manage than fully distributed teams. As a manager you can send people that are together home to work, so they work like the people who are distributed. Consider walking around the office before/during/after standups to show the office environment beyond the normal meeting rooms. It makes people feel included.

When you’ve made the wrong move

What happens when you’ve made the career change, and you realise you don’t want it any more? How do you address the impression that you aren’t successful when you go back to your prior role?

It’s important for your organisation to recognise the challenges people are facing in their new roles and support them.

“The fact is, if you’re miserable in a leadership role, you’re probably not doing a good job. Save your team the pain, and change.” – Hannah Browne

Move towards a position you provide the most value in, and you are the most happy in.

It’s not a promotion, it’s a career change helped people realise that the move is not “I have all this responsibility and stuff I need to look after”, it’s “I have different things to think about, I have new stuff to learn about people, communication, relationships, how to be the multiplier”.

Leadership is a fundamentally different set of skills to engineering.

We don’t look at a hairdresser, or a carpenter, and say the hairdresser should be able to knock you up a house, and the carpenter can dye your hair.

In our knowledge industry we expect that people can change the work they do at the drop of a hat – with minimal mentoring, support, training, guidance, advice – and be successful. We all have a role to play to help people who take the step and decide they don’t love it to not feel like they’re losing face.

Go back to something you love and are passionate about. You have invested years of development in those skills.

Leadership vs Management

Aren’t leadership and management separate things?

The feeling in the room was that to be a good manager you have to be a good leader, but you can be a good leader without being a good manager.

Leadership isn’t a position, it’s a function that anyone can adopt. David Marquet’s Turn The Ship Around is a great case study of encouraging a culture of distributed responsibility, creating leaders at all levels of your organisation.

Management has a bad name, leadership is the trendy alternative, but they are distinct things. When leading you have to look after your people, because you’re working for them.

We’ve heard the term “management heavy”, but you don’t really hear the term “leadership heavy” - maybe that’s an insight into the differences between the two functions?

Leadership cuts across different jobs and industries. Management less so, because it can be focused on technical details. Leaders that are great working with, encouraging, motivating, collaborating with people are successful.

Story time from Chris Madden:

Startup during the dotcom, hired 25 engineers, had a great team. But there was nobody in the team that was suitable to lead.

Startup leadership went out, found someone who had run restaurants, and had herded cats in the film industry. Although he didn’t have any technical experience, he was parachuted into this great team of engineers.

The engineers respected him because he was successfully doing work that they weren’t good at.

One thing new managers can be bad at is knowing when to be direct. You don’t want to tell anyone what to do because “you should just decide”.

But eventually you’ll get feedback from the team that sometimes they just want you to make a decision, set a direction, so they can get on with the job.

Regardless of whether you’re in a management or leadership position, make sure you have good feedback loops from the team.

Understanding your team

Understand how your people are working.

If you’re in a leadership position but come from an operations background, spend time understanding how developers work and think.

“No-one gets out of bed in the morning with the express purpose of making your life miserable”

Sometimes people will drive you insane. But everyone has their own stuff going on. People behave how they do for a reason, so spend time understanding why.

Recognise that everyone is different. A technique that works for one person won’t be guaranteed to work with the next person in your team. By having social interaction with the people in your team, you can work out what’s required of you to work with that person effectively.

Devops and changing roles within a team is a useful mechanism for building empathy. There’s the difference between understanding what someone is going through and actually living it.

Dan Pink’s Drive discusses a new model for understanding people’s motivations. Once you remove money as a concern, people are driven by autonomy, mastery, purpose. When you have 1:1s, use Drive as a framework for working out which category they fit into.

Involve the team in the decision making process. Don’t be “my way or the highway”.

Talk about ideas, collectively own it, build consensus around decisions. The work we’re all doing is hard, and you have to be pretty clever to do it well. As the team grows and gets better, you can start thinking “I wouldn’t be able to be part of this team”.

If you leave the team, there should be someone in place to take over your responsibilities.

The value you provide as a leader

You know you’re doing your job as a leader right when you realise that there’s more value in the communication you facilitate than the tasks you’re performing.

You’re not the leader because you’re the best at every job. You’re not delegating tasks because you’ve run out of time to do the tasks.

You’re delegating because you genuinely believe the people you’re giving the work to can do it better than you. Your responsibility is to create a context in which people in your team can succeed. You do this by talking to them, understanding their motivations, giving them purpose.

Sometimes people think that management or leadership isn’t something they can do, because they’re engineers, and it’s not their responsibility, and they’ll need to change the org structure to achieve the outcome.

But leadership is the responsibility of everyone in a team, and it’s within all your abilities.

Resources

Meetup: Melbourne Agile Dev Managers on Meetup, “MAD managers”
Meetup: Once you’re in a leadership position, or are aspiring, there’s the Melbourne CTO and Sydney CTO schools.
Podcast: Manager Tools. Two US-based management consultants talking about new topics every week.
Newsletter: Software Lead Weekly is a free weekly newsletter of curated management and leadership articles. Oren Ellenbogen maintains a Trello board of all the articles included in the newsletter over the last few years.
Book: Oren wrote a book called Leading Snowflakes off the back of the Software Lead Weekly newsletter. It’s specific to people making the transition from engineer to engineering manager.
Book: Talking With Tech Leads by Patrick Kua on Leanpub, lots of interviews with managers
Book: Behind Closed Doors by Esther Derby & Johanna Rothman is a good introduction to software development team management.
Book: Turn The Ship Around by David Marquet. A great case study of encouraging a culture of distributed responsibility, creating leaders at all levels of your organisation.
Tool: iDoneThis, SaaS, sends an email to each person in your team asking “what have you done?”, you itemise everything you’ve done, sends back to your team or your manager. At review time you can go over accomplishments in fine detail, and raise issues as they come up.
Photos in this post are from the DevOps Australia Flickr set, under a CC BY-NC 2.0 license.

Talk-To-Think, Think-To-Talk, and leadership

2015-07-10T00:00:00+00:00

This cop threw me to the ground, cos hip hop is violent, Said “You got freedom of speech, just choose to remain silent”

– Hilltop Hoods, Mic Felon

Effectively communicating with people in and around your team is the most important skill you need to develop as a leader.

How you communicate with people in your team defines how you build relationships and trust. Understanding how you communicate with people is key to being an effective leader and multiplier.

There are two main communication styles: talk to think, and think to talk.

Talk-to-think

Talk-to-think values speed over accuracy.

It is rapid fire brainstorming.

You say what comes to mind, no filter.

Don’t hold back. Be bold. Agitate.

It’s messy, chaotic, beautiful.

This communication style is excellent for covering lots of ground quickly, especially if you’re trying to quickly sketch a picture of a problem domain and potential solutions amongst of group of people. You use approximations of terms and ideas - the details don’t matter as much, as long as you communicate the gist.

Think-to-talk

Think-to-talk values accuracy over speed.

It is measured, sometimes slow, but always methodical. You don’t say what comes to mind immediately - you spend time thinking about articulating your ideas and arguments before saying them. You chose your words carefully, and embrace the ebbing silence.

Think-to-talk is excellent for covering a smaller, sometime sensitive topic area with depth and nuance.

Your communication style

You tend to use one style over the other, but the styles are not mutually exclusive.

I am firmly in the Think-To-Talk camp. When I started my career change I spent a lot of time befuddled why some conversations flowed effortlessly, and others felt like a bucking horse I could barely hold onto.

Realising that my experience was not universal and I needed to level up on my communication styles is one of the things I wish I was told when I became a manager.

Everyone is different. Some people’s communication is dominated by one style, others are someone in between. Some can fluidly move between styles, others take a while to transition, if at all. Fluidity and style are intersecting spectrums.

The good news is that either style can be learned - they just require practice and patience.

And you need to be adept at both if you’re going to be an effective leader.

Leadership and the two styles

You likely are proficient with one of the styles. Now you’re in the midst of your career change, you need to start cultivating the other style.

Why? Because your job will have you in situations where you’ll need to pick and choose the style based on the problem you’re dealing with. Also, it’s not about you - the people you lead are going to employ a mix of styles that you’ll need to match and adapt to.

You’ll often find that conversations stick to one communication style. Through experience you’ll get better at predicting ahead of time what style is called for, based on the topics, and who you’re talking to in your team.

Always be cautious of what you think you know - the situation may change and you’ll need to change your style. As Mark Twain said:

It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.

Sometimes you’ll need to switch styles mid-conversation. This can happen mid-1:1 when you’re switching from The Idea to The Person. Making that switch can be hard, and you’ll probably mess it up. That’s cool, we’ve all been there. You’ll get better with practice, just keep at it.

One of the most important things you can do to cultivate your skill is to spend time every day reflecting on the conversations you’re having with your people. Do it at the end of every conversation, or at the end of the day. Just make sure you do it.

The two questions you need to ask:

What style was in use by people in the room?
Was the style appropriate, given the topic?

What style should I use?

The context of the conversation determines what style you use. It’s your job to identify that context. Practice, practice, practice.

When picking the style, ask yourself: What are the implications of the conversation for the people involved? Are you talking about ideas, or people?

Talk-to-think is brilliant for discussing ideas. You’ll use it heavily for technical problem solving, when sketching out a problem and devising potential solutions as a team. Talk-to-think can also be used for organisational problem solving, when discussing org structure problems, organisational debt or inefficiencies. The caveat is that you need to be really fucking clear with the team that the conversation is a hypothetical brainstorm, and nothing is changing. It’s risky, and I would avoid having those sort of organisational problem solving discussions unless you know your team exceptionally well, and are confident in your ability to reroute the conversation when things get dicey. Tread with care.

Think-to-talk is brilliant for discussing people. This is the style you’ll want to use in your 1:1s when talking about reporting lines, career development, rates and salaries. Slow, methodical, precise conversations are important for setting expectations and not creating confusion and uncertainty about peoples positions within the company.

You need to be aware that the people you’re talking to may simply not be comfortable communicating in your preferred style. Sometimes the other person isn’t that good yet at using your preferred style. This will feel like a drag to you, because you want to use the most efficient style for the situation.

But it’s not about you.

It’s your responsibility to identify what’s going on and compromise. Look for cues that your style isn’t working. When using Talk-To-Think, the other person will be talking less and less, withdrawing from the conversation. When Think-To-Talking, the other person can be frustrated their conversational energy isn’t being matched.

How you are perceived

Sometimes you’ll misjudge the conversation and pick the wrong style.

If you’re using Think-To-Talk with a Talk-To-Thinker you’ll appear haughty, aloof, coldly calculating, surgical, and uncaring.

Talk-to-Thinking with a Think-To-Talker will paint you as scatterbrained, flippant, irrationally vigorous, overbearing, interrupting, and uncaring.

You’ll note that uncaring is shared. An empathy gap is at the root of the miscalibration.

Nobody wants to be perceived as any of these things. Watch for cues, be aware of how people are reacting to what you and others are saying.

Are you slowing the conversation down by not engaging more vigorously? Are you getting too caught up in detail? Switch to Talk-To-Think.

Are you confusing the other person by using lots of potentially conflicting ideas? Are they growing more concerned with every word that comes out of your mouth? Switch to Think-To-Talk.

Be the Talk-To-Think umbrella

People spend a lot of time looking up at what the people above them in the org structure are doing, what they’re saying, who they’re saying it to, and how often they say it. Couple that with a Talk-To-Think communication style up the chain, and it constantly creates and cultivates concerning confusion and uncertainty.

It is damaging as fuck to people in your team because they don’t know how seriously to take ideas from people further up the chain, forcing them into a terrible feedback loop of watching for more cues that have them despairing further.

If you see Talk-To-Think communication coming from above, especially around strategic direction, it’s your duty to sheild your team from that and turn that noise into signal. Distill those opinions into facts, create certainty for your team.

Be prepared

Before you walk into a conversation, you owe it to your people mentally prepare for the style you need to use. This doesn’t have to be an ornate, time consuming ritual - some simple priming is enough.

If you are a Think-To-Talker going into a Talk-To-Think melee, try listening to uptempo trigger music, or going for a walk or quick jog around the block.

Talk-To-Thinker going to the Think-To-Talk doldrums? Listen to downtempo trigger music, or limit sensory inputs by nesting yourself in a quiet dark room.

Have a routine. Prime yourself, have triggers, experiment and change. Where possible integrate some sort of physical activity into the trigger, and avoid screens.

Understanding your communication strengths and weaknesses is one of the hardest but most rewarding things you can do in your management career change.

Diligent and disciplined mastery of this alone puts you heads and shoulders above the rest, and the people you lead will respect you for treating them how they want to be treated.

CD for infrastructure services

2015-05-22T00:00:00+00:00

For the last 6 months I’ve been consulting on a project to build a monitoring metrics storage service to store several hundred thousand metrics that are updated every ten seconds. We decided to build the service in a way that could be continuously deployed and use as many existing Open Source tools as possible.

There is a growing body of evidence to show that continuous deployment of applications lowers defect rates and improves software quality. However, the significant corpus of literature and talks on continuous delivery and deployment is primarily focused on applications - there is scant information available on applying these CD principals to the work that infrastructure engineers do every day.

Through the process of building a monitoring service with a continous deployment mindset, we’ve learnt quite a bit about how to structure infrastructure services so they can be delivered and deployed continuously. In this article we’ll look at some of the principals you can apply to your infrastructure to start delivering it continuously.

How to CD your infrastructure successfully

There are two key principals for doing CD with infrastructure services successfully:

Optimise for fast feedback. This is essential for quickly validating your changes match the business requirements, and eliminating technical debt and sunk cost before it spirals out of control.
Chunk your changes. A CD mindset forces you to think about creating the shortest and smoothest path to production for changes to go live. Anyone who has worked on public facing systems knows that many big changes made at once rarely result in happy times for anyone involved. Delivering infrastructure services continuously doesn’t absolve you from good operational practice - it’s an opportunity to create a structure that re-inforces such practices.

Definitions

Continous Delivery is different from Continuous Deployment in that in Continuous Delivery there is some sort of human intevention required to promote a change from one stage of the pipeline to the next. In Continuous Deployment no such breakpoint exists - changes are promoted automatically. The speed of Continuous Deployment comes at the cost of potentially pushing a breaking change live. Most discussion of “CD” rarely qualifies the terms.
An infrastructure service is a configuration of software and data that is consumed by other software - not by end users themselves. Think of them as “the gears of the internet”. Examples of infrastructure services include DNS, databases, Continuous Integration systems, or monitoring.

What the pipeline looks like

Push. An engineer makes a change to the service configuration and pushes it to a repository. There may be ceremony around how the changes are reviewed, or they could be pushed directly into master.
Detect and trigger. The CI system detects the change and triggers a build. This can be through polling the repository regularly, or a hosted version control system (like GitHub) may call out via a webhook.
Build artifacts. The build sets up dependencies and builds any required software artifacts that will be deployed later.
Build infrastructure. The build talks to an IaaS service to build the necessary network, storage, compute, and load balancing infrastructure. The IaaS service may be run by another team within the business, or an external provider like AWS.
Orchestrate infrastructure. The build uses some sort of configuration management tool to string the provisioned infrastructure together to provide the service.

There is a testing step between almost all of these steps. Automated verification of the changes about to be deployed and the state of the running service after the deployment is crucial to doing CD effectively. Without it, CD is just a framework for continuously shooting yourself in the foot faster and not learning to stop. You will fail if you don’t build feedback into every step of your CD pipeline.

Defining the service for quality feedback

Decide what guarantees you are providing to your users. A good starting point for thinking about about what those guarantees should be is the CAP theorem. Decide if the service you’re building is an AP or CP system. Infrastructure services generally tend towards AP, but there are cases where CP is preferred (e.g. databases).
Define your SLAs. This is where you quantify the guarantees you’ve just made to your users. These SLAs will relate to service throughput, availability, and data consistency (note the overlap with CAP theorem). 95e response time for monitoring metric queries in a one hour window is < 1 second, and a single storage node failure does not result in graph unavailability are examples of SLAs.
Codify your SLAs as tests and checks. Once you’ve quantified your guarantees SLAs, this is how you get automated feedback throughout your pipeline. These tests must be executed while you’re making changes. Use your discretion as to if you run all of the tests after every change, or a subset.
Define clear interfaces. It’s extremely rare you have a service that is one monolithic component that does everything. Infrastructure services are made of multiple moving parts that work together to provide the service, e.g. multiple PowerDNS instances fronting a MySQL cluster. Having clear, well defined interfaces are important for verifying expected interactions between parts before and after changes, as well as during the normal operation of the service.
Know your data. Understanding where the data lives in your service is vital to understanding how failures will cascade throughout your service when one part fails. Relentlessly eliminate state within your service by pushing it to one place and front access with horizontally scalable immutable parts. Your immutable infrastructure is then just a stateless application.

Making it fast

Getting iteration times down is the most important goal for achieving fast feedback. From pushing a change to version control to having the change live should take less than 5 minutes (excluding cases where you’ve gotta build compute resources). Track execution time on individual stages in your pipeline with time(1), logged out to your CI job’s output. Analyse this data to determine the min, max, median and 95e execution time for each stage. Identify what steps are taking the longest and optimise them.

Get your CI system close to the action. One nasty aspect of working with infrastructure services is the latency between where you are making changes from, and the where the service you’re making changes to is hosted. By moving your CI system into the same point of presence as the service, you minimise latency between the systems.

This is especially important when you’re interacting with an IaaS API to inventory compute or storage resources at the beginning of a build. Before you can act on any compute resources to install packages or change configuration files you need to ensure those compute resources exist, either by building up an inventory of them or creating them and adding them to said inventory.

Every time your CD runs it has to talk to your IaaS provider to do these three steps:

Does the thing exist?
Maybe make a change to create the thing
Get info about the thing

Each of these steps requires sending and recieving often non-trivial amounts of data that will be affected by network and processing latency.

By moving your CI close to the IaaS API, you get a significant boost in run time performance. By doing this on the monitoring metrics storage project we reduced the CD pipeline build time from 20 minutes to 5 minutes.

Push all your changes through CI. It’s tempting when starting out your CD efforts to push some changes through the pipeline, but still make ad-hoc changes outside the pipeline, say from your local machine.

This results in several problems:

You don’t receive the latency reducing benefits of having your CI system close to the infrastructure.
You limit visibility to other people in your team as to what changes have actually been made to the service. That quick fix you pushed from your local machine might contribute to a future failure that your colleagues will have no idea about. The team as a whole benefits from having an authoriative log of all changes made.
You end up with divergent processes - one for ad-hoc changes and another for Real Changes™. Now you’re optimising two processes, and those optimisations will likely clobber one another. Have fun.
You reduce your confidence that changes made in one environment will apply cleanly to another. If you’re pushing changes through multiple environments before they are applied to your production environment, you reduce the certainty that one off changes in one environment won’t cause changes to pass there but fail elsewhere.

There’s no point in lying: pushing all changes through CI is hard but worth it. It requires thinking about changes differently and embracing a different way of working.

The biggest initial pushback you’ll probably get is having to context switch between your terminal where you’re making changes and the web browser where you’re tracking the CI system output. This context switch sounds trivial but I dare you to try it for a few hours and not feel like you’re working more slowly.

Netflix Skunkworks’ jenkins-cli is an absolutely godsend here - it allows you to start, stop, and tail jobs from your command line. Your workflow for making changes now looks something like this:

git push && jenkins start $job && jenkins tail $job

The tail is the real killer feature here - you get the console output from Jenkins on your command line without the need to switch away to your browser.

Chunking your changes

Change one, test one is a really important way of thinking about how to apply changes so they are more verifiable. When starting out CD the easiest path is to make all your changes and then test them straight away, e.g.

Change app

Change database

Change proxy

Test app

Test database

Test proxy

What happens when your changes cause multiple tests to fail? You’re faced with having to debug multiple moving parts without solid information on what is contributing to the failure.

There’s a very simple solution to this problem - and test immediately after you make changes:

Change app

Test app

Change database

Test database

Change proxy

Test proxy

When you make changes to the app that fail the tests, you’ll get fast feedback and automatically abort all the other changes until you debug and fix the problem in the app layer.

If you were applying changes by hand you would likely be doing something like this anyway, so encode that good practice into your CD pipeline.

Tests must finish quickly. If you’ve worked on a code base with good test coverage you’ll know that slow tests are a huge productivity killer. Exactly the same here - the tests should be a help not a hinderance. Aim to keep each test executing in under 10 seconds, preferably under 5 seconds.

This means you must make compromises in what you test. Test for really obvious things like “Is the service running?”, “Can I do a simple query?”, “Are there any obviously bad log messages?”. You’ll likely see the crossover here with “traditional” monitoring checks. You know, those ones railed against as being bad practice because they don’t sufficiently exercise the entire stack.

In this case, they are a pretty good indication your change has broken something. Aim for “good enough” fast coverage in your CD pipeline which complements your longer running monitoring checks to verify things like end-to-end behaviour.

Serverspec is your friend for quickly writing tests for your infrastructure.

Make the feedback visual. The raw data is cool, but graphs are better. If you’re doing a simple threshold check and you’re using something like Librato or Datadog, link to a dashboard.

If you want to take your visualisation to the next level, use gnuplot’s dumb terminal output to graph metrics on the command line:

  1480 ++---------------+----------------+----------------+---------------**
       +                +                +                + ************** +
  1460 ++                                            *******              ##
       |                                      *******                 #### |
  1440 ++                    *****************                 #######    ++
       |                  ***                                ##            |
  1420 *******************                                  #             ++
       |                                                   #               |
  1400 ++                                                ##               ++
       |                                             ####                  |
       |                                          ###                      |
  1380 ++                                      ###                        ++
       |                                     ##                            |
  1360 ++                               #####                             ++
       |                            ####                                   |
  1340 ++                    #######                                      ++
       |                  ###                                              |
  1320 ++          #######                                                ++
       ############     +                +                +                +
  1300 ++---------------+----------------+----------------+---------------++
       0                5                10               15               20


CRITICAL: Deviation (116.55) is greater than maximum allowed (100.00)

Conclusion

CD of infrastructure services is possible provided you stick to the two guiding principals:

Optimise for fast feedback.
Chunk your changes.

Focus on constantly identifying and eliminating bottlenecks in your CD pipeline to get your iteration time down.

Why do you want to lead people?

2014-10-03T00:00:00+00:00

Understanding your motivations for a career change into management is vitally important to understanding what kind of manager you want to be.

When I made the transition into management, I didn’t have a clear idea of what my motivations were. I had vague feelings of wanting to explore the challenges of managing people. I also wanted to test myself and see if I could do as good a job as role models throughout my career.

But all of this was vague, unquantifiable feelings that took a while to get a handle on. Understanding, questioning, and clarifying my motivations was something I put a lot of thought into in the first year of my career change.

People within your teams will spend much more time than you realise looking at and analysing what you are doing, and they will pick up on what your motivations are, and where your priorities lie.

They will mimic these behaviours and motivations, both positive and negative. You are a signalling mechanism to the team about what’s important and what’s not.

This is a huge challenge for people making the career change! You’re still working all this shit out, and you’ve got the ever gazing eye of your team examining and dissecting all of your actions.

These are some of the motivations I’ve picked up on in myself and others when trying to understand what drew me to the management career change.

Money

It is undeniable that there is a pay bump when moving to management. In most organisations, the pay ceiling is much higher in management than in engineering.

Many engineers who rise through the ranks get to a point where the only way they will earn more is if they switch from engineering to management, so that becomes the primary motivation.

The pay is higher for a good reason though - it’s actually difficult to do the job well! Management looks easy from the outside, but it’s difficult on the inside. Again, our friends Dunning and Kruger posit that for a given skill, incompetent people will:

tend to overestimate their own level of skill

fail to recognize genuine skill in others

fail to recognize the extremity of their inadequacy

recognize and acknowledge their own previous lack of skill, if they are exposed to training for that skill

Poor decisions are obvious and easy to criticise. Because we spend a lot of time looking at those in our organisation above us, we’re finely attuned to mistakes and inadequacies, and tend to glass over the good things they do.

Understanding what about those decisions and behaviours makes sense to the people making them is difficult but vital to effectively working with others, regardless of whether you’re in management, engineering, sales, finance, or operations.

More often than not, there are good reasons behind bad decisions. We are all locally rational.

The pay bump has strings attached - you’re going to be making plenty of decisions, both good and bad, and wearing the consequences of them.

You are being paid to be empathetic - to understand how people are feeling, how implementing change will affect people, how to keep them motivated and working towards the big picture goal. None of these tasks are simple!

If you’re primarily motivated to move into management by better pay, then you need to seriously consider how that motivation will affect the people that report to you, how mimicry of those motivations and behaviours by people in your team flow on other teams you work with, and what you need to do to meet the commitments you have to your team.

Will you be doing the bare minimum to collect your paycheck? What’s stopping you from becoming an example of the Peter Principle? What skills do you need to develop to meet your people’s needs and expectations?

The hard problems in tech are not technology, they’re people. That is why management pays more.

Influence

Being in management grants you power and influence in your organisation to build and run things as you see fit.

This is often a key motivation for people who want to transition from engineering to management - they have a clarity of vision and they want the power to mandate how things should be built, and implement that vision.

The motivation is always rooted in good intentions (“things could be so much more efficient if everyone just listened and did what I said”), and often results in a industrialist approach to managing people - “manager smart, worker stupid”.

The influence trap

Your influence can be wielded as a lever (leadership) or as a vise (management). Levers are useful at moving heavy objects but lack precision. Vises are very precise but a weight too heavy will slip from them.

Vises are an alluring way for first time managers to work. The vise management style is prescriptive, centrally co-ordinated, command and control. And if you watch carefully you’ll soon realise it limits the potential of the team.

Prescriptive, vise-like management assumes you are the smartest person in the room, and know best how things should be done.

It doesn’t multiply the teams effectiveness. The point of being a manager is to be a lever that multiplies the effectiveness of the team - to synthesise different and conflicting ideas to come to decisions and solutions nobody could have anticipated or come up by themselves. This is near impossible if you solely wield your influence as a vise.

Studies show people’s problem individual performance lifts after being exposed to teamwork situations and training.

Prescriptive management increases the gap between Work As Imagined vs Work As Done. While conceptually you may have a great idea about how to solve a problem or operate a system daily, the people implementing your plans always discover gaps between the concept and implementation. Over time these gaps become larger, to the point you have a distorted view of how work is being done compared to how it’s actually being carried out.

You optimise the effectiveness of the system by having tight feedback loops, open communication channels where people are rewarded for providing both negative and positive feedback about the design and operation of the system. As a manager, this means you need to be actively engaging with the people in your team - finding out what they think and feel about the work.

Finally, prescriptive management is an empirically bad way of retaining creative talent. Constant overruling and minimisation of feedback is a great way to piss people off. If you hire creative, intelligent, capable people and keep them locked in a box, they’re going to break out.

Multiplying, trust, and happiness

Maybe you are the smartest person in the room, but others will bring knowledge and experience to the table you simply don’t have.

You get the best out of the team by creating a safe space for people to put forward ideas, argue them without recriminations, and build consensus.

The goal for people leading high performing teams should be to have the output of the team be greater than the sum of the individual efforts of people in the team.

Your status as a manager grants you power within your organisation. That power must be wielded responsibly. You won’t know if you’re wielding that power responsibly in the first 12 months of the career change, at best.

You must constantly assess whether the decisions you’re making are the best for the people who report to you. It’s a constant tightrope act to balance the needs of your people over the needs of the business.

It’s easy to pass the policy buck and say “I’m just following orders” when implementing unpopular changes, but you do have a responsibility to identify and push back on change that negatively affects people before you roll it out, and minimise the unavoidable negative effects of that change.

It does not take long for things to come apart when you take your eye off the ball and stop looking out for the team. Trust is hard to build, and easy to lose. People spend a lot of time looking at you and analysing your behaviour. They will notice much earlier than you realise when you take your eye off the ball.

It takes at least 5 positive interactions to start re-establishing trust after you’ve breached it.

Being in a management position grants you the power to shape how people within your organisation do their work. This means you have a direct influence over their happiness and wellbeing. Blindly implementing policy and not empathising with the people in your team can cause irreparable damage and create emotional scar tissue that will stay with people for years, if not decades.

Your power must be wielded responsibly. Do not fuck this up. When you do (and don’t worry, you will, we all have), own your mistakes, apologise, and rebuild the trust.

Personal development / Career change

Personal development is a pretty good motivation for a career change to management! You want to challenge yourself to do a better job than those before and around you.

A huge personal motivation for me when moving into management was to treat others better than I had been treated throughout my career until that point.

Working in environments where the happiness of people was not the primary concern of those in charge is not a fun experience. Shared negative and stressful experiences helped me form close bonds and develop a camaraderie with the people I worked with. I couldn’t say the same about the people I worked for.

Those relationships are something I value, but I wouldn’t want anyone else to have to go through what we did just to obtain that sort of relationship.

The challenge for me was clear: was it possible to develop that camaraderie within the team I lead through purely positive experiences?

Looking back at how particular decisions and behaviours I experienced affected me and other people in the teams I worked in in these stressful environments, there were some obvious things that I could improve on.

There were other decisions I considered to be poor at the time, but after finding myself in similar positions I made similar choices.

I failed fairly terribly at the transition during the first 12 months of my career change. Someone in my team described my management style as “absent father”. That really put into perspective that my priorities were misplaced, and I needed to focus on the team and not my own individual performance.

My first experience working in tech was overwhelmingly positive. The working environment and management I experienced on a daily basis in the first 3 years of working in tech is the experience I aspire to create for people in the teams I lead every day.

The times I had a “good boss” are some of my best memories in my career. I was focused on the work, consistently delivered things I was excited about, and rarely worried about troubles elsewhere in the business (and it turned out there were a lot of them).

The enduring attitude from that time is the feeling of working with, not for my manager. We worked as a team to solve problems together, not as individuals off doing our own thing. That’s the feeling I want to create in the teams I lead.

Understanding what motivates your career change is not an easy task.

At the end of the first year of my career change, my motivations lay somewhere between influence and personal development.

These motivations have morphed over time. Today, my focus is the happiness of the people I work with.

You need to undertake a constant process of self-reflection and a space to develop your understanding of your motivations. It’s important you create the time and space to do this!

The simplest trap to fall into in your first year is to be focused on the daily grind, the tactical details, and not think about the bigger picture.

This is something that affects experienced and novice managers alike, and it’s important to establish good personal habits early on so you have time to reflect on what motivates you, and what sort of leader you’re going to be.

It's not a promotion - it's a career change

2014-09-19T00:00:00+00:00

The biggest misconception engineers have when thinking about moving into management is they think it’s a promotion.

Management is not a promotion. It is a career change.

If you want to do your leadership job effectively, you will be exercising a vastly different set of skills on a daily basis to what you are exercising as an engineer. Skills you likely haven’t developed and are unaware of.

Your job is not to be an engineer. Your job is not to be a manager. Your job is to be a multiplier.

You exist to remove roadblocks and eliminate interruptions for the people you work with.

You exist to listen to people (not just hear them!), to build relationships and trust, to deliver bad news, to resolve conflict in a just way.

You exist to think about the bigger picture, ask provoking and sometimes difficult questions, and relate the big picture back to something meaningful, tangible, and actionable to the team.

You exist to advocate for the team, to promote the group and individual achievements, to gaze into unconstructive criticism and see underlying motivations, and sometimes even give up control and make sacrifices you are uncomfortable or disagree with.

You exist to make systemic improvements with the help of the people you work with.

Does this sound like engineering work?

The truth of the matter is this: you are woefully unprepared for a career in management, and you are unaware of how badly unprepared you are.

There are two main contributing factors that have put you in this position:

The Dunning-Kruger effect
Systemic undervaluation of non-technical skills in tech

Systemic undervaluation of non-technical skills

Technical skills are emphasised above all in tech. It is part of our mythology.

Technical skill is the dominant currency within our industry. It is highly valued and sought after. If you haven’t read all the posts on the Hacker News front page today, or you’re not running the latest releases of all your software, or you haven’t recently pulled all-nighter coding sessions to ship that killer feature, you’re falling behind bro.

Naturally, for an industry so unhealthily focused on technical skills, they tend to be the deciding factor for hiring people.

Non-technical skills that are lacking, like teamwork, conflict resolution, listening, and co-ordination, are often overlooked and excused away in engineering circles. They are seen as being of lesser importance than technical skills, and organisations frequently compensate for, minimise the effects of, and downplay the importance of these skills.

If you really want to see where our industry places value, just think about the terms “hard” and “soft” we use to describe and differentiate between the two groups of skills. What sort of connotations do each of those words have, and what implicit biases do they feed into and trigger?

If you’re an engineer thinking about going into management, you are a product of this culture.

There are a handful of organisations that create cultural incentives to develop these non-technical skills in their engineers, but these organisations are, by and large, unicorns.

And if you want to lead people, you’re in for a rude shock if you haven’t developed those non-technical skills.

Because guess what - you can’t lead people in the same way you write code or manage machines. If you could, management would have been automated a long time ago.

The Dunning-Kruger effect

The identification of the Dunning-Kruger effect is one of the most interesting development of modern psychology, and one of the most revelatory insights available to our industry.

In 1999 David Dunning and Justin Kruger started publishing the results of experiments on the ability of people to self-assess competence:

Dunning and Kruger proposed that, for a given skill, incompetent people will:

tend to overestimate their own level of skill

fail to recognize genuine skill in others

fail to recognize the extremity of their inadequacy

recognize and acknowledge their own previous lack of skill, if they are exposed to training for that skill

If you’ve had a career in tech without any leadership responsibilities, you’ve likely had thoughts like:

“Managing people can’t be that hard.”
“My boss has no idea what they are doing.”
“I could do a better job than them.”

Congratulations! You’ve been partaking in the Dunning-Kruger effect.

The bad news: Dunning-Kruger is exacerbated by the systemic devaluation of non-technical skills within tech.

The good news: soon after going into leadership, the scope of your lack of skill, and unawareness of your lack of skill, will become plain for you to see.

Also, everyone else around you will see it.

Multiplied impact

This is the heart of the matter: by being elevated into a position of leadership, you are being granted a responsibility over people’s happiness and wellbeing.

Mistakes made due to lack of skill and awareness can cause people irreparable damage and create emotional scar tissue that will stay with people for years, if not decades.

Conversely, by developing skills and helping your team row in the same direction, you can also create positive experiences that will last with people their entire careers.

The people in your team will spend a lot of time looking up at you - far more time than what you realise. Everything you do will be analysed and disected, sometime fairly, sometimes not.

If you’re not willing to push yourself, develop the skills, and fully embrace the career change, maybe you should stay on the engineering career development track.

But it’s not all doom and gloom.

By striving to be a multiplier, the effects of the hard work you and the team put in can be far greater than what you can achieve individually.

You only reap the benefits of this if you shift your measure of job satisfaction from your own performance to the group’s.

“Real work”

Many engineers who change into management feel disheartened because they’re not getting as much “real work” done.

If you dig deeper, “real work” is always linked to their own individual performance. Of course you’re not going to perform to the same level as an engineer - you’re working towards the same goals, but you are each working on fundamentally different tasks to get there!

Focusing on your own skills and performance can be a tough loop to break out of - individual achievement is bound up in the same mythology as technical skills - it’s something highly prized and disproportionately incentivised in much of our culture.

If you’ve decided to undertake this career change, it’s important to treat your lack of skill as a learning opportunity, develop a hunger for learning more and developing your skills, routinely reflect on your experiences and compare yourself to your cohort.

None of these things are easy - I struggled with feelings of inadequacy in meeting the obligations of my job for the first 3 years of being in a leadership position. Once I worked out that I was tying job satisfaction to engineering performance, it was a long and hard struggle to re-link my definition of success to group performance.

If everything you’ve read here hasn’t scared you, and you’ve committed to the change to management, there are three key things you can start doing to start skilling up:

Do professional training.
Get mentors.
Educate yourself.

Training

Tech has a bias against professional training that doesn’t come from universities. Engineering organisations tend to value on-the-job experience over training and certification. A big part of that comes from a lot of technical training outside of universities being a little bit shit.

Our experience of bad training in the technical domain doesn’t apply to management - there is plenty of quality short course management training available, that other industries have been financing the development of the last couple of decades.

In Australia, AIM provide several courses ranging from introductory to advanced management and leadership development.

Do your research, ask around, find what people would recommend, then make the case for work to pay for it.

Mentors

Find other people in your organisation you can talk to about the challenges you are facing developing your non-technical skills. This person doesn’t necessarily need to be your boss - in fact diversifying your mentors is important for developing skills to entertain multiple perspectives on the same situation.

If you’re lucky, your organisation assigns new managers a buddy to act as a mentor, but professional development maturity for management skills varys widely across organisations.

If you don’t have anyone in your organisation to act as a mentor or buddy, then seek out old bosses and see if they’d be willing to chat for half an hour every few weeks.

I have semi-regular breakfast catchups with a former boss from very early on in my career that are always a breath of fresh air - to the point where my wife actively encourages me to catch up because of how less stressed I am afterwards.

Another option is to find other people in your organisation also going through the same transition from engineer to manager as you. You won’t have all the answers, but developing a safe space to bounce ideas around and talk about problems you’re struggling with is a useful tool.

Self-education

I spend a lot of time reading and sharing articles on management and leadership - far more time than I spend on any technical content.

At the very beginning of your journey it’s difficult to identify what is good and what is bad, what is gold and what is fluff. I have read a lot of crappy advice, but four years into the journey my barometer for advice is becoming more accurate.

Also, be careful of only reading things that re-inforce your existing biases and leadership knowledge. If there’s a particular article I disagree with, I’ll often spend a 5 minutes jotting a brief critique. I’ll either get better at articulating to others what about that idea is flawed, or my perspective will become more nuanced.

It’s also pertinent to note how the article made you feel, and reflect for a moment on what about the article made you to feel that way.

If you’re scratching your head for where to start, I recommend Bob’s Sutton “The No Asshole Rule”, then “Good Boss, Bad Boss”. Sutton’s work is rooted in evidence based management (he’s not talking out of his arse - he’s been to literally thousands of companies and observed how they work), but writes in an engaging and entertaining way.

Almost four years into my career change, I can say that it’s been worth it. It has not been easy. I have made plenty of mistakes, have prioritised incorrectly, and hurt people accidentally.

But so has everyone else. Nobody else has this nailed. Even the best managers are constantly learning, adapting, improving.

Think about it this way: you’re going to accumulate leadership skills faster than people who have made the change because you’re starting with nothing. The difference is nuance and tact that comes from experience, something you can develop by sticking with your new career.

This will only happen when you fully commit to your new career, and you change your definition for success to meet your new responsibilities as a manager.

Applying cardiac alarm management techniques to your on-call

2014-08-26T00:00:00+00:00

If alarms are more often false than true, a culture emerges on the unit in that staff may delay response to alarms, especially when staff are engaged in other patient care activities, and more important critical alarms may be missed.

One of the most difficult challenges we face in the operations field right now is “alert fatigue”. Alert fatigue is a term the tech industry has borrowed from a similar term used in the medical industry, “alarm fatigue” - a phenomenon of people being so desensitised to the alarm noise from monitors that they fail to notice or react in time.

In an on-call scenario, I posit two main factors contribute to alert fatigue:

The accuracy of the alert.
The volume of alerts received by the operator.

Alert fatigue can manifest itself in many ways:

Operators delaying a response to an alert they’ve seen before because “it’ll clear itself”.
Impaired reasoning and creeping bias, due to physical or mental fatigue.
Poor decision making during incidents, due to an overload of alerts.

Earlier this year a story popped up about a Boston hospital that silenced alarms to improve the standard of care. It sounded counter-intuitive, but in the context of the alert fatigue problems we’re facing, I wanted to get a better understanding of what they actually did, and how we could potentially apply it to our domain.

The Study

When rolling out new cardiac telemetry monitoring equipment in 2008 to all adult inpatient clinical units at Boston Medical Center (BMC), a Telemetry Task Force (TTF) was convened to develop standards for patient monitoring. The TTF was a multidisciplinary team drawing people from senior management, cardiologists, physicians, nursing practitioners and directors, clinical instructors, and a quality and patient safety specialist.

BMC’s cardiac telemetry monitoring equipment provide configurable limit alarms (we know this as “thresholding”), with alarms for four levels: message, advisory, warning, crisis. These alarms can either be visual or auditory.

As part of the rollout, TTF members observed nursing staff responding to alarms from equipment configured with factory default settings. The TTF members observed that alarms were frequently ignored by nursing staff, but for a good reason - the alarms would self-reset and stop firing.

To frame this behaviour from an operations perspective, this is like a Nagios check passing a threshold for a CRITICAL alert to fire, the on-call team member receiving the alert, sitting on it for a few minutes, and the alert recovering all by itself.

When the nursing staff were questioned about this behaviour, they reported that more often than not the alarms self-reset, and answering every alarm pulled them away from looking after patients.

Fast forward 3 years, and in 2011 BMC started an Alarm Management Quality Improvement Project that experimented with multiple approaches to reducing alert fatigue:

Widen the acceptable thresholds for patient vitals so alarms would fire less often.
Eliminate all levels of alarms except “message” and “crisis”. Crisis alarms would emit an audible alert, while message history would build up on the unit’s screen for the next nurse to review.
Alarms that had the ability to self-reset (recover on their own) were disabled.
If false positives were detected, nursing staff were required to tune the alarms as they occurred.

The approaches were applied over the course of 6 weeks, with buy-in from all levels of staff, most importantly with nursing staff who were responding to the alarms.

Results from the study were clear:

The number of total audible alarms decreased by 89%. This should come as no surprise, given the alarms were tuned to not fire as often.
The number of code blues decreased by 50%. This indicates that the reduction of work from the elimination of constant alarms freed up nurses to provide more proactive care, and that lower priority alarms for precursor problems for code blues are more likely to be responded to.
The number of Rapid Response Team activations on the unit stayed constant. It’s reasonable to assert that the operational effectiveness of the unit was maintained even though alarms fired less often.
Anonymous surveys of nurses on the unit showed an increase in satisfaction with the level of noise on the unit, with night staff reporting they “kept going back to the central station to reassure themselves that the central station was working”. One anonymous comment stated “I feel so much less drained going home at the end of my shift”.

At the conclusion of the study, the nursing staff requested that the previous alarming defaults were not restored.

Analysis

The approach outlined in the study is pretty simple: change the default alarm thresholds so they don’t fire unless action must be taken, and give the operator the power to tune the alarms if the alarm is inaccurate.

Alerts should exist in two states: nothing is wrong, and the world is on fire.

But the elimination of alarms that have the ability to recover is a really surprising solution. Can we apply that to monitoring in an operations domain?

Two obvious methods to make this happen:

Remove checks that have the ability to self-recover.
Redesign checks so they can’t self-recover.

For redesigning checks, I’ve yet to encounter a check designed to not recover when thresholds are no longer exceeded. That would be a very surprising alerting behaviour to stumble upon in the wild, that most operators, myself included, would likely attribute to a bug in the check. Socially, a check redesign like that would break many fundamental assumptions operators have about their tools.

From a technical perspective, a non-recovering check would require the check having some sort of memory about its previous states and acknowledgements, or at least have the alerting mechanism do this. This approach is totally possible in the realm of more modern tools, but is not in any way commonplace.

Regardless of the problems above, I believe adopting this approach in an operations domain would be achievable and I would love to see data and stories from teams who try it.

As for removing checks, that’s actually pretty sane! The typical CPU/memory/disk utilisation alerts engineers receive can be handy diagnostics during outages, but in almost all modern environments they are terrible indicators for anomalous behaviour, let alone something you want to wake someone up about. If my site can take orders, why should I be woken up about a core being pegged on a server I’ve never heard of?

Looking deeper though, the point of removing alarms that self-recover is to eliminate the background noise of alarms that are ignorable. This ensures each and every alarm that fires actually requires action, is investigated, acted upon, or is tuned.

This is only possible if the volume of alerts is low enough, or there are enough people to distribute the load of responding to alerts. Ops teams that meet both of these criteria do exist, but they’re in the minority.

Another consideration is that checks for operations teams are cheap, but physical equipment for nurses is not. I can go and provision a couple of thousand new monitoring checks in a few minutes and have them alert me on my phone, and do all that without even leaving my couch. There’s capacity constraints on the telemetry monitoring in hospitals - budgets limit the number of potential alarms that can be deployed and thus fire, and a person physically needs to move and act on a check to silence it.

Also consider that hospitals are dealing with pets, not cattle. Each patient is a genuine snowflake, and the monitoring equipment has to be tuned for size, weight, health. We are extremely lucky in that most modern infrastructure is built from standard, similarly sized components. The approach outlined in this study may be more applicable to organisations who are still looking after pets.

There are constraints and variations in physical systems like hospitals that simply don’t apply to the technical systems we’re nurturing, but there is a commonality between the fields: thinking about the purpose of the alarm, and how people are expected to react to it firing, is an extremely important consideration when designing the interaction.

One interesting anecdote from the study was that extracting alarm data was a barrier to entry, as manufacturers often don’t provide mechanisms to easily extract data from their telemetry units. We have a natural advantage in operations in that we tend to own our monitoring systems end-to-end and can extract that data, or have access to APIs to easily gather the data.

The key takeaway the authors of the article make clear is this:

Review of actual alarm data, as well as observations regarding how nursing staff interact with cardiac monitor alarms, is necessary to craft meaningful quality alarm initiatives for decreasing the burden of audible alarms and clinical alarm fatigue.

Regardless of whether you think any of the methods employed above make sense in the field of operations, it’s difficult to argue against collecting and analysing alerting data.

The thing that excites me so much about this study is there is actual data to back the proposed techniques up! This is something we really lack in the field of operations, and it would be amazing to see more companies publish studies analysing different alert management techniques.

Finally, the authors lay out some recommendations for other institutions can use to improve alarm fatigue without requiring additional resources or technology.

To adapt them to the field of operations:

Establish a multidisciplinary alerting work group (dev, ops, management).
Extract and analyse alerting data from your monitoring system.
Eliminate alerts that are inactionable, or are likely to recover themselves.
Standardise default thresholds, but allow local variations to be made by people responding to the alerts.

Rethinking monitoring post-Monitorama PDX

2014-05-10T00:00:00+00:00

The two key take home messages from Monitorama PDX are this:

We are mistakenly developing monitoring tools for ops people, not the developers who need them most.
Our over-reliance on strip charts as a method for visualising numerical data is hurting ops as a craft.

Death to strip charts

Two years ago when I received my hard copy of William S. Cleveland’s The Elements of Graphing Data, I eagerly opened it and scoured its pages for content on how to better visualise time series data. There were a few interesting methods to improve the visual perception of data in strip charts (banking to 45˚, limiting the colour palette), but to my disappointment there were no more than ~30 pages in the 297 page tome that addressed visualising time series data.

In his talk at Monitorama PDX, Neil Gunther goes on a whirlwind tour of visualising data used by ops daily with visual tools other than time series strip charts. By ignoring time, looking at the distribution, and applying various transformations to the axes (linear-log, log-log, log-linear), Neil demonstrates how you can expose patterns in data (like power law distributions) that were simply invisible in the traditional linear time series form.

Neil’s talk explains why Cleveland’s Elements gives so little time to time series strip charts - they are a limited tool that obfuscates data that doesn’t match all but a very limited set of patterns.

Strip charts are the PHP Hammer of monitoring.

We have been conditioned to accept strip charts as the One True Way to visualise time series data, and it is fucking us over without us even realising it. Time series strip charts are the single biggest engineering problem holding monitoring as a craft back.

It’s time to shape our future by building new tools and extending existing ones to visualise data in different ways.

This requires improving the statistical and visual literacy of tool developers (who are providing the generalised tools to visualise the data), and the people who are using the graphs to solve problems.

There is another problem here, which Rashid Khan touched on during his time on stage: many people are using logstash & Kibana directly and avoid numerical metric summaries of log data because that numerical data is just an abstraction of an abstraction.

The textual logs provide far more insight into what’s happening than numbers:

As an ops team, you have one job: provide a platform app developers can wire up logs, checks, and metrics to (in that order). Expose that to them in a meaningful way for analysis later on.

The real target audience for monitoring (or, How You Can Make Money In The Monitoring Space)

Adrian Cockcroft made a great point in his keynote: we are building monitoring tools for ops people, not the developers who need them most. This is a piercing insight that fundamentally reframes the problem domain for people building monitoring tools.

Building monitoring tools and clean integration points for developers is the most important thing we can do if we want to actually improve the quality of people’s lives on a day to day basis.

Help your developers ship a Sensu config & checks as part of their app. You can even leverage existing testing frameworks they are already familiar with.

This puts the power & responsibility of monitoring applications into the hands of people who are closest to the app. Ops still provide value: delivering a scalable monitoring platform, and working with developers to instrument & check their apps. You are reducing duplication of effort and have time to educate non-ops people on how to get the best insight into what’s happening.

There is still a room for monitoring tools as we’ve traditionally used them, but that’s mostly limited to providing insight into the platforms & environments that ops are providing to developers to run their applications.

The majority of application developers don’t care about the internal functioning of the platform though, and they almost certainly don’t want to be alerted about problems within the platform, other than “the platform has problems, we’re working on fixing them”.

The money in the monitoring industry is in building monitoring tools to eliminate the friction for developers get better insight into how their applications are performing and behaving in the real world. New Relic is living proof of this, but the market is far larger than what New Relic is currently catering to, and it’s a far larger market than the ops tools market because developers are much more willing to adopt new tools, experiment, and tinker.

If you can provide a method for developers to expose application state in a meaningful way while lowering the barrier of entry, they will jump at it.

So are you building monitoring tools for the future?

Using a first gen iPad mini as a grafana dashboard in 2024

Trial and error

Crazy or annoying: pick one

Using MikroTik Netinstall on Linux

Netinstall is only one half of the solution. The other is Etherboot.

How to run netinstall on Linux

You can’t use non-MikroTik tools (like dnsmasq) to serve up the RouterOS images

My philosophy on work

I wrote this so you understand my philosophy on work.

I’m here to route information, remove roadblocks, and shield the team

I value fairness, context, and pride in work

Fairness

Context

Pride in work

My expectations are few but firm

Feedback will be direct, prompt, and humane

My office hours are 10.00 to 17.30

1:1s are the most important conversations I have

Slack is the best way to contact me

I have some quirks. I’m working on them.

This document, like me, is a work in progress

A simple proxy service for scrapers running on Morph

The scraper

Designed to be cheap, resilient, and open

Drive changes with make and environment variables

Wrap it with a Continuous Deployment pipeline

Civic hacking for government shortfalls

AWS in government: risks, myths, and misconceptions

Myth: We can’t store data securely!

Misconception: We’ll run it like physical infrastructure!

Risk: Our spend is getting out of control!

Risk: Our stuff is getting hacked!

Misconception: We aren’t getting the reliability benefits!

Conclusion

Help! I’ve just been made a manager

Get a job description

Managing your workload

Create feedback loops

Hard truths

PreAccident Investigation Podcast Highlights, Sep-Oct 2015

Kent Whipple – The power of the story

Dr. Alan Frankfurt - High Reliability, Safety, and Delivering Babies

Dr Jim Joy - Critical Controls

Dr Jim Barker - Complexity

Martha Acosta - The 4 Things Leaders Control

Dr. Eric Young – Patient Safety, Surgery

Blame. Language. Sharing.

Language

Why

How

What

Blame

Confirmation bias

Hindsight bias

Sharing

Management skills for new leaders

Reviews

1:1s

Mentoring

Role models

Promotion vs career change

Celebrating successes

Distributed teams, remote workers, co-located offices

When you’ve made the wrong move

Leadership vs Management

Understanding your team

The value you provide as a leader

Resources

Talk-To-Think, Think-To-Talk, and leadership

Talk-to-think

Think-to-talk

Your communication style

Leadership and the two styles

What style should I use?

How you are perceived

Be the Talk-To-Think umbrella

Be prepared

CD for infrastructure services

How to CD your infrastructure successfully

Definitions

Drive changes with `make` and environment variables