Previous Entry Share Next Entry
Incident Management
As many of you are already aware, the most recent AO3 deploy did not go as smoothly as we hoped, and we’ve sometimes had issues on previous major releases. The big items are all fixed now, but it reminded me that I know a few places (both work projects at my day job and Dreamwidth) where we deal with similar issues. Here are a few ideas I have been thinking about, around the principles of managing incidents on an IT service.

Sometimes, when a technical group is trying to deal with a major problem or a code release that's gone wrong, management and task prioritisation is an issue. You have everyone putting out little fires with buckets, when actually it needs someone to go, "Wait, guys, this is a pretty big building and it's all on fire. I'm ringing the fire service - they have trucks with big hoses." But to do that you have to have one person let go of a bucket in order to pick up the phone.

The general part

At my day job, to support a live IT service, there are usually several levels of support: 1st line support, 2nd line support, 3rd line support, and incident management. For a major code release, there is also the deployment manager and support team for that release, who may or may not be the same people as the 2nd or 3rd line team.

There are three main ways a problem can be discovered and handled in this model:

  1. Sometimes a user reports it to a helpline or support form, which goes to 1st line (i.e. the support committee). They give a friendly helpful response, and pass on details of the bug/issue to 2nd line if it's working hours, or incident management if it's out of hours.

  2. Sometimes an automated tracker spots a problem. Dreamwidth, like many places, has a server monitoring tool tied to their official chat. Theirs is called Nagios, and says, "HEY MARK, HEY MARK, THE SERVER IS DOWN!" like a toddler trying to get your attention. In other places, your friendly systems person's customised script goes, "Email alert: that key bit of the system is running out of memory". Either way, that goes to 2nd line, and if it's critical, copies incident management automatically.

  3. Sometimes 2nd line are browsing a server, doing their day job and checking stuff while they're at it, and spot an issue. If it’s likely to affect the live service, they tell IM.

Either way, you now have two teams talking to each other - IM and 2nd line. Incident management's job is to co-ordinate, prioritise and make the tricky decisions. 2nd line do the actual fixing.

IM are the people who sometimes go, "Actually it's not a big deal, 1st line, go tell the users to stop whining," and 1st line make it all tactful and then tell the users how to work around it. Sometimes IM go, "Hey, this is a big problem, it may be 3am but we need to ring 2nd line NOW."

If 2nd line spot the issue, IM are the people who go, "Hey, maybe we should warn 1st line about this, since they're about to get a ton of angry phone calls that the system is down." And IM do the phoning, leaving 2nd line alone to get on with fixing it.

Sometimes 2nd line look at it and go, "The server is up and running, it's all telling me it's okay, but it's not working - there must be a bug in the code." And they send it to 3rd line. 3rd line are the coders - 2nd line are the sysadmins. In smaller organisations, 2nd line and 3rd line may be a combined team with a mixture of skills.

After a big release, the deployment manager will also be there straight afterwards, talking to 1st line via Incident Management and talking direct to 2nd line, monitoring things proactively. If a problem is found, they'll throw it straight to the release team to investigate the bug, but at the same time, the deployment manager will work with IM to decide on a plan of action. The coders can carry on coding, because the bug needs to be fixed anyway, while the deployment manager and IM make the big decision of whether to roll back the release now or live with it until the fix is ready. In a company, that involves talking to the business owner who says how much of a commercial impact the bug is having, as well as assessing the impact on users.

The AO3 part

In the OTW, we don't really have any formal Incident Management. In theory, AD&T chairs do some of it for the AO3, but at the moment, they're too busy doing all the 3rd line stuff as well. And part of the point is that one person can't do both at the same time.

In a crisis, 3rd line have to concentrate on figuring out where the bug is and why and how to fix it. IM concentrate on telling 1st line to do admin posts, buying cake for 2nd line and coffee for 3rd line, and insulating them from each other so the crucial information gets through but everyone has somewhere to rant that's free from people yelling at them.

IM (or the deployment manager for a big deployment, where that manager works closely with IM and does some of this) know everyone well enough that they can go, "Jane Bloggs from 3rd line has just added pro plus to her red bull, I'd better tell 2nd line that it's going to be at least 6 hours so they can go and get some rest." They're the ones who can say to themselves, "Sue Jones from 1st line has now eaten at least 5 chocolate bars and is gazing longingly at the vodka bottle, I'd better warn 3rd line that the users are now 'really upset' not just 'a little bit upset', and see if there's anything we can do to get them another helper."

I would love to get someone from AD&T officially as the IM-type person, or ideally a couple of people in different timezones, but it has to be someone who's unlikely to be coding in a crisis, and at the moment, that's not true of the chair. It also doesn't need to be the chair - anyone can do it, so long as they know what type of decisions need to be approved by the chair. Sometimes rolling back a release is so obvious that IM can take that decision themselves, and sometimes the cost vs. benefit is not so clear, and AD&T chair needs to make the decision with input from Support or Testers or Coders.

We could also have a discussion at some point about how 2nd line and 3rd line work is split between AD&T and Systems. We have an advantage where several of our senior people are familiar with both types of work, allowing them to analyse the root cause of a problem more effectively - e.g. Sidra is both Systems co-chair and a senior coder - but that also means that people can end up trying to do two jobs at once in a crisis.

Next AD&T meeting, we'll be discussing the last deploy and what we can learn for the future. Having seen some of the discussions, both internally and externally, I’m hopeful. We’ve got a lot of good processes in place already, so long as we continue to follow them, and we have people around with the expertise to advise us where we can improve things further.

This entry was originally posted at and has comment count unavailable comments there. Feel free to comment on either site.
Tags: , ,


Log in

No account? Create an account