Major Incident Management 

What is a major incident?


In theory, a major incident is a highest-impact, highest-urgency incident. It affects a large number of users, depriving the business of one or more crucial services. Business and IT have to agree on what constitutes a major incident. It is one of the rare occasions where ITIL is strict in terms of definition: it MUST be agreed on. ISO 20000 requirements on major incident management are short, but demanding: agreement, separate procedure, responsibility and review.

In practice, you know a major incident when you see it: a large number of Service Desk calls, customer impatience, rage of the management, panic. All the more reason to get it straight before it happens. In most cases, it will simply be the highest-priority incident in the impact/urgency matrix. You might have a look at my Incident Classification article. In some cases, IT and the business can decide that only special types of high-priority incidents will be marked as major incidents. This can be due to different SLA parameters with various businesses. For example, when you support a chain of pharmacies or tobacco shops, they will want their cash register service malfunction to be marked as priority one, with strict resolution times defined.  If you support another organization, say finance or marketing departments in the same corporation, their SLA will tend to address different issues, different response and resolution times, and probably a different amount of resources for the resolution.

Who should be involved?
When a major incident occurs, roles and the process should be strictly defined. Mind you, we are talking about the roles here, not the actual day-to-day jobs. Roles will differ according to the size of the IT service management organization and the scope of its service management. Smaller organizations will tend to aggregate a few roles into one job definition, while larger organizations will elaborate sub-roles for each major incident type, customer or technical expertise field.



Major incident manager. Accountable for the general procedure management, taking care that the required resources for incident resolution are engaged and the customer is informed appropriately about the progress. He shall also have basic technical knowledge about the outage.  In smaller service management organizations with a lower frequency of major incidents, this role will be taken by the Service Desk manager, who also acts as theIncident Manager. In larger organizations, the appointment of major incident manager will depend on the particular expertise area. It could be the technical account manager best acquainted with the respective business organization specifics, someone from the Technical management function or the Application management function.

Problem manager. This role will often have to be involved, since major incident resolution usually requires finding the underlying cause (root cause analysis) of the major incident. This role can’t be combined with the incident management role, due to the well-known conflict of interests between the incident management and problem management processes. The major incident team will be struggling to restore the service, and problem management tends to take its time finding the root cause.

Change manager. Involved in case some urgent changes have to be implemented to restore the service.

SLA manager. Must be informed in order to keep a record of the downtime and to inform the customer if the procedure requires this.

Service Desk. Responsible for keeping incident records up to date and for primary customer communication.

Communication

We mentioned major roles in the process. Guess whose is the most important role, and who is often omitted from the loop? The customer! It is the most common mistake for growing service management organizations – to get involved so deeply in incident resolution that the communication with the customer is neglected.

The moment you receive the call from the customer to inquire about the resolution progress, you should know that there is something wrong. Frequency, form and the scope of communication with the customer should be clearly stated in the SLA. The customer should always know what to expect. His vital business process is endangered; he must be on his heels. Short, concise information every half an hour or at least every hour should contain info about:

Start of downtime
Short description of the known cause of the downtime
The impact of the downtime
Estimated time for restoration
Next scheduled information
The major incident team should maximize its resources in service restoration, so the Service Desk should regularly ping them to receive a quick update about the process, which they will forward formally to the customer.

The after party

The incident is resolved, the service is restored, and the customer returns to his day-to-day business. The aftertaste remains. Why did it happen? What is to be expected going forward – have we done anything to prevent these downtimes in the future? How do we deal with these questions?

In short, the best practice is to resolve the incident and to continue working on a related problem ticket. This will produce a so-called problem report, or at least a root cause analysis (RCA) report in a brief, SLA-defined period of time to the customer. Recommended info in this report should consist of at least the following:

Short description of the incident
Downtime duration
SLA impact
Short incident history
How we resolved the incident
What is the root cause
A set of activities scheduled in order to prevent this kind of downtime