This guide will help you understand what major incidents are, and prepare your organization to face major incidents by leveraging a well-defined, planned major incident management process.
- Major incident management: An overview
- What is a major incident?
- The four stages of a major incident
- The major incident management process
- ITIL® major incident management process flow chart
- Major incident management roles and responsibilities
- Common mistakes in major incident management
- Major incident management best practices
- Major incident management metrics and KPIs
- Major incident scenario
- Download major incident management implementation kit for free
Major incident management: An overview
It's Monday morning and things are pretty normal at your service desk. Suddenly, you get an alert ticket that a critical service is down, and within the next 15 minutes you start getting an influx of tickets reporting the same issue. It could be that your website is down, your point of sale software has stopped working, or something even more far-reaching, like the stock exchange going down or planes being grounded. When your business is severely impacted by an IT issue causing loss of revenue and/or reputation, you have a major incident on your hands.
How you react to a major incident makes all the difference in minimizing the impact of the incident and bringing services back up. As they say, time is money, and in this case, that couldn't be more true. If your organization has a major incident management (MIM) process in place, you can swiftly respond to and resolve major incidents. If you don't have such a process in place, it's time to draw up an emergency response plan, also known as a major incident response process.
The stakes of a major incident are higher than ever before, and according to a study by Information Technology Intelligence Consulting, 98 percent of organizations lose at least $100,000 from an hour of downtime. This reinforces the importance of setting up a MIM process that can effectively and efficiently tackle major incidents.
Every organization aims to eliminate major incidents, but the bottom line is that major incidents are impossible to prevent completely and the only thing you can do is be prepared for them.
In this guide, we'll look at how to set up an effective MIM process, common mistakes that can affect your organization's MIM, and best practices for improving your MIM process.
But first, what makes an incident a major incident?
What is a major incident?
A major incident is a high-impact, urgent issue that usually affects the whole organization or a major part of it. A major incident almost always results in an organization's services becoming unavailable, which causes the organization's business to take a hit and ultimately affects its financial standing. There are two ways a major incident can affect an organization's services:
- By preventing customers from accessing the organization's services. The Cloudflare outage in July 2019 is an example of customers being affected by a major incident. This major outage affected almost half the internet and left millions of internet users unable to access various services.
- By disrupting employees' ability to complete their work on time, leading to a business disruption. IndiGo's outage in November 2019 affected the airline's check-in process, which led to long delays and affected thousands of passengers.
A well-prepared service desk is equipped to assess major incidents and come up with solutions or workarounds to reduce and control the impact of a major incident.
The 4 stages of a major incident
Major incidents are considered to have 4 main stages, namely:
The major incident management process
A major incident management process is a must-have for organizations, as it helps them minimize the business impact of a major incident. The major incident management process primarily consists of the following steps:
Stage 1: Identification
Stage 1: Identification
Declaring the major incident:
The first step is to identify possible major incidents. It is important for organizations to set up multiple methods of identifying threats. Major incidents can be flagged by technicians when they come across unusual tickets, or they can be detected by solutions like network monitoring tools that can automatically flag a network issue and create a ticket to alert the service desk. Organizations can also set up a dedicated hotline for service desk personnel to flag suspected major incidents.
Once a major incident has been identified, it needs to be communicated to all key stakeholders. There are four main groups that need to be informed of major incidents:
- Technical team: It is important to inform the technical team immediately so they can start deciding on a course of action to fix the issue.
- Management: Keeping upper management, like the CIO, informed about major incidents helps with accountability. Organizations should also keep management informed of all the steps taken to fix major incidents.
- Key stakeholders: The department heads and service-level business management staff also need to be informed of major incidents and receive regular status updates.
- Users: Users need to know which services may be unavailable due to a major incident.
Stage 2: Containment
Stage 2: Containment
Assembling the major incident team
A major incident team, or MIT for short, consists of technicians, service-level management heads, and other key stakeholders; sometimes highly skilled external personnel are brought in to tackle a major incident. The MIT works together to find a fix for the major incident and bring operations back to normal.
Setting up a conference bridge
A conference bridge, more commonly known as a conference call, helps with effective troubleshooting and centralized communication. It acts as a clear, fast channel of communication between members of the MIT.
Preparing a designated war room
Having a designated war room allows all members of the MIT to gather and troubleshoot the incident. This increases collaboration efforts, helping the MIT come up with a solution faster.
Creating a problem ticket to identify underlying issues
A problem ticket can be created to discover and understand the root cause of the major incident. This can help prevent similar major incidents in the future by addressing the causes of the major incident.
Stage 3: Resolution
Stage 3: Resolution
Implementing the resolution plan as a change
It is good practice to implement the fix for the major incident as a change to ensure that the resolution is properly documented and implemented. Implementing the resolution as a change minimizes the risk of a botched resolution disrupting other services.
Stage 4: Maintenance
Stage 4: Maintenance
Performing a post-implementation review
It is important to take stock of the incident over a period of time to make sure it's truly resolved. If underlying issues are left unresolved, they could lead to another major incident.
Producing clear documentation
Documenting the entire process of resolving the major incident helps the organization prepare for similar incidents in the future. With proper documentation of past incidents, the organization can implement the tried and tested solution immediately when faced with another similar major incident, reducing its impact.
Measuring the performance of the service desk helps gauge the effectiveness of the service desk and the MIM process. Some important metrics to measure are mean time to acknowledge (MTTA), mean time to resolve (MTTR), total number of major incidents, and average downtime for major incidents.
Tick all the boxes for an effective major incident management process
ITIL® major incident management process flow chart
Major incident management roles and responsibilities
A major incident calls for a special group of personnel to tackle the incident and resolve it. MIM roles include:
Service desk technicians
Service desk technicians are the first line of defense against major incidents. They analyze incident tickets and escalate them to the incident manager. Service desk technicians are also involved in the implementation of resolutions.
Major incident manager
The major incident manager is the owner of the major incident. Their role includes declaring the incident as a major incident and ensuring that the MIM process is followed and the incident is resolved at the earliest. They act as the main point of contact for any information about the major incident, and manage the MIT.
An MIT is a specialized team that is responsible for analyzing the major incident and formulating an action plan to handle the threat. The MIT ideally consists of service desk technicians, service-level management personnel, technical staff, other relevant stakeholders, and external consultants if the situation requires it.
The specialized personnel that are responsible for the upkeep of infrastructure and operations, including sysadmins, network administrators, and information security staff, that make up an organization's technical staff. The technical staff help troubleshoot the major incident and are primarily responsible for implementing the major incident resolution.
The change manager is the owner of the change that is created to implement the fix for the major incident. The change manager takes full ownership of the change ticket and is accountable for it.
If a problem is created in response to the major incident, the problem manager owns the problem ticket. The problem manager tries to ascertain the root causes of the incident and ensure it doesn't occur again, or that the organization is at least prepared for the next time the incident occurs.
External consultants or third-party vendors
In some cases, the major incident may require highly specialized personnel to help understand and troubleshoot the incident. The major incident manager identifies the required personnel and adds them to the MIT to help reduce the impact of the major incident.
An RACI matrix defines the responsibilities of various stakeholders in a process. The table below defines the roles and responsibilities of the major incident stakeholders throughout the MIM process.
|Process/roles||Service desk technicians||Major incident manager||MIT||Technical staff||Change manager||Problem manager||External consultants|
|Declaring the major incident||C||A||R||C||I||I||I|
|Assembling the MIT||I||R/A||C||C||I||C||I|
|Setting up a conference bridge||I||A||R||C||I||C||I|
|Preparing a designated war room||I||A||R||I||I||C||I|
|Creating a problem ticket to identify underlying issues||I||A||R||C||I||I||I|
|Implementing the resolution plan as a change||I||I||I||R||A||C||C|
|Performing post-implementation review||I||C||I||R||A||C||I|
|Producing clear documentation||C||A||R||C||C||C||C|
* R - Responsible, A - Accountable, C - Consulted, I - Informed
5 Common mistakes in major incident management
Here are 5 common mistakes that can hinder your MIM process:
Manual communication and escalation
By far the biggest challenge to MIM is communication. In the event of a major incident, various stakeholders need to be informed of the status of the incident, its severity, and what troubleshooting has been done to fix it. Communicating all this manually is an arduous task, and can lead to inconsistent communication, which only makes matters worse. By automating the process, key stakeholders are notified throughout the entire ticket life cycle, and the major incident manager can focus their entire attention on fixing the issue.
Ineffective channels for reporting major incidents
Every service desk receives tens or even hundreds of tickets a day, ranging from laptop issues to service requests; among this mountain of tickets, there could be a few potential major incidents. Not setting up a separate channel to report major incidents delays the identification of major incidents.
Duplication of efforts
Failure to delegate tasks in an organized manner can cause duplication of efforts within the MIT. It is important to assign tasks and keep the MIT informed of what each member is tasked with.
Lack of proper documentation will force the MIT to reinvent the wheel every time a similar major incident occurs, leading to delays in resolving major incidents and causing unnecessary downtime.
Failure to analyze the root cause
Similar to incident management, MIM can be myopic in scope, as its primary focus is to fix the issue and get services up and running within the shortest possible time. If not combined with problem management to identify underlying issues, the underlying cause of a major incident will continue to make the organization vulnerable to major incidents.
5 Major incident management best practices
Here are the best ways to approach the MIM process
Enable multiple channels for reporting major incidents
When it comes to handling major incidents, time is of the essence. It is vital for organizations to identify and classify major incidents as soon as they are detected. Offering users multiple ways to report incidents will make the entire process faster and more accessible. You can enable ticket creation through email or a web portal, or even set up a dedicated hotline to report suspected major incidents. Setting up network monitoring software to detect anomalies can help you proactively deal with major incidents.
Automate service desk processes
Speed and efficiency play a vital role in controlling the impact of a major incident, and automating various service desk processes helps achieve this by freeing up your technicians from repetitive tasks such as notifying stakeholders. Automating the notification system and setting up major incident workflows are good ways of automating service desk processes to improve resolution time and bring structure to your MIM process.
Strive for prompt, relevant communication
It is important to keep your organization's management and important stakeholders informed of every major incident. Keeping management in the loop will help with getting necessary approvals and permissions required to fix the major incident. Prompt communication ensures that all the major incident personnel are on the same page and allows for smooth, effective collaboration; it also keeps end users informed of any possible downtime so they can prepare for it.
Create clear documentation
Clear documentation helps the major incident manager record all the work done to fix the major incident, its impact, the affected services, and other key information about the major incident. This documentation is important to show management the benefit of having a MIM process, including its ROI. Clear documentation will also help with any similar major incident in the future.
Utilize deep integrations with ITOM software
Strong integrations with ITOM software enables the IT department to proactively handle major incidents. Reactive major incident identification relies on an influx of tickets to raise a red flag that a major incident is in progress. On the other hand, a proactive MIM process that utilizes ITOM integrations has systems in place to monitor networks and services, and can automatically flag anomalies that could be potential major incidents.
Learn how to set up your own best practice major incident management process
Major incident management metrics and KPIs
When it comes to MIM, below are some important metrics and KPIs to track.
|Mean time to resolution (MTTR)||The average time from when a major incident is reported to when it is resolved.||This indicates how quickly your service desk can resolve major incidents. A shorter MTTR is a sign that your MIT is effective and efficient.|
|Mean time to acknowledge (MTTA)||The average time to respond to a major incident.||A shorter MTTA is a sign that your service desk is quick to respond to major incidents.|
|Mean time between failure (MTBF)||The average time between failures. It is calculated by dividing the total uptime by the total number of failures.||This indicates your IT infrastructure's performance. A higher MTBF is a sign that your IT infrastructure is performing well.|
|Mean time to detect (MTTD)||The average time taken to detect major incidents or anomalies.||This measures how quickly a major incident is identified. A smaller MTTD is a sign that the service desk is quick to detect major incidents.|
|Percent increase or decrease of major incidents||The percent increase of problems in subsequent months relative to the first month.||This helps you identify trends in the occurrence of major incidents.|
Major incident scenario
It is important to remember that not all high-priority incidents are major incidents. Since the MIM process involves a sizable commitment of resources like implementing a separate MIT, it is important to carefully classify major incidents.
The 2019 Cloudflare outage is a very good example of what defines a major incident. In this case, a standard operating procedure of updating a managed rule for the web application firewall (WAF) spiked the usage of CPUs dedicated to serving HTTP/HTTPS traffic to nearly 100 percent across the servers in Cloudflare's network. The outage that followed resulted in a reduction of 80 percent of Cloudflare's traffic, and affected millions of internet users around the world.
The outage resulted in Cloudflare customers (and their customers) seeing a 502 error page when visiting any Cloudflare domain. The 502 errors were generated by the front-end Cloudflare web servers that still had CPU cores available but were unable to reach the processes that serve HTTP/HTTPS traffic. It's estimated that at least half of the entire internet was inaccessible for the twenty-seven minutes of downtime.
All Cloudflare websites were inaccessible, causing service disruptions for thousands of organizations and millions of users. The outage affected the internal operations of Cloudflare, too, preventing the Cloudflare employees from accessing various services like the company's change management tool and internal control panel. The outage had to be dealt with to resume normal service operations.
Timeline of events from detection to resolution:
The WAF managed rule was implemented at 13:42; three minutes later, Cloudflare's network operation tools started flagging the drop in traffic, many other end-to-end tests of Cloudflare services began failing, end users noticed various 502 errors, and Cloudflare received many reports of CPU exhaustion from its points of presence in cities worldwide.
The site reliability engineering team, London engineering team, and other relevant teams were brought together to troubleshoot and come up with a fix. At 14:00, the WAF was identified as the cause of the incident. And at 14:07, a global WAF kill was implemented to bring traffic levels back to normal.
By 14:52, Cloudflare was 100-percent satisfied that it understood the cause of the outage and had a fix in place, so the WAF was re-enabled globally.
The addition, modification, or removal of anything that can have a direct or indirect effect on services.
The process of taking changes to completion with minimum disruptions and collisions.
The act of transferring ownership of a ticket based on a functional or hierarchical need.
An occurrence that has significance for the management of a service or asset.
An occurrence where a service or asset does not function according to the agreed SLA.
The act of transferring ownership vertically to a higher tier service desk technician or relevant authority.
A measure of the severity of an incident.
An unplanned interruption to an IT service, or a reduction in the quality of an IT service. Failure of a configuration item, even if it has not yet affected a service, is also an incident (e.g. failure of one disk from a mirror set).
The process of managing the life cycle of all incidents to restore normal service operations as quickly as possible and minimize business impact.
Assigning priorities to incidents and defining what constitutes a major incident.
An incident that has a high impact and high urgency, requiring a separate process from incident management.
Major incident manager
The person who is responsible for the MIT and the implementation of the MIM process.
Mean time to acknowledge (MTTA)
A measurement of how quickly an incident is acknowledged by the service desk.
Mean time to detect (MTTD)
A measurement of how quickly a potential threat to a service or configuration item is detected.
Mean time between failures (MTBF)
A measurement of how frequently a service or asset fails.
Mean time to repair/resolve/respond/recover (MTTR)
A measurement of how quickly a service is restored after failure.
Normal service operation
A service operation that adheres to the service level agreement (SLA).
A cause or potential cause of one or more incidents.
It defines the roles and responsibilities in cross-functional or departmental projects and processes.
The point of communication between service providers and the organization's users.
Service desk manager
The one who oversees day-to-day activities of the service desk and is responsible for its performance.
Service-level objective (SLO)
It defines the objective of the service providers, and is a means of measuring their performance.
An agreement between the service provider and the customer about the expected level of service and the expected time in which it is delivered.
A measure of how quickly an incident needs to be resolved.
Major incident management implementation kit
An exclusive package of a feature checklist and incident management presentations.
Comprehensive list of must-have features that you can use as a benchmark for your IT service desk.
Detailed presentations with specific use cases to get started with incident management.