
Event, Incident, and Problem Management
This is a guide on event, incident, and problem management.
Monitoring and Event Management
First one of our service management practices is monitoring and event management.
[Video description begins] The screen displays a heading titled: Monitoring and event management. It contains one bullet point. This bullet point contains a sub bullet point. [Video description ends]
Its goal is to systematically observe services and service components and record and report changes of state identified as events. So it would be looking at and monitoring and looking for events from infrastructure, from services, and so on. And first thing we need to do then is define what an event actually is.
And again, as an exam tip, when we do ask in the examination about definitions of it, always try to look for a particular word in the definition that jumps out at you. And this one is, for me, significance is the significant word in this particular definition. So an event can be defined as any change of states that has significance for the management of a configuration item or an IT service. So let's say the change of state in an IT service is going from down to up. The significant is it can now be used.
So when there's a change of state, it's something significant about that service. So an event says, the number of log-ons has being reached, its maximum, then nobody else can log on. There is a significance to that event.
Event Classifications
So what best practice does with events is classifies them into three different classifications.
[Video description begins] The screen displays a heading titled: Monitoring and event management (cont.). It contains a bullet point having three sub bullet points. [Video description ends]
The first ones are information events, and they don't require action at the time they are identified. But they can be quite useful. So let's say Barry logged on. Barry logged off. Fred logged on. Fred logged off. Abdul logged on. Abdul logged off and so on. And going through like that. Then you can looking at that, if you're having a particular issue say with a network slowing down at a particular time, the information events can help you to see perhaps how many people are logging on a particular time and whether they're logging on in a pattern together. So the information events, even though they don't actually mean anything at the time they're identified, they can certainly be used afterwards by other practices within service management to enable them to make decisions. And it's also good the events, particularly if you're running through a sequence of events in a value chain, it's good to see what good looks like. If you've got a failure, you can match your particular chain of events with this is how it should work. So even though informational events don't actually require action at the time they're recorded, they're certainly of use, and we should still be recording them. Warning events allow action to be taken before any negative impact is actually experienced by the business.
Now to give you a non-IT example of that a lot of European cars, when they're starting to run out of petrol or gas, there's maybe 50 miles left or 75 miles left in the tank, an orange light will come on. An amber light will come on to tell us, look, guys, you've got to fill up. You've only got 50 miles or so. So it's giving us a warning. We don't actually have to heed that warning, but it allows us to take action before something more serious happens. And that more serious would obviously be running out of gas petrol, and your car actually stalling and stopping. To give you an IT example, it could be to do with could be to do with a database filling up or some disk space filling up. So obviously, the manufacturers of disks give us an event that tells us there's an issue. It's called full.
But what we don't want to do is necessarily get to that because that tends to be terminal. So quite often, what we will do is put our own events in there that say this is 90% full to give us enough time to do something about it before the failure actually occurs. So we've got warning events. And then the final one are the exception events. They indicate that a breach has happened to an established norm. It's been identified and requires action. So in my car, the red petrol light comes on, and I've got around about 10 miles before I'm pushing. Or it could well be that if the disk is full, then the service falls over and won't let other people onto it.
So the exception event has occurred. And ITIL suggests is that you have these three categories, and event management manages those events. Very often, it manages those events through what's called a correlation engine, so we can see different events coming in. And it will say, ah, we've got a warning event there, a warning event there, a warning event there. If we don't do something about that actually here, then all three of those are going to fail. So it builds an intelligence into event management.
How Event Management Interacts with Other Practices
So event management is highly interactive with other practices. For example, an exception event might give rise directly to an incident. An exception event might give rise directly to a change actually going through the system. So event management does, and the monitoring does get used by a lot of other practices, and from an input perspective, as well. So mostly, event management benefits from being automated. OK. You can have somebody sat there looking at a screen, but that's not so cool. And what we'd really want to do is make sure that we've got it as automated as much as possible.
But there still needs to be a level of intervention. For example, it's very, very important to make sure that when you're putting in early warnings for performance issues, the human intervention there is making sure we know what those early warnings are, and then you put the warning in early enough so that it gives you enough time to do something about it, but not too early that actually the warning light is always on. To go back to my car analogy if my car warning light came on 200 miles or 300 kilometers before my car was actually going to run out of fuel, then it would be on most of the time. And it would be no use.
If it only came on around about five miles or seven kilometers, eight kilometers before I run out of petrol or gas or fuel, then I might not have enough time to actually get filled up. So it wouldn't be much use there. So the human intervention is very, very important to make sure that you put those events in at the right point. And finally, on event management, it can be active or it can be passive.
So active event management will go out and it will poll, and it will go out and say, are you OK, are you OK, are you OK, are you OK, to the service or the CI. Whereas passive event management that's just going to sit back and relax and wait for stuff to happen. So that's going to wait for the actual CI or the service to actually raise an event or an alert for it.
Incident Management
Incident management is perhaps the seminal IT service management practice. Put quite simply, if the service isn't there, then users and customers tend not to be satisfied, and incident management's role is to minimize the negative impact of incidents by restoring normal service operation as quickly as possible. Now that normal service operation is normal service operation as defined by the service level agreements or experience level agreements, that your organization uses. So it's all about getting back the service as quickly as possible and hopefully, in doing that, promoting user and customer satisfaction. The first thing we're going to do is look at the definition of an incident an unplanned interruption to a service or reduction in the quality of that service. So if the service fails, then, obviously, it's not there. There's an unplanned interruption. But it might also be a reduction in the quality if the service is running very, very slowly. You've got intermittent issues with connection and so on and so forth. Then it's not the quality that you're expecting. And so that can also be classified as an incident but a whole host of activities the incident management gets involved in. First thing incident management does and the incident management process gets very, very heavily used by the service desk, and they're the people that tend to run and manage incident management for the organization. Then the first thing that they're likely to do is a kind of triage.
What you don't want is things that aren't rarely incidents entering into the incident management process. So for example, somebody rings up and says to your organization, I can't print. Now there could be a whole host of reasons that they ring your service desk up, but they're not all necessarily classify or could be classified as incidents. And this is just an example. But somebody rings up and says, I can't print on it. But they've had access to a printer, and they've lost permissions to it. That's most definitely an unplanned interruption to the service. And a reduction in the quality of that service I can't print it's an incident. Somebody rings up though, and they've never had permissions to a particular printer. Well, actually, that's not an incident because it's not an interruption to a service that they've had or should have.
So actually, that might be a request for service and something that we a service request something that we deal with slightly differently. Somebody rings up who's never even had a printer and not got a printer, then that might be a change request that they put into it for that to happen.
So one of the first things it's important to do and right at the very top of an incident management process should be this triage to make sure anything that actually enters the incident process is definitively an incident and the service desk is absolutely the right place to be able to do that. Some of that now being replaced by automation and you get in chatbots, for example, being able to make some of those decisions. Next thing is to log and manage the incidents so making sure that you've got the right information about the incident. And typically, your organization will have a minimum data set that they collect about the incident. Now logging the incident might well be the traditional user ring in a service desk. It could be somebody using a portal through a laptop. It could be somebody using a portal through a cell phone, through a mobile phone, through a handheld device. It could be automatically raised and incidents logged through event management. So there are a whole host of ways. It could even be that you've got incidents being raised by the support community, by the people who are actually supporting service management.
So what will happen there is that these incidents need to be logged and managed, but we need to be consistent in the way that we log incidents and we get the same information. So part of the early management will be making sure that if we get an incident logged automatically, and we get one logged by the service desk that we can actually spot, hey, guys, these are the same incidents. Let's make sure that what we do is treat them and join them together make sure they're dealt with together.
Prioritization of incidents is important and that should always be done based on business impact. I hold my hands up that I was probably in my early career in IT, I was probably guilty of the wrong type of prioritization of incidents. I'd have my pile of incidents to look through. That one's boring. That one's boring. That one's boring. That one's boring. I'll do this one.
It's not the most business-focused approach to prioritization of incidents. So please, I beg you don't use that. Incident management what else will it get involved in? Escalation making sure who resolves the incidents, making sure that sometimes we need to resolve the incident outside of maybe a service desk environment and it needs to be escalated to other teams. Sometimes there needs to be escalation because the profile of an incident the management profile of an incident needs to be locked up. So it could well be that an escalation happens because the service desk hasn't go the admin access to a particular system to resolve it. So that's one type of escalation that will occur. The other type of escalation might be because we have to get a third party involved, and we need a higher level of management to sign off, getting a third party involved. So there could be a kind of hierarchical escalation or an escalation between different functions. Hopefully, then we'll go on and resolve and close out the incident. And the incident needs always to be closed out by the customer, the user that actually raised it.
Different ways of achieving that there is the classic way of ringing the user back and saying, hey, guys, we've resolved your incident. Can we close this off now? Yeah, fine and it's done. But equally, there are a lot of organizations where it's not always possible to contact the user so the incident will still be closed by the user but through an SLA. An example of that, I did some work many years ago with an education authority who ran the IT for schools in a particular area in England. And a teacher would raise a call about some of their IT kit.
And the IT technician would maybe log on remotely and fix it, but then it was going to be difficult to get back to the teacher to close the call because they're teaching classes and so on. So what they did was they had an SLA that effectively said what they were going to do was they would close the call, send the teacher an email, and it if the teacher didn't respond within 24 hours, then the incident would be deemed over, and they would close it. And when I've been orbiting service desks and so on, then from my perspective, that's deemed a user closure because it's been agreed that would be the process in the service level agreement. So the resolution and closure quite often, with the closure, what will happen is that you will you've quite often if it's an incident that might reoccur, then a knowledge article could be created to make sure that the next time that incident occurs, we've actually got that.
This is what the symptoms were. This is how we might close it. So it's always important that good closure notes are added to the incident to enhance the knowledge of the organization. There's absolutely nothing worse than when you're part of when you look in you sat at at a service desk, you're in an incident management practice. You see exactly the same symptoms in an incident that you're looking at that somebody resolved last week. So you go in, and you look. And you say, right, what did they do to resolve it? Fixed.
So the wheel needs to be reinvented for that particular incident. So it's important that we record that knowledge. And what we've got now are some very, very, very clever tool sets out there in the service management space now that can do a lot of this knowledge management and creative knowledge articles, escalation, prioritization, alerting, through toolsets. And those toolsets also give us access to service level information to allow us to prioritize. They give us access to configuration information so as to allow us to understand the potential impact of an incident. So if an incident affects a particular server, then we can see what people are connected to the server or what people are connected to that service, so we can look at the potential impacts of an incident through configuration. And obviously, as I've just mentioned, that knowledge data and certainly, the automation of service management has found really has found a lot of benefit from incident management. Incident management gets a lot from having various parts of its process and its practice automated.
Incident Escalation and Resolution
We talked previously about how automation and how the use of the use of technology can make a big difference to incident management. And certainly, incidents that are now being resolved by scripting self-healing systems and so on that they're making a big, big difference much more so than they've ever done in the service management and technology space. It could be that people use self-help to work, to resolve. That's really going to depend on the culture of your organization. I would always be really careful with self-help because the tool sets are very good at self-help, but it does depend on the demographic of your user and your user base, whether they're whether they're happy to use self-help or if you've got a good service desk, whether they'd rather say, well, why should I trawl through self-help when
I can just push that to the service desk and those guys can resolve it pretty quickly? So as I've said, the service desk there, they may do the actually, may do the resolution at that point of first contact. It's always a good metric that organizations use for the health of the incident management practice about how many you can resolve at that first point of contact. It might be moved through support groups to management to suppliers in terms of escalation. So we've got that kind of functional escalation to other support groups.
It may involve escalation through management for example, if we do need to get suppliers involved so we've got a kind of hierarchical escalation through management. Just finishing off this little portion of incident management. I'm going to mention a couple of things, major incidents. Major incidents are very, very interesting that a lot of organizations talk about actually designating what constitutes a major incident.
And to be honest, it's said that there's an old saying about rules of fools and the guidance of wise men. I think you should have guidelines, and we say that there should be guidelines for what constitutes a major incident. But in most organizations, you kind of have that gut feel, and you know what's a major incident. Now what best practice definitely do say is that major incidents might involve dedicated or temporary teams.
Smaller organizations, bigger organizations have got dedicated major incident managers who can step into a role of the major incident manager, if need be, when that happens.
They might pull together a team temporarily, and that's where some of the newer aspects of incident management, such as swarming, which is going to be covered in more depth in later on when the incident management practice is published in full, then that will certainly involve major incidents. What you need are different processes. What you need are different procedures because, in a major incident, it's critical to your organization. There's a critical impact, and you need to make sure that it's resolved quickly. Finally, it might be that major incidents actually tip over into service continuity. The disaster are not being needed or invoked in some of the more extreme cases.
How Incident Management Assists the Value Chain
As you can see, the two dark ones there, the darker background, engage and deliver and support are the two areas where incident management does the most work and has the most effect. The obvious one there down at the bottom is deliver and support, where obviously incident management makes a significant contribution. And the value chain includes resolving incidents and their input into problem management, which will look at a later stage when we cover the problem management practice.
So deliver and support is a major area of involvement for incident management. Also, engage, because when we're actually getting information about incidents, when we're communicating to users and customers about incidents, incidents are always visible to users and the people using the services.
But also, if there are significant incidents, major incidents, they're going to be visible to customers as well the people who are responsible for the outcomes. The customer isn't always the user, so as you say, the incidents are going to be visible to them, and that engagement is going to need to happen as part of incident management.
That might be through the service desk practice, through the service level management practice, or through the relationship management practice. So they're important.
Other areas that incident management gets involved improve incident records from improve are a key input to improvement activities, and they help improvement on improve activity to prioritize in terms of the number of incidents, their severity, and so on. So we're into that improve arena, and that's where one of the big links with problem management is going to come within that improve arena.
It's incident management activity, it's incident management records, that are some of the biggest links into improvement. Design and transition incidents can occur in test environments as well as in service and release and deployment. So obviously incident management gets involved there. I think it also makes sense as well when you're doing training and testing of new services to get incident management involved.
I was involved with an organization some years ago who, whenever a new service went live and they were doing the training for that, what they had was the training taking place in one room on the new service, but then they had two people from incident management typically one from the service desk and one from second level support sat in a room next to them.
And whenever somebody had an issue in the training, they didn't put their hand up and ask the trainer. They rang the service desk, who were actually in the next room. And what they did was it meant that the service desk and incident management got to build up their knowledge very, very quickly of how to take calls on the new service, what information they were going to need, what typically issues people were having.
And it made a huge difference. So incident management can get involved in design and transition. Obtain and build incident management, again, getting involved, might occur in deployment. Incidents occurred in deployment environments as part of obtain and build. So again, it would be involved as a practice.
Problem Management
So imagine the scenario, I'm tapping away on my laptop, and all of a sudden, it freezes. I can resolve that perhaps by power cycling my laptop, switching it on, switching it off. So I've done that. I've power cycled my laptop. Comes back, I log on again, and everything's working. My incident has been resolved. But what caused it? What actually caused my laptop to lock up? If you've got that situation where you don't know what the cause was and why that incident actually happened, that's the time you need problem management. And its job is to reduce the likelihood and impact of incidents by identifying actual and potential causes of incidents and managing workarounds and known errors, so making sure that we can either do something about it that's temporary or, indeed, permanent. Now, what that does straight away is gives us three areas that I need to define for you.
The first one is that definition of a problem. The problem is the cause, or potential cause, of one or more incidents. So it could be a single significant incident, or it could be multiple like incidents that have got exactly the same symptoms. A known error is the state where a problem is being analyzed and perhaps we know what's caused it, but we don't necessarily know what to do about it on a temporary or a permanent basis. Or we've analyzed it and we've got a temporary fix for it.
They both fall into the category of being a known error, a problem that's been analyzed, but has not yet been resolved. Now, if we've got a temporary fix, that's where the workaround around comes into the situation, a solution that reduces or eliminates the impact of an incident or problem for which a full resolution is not yet available. So some workarounds are actually going to reduce the likelihood of an incident occurring. What I'm going to show you subsequently now, using the problem management process and the problem management practice is what all that looks like when it's put together.
The Approach to Problem Management
This next part, we're going to look at problem management, and we're going to do it in around about five minutes by telling you a story. Once upon a time, there was an incident. So I'm tapping away on my laptop. All of a sudden, nothing, it's frozen. Hmm! Three-fingered salute, Control-Alt-Delete, nothing. Mm. Little bit of a worry. I ring the service desk up. Service desk said, have you tried no, yeah, yeah, tried that three-fingered salute. Nothing, nothing. OK, we'll send one of the guys around. So they come around. They look at my laptop. They try the three-fingered salute. Nothing. Nothing happening for those guys either, and they decide to power cycle my machine, which you switch if off and switch it back on, but power cycle just sounds a lot more lot sexier, I guess. So they power cycle my machine, and away we go. We're working again. The incident can be closed. And they've restored my service, normal service operation. I'm happy. So the question is, do we need to find the underlying cause of that? Well, it's only happened to you once, Barry. It's not a big deal, so, OK, let's do it. The same thing happens around about two or three weeks later. So I'm a good boy. I go back to my service desk, and I ring them up. Yeah, tried that, tried that. Power cycle it. They close it off. The incident is again closed. And it's starting to be a bit of an issue for me now that this is happening.
And, you know, for me, it always seems to be the same time. So anyway, a couple of weeks later, the same thing happens again. Now, we do that, and we close the incident off. And I say, look, guys, this is an issue for me now. I'm a very, very senior user in this organization, it's my course. I can be yeah, so I'm a very, very senior user in this organization. I think it's time you did something about it. And they said, I'll tell you what, Barry. We've got problem management. And it's funny that you should start talking about that now because what's been happening is they've been doing a little bit of trend analysis. So those guys in problem management have been doing a bit of trend analysis, and they've actually spotted that this incident has happened to you three times. And what they've done is they've raised a problem. They have identified a problem. So what we've got then is problem identification. So they've identified a problem. And they said what we're going to try and do now is find an underlying cause to see if we can do something about it. So they've identified the problem, and so they're going to go away and look at it now. So they go away and they start to look at it. This is what we identify in problem management, trying to find this underlying cause.
It's what we identify as problem control. They're trying to find what's caused it. And eventually, they come back to me, and they say, Barry, we think you've got a memory problem not you personally, with your laptop. What's happening, we can see, is that you're using a particular application, and you've got lots of other applications open, and you're just blowing the memory completely, so the whole thing is actually locking go upon you. So that's what the issue is. And I'm saying to them, well, OK, what do you think you can do for me? Well, they say, look, we've got a few options here. We've not really looked at what the final solution is, but we think in the short term, what you can do is if you close all of your other applications down, and you just use the one you're using, you should be OK.
It's not ideal for me because I've got to remember, and I'm getting older, so it's a little more difficult. I've got to remember to close all of these other applications down before I actually go ahead and use this particular application. But what they've got is a way of me reducing the impact of this particular problem. Now, from an ITIL perspective, what we've got there is a workaround. It's actually going to hopefully stop my incident from occurring. And they spotted that workaround, and I can actually use to stop the incident occurring and minimize the impact. Now, one of the things to notice about that workaround it's not technical workaround. It's actually a process workaround. And we've got to be in problem management, we've got to think very, very holistically about what some of the solutions might actually be.
Now, one of the things that they've got with that workaround is now they finish that investigation. We know what's caused it. We've got to think about something else. And again, from an ITIL perspective, we can now declare that we've got a known error. Now that known error can be used. How can that known error be used? Well, that known error can certainly be used by the service desk, for example, in letting other people who've got the same setup as me who were having the same incident, letting them know that what they can do is they can use that same workaround if they're getting the same issue. Because one of things we will have done is linked all of these incidents to our particular problem that we're having. We've now got a known error. Are we going to do something about that? Well, the third part of problem management is error control, okay?
So, so far we've had problem identification, problem control. We've now got error control. And error control's job part of that is to make sure that that known error gets to people who could use it. So let's roll out that known error. Let's make sure that people can use that workaround that's related to it, and reduce the impact of any incidents occurring. But what they can also do in error control is they're looking for a final structural solution that they can pass through to change. So it might well be that they decide we're going to give you a new laptop. I like that. They might decide they're going to give me some more memory, or they might decide to play around with the application to make it less memory hungry. So error control will be looking in that at what a structural solution might be.
They might decide, well, we're replacing all of those laptops in the next six months anyway, so let's just stick with the workaround until we replace the laptops. So we've got a whole host of things going on there. Now, one of the things is they make a final change. Let's say they decided to give us new laptops. They give us new laptops, and everything gets closed off the errors, the known errors, the problems, all of the incidents. Everything is now being closed. The final part of problem management will be to conduct a review into that problem. What did we do well? What did we do badly? Can we do anything to prevent that from happening next time?
And one of the conclusions they might draw from this is that actually, if we'd have tested this particular application more realistically in the first place, and we'd actually done a stress test where we had a typical person like me who's got hundreds of applications open when they were in that particular application, we'd have actually spotted that, wouldn't we? And we would never have had that problem in the first place. So what you got there is effectively problem management's life story. We've got trend analysis going on up here. We've got the identification of problems. And it's not just multiples. If we get a single significant incident, yeah, that's in there, that could almost have a one-to-one relationship with a problem. Different goal get the user back working. Find the underlying cause in problem identification. Problem control what's gone wrong? What's the cause? Is there a chance of a workaround? Error control making sure that the known areas get to where they need to be, perhaps through the service desk. Doing those reviews and coming up with the answer. What might be the change that actually makes a difference and stops the incidents from occurring in the first place. And that there, in maybe a little bit over five minutes, is effectively what problem management does.
Planning and Interfaces of Problem Management
In fact, it makes absolute sense for the two of them to be planned together, since they share things like categorization and certain levels of prioritization. There are also interfaces between problem management and other areas. Interfaces with risk management with change control, with knowledge management with continual improvement. For example, with risk management it's often better within problem management to take a risk-based approach to prioritization rather than necessarily follow exactly what's in incident management. In fact, your problem management prioritization follows incident management exactly, that's actually going to be a recipe for disaster. Probably better to look at risk management for your prioritization of problems rather than incident management. Change control there's an obvious link that when we need to resolve a problem it might be that a work around has to go through change control. It could be that the structural solution at the end needs to go through change control.
There's an obvious link into knowledge management, known errors, work-arounds. Their knowledge, they need to be made available to all the practices within service management particularly incident and the service desk. And finally, continual improvement. There are very many organizations now who have actually from a team perspective abandoned the name problem management because it's got negative connotations it's got the word problem in it, after all and renamed those teams continual improvement because that's what they actually do.
So there's always a very, very close link between problem management and continual improvement in many, many organizations. It makes sense for the two to work very closely together. Many problem management activities going around to the people side of things rely on the knowledge and experience of staff. Because quite often, things that have got through to problem management that haven't been resolved in incident and the cause actually found, then they're likely to be of a more complex nature, which doesn't necessarily lend itself to the type of automation and kind of low-level problem solving that we might employ in incident management.
So rather than following those detail procedures, chances are what you're going to use is the experience of staff, and what I call sometimes the techies hunch, where yeah, I think it might be that. That's the sort of thing that you can't automate. I think one last thing I would say here about problem management is to make sure when you're looking for those causes you don't get blinkers on, and you don't just look at technology for the underlying or the for the cause of a particular incident.
Actually, the cause of an incident could be in any one of the four dimensions of service management. So when you're looking at teams and people to do problem management, then it makes a lot of sense to bring people in who've got business experience, who've got business process experience, who've got technology, who know about the suppliers, and so on. It makes far more sense to do things that way and have a very holistic view of looking at potential problem causes.
How Problem Management Assists the Value Chain
Well again, looking at this, the darker the color of the background the more involvement it has. So we're looking there to improved delivery and support as the two areas that problem management makes the most impact. In improve, effective problem management they are providing understanding needed to reduce the number of incidents and impact of incidents that cannot be prevented. And in deliver and support, problem management there making a significant contribution by preventing incident repetition and supporting timely incident resolution through things like known errors and work-arounds. Other areas in engage, problems have a significant impact on services.
They're going to be visible to customers and users because they won't reach problem status unless it's been a single significant incident or multiple incidents that add up to something that's more significant. So there's certainly going to be some engage activity at an operational, a tactical, and quite often a strategic level.
Design and transition there problem management providing information that helps them improve testing and knowledge transfer. And part of the knowledge transfer is certainly now we see from Agile that we don't let errors move forward. But every so often, an organization does decide that actually we will move forward with an error. And that knowledge transfer happening, problem management is certainly going to be involved there.
And finally, in obtain build problem management activities can identify product defects doing the kind of trend analysis work. And I think it's fair to say that the techniques that problem management and good problem Analysts use are some of the most sought after in any organization.