Resolving Problems

This is a guide on resolving problems.

C++ is among the best languages to start with to learn programming. It is not the easiest, but with its speed and strength, it is one of the most effective. This small study book is ideal for middle school or high school students.

Resolving Problems

In any technological environment, there are going to be problems, it really is inevitable. And clearly, when something goes wrong, our initial reaction is to simply fix it. But it's really not always that simple particularly if you are in a corporate environment. So in this presentation, we'll take a look at some detailed approaches to resolving problems and ultimately this does require some kind of a plan with well documented steps.

Now, this does depend on the degree of what went wrong. In other words, if a single user is having difficulty with their computer and a simple reboot fixes the problem, well, that's not really anything that requires a plan with well documented steps. But it's more so when there are wide-scale changes that are going to be implemented or there is some kind of wide-scale failure that is affecting a lot of users.

So, this is where we start to see a plan with those documented steps. And really it should follow some kind of logical path to whatever the solution is, and that path must consider the corporate policies that might be in place. The corporate procedures, with respect to what you can and cannot do, and of course, the possible impact that may result from whatever approach you take to resolving this problem. So again, it does generally come down to the scale, but in a lot of cases you can't just rush in and fix a problem if there are a lot of other things to consider.

[Video description begins] Managing Change. [Video description ends]

Now, primarily what this involves is the idea of change. Now, whether that's in response to a problem or just something that's changing because you need to update or upgrade something. There really still should be a formal process that's needed for the change, and this will help to avoid disruption of regular operations. For example, if it is an upgrade, you really just shouldn't show up one day and decide that you're going to replace someone's computer if they have no idea that it's happening. So clearly we need some kind of implementation of change control.

There needs to be a formal process to control the change. Again, you can't just start doing wide-scale changes without anyone else knowing what's going on. So you have to develop a plan for the change. You have to determine the possible risks from the change. You need to develop a backup and a rollback plan so that if something goes wrong, you can either recover or you can just go back to where you were before you started in the first place. Ideally, you should test those changes in what's known as a sandbox environment, which is essentially a lab. And ensure that all of the steps that were taken are well documented so that if and or when you encounter problems implementing this change in the production environment.

Hopefully it's something that you would have encountered before in the sandbox environment. And you can refer to the documentation so that you can, hopefully, get past that as easily as possible and ultimately implement those changes. But what you will likely find in most organizations is that there is a fairly well-structured path when it comes to handling problems and managing change. So you just need to familiarize yourself with those processes and, essentially, don't just immediately react to every problem with, let's just fix this and move on. You want to try to reduce the number of problems, of course, and ensure that when they are encountered, then people are aware of what's being done. You know what the reasons are for these problems as best as possible. And you know which approach to take so that you can minimize all of this and the overall impact on your day-to-day operations.


Identifying the Problem

Now, of course, we know that in any computing environment, there are going to be problems. So in this presentation, we'll take a look at some of the key activities to help identify those problems. And to assist with formulating an approach to dealing with them. And it really does begin with gathering as much relevant information as possible. Now, that might stand to reason that you would want to gather only relevant information. But that's not always obvious at first, because there's not always a direct connection between the symptoms and the problem.

There might be a number of problems that could result in the same symptoms. So it can take a while to really nail everything down, if you will. Questioning the users certainly helps. They are the ones that are in front of these systems day in and day out. And if you can gather as much information as possible from them, then this can certainly assist in building a bigger picture. And ideally, if you can attempt to recreate the issue then this usually can really help to focus on what it is. Now, again, easier said than done. But if you can manually recreate the same problem, then, of course, you would have a very good idea as to what the problem actually is.

[Video description begins] Other Helpful Activities. [Video description ends]

Some other helpful activities include identifying changes that led to the problem and a common example here would be system updates. Updates, of course, are designed to address problems or vulnerabilities. But they don't always work. There have been situations where updates simply have caused problems. May be with compatibility, for example, with an application or a hardware device. So they can be problematic. And this could be fairly wide-scale as well. So it's not a bad idea to try to break up large problems into smaller segments or units and address one aspect at a time.

Now, if you can, of course, it would certainly help to be able to back up the environment. In other words, take a snapshot, if you will, something that was done prior to making the changes such as the updates. This enables what we call rollback. You can simply undo what was done, and revert everything back to that point in time where it was stable. And studying log files is also a good idea because they tend to record what kind of changes were made. So that can help you to narrow down what the change was that led to this problem. And, virtually, every log file would tell you when it occurred as well. So you would have a much better idea of where to focus your efforts.

[Video description begins] Understand System Changes. [Video description ends]

Now, when you come up with some kind of resolution or some kind of change, you do need to consider the impact. This might include environmental changes, if it's a large-scale update or a large-scale change from one technology to another. There certainly can be environmental changes. Now, they may be good, but you just need to be aware of that. And you also need to be mindful of the infrastructure changes. So in other words, what will the impact be on your current infrastructure, and how will this affect your day-to-day operations? Is it going to be very disruptive? Is it not going to be very disruptive at all? And, ideally, we don't want to be disruptive, but in some cases, it is unavoidable. So the more you know, the more you can plan before implementing these changes.

[Video description begins] Identify Likely Cause. [Video description ends]

And try to just identify what the most likely cause is for any kind of problem. Obviously, again, that's easier said than done. But there are certain things that really should not be related while others might have some kind of an effect. So try to just narrow down the causes as much as possible. And prioritize those which seem to be the most likely. Or conversely, discard those that would not be likely at all. And may be you could rank from the most probable to the least probable to help you focus. And ultimately, sometimes there's just no immediate obvious solution.

So if you can't really determine what the cause is, then escalate this. Get more people involved, and you will likely find that many environments do have tiers of support. So there would be sort of a first responder type of tier. And if they are unable to determine the cause, then it's escalated to the next tier, and again, more people get involved, more senior people get involved. Those who are may be a little more experienced and a little more familiar with the environment. So escalate as much as necessary until a cause can be determined.


Establishing a Theory of Probable Cause

Now, when it comes to solving problems, really, one of your most effective tools is knowledge. But knowledge is something that has to be acquired. And in many cases, this involves research.

[Video description begins] Internal Research Based on Symptoms. [Video description ends]

So when it comes to trying to ascertain the cause of a problem, in most cases, you want to begin with internal research that is based on the symptoms. And this is because it is most likely that people within your own organization have encountered this problem before. Or they are just so much more familiar with the setup of your organization that it simply makes more sense to start with internal research, as opposed to just immediately going out onto the Internet, for example. So try to begin your research based on the symptoms that you are seeing. In other words, investigate the most obvious causes.

And clearly that makes sense. But you do need to consider other causes as well, because the links may not always be apparent. Relatively complex relationships may exist, and that really is an understatement. I mean, when it comes to just trying to access, let's say, a database on a server. Well, problems, of course, can begin with your own computer. And there is just a huge number of possible problems that can exist right there alone. Then we need to access the network. The network can be very complex. Then we need to arrive at the server.

And there can be as many things wrong with the server as with your own client system, possibly more. So when you put them all together, it's not always that easy to determine what the problem is. So you kind of need to capture all possible reasons leading to an issue, and may be even develop a list of causes. Start with the most obvious causes that are easy to test, the process of elimination. Essentially just saying, well, it clearly is not this, because this works in a different scenario. That way you can help to focus your efforts. Then you can move to the more complex issues to test once you have eliminated those possible obvious causes.

[Video description begins] External Research Based on Symptoms. [Video description ends]

Now, internal research certainly will help, but in some cases you do need to go to external sources to capture more information. This might include interdepartmental knowledge bases. May be third-party knowledge bases such as documentation from a vendor. Of course, the Internet is a wealth of information, and industry or academic literature will also help. But ultimately, gathering as much knowledge as possible can really only be a good thing. And even if that doesn't immediately lead you to a cause of this exact problem, it will usually help in understanding the entire relationship of all the parts.

So again you can at least focus a little bit better. Or may be even it will help to address a future problem because you do understand now that this relates to that and this connects to that. And everything is always interconnected. So again, the more knowledge you can gather, the better. And this should always involve documentation as well, so that if you do find something, let's just say through a basic Internet search. Don't just find that solution and implement it, and walk away. Make sure you document what you found, where you found it, and how it was applicable. And, of course, what you did to apply that research in terms of correcting this problem so that it's there for future purposes.


Test the Theory to Determine Cause

Now when it comes time to implement some kind of resolution to a problem, again this doesn't mean that you just always rush right in and implement whatever you feel is the solution. You still need to be sure that you have identified the correct cause. So, test the likely cause of the problem, and this of course may result in things being proven to be correct or may be disproved. And you really don't know until you try.

So again, the idea here is not the fact that I know that this is the solution to the problem. It's have I identified the correct problem because you might actually be addressing something that isn't the problem. So, if the likely cause is proven to be correct, then you should still verify your results and may be submit whatever your plan is for approval. Or essentially still escalate this to a higher authority whereby they can say okay, yes, everything looks good you've identified the problem. You've identified a solution so let's move forward in implementing this solution.

But if, of course, it's disproved then you really start again. You have to re-examine the problems and the causes again, and develop a new theory. Or again, escalate to a higher authority if you just can't seem to get anywhere with respect to the original problem in the first place. But again, you really just need to make sure that you have identified the correct problem because implementing a solution for a problem that really doesn't exist doesn't get you anywhere.

[Video description begins] Evaluation of Test Results. [Video description ends]

So then of course, once you have implemented some kind of theory with respect to correcting this problem, then you still need to test that out and you need to evaluate the results, and really this is just an ongoing process in many cases, because something will undoubtedly happen again in the future. Or the resolution that you've implemented simply may not have worked. So, you still have to begin with developing a theory. You test that theory, you evaluate the results and if the issue is not yet resolved, you go back to square one. You develop a new theory, you test that theory. You evaluate the results.

And if, then, the issue does finally get resolved, then you develop a plan of action. Again, this is not something that is typically done in production. You should try to reproduce the problem in a lab environment, so that you can test and test without disrupting anything and without getting in anyone's way, quite simply. Once you then are able to resolve that, this is when you develop the plan of action because again, you can't just say all right, I figured out what the problem is, and then rush in and disrupt everyone in terms of correcting it. Obviously, everybody wants problems corrected as quickly as possible.

But you really can't disrupt their day-to-day activities as well. If they're still able to do most of their work then, again, it makes sense to just hold off until you can address this when may be everyone has gone home or just during some kind of downtime. So you still need to develop some kind of plan of action that outlines what is going to be done. And then, that plan itself may still need to be approved. So ultimately, you have figured this all out. But you still need to ensure that you take the correct approach with respect to implementing what your resolution is.


Establish a Plan of Action

In this presentation, we'll talk about establishing a plan of action, which, of course, begins with developing the plan in the first place, and regardless of what the issue might be or what type of solution you're implementing, ideally you want to attempt to fix any issues with minimal operating impact. Now, of course, this doesn't mean that you always have to wait for off hours, for example, to address an issue. Many may be addressed during your regular operating hours, particularly if it's a fairly simple problem.

Now I'll exaggerate a little bit. But if somebody is having trouble accessing the Internet because their Ethernet cable fell out, clearly, it's just a matter of plugging the cable back in. And you're certainly not going to wait for down time to do something very simple. Really, that's the point. But other issues will need to be addressed outside of regular operating hours and this is typically where a much larger problem has been identified. Or may be it's not necessarily a problem, but something you have anticipated, such as an upgrade. They can take a long time to complete so typically they are addressed during downtime.

And you always have to consider the possible effects of the change. And as I'm sure you are aware in just any aspect of daily life, things do not always go as planned. So you really need to have some kind of backup plan. Something that allows you to just revert back to wherever you were before you started. So that at least everything can end up being the same, and not worse than when you started.

[Video description begins] Complexity of Plans. [Video description ends]

Now, any kind of plan, again, will involve something that is a little more involved. As mentioned, you aren't going to draw up a plan just to plug someone's Ethernet cable back in. So when you are talking about a plan, that alone tends to mean it's something a little more significant. But there can still be simple action plans and fairly complex action plans. And this generally comes down to how widespread the problem is.

And of course, its overall severity and what kind of impact it might have but once you're talking about something that does have to be written down, if you will or documented, then you do want to try to keep it as simple as possible. The more complex the plan, the more difficult the implementation. So break it down into its constituent components, perhaps. May be try to consider if you can do it in a bit of a phased approach, or in stages. But ultimately a more complex plan, of course, will require much more forethought, a little more analysis, and more care in its implementation. So again try to keep things as simple as you can.

[Video description begins] Implement the Action Plan. [Video description ends]

So when it does come time to implement your action plan, you do need to be mindful of things like maintenance windows that have been established within your organization. Again, if it's a little more of a complex plan, then in most cases you can't just run out and start implementing it. This will disrupt your regular operations. So in many organizations, they have set aside specific times where maintenance can happen.

And typically this is when you execute your action plan. Now, you do also need to be mindful of what is happening while you're implementing it. And determine whether or not anything should be escalated. And this typically involves multiple tiers of internal support but there may be external support as well. May be there has been some degree of outsourcing in your organization. Or may be you are dealing with vendors directly. Whether it's for an application or some sort of hardware, you may require that external support.

And of course, be mindful of the time frame that is necessary to make the change or implement the plan. You just have to ensure that you have sufficient resources available to meet any deadlines. And this typically is not something that you would find a deadline, that is, when it's just correcting a problem. You know, if something has failed, then generally the deadline is just get it back up as quickly as you can. But when you are talking about a plan that has been drawn up, then this typically does tend to have a deadline. So you want to ensure that you have essentially the personnel, and the human resources available to be able to meet that deadline. Because the longer it goes on beyond that, essentially the more difficult things will become. So try to be mindful of all of those before you start implementing your action plan.


Verify Functionality

So then once you have identified a problem and you've formulated some kind of plan and implemented that plan. Then it culminates in, of course, verifying that you have restored full system functionality post changes. So in other words, you've done something to correct the issue, but you really need to verify that this has in fact fixed the problem. So testing should always be part of the action plan. In other words, you would not go to a particular system and implement a particular type of fix, and then just assume that everything is fine. You really need to determine if the problem is actually fixed.

And you should try to determine this for yourself as a support person, but you should also have the users verify that the problem is fixed because it may not always appear immediately. Now, I'll go with a fairly simple example here, but imagine if it was just a component that was overheating. So you've done something to correct the problem, but a component typically does not overheat immediately. So you may need to wait a while. So you can certainly still implement your fix and try your best to determine if it has corrected the issue. But check in with the users periodically and make sure that it is not happening again and again.

[Video description begins] System Functionality. [Video description ends]

So with respect to whether or not you have addressed the problem, in other words, has full functionality been restored yes or no? If it hasn't then clearly you need to really go back to the drawing board. You need to re-examine the likely problems and the causes again. And if, of course, it has been restored then if you can try to implement possible preventative procedures. Now again, this is clearly going to depend on what the problem was in the first place.

There are a lot of cases where you just can't prevent certain things from happening. You know, hardware, for example, may just fail due to manufacturing problems. And there's really not much you can do about that. But if you can, then try your best to implement something that will prevent this in the future. If it was may be a process that was being performed incorrectly, then clearly, you can educate that user on how to perform that process correctly in the future. So that ultimately you've got documentation that says, we identified this problem, we implemented this plan, and we verified that functionality was or wasn't restored. And then whatever was done after that will dictate what needs to be done in the future. But the more information, the more documentation, the more easily it will be to run down problems in the future.


Document Findings

Now in this presentation, we'll talk about documenting the methodology that was implemented to correct any kind of problem. And this really is something that should get as much attention as any other aspect of troubleshooting. Because I've said this earlier, the more you know, the more easily you can address problems. Knowledge really is your best tool. So, when a problem arises, you need to document the findings. What have we discovered? You need to document the actions that were taken.

And this can include both successful and failed actions because it's certainly as valuable to know that trying this, generally, won't succeed. So, don't bother trying that. And of course, what the outcome is and that really relates back to the actions as to whether or not they succeeded or failed. But again, the more information, the better. And all of these get assembled into creating what's often referred to as a knowledge base.

[Video description begins] Knowledge base comprises of findings, actions, and outcomes. [Video description ends]

And really this is just an ever growing collection of all of the problems that have been encountered and what was done to address them.

[Video description begins] Key Benefits of Documentation. [Video description ends]

So there are many benefits to thorough documentation particularly when it comes to any kind of recurring issue. This is something that of course tends to happen often. So if we can go to the documentation and discover the problem and/or the root causes. May be how many users are affected and/or which types of users. May be examine the history of the equipment with respect to how reliable it's been. We can also see corrective measures that have been taken and may be even preventative procedures that have also been implemented.

And then of course with respect to those measures and procedures, which of them had positive versus negative outcomes because again, it's just as useful to know that trying this typically will not help versus trying something else that will. So, of course, this can help you to much more quickly resolve these types of recurring issues. And may be those prior issues can be avoided altogether, or at least addressed more easily.

[Video description begins] Documentation Sharing. [Video description ends]

Now, obviously documentation then needs to be made available so you do want to consider methods of sharing. And ideally a centralized knowledge base is typically something that is desirable. So that everyone is just feeding into the same source of information and drawing from it as well. In other words, you don't want to have a bunch of decentralized sources that don't relate to each other. May be it's some kind of searchable database. May be it's wiki based.

And that means that it is assembled by the users. That's essentially what wiki means, it's built by the people, if you will. And of course, any kind of case notes will always help for any specific instance. Those are usually much more specific as opposed to may be a centralized knowledge based, which might be a little more broad. But the more information, the better. And of course, you can always include external resources as well. May be it's support from vendors or plain old Internet searches. But ultimately making sure that the data is available, and easily accessible by the people who need it, will facilitate more easily addressing these problems when they arise in the future.