Programmer Musings: August 2013 Archives

This site will look much better in a browser that supports web standards, but is accessible to any browser or Internet device.

August 16, 2013

Troubleshooting Questions

Between my earlier career in electronics and my current career in software, I have found a number of questions that have been helpful in troubleshooting problems reported by others. None of these questions are original to me. My wife described troubleshooting a hardware problem over the phone and mentions some of her troubleshooting questions. I thought I should write down some of the questions I have used.

Is it plugged in?

This is obviously a hardware troubleshooting question. Once you've spent time travelling to a customer location a few times to solve a hardware failure, you begin asking this question up front. Be prepared to ask the person to check, because their immediate reaction will be to respond without checking.

Is it turned on?

This is the other hardware-only troubleshooting question. People are also likely to answer this without checking as well. At a previous job, I heard our tech support guy go through this in a really wonder way. (Remember, I could only hear half of the conversation.)

"Is the printer turned on?"

"Are you sure?"

"Is the little LED on the front of the printer on?"

"Reach all the way to the back of the printer on the right side and push the switch there."

"I'm happy your printer works now. Thanks for calling."

Was it working before?

This question is very easy to forget, but it can be critical. If the device or software has never worked before, you may need to check to see if it has ever been set up or configured. You might need to ask about pre-requisite hardware or software.

I remember when this question helped me get a very irate customer to realise that he had installed a piece of our hardware on the same port as his printer, confusing both devices. In software, this is often where you expose that someone has installed software for the wrong system. (That Windows program the customer wants to use is not running on their new MacBook.)

What changed?

It seems that every time any software or hardware fails, it does so spontaneously. Nobody did anything different. It was working fine and then it just failed.

Asking about changes to the software, environment, machine, or usage will often point to something critical to finding the problem. This change is often hard to identify, because the person reporting the problem may not think it was important.

I remember one case where some proprietary web servers for a company began failing one morning. Over half of the servers were locking up, but a few were still functional. I had been part of writing the servers, so I was brought in to find out why they were failing. After running through the code and checking off everything I could think of, I asked the operations staff if they had changed anything that morning, they said 'No'.

So, I asked about the day before. After a bit of questioning, I found that they had changed the path in the default TEMP environment to point to a second disk. It turned out that all of the failing servers only had one drive. One of the libraries that the servers depended on wrote a temporary file in the temporary directory and failed if it couldn't write. It turned out that the problem was only noticed when the servers were restarted, which happened early in the morning.

What did you do right before it stopped working?

The idea is to walk the person through anything special in the environment or process in order to trigger a memory of what was different this time. Sometimes this results in the person mentioning something that they don't think matters but is actually critical to solving the problem.

Sometimes walking through the steps leading up to the problem will trigger a memory of what was different this time.

What did you expect it to do? What did it actually do?

Every now and then, someone reports a problem caused by a misunderstanding about the intended behaviour of the hardware or software. Working to understand and possibly modify an expectation is much better than trying to fix what is actually the intended behaviour.

Were there any error messages? What were they?

Usually software provides some kind of message when things go wrong. The exact error message (including any numeric codes) is often a major help in identifying a problem. On the other hand, there was an error message, but I don't know what it said is pretty useless.

What exact steps did you follow to trigger the problem?

I often find that any report of a problem is missing the critical information of the exact steps needed to reproduce the problem. Having the person list everything they did sometimes helps to identify a bad process or strange usage that causes the problem.

This is particularly important if the person is using the software or hardware in an unexpected way. They may not realise that what they are trying to do is unsupported or unexpected.

Walking through the steps leading to the problem will sometimes identify this issue.

Does it happen every time?

Intermittent problems are the most difficult problems to troubleshoot. In addition to the obvious problem of trying to reproduce the problem, the worst part I've found is being certain that the problem is actually gone.

Many intermittent problems actually become consistent once you document all of the steps needed to reproduce. Sometimes the inconsistency is that the user is doing something slightly different each time.

In any case, knowing that the problem is intermittent is one way to avoid dismissing the problem as can't reproduce.

Does it act the same on a different computer/browser/office?

This one is much more common with web apps, although it has been getting better in recent years. Each browser has its own capabilities and its own bugs. The problem may not be in the app, it may be in the browser. Determining that up front is a good way to save you from pulling out hair.

I have seen cases where hardware works differently in different offices. This often turns out to be bad power or noisy wiring. Don't assume that your power or network connections are clean.

Although computers have certainly stabilised in capabilities in the last couple of decades, you can still run into cases where a system has too little memory or a flaky hard drive that causes the problem in question.

Specific Questions

Obviously, there are many more questions that can help when troubleshooting. Anyone supporting a particular app or piece of hardware often discovers questions that apply only to this thing.

If your program needs to talk to another server, asking if you can see that server (use ping) may solve a host of problems (no pun intended). Asking if the drive is full may diagnose problems for a disk-intensive program.

Conclusion

Often the key to good troubleshooting is knowing which questions to ask, rather than knowing what is wrong with the system. If you approach the problem with an assumption that you know what is happening, you can waste time that would have been saved with the right question.

The most important point is to remember that you should start with the really basic questions to save unnecessary troubleshooting effort. Some of the funniest tech support questions I've seen have been resolved after someone finally asks if the hardware is installed or if the software has ever worked.

Posted by GWade at 04:15 PM. Email comments