This site will look much better in a browser that supports web standards, but is accessible to any browser or Internet device.
Earlier in this weblog, I listed some of my basic troubleshooting rules. I thought it was probably time to come back and spend a little more time on this topic.
I plan to expand on the points I made earlier and add a few more thoughts along the way.
I left off the most important rule last time.
If you can't reproduce the symptom, you will never fix it. (Or, at least you will never be sure it's fixed.) This one is pretty basic, but I notice people forgetting it all of the time.
I also try to make the distinction between symptom and bug/problem at this point, because the bug itself is almost never visible. What you can see is a symptom of the actual problem. This distinction becomes much more important later.
Almost every problem can be solved fastest through a divide and conquer approach. Try to "trap" the symptom in a smaller and smaller box so that you can see it. Finding a bug in a 10 line function is much easier than finding it in a 100,000 line program.
Different approaches to this technique include trying to generate smaller sets of actions needed to reproduce the problem. Adding logging statements or breakpoints to find code which is before and after the bug. Any code that you can eliminate reduces the space you have to search for the bug.
A friend of mine who has a lot of experience in information theory pointed this out. One of the good things about a 50/50 test is that whichever way the test goes, you learn the same amount. If a test can divide the program in half, and you can tell which half the bug is in, you now have half the code to check. If you can repeat this, a 1,000,000 line program can be reduced to 1 line in around 20 tests. But, all programmers should recognize 2^20. (Gives a whole new aspect to the game Twenty Questions doesn't it?) So each 50/50 test gives you about 1 bit of information.
Often when debugging, it's easy to fall into the trap of testing to see if the bug is in one small piece of code. This turns out to be a really bad idea. If your test would limit the bug to 5% of the program if it succeeds, how much will you learn (on average)? Well, if the test succeeds, you eliminate 95% of the code. If the test fails, you eliminate 5% of the code. In information theory terms, this test gains you about a quarter of a bit (on average). So if the test succeeds, it would localize the solution as well as 4 simple tests. But, if it fails you don't really know much more than you did to start with.
Some of the most effective tests I know are tests for things that can't happen. The classic hardware test is to see if the device is turned on/plugged in.
At one point, I was told that some code had a problem because it took a lot longer to complete a remote process with one machine than it did with another. I suggested that we ping
the two servers to make sure the network acted the same to both machines. I was told that the network was the same so it couldn't possibly be the problem. We tested it anyway. It turned out there was a significant difference in the ping times. The network group was then able to find and fix the configuration problem.
Many times, the "can't happen" case is actually the I don't believe it could do that case. This may mean that it really can't happen. But, it may indicate a blind spot for that problem. Blind spots are good places to check for bugs. If you can't see it now, it's possible the original programmer didn't see it when coding.
One mistake that many new debuggers make is to try to guess at the bug, fix what they think they've found, and hope it works. A few years ago, I realized that there seem to be three main causes for this behavior:
Not all programmers are driven by all of these causes. Some programmers are affected more strongly by some of these. The first is the easiest to spot. The programmer does not want to waste time finding and fixing bugs. "There's too much real work to do." Anyone who ever gets very good at programming eventually learns that debugging is part of the business.
Part of what makes a good programmer is ego. Larry Wall describes this as the great programmer virtue of Hubris. This is basically the belief that you can do it, in spite of evidence to the contrary. If it weren't for this kind of ego, no code would ever get written. Most systems are so complex, that if we ever really thought about what we are getting ourselves into, we would run screaming into the night. Fortunately, programmers do suffer from hubris.
However, the process of debugging on a system you don't understand is frustrating and humbling. You have no idea where in thousands of lines of code the problem lies. To keep from bruising the ego, the programmer will sometimes guess to appear to come to a swift conclusion. This approach usually fails and ends up making that programmer look worse.
If we go back to the divide and conquer approach with 50/50 tests, you can reduce the size of the problem to 1/1024 it's original size in about 10 tests. In 10 more tests, the area to search could be less than 1 in a million.
In actual fact, you don't usually get actual halving like that in a real problem. And you can usually spot the problem quite a ways before you get it down to one machine instruction. But the truth is that this kind of steady progress will narrow down to the bug more consistently than any other method. Moreover, it's more useful to be able to find any bug than to guess right on a few.
Sometimes, when debugging, you recognize some symptoms or get a hunch about what the problem could be. If you can test it simply, go ahead. But, if your guess doesn't pan out, don't try to keep guessing. This is where many people go wrong in debugging. They spend a lot of time chasing spurious hunches, when they should be whittling down the problem space.
The fact that your hunch didn't find it probably means that you don't understand the problem as well as you thought. Don't despair, a few more simple tests may give you the information you need to make better guesses later. More likely, the tests will help you find the bug in a more systematic manner.
Finding the code that generates the symptom is not the hardest part of the problem. Now is the time to identify the real problem. How do you find the root cause? For example, say you have a program written in C++ that slowly increases its memory consumption. This is an obvious symptom of a classic memory leak. After a significant amount of effort, you find the area where the memory is allocated and not freed.
The quick fix is to slap in a delete
at the right place and pat yourself on the back. But, you haven't really found the root of the problem. Why wasn't the memory freed at that time? Possibly, under some circumstance, that object is used elsewhere. Even if it isn't, the patch is not exception-safe. If an exception occurs before the delete
, the memory leak is back. A better idea would be to use some form of smart pointer, like auto_ptr
. (Before you get up and scream "garbage collection would have fixed that", I've seen Java programs that leaked memory like crazy. Garbage collection doesn't fix all memory problems.)
Well, you've fixed the bug. Now is a good time to look in the immediate area for similar mistakes. For many reasons, some pieces of code seem to collect more than their fair share of bugs. If you've found one, there are likely others nearby.
Posted by GWade at January 30, 2004 10:59 PM. Email comments