This site will look much better in a browser that supports web standards, but is accessible to any browser or Internet device.

Anomaly ~ G. Wade Johnson Anomaly Home G. Wade Home

December 16, 2007

Some Thoughts About Generating Random Numbers

In my last essay, I mentioned the Shuffling essay from Jeff Atwood's blog. This time, my essay is a little less philosophical and more technical. Several of the comments on the Shuffling essay talked about the importance of knowing the random number generator you are using.

Generating random numbers sequences is a particularly interesting subject for me. Several of the comments about generating random numbers made in that essay covered both good and bad information about generating random sequences. Because of this, I decided to explain some of what I know about the subject in the hopes of giving a little more to think about.

Random Numbers

Most people don't think about the fact that the term random number is relatively useless. Is the number 17 random? How about 7183? How about 1? Depending on how a number was generated and in the context in which it was generated, we could say either yes or no.

A better question would be to ask what someone is looking for when they ask for a random number. Sometimes when someone says they need a random number, what they really want is a number that's not related to anything in particular in their system. They probably also want it to be different each time they ask for one. If that is all they need, almost any method of picking numbers is reasonable.

More often, people are looking for a random sequence of numbers.

Random Number Sequences

What do we mean by a random number sequence?

The main attribute that makes a number sequence random is that it is unpredictable. Most of the other attributes of random number sequences boil down to measurements of the difficulty of guessing the next number in the sequence. Some of these measures include periodicity, correlation between values, and various kinds of spectral analysis. Although random sequences are unpredictable, the fact that a sequence is unpredictable (by any given person) does not mean that it is random.

In many cases, a sequence does not need to actually be random. It just needs to be sufficiently random for the purpose. Different problems have different requirements for random sequences. If I need a single number, even a low-quality sequence will do. For a quick game that needs 100 random numbers, we would need a higher-quality sequence. For a Monte Carlo simulation needing 1000s of random numbers, we need a high-quality random sequence. For cryptographic applications, only the highest-quality random sequences are acceptable.

How do we measure the quality of a random sequence?

A Measure of Randomness

Information theory measures the information in a process in terms of entropy. A particular sequence has low entropy if not much information is encoded in it. A sequence with high entropy contains a lot of information. In the case of a random number sequence, the harder it is to guess the next number, the higher the entropy.

This gives us a way to talk about the quality of a random number sequence. A high-entropy sequence is higher quality than a low-entropy sequence.

The problem is that entropy is hard to measure. It is also difficult to harvest entropy from high-quality sources. Most good sources of entropy are difficult to access from a computer without sacrificing some of the entropy. The time between two nuclei in a radioactive isotope decaying is very unpredictable, but the rate at which we can measure the time is not.

Trying to measure this time as a source of random numbers either wastes entropy by not being precise enough or is very expensive. There is also a maximum rate at which we can extract individual numbers in the sequence. The same holds true for Johnson Noise, Brownian motion, or any other random physical process.

Uncorrelated Numbers

Except in very specialized circumstances, people normally don't need random sequences, they need uncorrelated sequences. Back in 1947, the RAND corporation was doing a lot of work using Monte Carlo simulations. When doing simulations, you need sequences of numbers with no correlation between the numbers. However, truly random numbers are not as useful in a simulation, because there is no way to repeat a particular simulation run. They solved this problem by creating a list of one million random digits that they could reference for their simulations.

The RAND corporation sold this table in published form for people to use in experiments. Once the table was generated, these numbers were no longer random. Anyone recognizing the part of the sequence could predict the next number by looking it up in the table. You could do the same thing with the digits of π, e, or other irrational numbers. (After convincing yourself that the correlation between digits is negligible for your purposes.)

Despite the fact that the RAND table was generated as a high-quality random sequence, it is no longer a random sequence because it has been generated before (and published).

Pseudo-Random Number Generators

The difficulty with using real randomness in a computer lead to the creation of pseudo-random number generators (PRNGs). If we can algorithmically generate an unpredictable sequence of numbers, we get most of the benefits of a random sequence without the downsides. Any algorithmic method of generating a sequence is predictable given enough information.

Algorithmic generators started relatively simple and have become more sophisticated over the years. Interestingly, all PRNGs contain a certain amount of state. This state is a very good representation of the entropy of the generator. Early PRNGs had a single integer as internal state. Current state-of-the-art generators contain much more.

Real Random Generators

The best class of random generators today harvest small amounts of entropy from many sources. These generators extract random bits from a pool of entropy to make random numbers. Unfortunately, as we extract bits, we use up the entropy. So these generators continue to harvest entropy almost continuously. Unfortunately, we still have problems with the speed at which we can acquire new entropy and the storage for the entropy we harvest. It turns out that attempting to extract entropy from really random sources loses some of that entropy in the measurement process.

Most systems don't store all of the entropy they harvest. Instead they use a stirring function to continually merge the new entropy into a pool. At any given point in time, the pool contains a bit less entropy than we have put into it. These stirring functions are carefully designed to mix the new data into the pool in such a way that we don't generate regular patterns.

This approach is much more sophisticated than the simple unconnected sound card or lava-lamp type generators of a few years ago. In addition to being a good source of random sequences, there is solid research behind the amount of entropy that can be collected and used this way.

Unfortunately, this approach still has a downside. There is a limit to how fast we can extract random numbers from the pool without depleting its entropy. Different systems may either give lower quality numbers in the sequence for a while or pause until enough entropy has been collected.

Concluding Thoughts

The universe contains many sources of randomness. This randomness is difficult to harvest for use in a computer program. Randomness is a more complicated concept than most people realize. As Donald Knuth pointed out in Seminumeric Algorithms (The Art of Computer Programming, Vol. 2), you should never randomly pick a method for generating random numbers.

Posted by GWade at 07:41 PM. Email comments

December 09, 2007

The Show Me Response

This week, there was an interesting post on Jeff Atwood's Coding Horror blog. This essay was on Shuffling. Shuffling is an application of random numbers, which is a particular interest of mine. I will write more on that subject in an upcoming essay.

One of the comments caught my eye because it echoed a sentiment I have fought several times in my career. The commenter referred to a particular simple solution to the problem and said

Random removal from a mutable array is the same approach I used last time, and the same approach I'll use next time, unless someone shows me why it's wrong.

This answer really bothered me. It wasn't particularly about the solution the person chose. I also would prefer not to abuse a particular person. After all, depending on the domain, it might be a reasonable answer. My problem is with the assumption that it is someone else's job to prove that the simple (naive) solution is not the best one.

I often run into this problem with code written by domain experts. The idea that the naive solution they created in 30 seconds of thought is probably better than the one created by a computer science expert after a great deal of research is baffling to me. The fact that the documented solution is easy to implement and supplied by many libraries makes this comment even more amazing. Just because the computer scientist doesn't know about your domain is no reason to discount their knowledge of the computer science domain.

As a professional programmer, I feel that improving my knowledge of algorithms is part of the craft of writing code. When looking at a new problem, I often try to check my books and online references for better solutions, especially if the problem seems particularly general. After all, many of the general problems in computer science have been studied for 30 or 40 years. Many of the people doing that research are extremely bright. They were also probably pretty knowledgeable about the area of their own research. It is relatively safe to assume that they knew more about the subject than I do.

Granted, our field often involves trade-offs. Sometimes the well-studied general algorithm is not the best solution to the particular problem you are working on. But, without understanding the general algorithm you can't know that. I have often heard comments along the line of our problem is much more complicated than this toy research problem, so we'll use our solution. (As an interesting footnote, almost every place I've ever worked in about two decades of software development was convinced that they are solving harder problems than everyone else.)

If the person making the comment has carefully considered the algorithm in question and can point to specific issues with the general algorithm, that is one thing. We all make design trade-offs all the time. However, if you don't bother to look at the general algorithm and assume that your approach is good enough, you are missing an opportunity to learn and you are potentially generating worse software in the bargain.

I have repeatedly worked on code where someone is implemented a horrible naive algorithm without bothering to look at the implications. For example, a recent performance problem was caused by an O(n2) algorithm that was easily converted into an O(n) algorithm with a small amount of thought. Unfortunately, the original programmer could not be bothered to look for a better approach. After all, in the small test sets he had checked the naive solution was fine.

I don't think it is unreasonable to expect a professional programmer to improve his or her craft by studying classical solutions to known problems. I also feel that, as professionals, we should welcome the opportunity to improve our understanding and our tools when a problem we are studying turns out to have a standard, well-researched solution. Understanding the solution improves our ability to analyze algorithms in the future. Seeing how the experts build and analyze algorithms will help you improve your ability to do the same.

Deciding that it is someone else's job to prove that the standard solution is better than your quick thought is handing the job of improving your skills to someone else. As a professional programmer, I realize that knowledge is what allows me to do my job. Seeking knowledge and understanding is, therefore, the most important thing I can do to improve my skills as a programmer.

Although I have been called arrogant many times, even I am not so arrogant as to believe that I am smarter and know more than all of the people working in software over the last 40 years.

Posted by GWade at 07:01 PM. Email comments

December 01, 2007

Debugging Without a Debugger, Part II

Last week, I wrote about a technique for debugging without using a debugger in Debugging Without a Debugger. I talked a bit about the advantages of instrumenting code and how it can be used to supplement the use of a debugger.

Those of you who have always used a debugger might consider this vaguely interesting, but not particularly critical. I first learned this technique in an environment where we could not use a debugger even if we had wanted to. Several times in my career I have have been in situations where a debugger was not available and this technique was the only way to debug code.

Background Processes

The first debugger-intolerant environment I developed in was writing a TSR program under DOS that communicated with a foreground program written by another company. The equivalent type of program would be daemon processes under various flavors of Unix or Windows services.

At the time, debuggers did not support the ability to attach to a running process. Since we couldn't start the process under the debugger, there was no way to run the program inside the debugger. Additionally, the program we were working on was communicating with a device that was not under our control. Stopping the program in the debugger would not have been an option even if we could have run under one. Which leads us to the next class of programs that do not work well with debuggers.

Real-Time Code

If the program has real-time requirements, running under a debugger may not be an option. Real-time does not always mean that it has to be lightning fast, but almost all real-time systems have requirements on the amount of time they are allowed to let certain operations wait. Stepping through code in a debugger or stopping on a breakpoint will almost certainly violate these requirements.

If the only real-time requirement is user responsiveness, this is not a real problem. But, if the program is communicating with another program or external device, pausing in the debugger might cause the whole application to fail or behave mysteriously. The TSR I talked about above communicated with another computer over a special interface card. If the card wasn't serviced in a timely fashion messages would be lost and the application would fail.

If the program has really tight, hard real-time requirements, even printing to a log might be too disruptive. In those cases, I've seen systems that log to a buffer in memory that is written to disk when the system has time.

Multi-Threaded Code

Although there are a few debuggers on the market that deal with multi-threaded code, this kind of system plays havoc with debuggers. First of all, there is the question of what happens when a breakpoint is hit. Do we stop all threads or just the one? If we allow the other threads to continue, what happens if a second thread hits a breakpoint while we are looking at the breakpoint on the first thread?

If that isn't confusing enough, think about the changes to the timing of the interactions between thread. Race conditions may appear and disappear at random because of the interactions we are having with one or more threads. How does access to a shared object work when a thread changes the object we are inspecting in the debugger?

Server Code

The next kind of system that I worked on without debugger help was a server in an on-line system. Like many on-line systems, this one had multiple threads to deal with incoming requests. There was a real-time component in the time required to service the request and respond to the client. If that weren't enough, the servers needed to stay running pretty close to 24/7. We could rotate servers into and out of service to load new code and fix problems, but they tended to run for hours, days, or weeks at a time.

A debugger is practically useless in this scenario. How do you watch a breakpoint that is only hit once every few hours? How do you catch problems that only occur on certain kinds of requests when you aren't sure which request triggers the problem?

Instrumentation

In each of these scenarios, we found that by carefully instrumenting the code we were able to troubleshoot and solve problems despite the lack of a visual debugging environment. In some cases, we logged lots of information in the hopes of spotting the problem in the reams of collected data. In other cases, we put very specific instrumentation in place to catch the rare times when the problem occurred.

One benefit of this approach is the ability to bring the full power of your programming language to bear on recognizing a problem and logging the appropriate information. If we knew that the problem was related to a certain area of memory becoming corrupted, it was possible to make a function that tested that area of memory. Now, we can call the test at various points in the code and log when the error was detected. Most debuggers today support some form of conditional breakpoint. In most cases, though they support only counting or simple conditional expressions. If your debugger supports a condition based on a function call, the debugger can match this feature.

You can also easily write a test that saves earlier state of the program to compare with the current state to see when things change. For example, you might only want to log in the destructor of the ABC object that was created by function abc(), not the other hundred or so ABC objects in the system. If this information is not already in a variable, most debuggers could not track this change. Most debuggers support some method of testing if a small number of simple variables change.

With the ability to write arbitrarily complex tests and the ability to log anything that you can access from the code. Instrumenting the code is a very powerful technique.

Wrap up

Between this article and the last, I hope I've given you some reasons to consider troubleshooting without a debugger. Maybe the next time you find yourself bouncing on the step or next command in your debugger, you might consider a more automated way to troubleshoot.

Posted by GWade at 03:19 PM. Email comments