This site will look much better in a browser that supports web standards, but is accessible to any browser or Internet device.
In my last essay, I mentioned the Shuffling essay from Jeff Atwood's blog. This time, my essay is a little less philosophical and more technical. Several of the comments on the Shuffling essay talked about the importance of knowing the random number generator you are using.
Generating random numbers sequences is a particularly interesting subject for me. Several of the comments about generating random numbers made in that essay covered both good and bad information about generating random sequences. Because of this, I decided to explain some of what I know about the subject in the hopes of giving a little more to think about.
Most people don't think about the fact that the term random number is relatively useless. Is the number 17 random? How about 7183? How about 1? Depending on how a number was generated and in the context in which it was generated, we could say either yes or no.
A better question would be to ask what someone is looking for when they ask for a random number. Sometimes when someone says they need a random number, what they really want is a number that's not related to anything in particular in their system. They probably also want it to be different each time they ask for one. If that is all they need, almost any method of picking numbers is reasonable.
More often, people are looking for a random sequence of numbers.
What do we mean by a random number sequence?
The main attribute that makes a number sequence random is that it is unpredictable. Most of the other attributes of random number sequences boil down to measurements of the difficulty of guessing the next number in the sequence. Some of these measures include periodicity, correlation between values, and various kinds of spectral analysis. Although random sequences are unpredictable, the fact that a sequence is unpredictable (by any given person) does not mean that it is random.
In many cases, a sequence does not need to actually be random. It just needs to be sufficiently random for the purpose. Different problems have different requirements for random sequences. If I need a single number, even a low-quality sequence will do. For a quick game that needs 100 random numbers, we would need a higher-quality sequence. For a Monte Carlo simulation needing 1000s of random numbers, we need a high-quality random sequence. For cryptographic applications, only the highest-quality random sequences are acceptable.
How do we measure the quality of a random sequence?
Information theory measures the information in a process in terms of entropy. A particular sequence has low entropy if not much information is encoded in it. A sequence with high entropy contains a lot of information. In the case of a random number sequence, the harder it is to guess the next number, the higher the entropy.
This gives us a way to talk about the quality of a random number sequence. A high-entropy sequence is higher quality than a low-entropy sequence.
The problem is that entropy is hard to measure. It is also difficult to harvest entropy from high-quality sources. Most good sources of entropy are difficult to access from a computer without sacrificing some of the entropy. The time between two nuclei in a radioactive isotope decaying is very unpredictable, but the rate at which we can measure the time is not.
Trying to measure this time as a source of random numbers either wastes entropy by not being precise enough or is very expensive. There is also a maximum rate at which we can extract individual numbers in the sequence. The same holds true for Johnson Noise, Brownian motion, or any other random physical process.
Except in very specialized circumstances, people normally don't need random sequences, they need uncorrelated sequences. Back in 1947, the RAND corporation was doing a lot of work using Monte Carlo simulations. When doing simulations, you need sequences of numbers with no correlation between the numbers. However, truly random numbers are not as useful in a simulation, because there is no way to repeat a particular simulation run. They solved this problem by creating a list of one million random digits that they could reference for their simulations.
The RAND corporation sold this table in published form for people to use in experiments. Once the table was generated, these numbers were no longer random. Anyone recognizing the part of the sequence could predict the next number by looking it up in the table. You could do the same thing with the digits of π, e, or other irrational numbers. (After convincing yourself that the correlation between digits is negligible for your purposes.)
Despite the fact that the RAND table was generated as a high-quality random sequence, it is no longer a random sequence because it has been generated before (and published).
The difficulty with using real randomness in a computer lead to the creation of pseudo-random number generators (PRNGs). If we can algorithmically generate an unpredictable sequence of numbers, we get most of the benefits of a random sequence without the downsides. Any algorithmic method of generating a sequence is predictable given enough information.
Algorithmic generators started relatively simple and have become more sophisticated over the years. Interestingly, all PRNGs contain a certain amount of state. This state is a very good representation of the entropy of the generator. Early PRNGs had a single integer as internal state. Current state-of-the-art generators contain much more.
The best class of random generators today harvest small amounts of entropy from many sources. These generators extract random bits from a pool of entropy to make random numbers. Unfortunately, as we extract bits, we use up the entropy. So these generators continue to harvest entropy almost continuously. Unfortunately, we still have problems with the speed at which we can acquire new entropy and the storage for the entropy we harvest. It turns out that attempting to extract entropy from really random sources loses some of that entropy in the measurement process.
Most systems don't store all of the entropy they harvest. Instead they use a stirring function to continually merge the new entropy into a pool. At any given point in time, the pool contains a bit less entropy than we have put into it. These stirring functions are carefully designed to mix the new data into the pool in such a way that we don't generate regular patterns.
This approach is much more sophisticated than the simple unconnected sound card or lava-lamp type generators of a few years ago. In addition to being a good source of random sequences, there is solid research behind the amount of entropy that can be collected and used this way.
Unfortunately, this approach still has a downside. There is a limit to how fast we can extract random numbers from the pool without depleting its entropy. Different systems may either give lower quality numbers in the sequence for a while or pause until enough entropy has been collected.
The universe contains many sources of randomness. This randomness is difficult to harvest for use in a computer program. Randomness is a more complicated concept than most people realize. As Donald Knuth pointed out in Seminumeric Algorithms (The Art of Computer Programming, Vol. 2), you should never randomly pick a method for generating random numbers.
Posted by GWade at December 16, 2007 07:41 PM. Email comments