Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | wgould@stata.com (William Gould, StataCorp LP) |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: RE: Random start to random number sequence |
Date | Tue, 17 Aug 2010 16:43:27 -0500 |
Allan Reese posted today on initializing Stata's random-number generator. Allan wrote, > Using the clock has been suggested several time but with a complex > parsing. Here is a simple program that can be added to your ado > library: > > program define randomseed > version 7 > local seed = clock( c(current_time), "hms" ) > set seed `seed' > di "Clock based seed = " c(seed) > end Here's a better one: . display %15.0g tc(`c(current_date' `c(current_time)') 52750000 It's better given that Alan said, "you may simply require a different random sequence each session". If Alan had said, "you may simply require a different seed each day", every bit as good would be . display td(`c(current_date)') 18491 I want to explain why. I'm not desirous of singling out Allan. Indeed, from the way Allan wrote his post, I suspect he already knows most or all of what I'm about to say. The problem with what Alan wrote is not what he said, but what he didn't say. He didn't give all the warnings on how -randomseed- must not be used. I am going to explain how automatic seed setters can be misused. Moreover, I going to explain the underlying issues so that you can generate your own warnings from now on. What's a pseudo-random-number generator? ---------------------------------------- It's not a random-number generator. In fact, it's deterministic. In most computer software -- Stata included -- the pseudo-random-number generator is a formula-based approach that generates a sequence of numbers that could pass for random for some applications. The one in Stata is designed for doing simulations in statistical research. It would not appropriate for generating encryption keys. What's a pseudo-random-number-generator seed? --------------------------------------------- The initial value that start off the formula that manufactures pseudo-random numbers is sometimes referred to as "the seed", although as we will see, that usage is a little sloppy. In fact, in modern pseudo-random-number generators, there is not just a single seed; there is a vector of numbers that describe the state of the system. Think of the formulas as S = (v1, v2, ..., vK) (state of system) ... S(n+1) = F(S_n) (transition) We have a state vector S. From any State vector, we have a function F() which will generate another state vector. Note, we haven't even addressed random-numbers yet. We have another function G(S), which will generate random numbers based on a state vector. Thus, the sequence is, S_0 = (v1_0, v2_0, ... vK_0) S_1 = F(S_0) 1st random number is G(S_1) S_2 = F(S_1) 2nd random number is G(S_2) ... S_0 is in fact the seed of our random number generator, but it's not called the seed. It called the initial state. We have another function, let's call it H(), that creates S_0 from a scalar: S_0 = H(seed) and so here is the entire system in general form: S_0 = H(seed) S_1 = F(S_0) 1st random number is G(S_1) S_2 = F(S_1) 2nd random number is G(S_2) ... In the above, remember that S is a vector. The seed is a scalar, which I will write as s from now on. A pseudo-random-number generator is described by one scalar, the seed, and three functions: H(), F(), and G(). I want to talk about H(), but let me point out that F() is periodic. That is, given any S_0 and F(), eventually there will be an N such that S_N equal to the original S_0. Mathematically, F(S_N) = F(S_(N-1)) = F(F(S_(N-2))) = F^N(S_0). Once that happens, we will get the same "random" numbers we got the first time around. Modern pseudo-random-number generators have a long period. When we choose a seed s we produce the sequence (F(H(s)), F^2(H(s)), F^3(H(s)), ...). In fact, that sequence is just a subset of the sequence (F(H()), F^2(H()), F^3(H()), ..., F^N(H()). In effect, we are selecting an entry into the full sequence generated by F(H()). In this way, we can use the same pseudo-random-number system again and again, or at least we can if n<<N, and if we somehow choose the different seeds appropriately, in a way I will explain. Do the seeds have to be random? ------------------------------- Randomness of the seed is something required only because we are re-using the same pseudo-random-number generator system. If we used the system only once, even to generate billions of random numbers, we could use any seed. s=1 would be a perfectly good choice. It is because we are using the same system again and again that we need to randomize the choice of seed. Even then, randomization is not really required. Modern H() functions attempt to to produce states that are far away from each other even if the seeds are in close proximity. Given a good H() function, you should be able to choose s=1 on Monday, s=2 on Tuesday, and so on, and generate perfectly good random numbers (i.e., the sequences will not be correlated). In fact, you should not tax H()'s in this way because the claim may not be true. It is not uncommon that a pattern is discovered much later after function design when one chooses s=1, s=2, ..., despite the designer's attempt to avoid such patterns. Do seeds have to be long? ------------------------- No. Small seeds a perfectly good. H() is designed to allow seeds over a certain range, and all seeds within the range are expected to perform as well. Is resetting the seed a good idea? ---------------------------------- That very simple question has a complicated answer. First, you must not reset the seed too often. The proof for the limiting case is easy: Say George resets the seed before generating every random number. That is, George, in a quest for really good pseudo-random numbers, sets the seed, generates one pseudo-random number, sets the seed, generates another pseudo-random number, and so on. The sequence of random numbers George obtains is G(F(H(s0))) G(F(H(s1))) G(F(H(s2))) ... where s0, s1, s2, ..., are the seeds George sets. Taking G(F(H())) does nothing for George because it is the same deterministic function every time. George's random numbers are as random as s0, s1, s2, ... were random. I hope George used a good method for selecting his seeds. This next will surprise you until you think about it, but the best way to use a pseudo-random-number generator is to set the seed only once, the day you get it, and to just let it continue on its merry way until you've used it up! Because you set the seed only once, we do not need to discuss randomness. Randomness is a property of sequences of numbers. We will still recommend you set the seed randomly, however, because we will want randomness in numbers generated across researchers. Most pseudo-random-number generator designers would prefer it if you used their generators in this way. But what about reproducibility? ------------------------------- Stata will let you display the current state of the random number generator in an encoded string. That is, Stata display S, not s. You can even use that encoded string to reset S. When you do that, you are resetting the state S directly. H() plays no role. Closing remarks --------------- So what was wrong with Allan's original suggestion? Allen based the seed on the time of day. Let's say Allan gets to the office around the same time every day. Let's assume Allan runs simulations around the same time on days he runs them. Perhaps he starts them right after lunch, or just before going home. Alan is now drawing seeds in close proximity to each other. He is trusting H() to jumble that for him. Moreover, he is drawing from such a reduced set that over a period of time, Allan is likely to choose the same seed! That's why I offered an improvement based on the date and time. The seeds are still ordered, however, so I'm still trusting H(). In my daily set-the-seed suggestion, I just used the number of days since January 1, 1960. I am really trusting to the design of H(). Can't I give you a good, deterministic way to set the seed, no matter how complicated? No. It's the deterministic part that's the problem. Complication has nothing to do with it. Complication may confuse you, but it does not confuse the universe. Here's one way to generate seeds for those of you that want a process. I warn you, it's random: Get a spinner. Spin. Use the digit as the first digit. Spin. Use the digit as the second. And so on. Do it about 12 times. Even biased spinners will do a good enough job. Me, I set the seed using all sorts of methods. Sometimes I just type. That's not very good, but I don't do it often. Sometimes I flip open a thick book, maybe twice. Sometimes, I look at the time. I choose my method, well, at random. It's not really random, but it's pretty good. When I run simulations, however, I don't willy-nilly reset the seed. I set the seed and try to keep it set, using Stata's -display c(seed)- to handle my problems with reproducibility. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/