Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Random start to random number sequence

From	[email protected] (William Gould, StataCorp LP)
To	[email protected]
Subject	Re: st: RE: Random start to random number sequence
Date	Tue, 17 Aug 2010 16:43:27 -0500
Allan Reese posted today on initializing Stata's random-number
generator.  Allan wrote,

> Using the clock has been suggested several time but with a complex
> parsing.  Here is a simple program that can be added to your ado
> library:
> 
>         program define randomseed
>            version 7
>            local seed = clock( c(current_time), "hms" )
>            set seed  `seed'
>            di "Clock based seed = " c(seed)
>         end     

Here's a better one:

        . display %15.0g tc(`c(current_date' `c(current_time)')
               52750000

It's better given that Alan said, "you may simply require a different
random sequence each session".  If Alan had said, "you may simply
require a different seed each day", every bit as good would be

        . display td(`c(current_date)')
        18491

I want to explain why.

I'm not desirous of singling out Allan.  Indeed, from the way Allan
wrote his post, I suspect he already knows most or all of what I'm
about to say.

The problem with what Alan wrote is not what he said, but what he
didn't say.  He didn't give all the warnings on how -randomseed- must
not be used.

I am going to explain how automatic seed setters can be misused.
Moreover, I going to explain the underlying issues so that you can
generate your own warnings from now on.


What's a pseudo-random-number generator?
----------------------------------------

It's not a random-number generator.  In fact, it's deterministic.
In most computer software -- Stata included -- the
pseudo-random-number generator is a formula-based approach that
generates a sequence of numbers that could pass for random for some
applications.  The one in Stata is designed for doing simulations in
statistical research.  It would not appropriate for generating 
encryption keys.


What's a pseudo-random-number-generator seed?
---------------------------------------------

The initial value that start off the formula that manufactures
pseudo-random numbers is sometimes referred to as "the seed", although
as we will see, that usage is a little sloppy.  In fact, in modern
pseudo-random-number generators, there is not just a single seed;
there is a vector of numbers that describe the state of the system.
Think of the formulas as

              S = (v1, v2, ..., vK)       (state of system) 

              ...

              S(n+1) = F(S_n)             (transition)

We have a state vector S.  From any State vector, we have a function
F() which will generate another state vector.  Note, we haven't even
addressed random-numbers yet.  We have another function G(S), which
will generate random numbers based on a state vector.  Thus, the
sequence is,


              S_0 = (v1_0, v2_0, ... vK_0) 

              S_1 = F(S_0)
              1st random number is G(S_1)

              S_2 = F(S_1)
              2nd random number is G(S_2)

              ...


S_0 is in fact the seed of our random number generator, but it's not
called the seed.  It called the initial state.  We have another
function, let's call it H(), that creates S_0 from a scalar:

              S_0  = H(seed)

and so here is the entire system in general form:

              S_0  = H(seed)

              S_1 = F(S_0)
              1st random number is G(S_1)

              S_2 = F(S_1)
              2nd random number is G(S_2)

              ...

In the above, remember that S is a vector.

The seed is a scalar, which I will write as s from now on.

A pseudo-random-number generator is described by one scalar, the seed,
and three functions:  H(), F(), and G().

I want to talk about H(), but let me point out that F() is periodic.
That is, given any S_0 and F(), eventually there will be an N such
that S_N equal to the original S_0.  Mathematically, F(S_N) =
F(S_(N-1)) = F(F(S_(N-2))) = F^N(S_0).  Once that happens, we will get
the same "random" numbers we got the first time around.

Modern pseudo-random-number generators have a long period.  

When we choose a seed s we produce the sequence (F(H(s)), F^2(H(s)),
F^3(H(s)), ...).  In fact, that sequence is just a subset of the
sequence (F(H()), F^2(H()), F^3(H()), ..., F^N(H()).  In effect, we
are selecting an entry into the full sequence generated by F(H()).
In this way, we can use the same pseudo-random-number system again and
again, or at least we can if n<<N, and if we somehow choose the
different seeds appropriately, in a way I will explain.


Do the seeds have to be random?
-------------------------------

Randomness of the seed is something required only because we are 
re-using the same pseudo-random-number generator system.

If we used the system only once, even to generate billions of random
numbers, we could use any seed.  s=1 would be a perfectly good choice.

It is because we are using the same system again and again that we 
need to randomize the choice of seed.

Even then, randomization is not really required.  Modern H() functions
attempt to to produce states that are far away from each other even if
the seeds are in close proximity.  Given a good H() function, you
should be able to choose s=1 on Monday, s=2 on Tuesday, and so on, and
generate perfectly good random numbers (i.e., the sequences will not
be correlated).

In fact, you should not tax H()'s in this way because the claim may
not be true.  It is not uncommon that a pattern is discovered much
later after function design when one chooses s=1, s=2, ..., despite
the designer's attempt to avoid such patterns.


Do seeds have to be long?
-------------------------

No.  Small seeds a perfectly good.  H() is designed to allow seeds 
over a certain range, and all seeds within the range are expected 
to perform as well.


Is resetting the seed a good idea?
----------------------------------

That very simple question has a complicated answer. 

First, you must not reset the seed too often.  The proof for the
limiting case is easy:

Say George resets the seed before generating every random number.
That is, George, in a quest for really good pseudo-random numbers, 
sets the seed, generates one pseudo-random number, sets the seed, 
generates another pseudo-random number, and so on.

The sequence of random numbers George obtains is

        G(F(H(s0)))
        G(F(H(s1)))
        G(F(H(s2)))
	...

where s0, s1, s2, ..., are the seeds George sets.

Taking G(F(H())) does nothing for George because it is the same 
deterministic function every time. 

George's random numbers are as random as s0, s1, s2, ...
were random.  I hope George used a good method for selecting his 
seeds.

This next will surprise you until you think about it, but the best way
to use a pseudo-random-number generator is to set the seed only once,
the day you get it, and to just let it continue on its merry way until
you've used it up!  Because you set the seed only once, we do not need
to discuss randomness.  Randomness is a property of sequences of
numbers.  We will still recommend you set the seed randomly, however,
because we will want randomness in numbers generated across
researchers.

Most pseudo-random-number generator designers would prefer it if you 
used their generators in this way.


But what about reproducibility?
-------------------------------

Stata will let you display the current state of the random number
generator in an encoded string.  That is, Stata display S, not s.  You
can even use that encoded string to reset S.  When you do that, you
are resetting the state S directly.  H() plays no role.


Closing remarks
---------------

So what was wrong with Allan's original suggestion?

Allen based the seed on the time of day.  Let's say Allan gets to the
office around the same time every day.  Let's assume Allan runs
simulations around the same time on days he runs them.  Perhaps he
starts them right after lunch, or just before going home.  Alan is now
drawing seeds in close proximity to each other.  He is trusting H() to
jumble that for him.  Moreover, he is drawing from such a reduced set
that over a period of time, Allan is likely to choose the same seed!

That's why I offered an improvement based on the date and time.  
The seeds are still ordered, however, so I'm still trusting H().

In my daily set-the-seed suggestion, I just used the number of days 
since January 1, 1960.  I am really trusting to the design of H().

Can't I give you a good, deterministic way to set the seed, no matter 
how complicated?  No.  It's the deterministic part that's the problem.
Complication has nothing to do with it.  Complication may confuse you, 
but it does not confuse the universe.  Here's one way to generate seeds
for those of you that want a process.  I warn you, it's random:

         Get a spinner.  Spin.  Use the digit as the first digit. 
         Spin.  Use the digit as the second.  And so on.  Do it 
         about 12 times.  Even biased spinners will do a good 
         enough job.

Me, I set the seed using all sorts of methods.  Sometimes I just type.
That's not very good, but I don't do it often.  Sometimes I flip open
a thick book, maybe twice.  Sometimes, I look at the time.  I choose
my method, well, at random.  It's not really random, but it's pretty
good.

When I run simulations, however, I don't willy-nilly reset the seed.
I set the seed and try to keep it set, using Stata's -display c(seed)- 
to handle my problems with reproducibility.

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: Re: st: Current Population Survey and Stata, 2003-2005
Next by Date: Re: st: Re: test of the equality of coefficients in two models
Previous by thread: st: Current Population Survey and Stata, 2003-2005
Next by thread: Re: st: RE: Random start to random number sequence
Index(es):
- Date
- Thread