Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Default Seed of Stata 12


From   "William Gould, StataCorp LP" <wgould@stata.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Default Seed of Stata 12
Date   Mon, 29 Oct 2012 12:58:11 -0500

Stephen Jenkins <S.Jenkins@lse.ac.uk> asked, 

SJ> Given the "infinitely long" sequence which repeats, and Bill's
SJ> reference to "entry points", does it ever matter what number one
SJ> chooses to be the initial seed and hence enters the sequence?

Stas Kolenikov <skolenik@gmail.com> replied 

SK> Pretty much every number is as good as any other number for a 
SK> starting value.

Right, but still be cautious.  For reasons I didn't explain, the seed
needs to be roughly random.  Never do this:

        local seed = 11682614

        forvalues simultation = 1(1)5 {
                local seed = `seed' + 1
                set seed `seed'
                do simulation`i' 
        }

In addition, do not set the seed too often.  The reason for this 
recommendation is explained in -help seed-.  

In the above example, I do not need to set the seed in each simulation
to ensure reproducibility.  A better solution would be,

        local seed = 11682614
        forvalues simultation = 1(1)100 {
                display c(seed)
                do simulation`i'
        }

The simultations I usually run produce datasets; I analyze the results
separately.  For instance, a simulation might be create 10,000 samples
and estimate coefficients in each.  Each of the simulation#.do files
would then create a 10,000-observation dataset of coefficients.

I organize my problem like this:

        simulation1.do:
                do simulation 1 10000 1000 10 5 0 0.0

        simulation2:.do
                do simulation 2 10000 1000 10 5 0 0.2

        ...

        simulation5.do:
                do simulation 2 10000 1000 10 5 0 0.8

The simulation#.do files merely specify the parameters that control 
the simulation.  File simulation.do will perform the simulation using 
those parameters.

Importantly, notice that none of the simulation#.do files set the
random-number seed.  Neither does simulation.do.  It looks like this:

        simulation.do:
                args simul_No N n alpha0 alpha1 alpha2 alpha3 rho

                local initial_seed = c(seed)

                ...   // Code that actually performs the simulations
                ...   // goes here.
                ...   // The results of running simulation.do is 
                ...   // to prouduce a N-observation dataset of 
                ...   // results.  Each observation of this dataset 
                ...   // is itself based running some particular 
                ...   // statistical problem on separate n-obsrvation 
                ...   // datasets created with parameters alpha0, ..., rho. 

                /*
                    After all all simulations are run and the N-obsrvation 
                    dataset of results is created, tghe code ends with 
                */

                note: File results`simul_no' 
                note: Ran simluation `0'
                note: Seed was `initial_seed'

                save results`simul_no', replace

and then the file that drives all of this is just as I showed you:

        local seed = 11682614
        forvalues simultation = 1(1)100 {
                display c(seed)
                do simulation`i'
        }

I set the seed once for all the simulations I plan to run overnight, or
over the weekend.  One simulation follows on the heels of the next and
each merely continues to use the random-number generator.  In
simultion.do, however, however, I do save the value of c(seed) as it
was at the start of the simulation in case I should ever need to re-run
the simulation.

Say I want to rerun simulation3.do.  Here's what I do:

        . use results3

        . notes 

        _dta:
        1.  Files results simul3
        2.  Ran simulation do simulation 3 10000 1000 10 5 0 0.4
        3.  Seed was X075bcd151f123bb5159a55e50022865700043e55


        . clear all

        . set seed X075bcd151f123bb5159a55e50022865700043e55

        . do simulation 3 10000 1000 10 5 0 0.4

I use copy-and-paste to reset the seed and repeat the simulation command.

For most readers, reproducilibity is important to prove you did things
right, or to fix them laster when you did them wrong.  You probably do
not test reproducilibity often, but it's reassuring to know that you
could reproduce results.

For my work, I often need to reproduce results. 

Say I'm working on a new estimation command for Stata.  The steps are

    1.  Write code; get it into shape where I think it is working.

    2.  Prove that the code produces correct results by running 
        simulations on artificial datasets based on assumed 
        parameters, and then verify that the estimated coefficients 
        are on average correct and verify that the coverage is 
        correct.  This last step is about verifying that the true 
        cofficients lie outside 95% confidence intervals 5% of the time.

    3.  Write a certification script that casts in cement the answers 
        produced by the code on a sample of problems, thus ensuring 
        that results are the same across computers as they are on mine 
        today, and that results do not change in the future. 

We'll skip step 3.  Step 2 is about running simulations.  But betweeen 
steps 1 and 2, something else happens.

What happens is that I think I've started step 2, but simulation2,
replication 4,233 blows up because my code has a bug or m code is just
not up to handling that particular problem.  I have to figure out
which, and fix it, so I need to the dataset for simluation2,
replication 4,233.

Following the procedure outlined above, I could rerun simluation2.do.
I could type 

        . clear all

        . set seed X075bcd151f123bb5159a55e50022865700043e55

        . set more off

        . do simulation 3 10000 1000 10 5 0 0.4

and just wait as 4,232 simluations run by and simulation 4,322 begins.
And blows up.  

Actually, I do better than that because in simulate.do, although I
didn't show you this, I display the c(seed) associated with each
replication.  So I look at the log and discover simulation 4,232 began
when seed was Xe0b8c1f53fc5940d2f25041582eadac200043d25.

So I type 

        . set seed Xe0b8c1f53fc5940d2f25041582eadac200043d25

        . do simulation 3 1 1000 10 5 0 0.4
                         /
                        /
                 I set n, the number of replications, to 1.

I'm able to jump right to the problem.

My point is that you do not have specify seeds often to ensure 
reproducibility.  Instead, display or record in the data the 
value of c(seed).

You do need to specify the seed at the outset.  Stas Kolinkov
suggested,

SK>    1. use today's date: set seed 20121026
SK> 
SK>    2. pull a bill out of your pocket, and copy its numbers
SK>
SK>    3. take a look at your RSA key and use the digits from there
SK>       (sh-h-h... I hope my IT department is not listening to this)
SK>
SK>    4. use an actual random number from random.org
SK>
SK>    5. use a Dilbert-like random number generator
SK>       (http://dilbert.com/strips/comic/2001-10-25/)

I love suggestion #3 because I carry an RSA key and yet this never 
occured me!  

I do not suggest you use suggestion #1.  There's nothing wrong with the 
suggestion as presented, but the problem is that you will be tempted 
to use it tomorrow, and the next day, and so on.  

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index