Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fw: R: st: odd results after insample


From   "Carlo Lazzaro" <carlo.lazzaro@tin.it>
To   <statalist@hsphsun2.harvard.edu>
Subject   Fw: R: st: odd results after insample
Date   Wed, 30 Sep 2009 18:39:32 +0200

Dear Statalisters,
thanks to Brian Poi, some days ago I solved a problem in drawing random
samples from a given dataset with Stata 9.2/SE.
I would like to share Brian's kind reply with whom might be interested in
the same topic.
I also take the chance to thank Martin Weiss one more time for his precious
support along the way.

Kind Regards,
Carlo


-----Messaggio originale-----
Da: Brian P. Poi [mailto:bpoi@stata.com] 
Inviato: lunedì 28 settembre 2009 18.31
A: Carlo Lazzaro
Oggetto: Re: R: st: odd results after insample

> I take the chance to ask you whether Stata 9.2 SE (I don't know about
other
> more recent releases) can be programmed to run -sample- repeatedly (and
not
> just one time) for drawing, say, 10,000 random samples from a given
dataset,

Yes, you could do

    . sysuse auto
    . sample
    . sample
    . sample

or put -sample- in a -forvalues- loop.  But you'd have a hard time 
convincing me that's the right thing to do.

Or, do you mean something like this:

set seed 1
sysuse auto
gen mean = .
quietly forvalues i = 1/74 {
         preserve
         sample 50
         summ mpg
         scalar mpgm = r(mean)
         restore
         replace mean = mpgm in `i'
}

su mean
di %20.16f r(mean)

That is perfectly valid, as long as you keep in mind that -sample- samples 
without replacement.  On the other hand,

sysuse auto,clear
set seed 1
bootstrap mu = r(mean), size(50) reps(74) saving(mybs, replace): summ mpg
use mybs, clear
summ mu
di %20.16f r(mean)

will give you a slightly different answer because -bootstrap- samples with 
replacement.

Thus, the $64,000 question is whether you want to sample with or without 
replacement.

*************************************************************************
    ___  ____  ____  ____  ____
   /__    /   ____/   /   ____/                       Brian P. Poi, Ph.D.
  ___/   /   /___/   /   /___/                           Senior Economist
                                                             StataCorp LP
                                                       4905 Lakeway Drive
                                                College Station, TX 77845
                                                           bpoi@stata.com
*************************************************************************

On Mon, 28 Sep 2009, Carlo Lazzaro wrote:

> Dear Brian,
> thanks a lot for your kind reply. I was actually banging my head against
the
> wall in trying to understand what went wrong with my code lines and you
shed
> light on this.
> I take the chance to ask you whether Stata 9.2 SE (I don't know about
other
> more recent releases) can be programmed to run -sample- repeatedly (and
not
> just one time) for drawing, say, 10,000 random samples from a given
dataset,
> no matter the underlying distribution: in fact, this is the need I am
> currently facing.
> I am very fond of -simulate- as far as my programming skills allow me to
> invoke it, but it requires Stata users to know (or to mimic) the
underlying
> distribution of the population.
>
> Thanks a lot again for your kindness and for your time.
>
> Kind Regards,
> Carlo
> -----Messaggio originale-----
> Da: Brian P. Poi [mailto:bpoi@stata.com]
> Inviato: lunedì 28 settembre 2009 16.07
> A: Carlo Lazzaro
> Oggetto: Re: st: odd results after insample
>
> Carlo,
>
> I don't think anyone on statalist actually answered the question of why
> your code doesn't produce 2000 observations like you expect.  It had me
> stumped for a bit, so I just had to try the code myself to figure it out.
>
> Here's why.  In the first part of your loop you randomly sort the data and
> summarize the first 20 observations.  In the second part of your loop you
> try and store the mean and standard deviation in the `i'th observation,
> assuming that `i' runs from 1 to 2000 so that you will fill in the 1st
> observation, then the 2nd, and so on up to the 2000th.  But that won't
> work, because in every iteration of the loop you change the order of your
> data.  Therefore, you essentially are sticking the mean and s.d. into a
> random observation of your dataset.  Given the luck of the draw, some
> observations of ln_g_20 are being filled in more than once, and others
> never do get filled in like you expect.
>
> Also, note that because you generate A for only 972 observations, your
> mean and s.d. will on average will be computed using (972/2000)*20 = 9.72
> observations, not 20 observations.
>
> You could make your loop work with -preserve- and -restore, preserve-
> statements or perhaps with some contorted logic, but it's easier to just
> let -simulate- do it.
>
> *************************************************************************
>    ___  ____  ____  ____  ____
>   /__    /   ____/   /   ____/                       Brian P. Poi, Ph.D.
>  ___/   /   /___/   /   /___/                           Senior Economist
>                                                             StataCorp LP
>                                                       4905 Lakeway Drive
>                                                College Station, TX 77845
>                                                           bpoi@stata.com
> *************************************************************************
>
> On Sat, 26 Sep 2009, Carlo Lazzaro wrote:
>
>> Dear Statalisters,
>> as an alternative to - simulate - , I have written the following do file
>> (for Stata 9.2/SE) to draw 2000 random samples, 20 observations each,
from
> a
>> normal distribution:
>>
>> drop _all
>> set more off
>> set obs 2000
>> obs was 0, now 2000
>> g double ln_g_20=.
>> g double ln_sd_g_20=.
>> set seed 999
>> qui gen A=5.37 + 1.19*invnorm(uniform()) in 1/972
>> qui forvalues i = 1(1)2000 {
>> qui gen ln_20`i'=A
>> qui generate random`i' = uniform()
>> qui sort random`i'
>> qui generate insample`i' = _n <= 20
>> qui sum ln_20`i' if insample`i' == 1
>> replace ln_g_20=r(mean)  in `i'
>> replace ln_sd_g_20=r(sd) in `i'
>> drop ln_20`i'
>> drop random`i'
>> drop insample`i'
>> }
>> drop A
>>
>> However, as a result I have obtained 1721 observations instead of the
>> expected 2000.
>>
>> sum ln_g_20 ln_sd_g_20
>>
>> Variable |       Obs        Mean    Std. Dev.       Min        Max
>> -------------+--------------------------------------------------------
>>     ln_g_20 |      1271    5.314033    .3800687    3.79247   6.587941
>>  ln_sd_g_20 |      1271    1.101084    .2835007   .0260279   2.161299
>>
>>
>> Besides, results are even more puzzling when I increase the number of
>> samples (again 20 observations each), in that I get a different number of
>> observation for ln_g and ln_sd_g.
>>
>> Comments are gratefully acknowledged.
>>
>> Thanks a lot for your kindness and for your time.
>>
>> Kind Regards,
>> Carlo
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
>



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index