Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: random selection across files


From   "Austin Nichols" <[email protected]>
To   [email protected]
Subject   Re: st: random selection across files
Date   Thu, 25 Jan 2007 10:33:00 -0500

This may have been answered already, but if so, I missed it.

I think this problem is easy if you simply -merge- datasets (with no
matching variable).  After sampling the first dataset, you have 3000
obs with some distribution of s2 and you can tab s2 to find out what
distribution of s2 the merged file should have.  Sort s2 and save the
first file (with a new name to indicate it is now 3000 obs instead of
1.3e6 obs).  Now the trick is to generate a random sample (with
replacement) from the second file that has exactly the same
distribution of s2.  One way is to use -bsample- with the -strata-
option and some extra futzing, but the -gsample- command available
from ssc is more expedient.  From its help file: "For stratified
sampling, # units will be selected from each stratum identified by the
strata() option. Alternatively, specify varname instead of #, where
varname is a variable containing for each stratum a specific sample
size. varname is assumed to be constant within strata."

Here's an illustrative example:

ssc install gsample
clear
set seed 12345
*make artificial 1st dataset
range s2 1 12 1300000
replace s2=round(s2,1)
gen s1=ceil(uniform()*3)
tab s2 s1
*now implement plan of attack
sample 1000, count by(s1)
tab s2 s1
sort s2
compress
save /test1, replace
collapse (count) countvar=s1, by(s2)
sort s2
compress
save /countvar, replace
*now make artificial 2nd dataset
clear
range s2 1 12 200
replace s2=round(s2,1)
gen extravar=uniform()
sort s2
merge s2 using /countvar
tab _m
drop _m
gsample countvar, strata(s2)
sort s2
ren s2 test
merge using /test1
tab _m
g diff=test-s2
su
keep s1 s2 extravar

On 1/23/07, Richard Goldstein <[email protected]> wrote:
I have a data set (about 1.3 million lines) that is divided
into sets of strata (call them s1 w/3 categories and s2 w/12
categories).  I want to randomly draw a sample of 1000 from
each of s1 (the 3 category) -- no problem.  Then I want to
take this sample of 3000 and go to another, much smaller, file
that has the s2 stratification (12 categories) and randomly
select, with replacement, for each of the 3000 one piece of
information from the same s2 stratum.

It is the issue of going to the second file and grabbing one
piece of information and taking it back to the first file
that is causing me a problem.
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index