Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Creating a second output data set

From	Steven Samuels <[email protected]>
To	[email protected]
Subject	Re: st: Creating a second output data set
Date	Sat, 10 Sep 2011 08:05:36 -0400

The last one was not correct either.  Here's the working version, which replicates the examples in Levy & Lemeshow, p.334 as stratum "1".  Code from http://www.ats.ucla.edu/stat/stata/faq/allpairs_faq.htm
was helpful. Generalizing to n-tuples

Steve

*************CODE BEGINS*************
/* Calculate joint & marginal probabilities 
   for sampling with unequal probabilities without replacement
   with draw selections proportional to size
   formula 11.1. p. 334 
   Levy & Lemeshow, Sampling of Populations (2008)
   Wiley. */
clear
input stratum psu size
1  1 3000
1  2  4000
1  3  10000
2  1 8000
2  2 1500
end
egen tsize = total(size), by(stratum)


tempfile t0
save `t0'
gen psu1 = psu
gen size1 = size
tempfile t1
keep stratum tsize psu1 size1
save `t1'

use `t0'
gen psu2 = psu
gen size2 = size
tempfile t2
keep stratum psu2 size2
save `t2'

use `t1'
joinby stratum using `t2'
sort stratum psu1 psu2

drop if psu1 ==psu2
// http://www.ats.ucla.edu/stat/stata/faq/allpairs_faq.htm
egen d1 = concat(psu1 psu2)  
egen d2 = concat(psu2 psu1)
order stratum psu1 psu2 size1 size2 d1 d2

replace d1 = d2 if psu2>psu1

bys stratum d1: keep if _n==1

order stratum psu1 psu2 size1 size2 d1

gen jprob = ((size1*size2)/tsize)*(1/(tsize-size1)+1/(tsize-size2))
label var jprob "Joint Probability"
// check
egen st = total(jprob), by(stratum)  // should be 1
by stratum: sum st
keep stratum psu1 psu2 size1 size2 jprob d1

save djoint, replace

list
tempfile t3
save `t3'
gen psu = psu1
gen size = size1

tempfile t4
save `t4'
list

use `t3', clear
gen psu = psu2
gen size = size2
append using `t4'
sort psu 

egen mprob = total(jprob), by(stratum psu)


label var mprob "Marginal Probability"
bys stratum psu: keep if _n==1

keep stratum psu size mprob

save dmarginal, replace
list
**************CODE ENDS**************



Aside: The method of Levy and Lemeshow is not used in serious practice. Many people sample PPS with the systematic sampling algorithms  orginally due to Madow (1949), implemented in Stephen Jenkins's -samplepps-;  see Brewer and Hanif (1983) and Cochran (1977, p. 265) who cites Hartley and Rao (1962) and Madow (1949). But practitioners then use formulas appropriate for sampling with replacement.  A few packages do compute the joint probabilities, for example the SURVEYSELECT PROCEDURE in SAS and Yves Tille's -sampling- package in R.   Ken Brewer  (2002) has some interesting results on approximating the variance of the Horvitz-Thompson estimator without knowing the joint probabilities. 

References:

Ken Brewer, 2002 Combined Survey Sampling Inference: Weighing Basu's Elephants, Arnold Books, 

Madow, William G. 1949. On the theory of systematic sampling. II. Annals of Mathematical Statistics, 19: 535-545.

Brewer, K. R. W. and Muhammad Hanif. 1983. Sampling with Unequal Probabilities. New York: Springer.

Cochran, William G. 1977. Sampling Techniques, 3rd Edition. New York: Wiley.

Hartley, H.O. and J.N.K. Rao. 1962. Sampling with unequal probabilities and
without replacement.  Annals of Mathematical Statistics, 33: 350-374.

Steve



On Sep 9, 2011, at 9:07 PM, Bryan Sayer wrote:

No, if you look at the bottom of the message you will see an example.
The memory data set is composed of N observations.  This is a probability proportional to size without replacement (stratified) sample.  For the time being I am just doing one stratum at a time.  See Levy and Lemeshow "Sampling of Populations" chapter 11.

Input (memory) data set (with marginal probability added)

District Size 	pi(i)
LUWEERO 12,466 	0.916858
KAMPALA 3,459 	0.542857
TORORO 	2,815 	0.448739
KAMULI 	549 	0.091546
Total 	19,289

Output (postfile) data set:

COMBINATIONS pi(I,j)
LUWEERO,KAMPALA 0.468854
LUWEERO,TORORO 0.377069
LUWEERO,KAMULI 0.070934
KAMPALA,TORORO 0.062531
KAMPALA,KAMULI 0.011473
TORORO,KAMULI 0.009139

Bryan Sayer



On 9/9/2011 8:13 PM, Steven Samuels wrote:
> 
> 
> Bryan,
> 
> You don't need to create any loops. I assume that the  "memory" data
> set  consists of  N x (N-1)/2 observations.    If you haven't done so
> already, create two separate variables, e.g.  id_1  id__2 that
> uniquely identify the  PSU pair. Call the joint probability  j_prob.
> ************************** sort id_1 id_2 **************************
> will give data that look like:
> 
> id_1  id_2 j_prob 1      2    .07 1      3    .05 . . . N-1    N
> .04
> 
> Then the following code will create the data set you are requesting.
> ******************************************* egen marg_prog =
> total(j_prob), by(id_1) bys id_1: keep if _n==1 drop id_2 j_prob
> save *******************************************
> 
> The same procedure will work if the output data set consists of any
> number of tuples.  For 4-tuples, for example, change the -drop-
> statement to:
> 
> ******************** drop id_2 id_3 id_4 j_prob *****************
> 
> Out of curiosity: what selection algorithms are you studying?
> 
> Steve
> 
> 
> On Sep 9, 2011, at 6:30 PM, Nick Cox wrote:
> 
> Too much survey sampling in this for me to understand what you want.
> Someone else will probably help.
> 
> 
> On Fri, Sep 9, 2011 at 11:08 PM, Bryan Sayer<[email protected]>
> wrote:
>> Great, thanks!  I tend to think more along the lines of FORTRAN.
>> 
>> A local for the total makes sense.  The total is used in the
>> calculation of the joint probability.
>> 
>> So locals seem to make sense for what I send to -postfile- (though
>> I realize I probably don't have it correct yet), provided it is ok
>> to change their value during the loops.  But maybe it should be
>> scalers?
>> 
>> But I am still stuck on the new variable I need to add to the input
>> data set (the one in memory).  I'm going to call that "memory data
>> set" to distinguish it from "postfile data set".
>> 
>> The memory data set is the input for this calculation.  Each
>> observation in the memory data set is one PSU.  The postfile data
>> set consists of each possible pair of PSUs from the memory data
>> set, thus the number of observations is the combination of _N taken
>> 2 at a time from the memory data set (without replacement).  The
>> memory data set will gain one variable, the marginal probability
>> (margprob ), which is the sum of the joint probabilities involving
>> each PSU.
>> 
>> Can I act on margprob in the memory data set one observation at a
>> time?  I presume I would generate a new variable first, assigning
>> it a value of zero before the loops start
>> 
>> Basically, in the loop, margprob looks like this:
>> 
>> replace margprob[J] = margprob[J] + jointprob replace margprob[K] =
>> margprob[K] + jointprob
>> 
>> Where J and K are the observation number of the memory data set.
>> 
>> Does this work?
>> 
>> Bryan Sayer Monday to Friday, 8:30 to 5:00 Phone: (614) 442-7369
>> FAX:  (614) 442-7329 [email protected]
>> 
>> 
>> On 9/9/2011 5:25 PM, Nick Cox wrote:
>>> 
>>> Your code shades between Stata and incomplete Stata, as you will
>>> know.
>>> 
>>> However, a key principle here is that -postfile- never sees the
>>> locals in your calling program. It just gets passed their values.
>>> That's not a problem. It's the way to get round the basic fact
>>> that one program's locals are invisible to another program.
>>> 
>>> Also these two lines definitely won't work
>>> 
>>> local N_total egen double `N_total'=total(`count')
>>> 
>>> The first defines the local N_total as blank, which is equivalent
>>> to not defining it at all. So, Stata will read the second line
>>> as
>>> 
>>> egen double = total(`count')
>>> 
>>> which will fail, as no new variable name is supplied.
>>> 
>>> That said, there is no need to create a variable just to hold a
>>> total.
>>> 
>>> su `count', meanonly
>>> 
>>> will leave r(sum) in memory and the value of that can be put
>>> somewhere appropriate, into a local or a scalar or directly into
>>> another file.
>>> 
>>> On Fri, Sep 9, 2011 at 8:49 PM, Bryan Sayer<[email protected]>
>>> wrote:
>>> 
>>>> So I am still a bit confused about how -postfile- works when I
>>>> want to preserve the data in memory.  Specifically, how I
>>>> generate the variables that I want in my -postfile- output
>>>> versus the new one I do want to add to the data set in memory.
>>>> 
>>>> I'm thinking I want to use a local (maybe macro?) variable for
>>>> my results that go to -postfile-?  In other words, how do I
>>>> distinguish variables between the two files.
>>>> 
>>>> Also, how do I accumulate results for my new variable that goes
>>>> in my memory data set.  I need to accumulate a sum for two
>>>> observations in memory on each post to -postfile-.
>>>> 
>>>> Here is what I have so far, but with the last part calculating
>>>> the marginal probability (note that the joint probability
>>>> calculation should be on one line):
>>>> 
>>>> program jointprob args design infile outfile psu count
>>>> margprob tempvar psu1 psu2 pi_one pi_joint tempfile results /*
>>>> set up the file with the joint probabilities */ postfile
>>>> `results' `psu1' `psu2' using "`outfile'" ,replace /* get the
>>>> number of observations and the total count */ local N=_N local
>>>> N_total egen double `N_total'=total(`count')
>>>> 
>>>> quietly { /* read the input data set and create combinations of
>>>> N items taken 2 at a time, without replacement */ forvalues J =
>>>> 1/`N'{ forvalues K = 1/`N'{ if `K'>`J'{ psu1=`psu'[`J']
>>>> psu2=`psu'[`K']
>>>> 
>>>> pi_joint=(`count[`J']'*`count[`K']'/`N_total') *
>>>> ((1/(`N_total'-`count[`J']')+(1/(`N_total'-`count[`K']')) post
>>>> `results' psu1 psu2 pi_joint } } } }
>>>> 
>>>> 
>>>> Bryan Sayer Monday to Friday, 8:30 to 5:00 Phone: (614)
>>>> 442-7369 FAX:  (614) 442-7329 [email protected]
>>>> 
>>>> 
>>>> On 9/7/2011 9:44 AM, Roger Newson wrote:
>>>>> 
>>>>> -postfile- will still work if there is an existing dataset in
>>>>> the memory. However, the new dataset will be built in a
>>>>> file.
>>>>> 
>>>>> Best wishes
>>>>> 
>>>>> Roger
>>>>> 
>>>>> 
>>>>> Roger B Newson BSc MSc DPhil Lecturer in Medical Statistics
>>>>> Respiratory Epidemiology and Public Health Group National
>>>>> Heart and Lung Institute Imperial College London Royal
>>>>> Brompton Campus Room 33, Emmanuel Kaye Building 1B Manresa
>>>>> Road London SW3 6LR UNITED KINGDOM Tel: +44 (0)20 7352 8121
>>>>> ext 3381 Fax: +44 (0)20 7351 8322 Email:
>>>>> [email protected] Web page:
>>>>> http://www.imperial.ac.uk/nhli/r.newson/ Departmental Web
>>>>> page:
>>>>> 
>>>>> 
>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
Opinions expressed are those of the author, not of the institution.
>>>>> 
>>>>> On 07/09/2011 14:40, Bryan Sayer wrote:
>>>>>> 
>>>>>> -postfile- will post my results, but my reading of how it
>>>>>> works seems to indicate that my original data set cannot be
>>>>>> open at the same time. The examples appear to me to clear
>>>>>> the existing data set from memory.
>>>>>> 
>>>>>> Admittedly, this is without me having tried anything yet,
>>>>>> but am I not reading it correctly?
>>>>>> 
>>>>>> What I need to do is a double loop through the input data
>>>>>> set, outputting a record on each iteration of each loop. So
>>>>>> I need the input data set open in memory, and a second file
>>>>>> to post the results to.
>>>>>> 
>>>>>> Are there any examples of something similar?
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> Bryan Sayer Monday to Friday, 8:30 to 5:00 Phone: (614)
>>>>>> 442-7369 FAX: (614) 442-7329 [email protected]
>>>>>> 
>>>>>> 
>>>>>> On 9/6/2011 4:58 PM, Roger Newson wrote:
>>>>>>> 
>>>>>>> I think you are looking for the -postfile- utility. In
>>>>>>> Stata, type
>>>>>>> 
>>>>>>> help postfile
>>>>>>> 
>>>>>>> to find out more.
>>>>>>> 
>>>>>>> HTH.
>>>>>>> 
>>>>>>> Best wishes
>>>>>>> 
>>>>>>> Roger
>>>>>>> 
>>>>>>> 
>>>>>>> Roger B Newson BSc MSc DPhil Lecturer in Medical
>>>>>>> Statistics Respiratory Epidemiology and Public Health
>>>>>>> Group National Heart and Lung Institute Imperial College
>>>>>>> London Royal Brompton Campus Room 33, Emmanuel Kaye
>>>>>>> Building 1B Manresa Road London SW3 6LR UNITED KINGDOM
>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381 Fax: +44 (0)20 7351
>>>>>>> 8322 Email: [email protected] Web page:
>>>>>>> http://www.imperial.ac.uk/nhli/r.newson/ Departmental Web
>>>>>>> page:
>>>>>>> 
>>>>>>> 
>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
Opinions expressed are those of the author, not of the institution.
>>>>>>> 
>>>>>>> On 06/09/2011 21:53, Bryan Sayer wrote:
>>>>>>>> 
>>>>>>>> I need to create an output data set that will differ in
>>>>>>>> the content and number of observations from the input
>>>>>>>> file. The observations will be created one at a time,
>>>>>>>> based on the input data set.
>>>>>>>> 
>>>>>>>> Specifically, I am creating all combinations of N
>>>>>>>> objects taken two at a time. I will probably also do
>>>>>>>> permutations.
>>>>>>>> 
>>>>>>>> The input data set (to start with) consists of N
>>>>>>>> records with two variables, the primary sampling unit
>>>>>>>> (PSU) and a size variable associated with the PSU (a
>>>>>>>> count variable). I want to create two output data sets.
>>>>>>>> One is each combination of PSU with the associated
>>>>>>>> joint probability. The second has the same structure as
>>>>>>>> the input data set but includes the marginal
>>>>>>>> probability, calculated as the sum of the joint
>>>>>>>> probabilities associated with the PSU (which are
>>>>>>>> accumulated as each combination is created).
>>>>>>>> 
>>>>>>>> The part I am stuck on is how to output the data set of
>>>>>>>> combinations. Can someone point me to a program that
>>>>>>>> outputs a file as calculations are made?
>>>>>>>> 
>>>>>>>> (For those interested, this is for probability
>>>>>>>> proportional to size (PPS) sampling. See, for example,
>>>>>>>> Levy and Lemeshow "Sampling of Populations, chapter
>>>>>>>> 11).
>>>>>>>> 
>>>>>>>> Here is an example of one stratum:
>>>>>>>> 
>>>>>>>> Input data set (with marginal probability added)
>>>>>>>> 
>>>>>>>> District Size pi(i) LUWEERO 12,466 0.916858 KAMPALA
>>>>>>>> 3,459 0.542857 TORORO 2,815 0.448739 KAMULI 549
>>>>>>>> 0.091546 Total 19,289
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Output data set:
>>>>>>>> 
>>>>>>>> COMBINATIONS pi(I,j) LUWEERO,KAMPALA 0.468854
>>>>>>>> LUWEERO,TORORO 0.377069 LUWEERO,KAMULI 0.070934
>>>>>>>> KAMPALA,TORORO 0.062531 KAMPALA,KAMULI 0.011473
>>>>>>>> TORORO,KAMULI 0.009139
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Roger Newson <[email protected]>
- Re: st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Roger Newson <[email protected]>
- Re: st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Nick Cox <[email protected]>
- Re: st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Nick Cox <[email protected]>
- Re: st: Creating a second output data set
  - From: Steven Samuels <[email protected]>
- Re: st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>

Prev by Date: Re: st: Polynomial Fitting and RD Design
Next by Date: st: Return-based style analysis / Shape's Quadratic Programming technique
Previous by thread: Re: st: Creating a second output data set
Next by thread: st: SAS vs Stata boxcox output [SEC: UNCLASSIFIED]
Index(es):
- Date
- Thread