Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Creating a second output data set

 From Steven Samuels To statalist@hsphsun2.harvard.edu Subject Re: st: Creating a second output data set Date Fri, 9 Sep 2011 20:13:12 -0400

```
Bryan,

You don't need to create any loops. I assume that the  "memory" data set  consists of  N x (N-1)/2 observations.    If you haven't done so already, create two separate variables, e.g.  id_1  id__2 that uniquely identify the  PSU pair. Call the joint probability  j_prob.
**************************
sort id_1 id_2
**************************
will give data that look like:

id_1  id_2 j_prob
1      2    .07
1      3    .05
.
.
.
N-1    N    .04

Then the following code will create the data set you are requesting.
*******************************************
egen marg_prog = total(j_prob), by(id_1)
bys id_1: keep if _n==1
drop id_2 j_prob
save
*******************************************

The same procedure will work if the output data set consists of any number of tuples.  For 4-tuples, for example, change the -drop- statement to:

********************
drop id_2 id_3 id_4 j_prob
*****************

Out of curiosity: what selection algorithms are you studying?

Steve

On Sep 9, 2011, at 6:30 PM, Nick Cox wrote:

Too much survey sampling in this for me to understand what you want.
Someone else will probably help.

On Fri, Sep 9, 2011 at 11:08 PM, Bryan Sayer <bsayer@chrr.osu.edu> wrote:
> Great, thanks!  I tend to think more along the lines of FORTRAN.
>
> A local for the total makes sense.  The total is used in the calculation of
> the joint probability.
>
> So locals seem to make sense for what I send to -postfile- (though I realize
> I probably don't have it correct yet), provided it is ok to change their
> value during the loops.  But maybe it should be scalers?
>
> But I am still stuck on the new variable I need to add to the input data set
> (the one in memory).  I'm going to call that "memory data set" to
> distinguish it from "postfile data set".
>
> The memory data set is the input for this calculation.  Each observation in
> the memory data set is one PSU.  The postfile data set consists of each
> possible pair of PSUs from the memory data set, thus the number of
> observations is the combination of _N taken 2 at a time from the memory data
> set (without replacement).  The memory data set will gain one variable, the
> marginal probability (margprob ), which is the sum of the joint
> probabilities involving each PSU.
>
> Can I act on margprob in the memory data set one observation at a time?  I
> presume I would generate a new variable first, assigning it a value of zero
> before the loops start
>
> Basically, in the loop, margprob looks like this:
>
> replace margprob[J] = margprob[J] + jointprob
> replace margprob[K] = margprob[K] + jointprob
>
> Where J and K are the observation number of the memory data set.
>
> Does this work?
>
> Bryan Sayer
> Monday to Friday, 8:30 to 5:00
> Phone: (614) 442-7369
> FAX:  (614) 442-7329
> BSayer@chrr.osu.edu
>
>
> On 9/9/2011 5:25 PM, Nick Cox wrote:
>>
>> Your code shades between Stata and incomplete Stata, as you will know.
>>
>> However, a key principle here is that -postfile- never sees the locals
>> in your calling program. It just gets passed their values. That's not
>> a problem. It's the way to get round the basic fact that one program's
>> locals are invisible to another program.
>>
>> Also these two lines definitely won't work
>>
>>        local N_total
>>        egen double `N_total'=total(`count')
>>
>> The first defines the local N_total as blank, which is equivalent to
>> not defining it at all. So, Stata will read the second line as
>>
>> egen double = total(`count')
>>
>> which will fail, as no new variable name is supplied.
>>
>> That said, there is no need to create a variable just to hold a total.
>>
>> su `count', meanonly
>>
>> will leave r(sum) in memory and the value of that can be put somewhere
>> appropriate, into a local or a scalar or directly into another file.
>>
>> On Fri, Sep 9, 2011 at 8:49 PM, Bryan Sayer<bsayer@chrr.osu.edu>  wrote:
>>
>>> So I am still a bit confused about how -postfile- works when I want to
>>> preserve the data in memory.  Specifically, how I generate the variables
>>> that I want in my -postfile- output versus the new one I do want to add
>>> to
>>> the data set in memory.
>>>
>>> I'm thinking I want to use a local (maybe macro?) variable for my results
>>> that go to -postfile-?  In other words, how do I distinguish variables
>>> between the two files.
>>>
>>> Also, how do I accumulate results for my new variable that goes in my
>>> memory
>>> data set.  I need to accumulate a sum for two observations in memory on
>>> each
>>> post to -postfile-.
>>>
>>> Here is what I have so far, but with the last part calculating the
>>> marginal
>>> probability (note that the joint probability calculation should be on one
>>> line):
>>>
>>> program jointprob
>>> tempvar psu1 psu2 pi_one pi_joint
>>> tempfile results
>>>        /* set up the file with the joint probabilities */
>>>        postfile `results' `psu1' `psu2' using "`outfile'" ,replace
>>>        /* get the number of observations and the total count */
>>>        local N=_N
>>>        local N_total
>>>        egen double `N_total'=total(`count')
>>>
>>> quietly {
>>>        /* read the input data set and create combinations of N items
>>>           taken 2 at a time, without replacement */
>>>        forvalues J = 1/`N'{
>>>                forvalues K = 1/`N'{
>>>                        if `K'>`J'{
>>>                                psu1=`psu'[`J']
>>>                                psu2=`psu'[`K']
>>>
>>>  pi_joint=(`count[`J']'*`count[`K']'/`N_total') *
>>> ((1/(`N_total'-`count[`J']')+(1/(`N_total'-`count[`K']'))
>>>                                post `results' psu1 psu2 pi_joint
>>>                                }
>>>                }
>>>        }
>>> }
>>>
>>>
>>> Bryan Sayer
>>> Monday to Friday, 8:30 to 5:00
>>> Phone: (614) 442-7369
>>> FAX:  (614) 442-7329
>>> BSayer@chrr.osu.edu
>>>
>>>
>>> On 9/7/2011 9:44 AM, Roger Newson wrote:
>>>>
>>>> -postfile- will still work if there is an existing dataset in the
>>>> memory. However, the new dataset will be built in a file.
>>>>
>>>> Best wishes
>>>>
>>>> Roger
>>>>
>>>>
>>>> Roger B Newson BSc MSc DPhil
>>>> Lecturer in Medical Statistics
>>>> Respiratory Epidemiology and Public Health Group
>>>> National Heart and Lung Institute
>>>> Imperial College London
>>>> Royal Brompton Campus
>>>> Room 33, Emmanuel Kaye Building
>>>> London SW3 6LR
>>>> UNITED KINGDOM
>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>> Fax: +44 (0)20 7351 8322
>>>> Email: r.newson@imperial.ac.uk
>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>> Departmental Web page:
>>>>
>>>>
>>>>
>>>>
>>>> Opinions expressed are those of the author, not of the institution.
>>>>
>>>> On 07/09/2011 14:40, Bryan Sayer wrote:
>>>>>
>>>>> -postfile- will post my results, but my reading of how it works seems
>>>>> to
>>>>> indicate that my original data set cannot be open at the same time. The
>>>>> examples appear to me to clear the existing data set from memory.
>>>>>
>>>>> Admittedly, this is without me having tried anything yet, but am I not
>>>>>
>>>>> What I need to do is a double loop through the input data set,
>>>>> outputting a record on each iteration of each loop. So I need the input
>>>>> data set open in memory, and a second file to post the results to.
>>>>>
>>>>> Are there any examples of something similar?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Bryan Sayer
>>>>> Monday to Friday, 8:30 to 5:00
>>>>> Phone: (614) 442-7369
>>>>> FAX: (614) 442-7329
>>>>> BSayer@chrr.osu.edu
>>>>>
>>>>>
>>>>> On 9/6/2011 4:58 PM, Roger Newson wrote:
>>>>>>
>>>>>> I think you are looking for the -postfile- utility. In Stata, type
>>>>>>
>>>>>> help postfile
>>>>>>
>>>>>> to find out more.
>>>>>>
>>>>>> HTH.
>>>>>>
>>>>>> Best wishes
>>>>>>
>>>>>> Roger
>>>>>>
>>>>>>
>>>>>> Roger B Newson BSc MSc DPhil
>>>>>> Lecturer in Medical Statistics
>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>> National Heart and Lung Institute
>>>>>> Imperial College London
>>>>>> Royal Brompton Campus
>>>>>> Room 33, Emmanuel Kaye Building
>>>>>> London SW3 6LR
>>>>>> UNITED KINGDOM
>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>> Fax: +44 (0)20 7351 8322
>>>>>> Email: r.newson@imperial.ac.uk
>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>> Departmental Web page:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>>
>>>>>> On 06/09/2011 21:53, Bryan Sayer wrote:
>>>>>>>
>>>>>>> I need to create an output data set that will differ in the content
>>>>>>> and
>>>>>>> number of observations from the input file. The observations will be
>>>>>>> created one at a time, based on the input data set.
>>>>>>>
>>>>>>> Specifically, I am creating all combinations of N objects taken two
>>>>>>> at a
>>>>>>> time. I will probably also do permutations.
>>>>>>>
>>>>>>> The input data set (to start with) consists of N records with two
>>>>>>> variables, the primary sampling unit (PSU) and a size variable
>>>>>>> associated with the PSU (a count variable). I want to create two
>>>>>>> output
>>>>>>> data sets. One is each combination of PSU with the associated joint
>>>>>>> probability. The second has the same structure as the input data set
>>>>>>> but
>>>>>>> includes the marginal probability, calculated as the sum of the joint
>>>>>>> probabilities associated with the PSU (which are accumulated as each
>>>>>>> combination is created).
>>>>>>>
>>>>>>> The part I am stuck on is how to output the data set of combinations.
>>>>>>> Can someone point me to a program that outputs a file as calculations
>>>>>>>
>>>>>>> (For those interested, this is for probability proportional to size
>>>>>>> (PPS) sampling. See, for example, Levy and Lemeshow "Sampling of
>>>>>>> Populations, chapter 11).
>>>>>>>
>>>>>>> Here is an example of one stratum:
>>>>>>>
>>>>>>> Input data set (with marginal probability added)
>>>>>>>
>>>>>>> District Size pi(i)
>>>>>>> LUWEERO 12,466 0.916858
>>>>>>> KAMPALA 3,459 0.542857
>>>>>>> TORORO 2,815 0.448739
>>>>>>> KAMULI 549 0.091546
>>>>>>> Total 19,289
>>>>>>>
>>>>>>>
>>>>>>> Output data set:
>>>>>>>
>>>>>>> COMBINATIONS pi(I,j)
>>>>>>> LUWEERO,KAMPALA 0.468854
>>>>>>> LUWEERO,TORORO 0.377069
>>>>>>> LUWEERO,KAMULI 0.070934
>>>>>>> KAMPALA,TORORO 0.062531
>>>>>>> KAMPALA,KAMULI 0.011473
>>>>>>> TORORO,KAMULI 0.009139
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> *
>>>>>> * For searches and help try:
>>>>>> * http://www.stata.com/help.cgi?search
>>>>>> * http://www.stata.com/support/statalist/faq
>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/statalist/faq
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/statalist/faq
>>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```