Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Creating a second output data set

From	Steven Samuels <[email protected]>
To	[email protected]
Subject	Re: st: Creating a second output data set
Date	Fri, 9 Sep 2011 20:13:12 -0400


Bryan, 

You don't need to create any loops. I assume that the  "memory" data set  consists of  N x (N-1)/2 observations.    If you haven't done so already, create two separate variables, e.g.  id_1  id__2 that uniquely identify the  PSU pair. Call the joint probability  j_prob.  
**************************
sort id_1 id_2
**************************
will give data that look like:

id_1  id_2 j_prob
1      2    .07
1      3    .05
.
.
.
N-1    N    .04

Then the following code will create the data set you are requesting. 
*******************************************
egen marg_prog = total(j_prob), by(id_1)
bys id_1: keep if _n==1
drop id_2 j_prob
save
*******************************************

The same procedure will work if the output data set consists of any number of tuples.  For 4-tuples, for example, change the -drop- statement to:

********************
drop id_2 id_3 id_4 j_prob
*****************

Out of curiosity: what selection algorithms are you studying?

Steve


On Sep 9, 2011, at 6:30 PM, Nick Cox wrote:

Too much survey sampling in this for me to understand what you want.
Someone else will probably help.


On Fri, Sep 9, 2011 at 11:08 PM, Bryan Sayer <[email protected]> wrote:
> Great, thanks!  I tend to think more along the lines of FORTRAN.
> 
> A local for the total makes sense.  The total is used in the calculation of
> the joint probability.
> 
> So locals seem to make sense for what I send to -postfile- (though I realize
> I probably don't have it correct yet), provided it is ok to change their
> value during the loops.  But maybe it should be scalers?
> 
> But I am still stuck on the new variable I need to add to the input data set
> (the one in memory).  I'm going to call that "memory data set" to
> distinguish it from "postfile data set".
> 
> The memory data set is the input for this calculation.  Each observation in
> the memory data set is one PSU.  The postfile data set consists of each
> possible pair of PSUs from the memory data set, thus the number of
> observations is the combination of _N taken 2 at a time from the memory data
> set (without replacement).  The memory data set will gain one variable, the
> marginal probability (margprob ), which is the sum of the joint
> probabilities involving each PSU.
> 
> Can I act on margprob in the memory data set one observation at a time?  I
> presume I would generate a new variable first, assigning it a value of zero
> before the loops start
> 
> Basically, in the loop, margprob looks like this:
> 
> replace margprob[J] = margprob[J] + jointprob
> replace margprob[K] = margprob[K] + jointprob
> 
> Where J and K are the observation number of the memory data set.
> 
> Does this work?
> 
> Bryan Sayer
> Monday to Friday, 8:30 to 5:00
> Phone: (614) 442-7369
> FAX:  (614) 442-7329
> [email protected]
> 
> 
> On 9/9/2011 5:25 PM, Nick Cox wrote:
>> 
>> Your code shades between Stata and incomplete Stata, as you will know.
>> 
>> However, a key principle here is that -postfile- never sees the locals
>> in your calling program. It just gets passed their values. That's not
>> a problem. It's the way to get round the basic fact that one program's
>> locals are invisible to another program.
>> 
>> Also these two lines definitely won't work
>> 
>>        local N_total
>>        egen double `N_total'=total(`count')
>> 
>> The first defines the local N_total as blank, which is equivalent to
>> not defining it at all. So, Stata will read the second line as
>> 
>> egen double = total(`count')
>> 
>> which will fail, as no new variable name is supplied.
>> 
>> That said, there is no need to create a variable just to hold a total.
>> 
>> su `count', meanonly
>> 
>> will leave r(sum) in memory and the value of that can be put somewhere
>> appropriate, into a local or a scalar or directly into another file.
>> 
>> On Fri, Sep 9, 2011 at 8:49 PM, Bryan Sayer<[email protected]>  wrote:
>> 
>>> So I am still a bit confused about how -postfile- works when I want to
>>> preserve the data in memory.  Specifically, how I generate the variables
>>> that I want in my -postfile- output versus the new one I do want to add
>>> to
>>> the data set in memory.
>>> 
>>> I'm thinking I want to use a local (maybe macro?) variable for my results
>>> that go to -postfile-?  In other words, how do I distinguish variables
>>> between the two files.
>>> 
>>> Also, how do I accumulate results for my new variable that goes in my
>>> memory
>>> data set.  I need to accumulate a sum for two observations in memory on
>>> each
>>> post to -postfile-.
>>> 
>>> Here is what I have so far, but with the last part calculating the
>>> marginal
>>> probability (note that the joint probability calculation should be on one
>>> line):
>>> 
>>> program jointprob
>>> args design infile outfile psu count margprob
>>> tempvar psu1 psu2 pi_one pi_joint
>>> tempfile results
>>>        /* set up the file with the joint probabilities */
>>>        postfile `results' `psu1' `psu2' using "`outfile'" ,replace
>>>        /* get the number of observations and the total count */
>>>        local N=_N
>>>        local N_total
>>>        egen double `N_total'=total(`count')
>>> 
>>> quietly {
>>>        /* read the input data set and create combinations of N items
>>>           taken 2 at a time, without replacement */
>>>        forvalues J = 1/`N'{
>>>                forvalues K = 1/`N'{
>>>                        if `K'>`J'{
>>>                                psu1=`psu'[`J']
>>>                                psu2=`psu'[`K']
>>> 
>>>  pi_joint=(`count[`J']'*`count[`K']'/`N_total') *
>>> ((1/(`N_total'-`count[`J']')+(1/(`N_total'-`count[`K']'))
>>>                                post `results' psu1 psu2 pi_joint
>>>                                }
>>>                }
>>>        }
>>> }
>>> 
>>> 
>>> Bryan Sayer
>>> Monday to Friday, 8:30 to 5:00
>>> Phone: (614) 442-7369
>>> FAX:  (614) 442-7329
>>> [email protected]
>>> 
>>> 
>>> On 9/7/2011 9:44 AM, Roger Newson wrote:
>>>> 
>>>> -postfile- will still work if there is an existing dataset in the
>>>> memory. However, the new dataset will be built in a file.
>>>> 
>>>> Best wishes
>>>> 
>>>> Roger
>>>> 
>>>> 
>>>> Roger B Newson BSc MSc DPhil
>>>> Lecturer in Medical Statistics
>>>> Respiratory Epidemiology and Public Health Group
>>>> National Heart and Lung Institute
>>>> Imperial College London
>>>> Royal Brompton Campus
>>>> Room 33, Emmanuel Kaye Building
>>>> 1B Manresa Road
>>>> London SW3 6LR
>>>> UNITED KINGDOM
>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>> Fax: +44 (0)20 7351 8322
>>>> Email: [email protected]
>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>> Departmental Web page:
>>>> 
>>>> 
>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>> 
>>>> 
>>>> Opinions expressed are those of the author, not of the institution.
>>>> 
>>>> On 07/09/2011 14:40, Bryan Sayer wrote:
>>>>> 
>>>>> -postfile- will post my results, but my reading of how it works seems
>>>>> to
>>>>> indicate that my original data set cannot be open at the same time. The
>>>>> examples appear to me to clear the existing data set from memory.
>>>>> 
>>>>> Admittedly, this is without me having tried anything yet, but am I not
>>>>> reading it correctly?
>>>>> 
>>>>> What I need to do is a double loop through the input data set,
>>>>> outputting a record on each iteration of each loop. So I need the input
>>>>> data set open in memory, and a second file to post the results to.
>>>>> 
>>>>> Are there any examples of something similar?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Bryan Sayer
>>>>> Monday to Friday, 8:30 to 5:00
>>>>> Phone: (614) 442-7369
>>>>> FAX: (614) 442-7329
>>>>> [email protected]
>>>>> 
>>>>> 
>>>>> On 9/6/2011 4:58 PM, Roger Newson wrote:
>>>>>> 
>>>>>> I think you are looking for the -postfile- utility. In Stata, type
>>>>>> 
>>>>>> help postfile
>>>>>> 
>>>>>> to find out more.
>>>>>> 
>>>>>> HTH.
>>>>>> 
>>>>>> Best wishes
>>>>>> 
>>>>>> Roger
>>>>>> 
>>>>>> 
>>>>>> Roger B Newson BSc MSc DPhil
>>>>>> Lecturer in Medical Statistics
>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>> National Heart and Lung Institute
>>>>>> Imperial College London
>>>>>> Royal Brompton Campus
>>>>>> Room 33, Emmanuel Kaye Building
>>>>>> 1B Manresa Road
>>>>>> London SW3 6LR
>>>>>> UNITED KINGDOM
>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>> Fax: +44 (0)20 7351 8322
>>>>>> Email: [email protected]
>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>> Departmental Web page:
>>>>>> 
>>>>>> 
>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>> 
>>>>>> On 06/09/2011 21:53, Bryan Sayer wrote:
>>>>>>> 
>>>>>>> I need to create an output data set that will differ in the content
>>>>>>> and
>>>>>>> number of observations from the input file. The observations will be
>>>>>>> created one at a time, based on the input data set.
>>>>>>> 
>>>>>>> Specifically, I am creating all combinations of N objects taken two
>>>>>>> at a
>>>>>>> time. I will probably also do permutations.
>>>>>>> 
>>>>>>> The input data set (to start with) consists of N records with two
>>>>>>> variables, the primary sampling unit (PSU) and a size variable
>>>>>>> associated with the PSU (a count variable). I want to create two
>>>>>>> output
>>>>>>> data sets. One is each combination of PSU with the associated joint
>>>>>>> probability. The second has the same structure as the input data set
>>>>>>> but
>>>>>>> includes the marginal probability, calculated as the sum of the joint
>>>>>>> probabilities associated with the PSU (which are accumulated as each
>>>>>>> combination is created).
>>>>>>> 
>>>>>>> The part I am stuck on is how to output the data set of combinations.
>>>>>>> Can someone point me to a program that outputs a file as calculations
>>>>>>> are made?
>>>>>>> 
>>>>>>> (For those interested, this is for probability proportional to size
>>>>>>> (PPS) sampling. See, for example, Levy and Lemeshow "Sampling of
>>>>>>> Populations, chapter 11).
>>>>>>> 
>>>>>>> Here is an example of one stratum:
>>>>>>> 
>>>>>>> Input data set (with marginal probability added)
>>>>>>> 
>>>>>>> District Size pi(i)
>>>>>>> LUWEERO 12,466 0.916858
>>>>>>> KAMPALA 3,459 0.542857
>>>>>>> TORORO 2,815 0.448739
>>>>>>> KAMULI 549 0.091546
>>>>>>> Total 19,289
>>>>>>> 
>>>>>>> 
>>>>>>> Output data set:
>>>>>>> 
>>>>>>> COMBINATIONS pi(I,j)
>>>>>>> LUWEERO,KAMPALA 0.468854
>>>>>>> LUWEERO,TORORO 0.377069
>>>>>>> LUWEERO,KAMULI 0.070934
>>>>>>> KAMPALA,TORORO 0.062531
>>>>>>> KAMPALA,KAMULI 0.011473
>>>>>>> TORORO,KAMULI 0.009139
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> *
>>>>>> * For searches and help try:
>>>>>> * http://www.stata.com/help.cgi?search
>>>>>> * http://www.stata.com/support/statalist/faq
>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>> 
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/statalist/faq
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>> 
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/statalist/faq
>>>> * http://www.ats.ucla.edu/stat/stata/
>>> 
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>> 
>> 
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>

References:
- st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Roger Newson <[email protected]>
- Re: st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Roger Newson <[email protected]>
- Re: st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Nick Cox <[email protected]>
- Re: st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: SPSS to STATA
Next by Date: Re: st: Creating a second output data set
Previous by thread: Re: st: Creating a second output data set
Next by thread: Re: st: Creating a second output data set
Index(es):
- Date
- Thread