Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Steven Samuels <sjsamuels@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Creating a second output data set |

Date |
Fri, 9 Sep 2011 20:13:12 -0400 |

Bryan, You don't need to create any loops. I assume that the "memory" data set consists of N x (N-1)/2 observations. If you haven't done so already, create two separate variables, e.g. id_1 id__2 that uniquely identify the PSU pair. Call the joint probability j_prob. ************************** sort id_1 id_2 ************************** will give data that look like: id_1 id_2 j_prob 1 2 .07 1 3 .05 . . . N-1 N .04 Then the following code will create the data set you are requesting. ******************************************* egen marg_prog = total(j_prob), by(id_1) bys id_1: keep if _n==1 drop id_2 j_prob save ******************************************* The same procedure will work if the output data set consists of any number of tuples. For 4-tuples, for example, change the -drop- statement to: ******************** drop id_2 id_3 id_4 j_prob ***************** Out of curiosity: what selection algorithms are you studying? Steve On Sep 9, 2011, at 6:30 PM, Nick Cox wrote: Too much survey sampling in this for me to understand what you want. Someone else will probably help. On Fri, Sep 9, 2011 at 11:08 PM, Bryan Sayer <bsayer@chrr.osu.edu> wrote: > Great, thanks! I tend to think more along the lines of FORTRAN. > > A local for the total makes sense. The total is used in the calculation of > the joint probability. > > So locals seem to make sense for what I send to -postfile- (though I realize > I probably don't have it correct yet), provided it is ok to change their > value during the loops. But maybe it should be scalers? > > But I am still stuck on the new variable I need to add to the input data set > (the one in memory). I'm going to call that "memory data set" to > distinguish it from "postfile data set". > > The memory data set is the input for this calculation. Each observation in > the memory data set is one PSU. The postfile data set consists of each > possible pair of PSUs from the memory data set, thus the number of > observations is the combination of _N taken 2 at a time from the memory data > set (without replacement). The memory data set will gain one variable, the > marginal probability (margprob ), which is the sum of the joint > probabilities involving each PSU. > > Can I act on margprob in the memory data set one observation at a time? I > presume I would generate a new variable first, assigning it a value of zero > before the loops start > > Basically, in the loop, margprob looks like this: > > replace margprob[J] = margprob[J] + jointprob > replace margprob[K] = margprob[K] + jointprob > > Where J and K are the observation number of the memory data set. > > Does this work? > > Bryan Sayer > Monday to Friday, 8:30 to 5:00 > Phone: (614) 442-7369 > FAX: (614) 442-7329 > BSayer@chrr.osu.edu > > > On 9/9/2011 5:25 PM, Nick Cox wrote: >> >> Your code shades between Stata and incomplete Stata, as you will know. >> >> However, a key principle here is that -postfile- never sees the locals >> in your calling program. It just gets passed their values. That's not >> a problem. It's the way to get round the basic fact that one program's >> locals are invisible to another program. >> >> Also these two lines definitely won't work >> >> local N_total >> egen double `N_total'=total(`count') >> >> The first defines the local N_total as blank, which is equivalent to >> not defining it at all. So, Stata will read the second line as >> >> egen double = total(`count') >> >> which will fail, as no new variable name is supplied. >> >> That said, there is no need to create a variable just to hold a total. >> >> su `count', meanonly >> >> will leave r(sum) in memory and the value of that can be put somewhere >> appropriate, into a local or a scalar or directly into another file. >> >> On Fri, Sep 9, 2011 at 8:49 PM, Bryan Sayer<bsayer@chrr.osu.edu> wrote: >> >>> So I am still a bit confused about how -postfile- works when I want to >>> preserve the data in memory. Specifically, how I generate the variables >>> that I want in my -postfile- output versus the new one I do want to add >>> to >>> the data set in memory. >>> >>> I'm thinking I want to use a local (maybe macro?) variable for my results >>> that go to -postfile-? In other words, how do I distinguish variables >>> between the two files. >>> >>> Also, how do I accumulate results for my new variable that goes in my >>> memory >>> data set. I need to accumulate a sum for two observations in memory on >>> each >>> post to -postfile-. >>> >>> Here is what I have so far, but with the last part calculating the >>> marginal >>> probability (note that the joint probability calculation should be on one >>> line): >>> >>> program jointprob >>> args design infile outfile psu count margprob >>> tempvar psu1 psu2 pi_one pi_joint >>> tempfile results >>> /* set up the file with the joint probabilities */ >>> postfile `results' `psu1' `psu2' using "`outfile'" ,replace >>> /* get the number of observations and the total count */ >>> local N=_N >>> local N_total >>> egen double `N_total'=total(`count') >>> >>> quietly { >>> /* read the input data set and create combinations of N items >>> taken 2 at a time, without replacement */ >>> forvalues J = 1/`N'{ >>> forvalues K = 1/`N'{ >>> if `K'>`J'{ >>> psu1=`psu'[`J'] >>> psu2=`psu'[`K'] >>> >>> pi_joint=(`count[`J']'*`count[`K']'/`N_total') * >>> ((1/(`N_total'-`count[`J']')+(1/(`N_total'-`count[`K']')) >>> post `results' psu1 psu2 pi_joint >>> } >>> } >>> } >>> } >>> >>> >>> Bryan Sayer >>> Monday to Friday, 8:30 to 5:00 >>> Phone: (614) 442-7369 >>> FAX: (614) 442-7329 >>> BSayer@chrr.osu.edu >>> >>> >>> On 9/7/2011 9:44 AM, Roger Newson wrote: >>>> >>>> -postfile- will still work if there is an existing dataset in the >>>> memory. However, the new dataset will be built in a file. >>>> >>>> Best wishes >>>> >>>> Roger >>>> >>>> >>>> Roger B Newson BSc MSc DPhil >>>> Lecturer in Medical Statistics >>>> Respiratory Epidemiology and Public Health Group >>>> National Heart and Lung Institute >>>> Imperial College London >>>> Royal Brompton Campus >>>> Room 33, Emmanuel Kaye Building >>>> 1B Manresa Road >>>> London SW3 6LR >>>> UNITED KINGDOM >>>> Tel: +44 (0)20 7352 8121 ext 3381 >>>> Fax: +44 (0)20 7351 8322 >>>> Email: r.newson@imperial.ac.uk >>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/ >>>> Departmental Web page: >>>> >>>> >>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/ >>>> >>>> >>>> Opinions expressed are those of the author, not of the institution. >>>> >>>> On 07/09/2011 14:40, Bryan Sayer wrote: >>>>> >>>>> -postfile- will post my results, but my reading of how it works seems >>>>> to >>>>> indicate that my original data set cannot be open at the same time. The >>>>> examples appear to me to clear the existing data set from memory. >>>>> >>>>> Admittedly, this is without me having tried anything yet, but am I not >>>>> reading it correctly? >>>>> >>>>> What I need to do is a double loop through the input data set, >>>>> outputting a record on each iteration of each loop. So I need the input >>>>> data set open in memory, and a second file to post the results to. >>>>> >>>>> Are there any examples of something similar? >>>>> >>>>> Thanks! >>>>> >>>>> Bryan Sayer >>>>> Monday to Friday, 8:30 to 5:00 >>>>> Phone: (614) 442-7369 >>>>> FAX: (614) 442-7329 >>>>> BSayer@chrr.osu.edu >>>>> >>>>> >>>>> On 9/6/2011 4:58 PM, Roger Newson wrote: >>>>>> >>>>>> I think you are looking for the -postfile- utility. In Stata, type >>>>>> >>>>>> help postfile >>>>>> >>>>>> to find out more. >>>>>> >>>>>> HTH. >>>>>> >>>>>> Best wishes >>>>>> >>>>>> Roger >>>>>> >>>>>> >>>>>> Roger B Newson BSc MSc DPhil >>>>>> Lecturer in Medical Statistics >>>>>> Respiratory Epidemiology and Public Health Group >>>>>> National Heart and Lung Institute >>>>>> Imperial College London >>>>>> Royal Brompton Campus >>>>>> Room 33, Emmanuel Kaye Building >>>>>> 1B Manresa Road >>>>>> London SW3 6LR >>>>>> UNITED KINGDOM >>>>>> Tel: +44 (0)20 7352 8121 ext 3381 >>>>>> Fax: +44 (0)20 7351 8322 >>>>>> Email: r.newson@imperial.ac.uk >>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/ >>>>>> Departmental Web page: >>>>>> >>>>>> >>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Opinions expressed are those of the author, not of the institution. >>>>>> >>>>>> On 06/09/2011 21:53, Bryan Sayer wrote: >>>>>>> >>>>>>> I need to create an output data set that will differ in the content >>>>>>> and >>>>>>> number of observations from the input file. The observations will be >>>>>>> created one at a time, based on the input data set. >>>>>>> >>>>>>> Specifically, I am creating all combinations of N objects taken two >>>>>>> at a >>>>>>> time. I will probably also do permutations. >>>>>>> >>>>>>> The input data set (to start with) consists of N records with two >>>>>>> variables, the primary sampling unit (PSU) and a size variable >>>>>>> associated with the PSU (a count variable). I want to create two >>>>>>> output >>>>>>> data sets. One is each combination of PSU with the associated joint >>>>>>> probability. The second has the same structure as the input data set >>>>>>> but >>>>>>> includes the marginal probability, calculated as the sum of the joint >>>>>>> probabilities associated with the PSU (which are accumulated as each >>>>>>> combination is created). >>>>>>> >>>>>>> The part I am stuck on is how to output the data set of combinations. >>>>>>> Can someone point me to a program that outputs a file as calculations >>>>>>> are made? >>>>>>> >>>>>>> (For those interested, this is for probability proportional to size >>>>>>> (PPS) sampling. See, for example, Levy and Lemeshow "Sampling of >>>>>>> Populations, chapter 11). >>>>>>> >>>>>>> Here is an example of one stratum: >>>>>>> >>>>>>> Input data set (with marginal probability added) >>>>>>> >>>>>>> District Size pi(i) >>>>>>> LUWEERO 12,466 0.916858 >>>>>>> KAMPALA 3,459 0.542857 >>>>>>> TORORO 2,815 0.448739 >>>>>>> KAMULI 549 0.091546 >>>>>>> Total 19,289 >>>>>>> >>>>>>> >>>>>>> Output data set: >>>>>>> >>>>>>> COMBINATIONS pi(I,j) >>>>>>> LUWEERO,KAMPALA 0.468854 >>>>>>> LUWEERO,TORORO 0.377069 >>>>>>> LUWEERO,KAMULI 0.070934 >>>>>>> KAMPALA,TORORO 0.062531 >>>>>>> KAMPALA,KAMULI 0.011473 >>>>>>> TORORO,KAMULI 0.009139 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/statalist/faq >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/statalist/faq >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>> >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/statalist/faq >>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/statalist/faq >>> * http://www.ats.ucla.edu/stat/stata/ >>> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Creating a second output data set***From:*Bryan Sayer <bsayer@chrr.osu.edu>

**References**:**st: Creating a second output data set***From:*Bryan Sayer <bsayer@chrr.osu.edu>

**Re: st: Creating a second output data set***From:*Roger Newson <r.newson@imperial.ac.uk>

**Re: st: Creating a second output data set***From:*Bryan Sayer <bsayer@chrr.osu.edu>

**Re: st: Creating a second output data set***From:*Roger Newson <r.newson@imperial.ac.uk>

**Re: st: Creating a second output data set***From:*Bryan Sayer <bsayer@chrr.osu.edu>

**Re: st: Creating a second output data set***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Creating a second output data set***From:*Bryan Sayer <bsayer@chrr.osu.edu>

**Re: st: Creating a second output data set***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: SPSS to STATA** - Next by Date:
**Re: st: Creating a second output data set** - Previous by thread:
**Re: st: Creating a second output data set** - Next by thread:
**Re: st: Creating a second output data set** - Index(es):