Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Creating a second output data set

From	Bryan Sayer <[email protected]>
To	[email protected]
Subject	Re: st: Creating a second output data set
Date	Fri, 09 Sep 2011 18:08:41 -0400

Great, thanks!  I tend to think more along the lines of FORTRAN.

A local for the total makes sense. The total is used in the calculationof the joint probability.

So locals seem to make sense for what I send to -postfile- (though Irealize I probably don't have it correct yet), provided it is ok tochange their value during the loops. But maybe it should be scalers?

But I am still stuck on the new variable I need to add to the input dataset (the one in memory). I'm going to call that "memory data set" todistinguish it from "postfile data set".

The memory data set is the input for this calculation. Each observationin the memory data set is one PSU. The postfile data set consists ofeach possible pair of PSUs from the memory data set, thus the number ofobservations is the combination of _N taken 2 at a time from the memorydata set (without replacement). The memory data set will gain onevariable, the marginal probability (margprob ), which is the sum of thejoint probabilities involving each PSU.

Can I act on margprob in the memory data set one observation at a time?I presume I would generate a new variable first, assigning it a valueof zero before the loops start


Basically, in the loop, margprob looks like this:

replace margprob[J] = margprob[J] + jointprob
replace margprob[K] = margprob[K] + jointprob

Where J and K are the observation number of the memory data set.

Does this work?

Bryan Sayer
Monday to Friday, 8:30 to 5:00
Phone: (614) 442-7369
FAX:  (614) 442-7329
[email protected]


On 9/9/2011 5:25 PM, Nick Cox wrote:

Your code shades between Stata and incomplete Stata, as you will know.

However, a key principle here is that -postfile- never sees the locals
in your calling program. It just gets passed their values. That's not
a problem. It's the way to get round the basic fact that one program's
locals are invisible to another program.

Also these two lines definitely won't work

        local N_total
        egen double `N_total'=total(`count')

The first defines the local N_total as blank, which is equivalent to
not defining it at all. So, Stata will read the second line as

egen double = total(`count')

which will fail, as no new variable name is supplied.

That said, there is no need to create a variable just to hold a total.

su `count', meanonly

will leave r(sum) in memory and the value of that can be put somewhere
appropriate, into a local or a scalar or directly into another file.

On Fri, Sep 9, 2011 at 8:49 PM, Bryan Sayer<[email protected]>  wrote:

So I am still a bit confused about how -postfile- works when I want to
preserve the data in memory.  Specifically, how I generate the variables
that I want in my -postfile- output versus the new one I do want to add to
the data set in memory.

I'm thinking I want to use a local (maybe macro?) variable for my results
that go to -postfile-?  In other words, how do I distinguish variables
between the two files.

Also, how do I accumulate results for my new variable that goes in my memory
data set.  I need to accumulate a sum for two observations in memory on each
post to -postfile-.

Here is what I have so far, but with the last part calculating the marginal
probability (note that the joint probability calculation should be on one
line):

program jointprob
args design infile outfile psu count margprob
tempvar psu1 psu2 pi_one pi_joint
tempfile results
        /* set up the file with the joint probabilities */
        postfile `results' `psu1' `psu2' using "`outfile'" ,replace
        /* get the number of observations and the total count */
        local N=_N
        local N_total
        egen double `N_total'=total(`count')

quietly {
        /* read the input data set and create combinations of N items
           taken 2 at a time, without replacement */
        forvalues J = 1/`N'{
                forvalues K = 1/`N'{
                        if `K'>`J'{
                                psu1=`psu'[`J']
                                psu2=`psu'[`K']

  pi_joint=(`count[`J']'*`count[`K']'/`N_total') *
((1/(`N_total'-`count[`J']')+(1/(`N_total'-`count[`K']'))
                                post `results' psu1 psu2 pi_joint
                                }
                }
        }
}


Bryan Sayer
Monday to Friday, 8:30 to 5:00
Phone: (614) 442-7369
FAX:  (614) 442-7329
[email protected]


On 9/7/2011 9:44 AM, Roger Newson wrote:


-postfile- will still work if there is an existing dataset in the
memory. However, the new dataset will be built in a file.

Best wishes

Roger


Roger B Newson BSc MSc DPhil
Lecturer in Medical Statistics
Respiratory Epidemiology and Public Health Group
National Heart and Lung Institute
Imperial College London
Royal Brompton Campus
Room 33, Emmanuel Kaye Building
1B Manresa Road
London SW3 6LR
UNITED KINGDOM
Tel: +44 (0)20 7352 8121 ext 3381
Fax: +44 (0)20 7351 8322
Email: [email protected]
Web page: http://www.imperial.ac.uk/nhli/r.newson/
Departmental Web page:

http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/


Opinions expressed are those of the author, not of the institution.

On 07/09/2011 14:40, Bryan Sayer wrote:


-postfile- will post my results, but my reading of how it works seems to
indicate that my original data set cannot be open at the same time. The
examples appear to me to clear the existing data set from memory.

Admittedly, this is without me having tried anything yet, but am I not
reading it correctly?

What I need to do is a double loop through the input data set,
outputting a record on each iteration of each loop. So I need the input
data set open in memory, and a second file to post the results to.

Are there any examples of something similar?

Thanks!

Bryan Sayer
Monday to Friday, 8:30 to 5:00
Phone: (614) 442-7369
FAX: (614) 442-7329
[email protected]


On 9/6/2011 4:58 PM, Roger Newson wrote:


I think you are looking for the -postfile- utility. In Stata, type

help postfile

to find out more.

HTH.

Best wishes

Roger


Roger B Newson BSc MSc DPhil
Lecturer in Medical Statistics
Respiratory Epidemiology and Public Health Group
National Heart and Lung Institute
Imperial College London
Royal Brompton Campus
Room 33, Emmanuel Kaye Building
1B Manresa Road
London SW3 6LR
UNITED KINGDOM
Tel: +44 (0)20 7352 8121 ext 3381
Fax: +44 (0)20 7351 8322
Email: [email protected]
Web page: http://www.imperial.ac.uk/nhli/r.newson/
Departmental Web page:

http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/




Opinions expressed are those of the author, not of the institution.

On 06/09/2011 21:53, Bryan Sayer wrote:


I need to create an output data set that will differ in the content and
number of observations from the input file. The observations will be
created one at a time, based on the input data set.

Specifically, I am creating all combinations of N objects taken two
at a
time. I will probably also do permutations.

The input data set (to start with) consists of N records with two
variables, the primary sampling unit (PSU) and a size variable
associated with the PSU (a count variable). I want to create two output
data sets. One is each combination of PSU with the associated joint
probability. The second has the same structure as the input data set
but
includes the marginal probability, calculated as the sum of the joint
probabilities associated with the PSU (which are accumulated as each
combination is created).

The part I am stuck on is how to output the data set of combinations.
Can someone point me to a program that outputs a file as calculations
are made?

(For those interested, this is for probability proportional to size
(PPS) sampling. See, for example, Levy and Lemeshow "Sampling of
Populations, chapter 11).

Here is an example of one stratum:

Input data set (with marginal probability added)

District Size pi(i)
LUWEERO 12,466 0.916858
KAMPALA 3,459 0.542857
TORORO 2,815 0.448739
KAMULI 549 0.091546
Total 19,289


Output data set:

COMBINATIONS pi(I,j)
LUWEERO,KAMPALA 0.468854
LUWEERO,TORORO 0.377069
LUWEERO,KAMULI 0.070934
KAMPALA,TORORO 0.062531
KAMPALA,KAMULI 0.011473
TORORO,KAMULI 0.009139

*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/


*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/


*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Creating a second output data set
  - From: Nick Cox <[email protected]>

References:
- st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Roger Newson <[email protected]>
- Re: st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Roger Newson <[email protected]>
- Re: st: Creating a second output data set
  - From: Bryan Sayer <[email protected]>
- Re: st: Creating a second output data set
  - From: Nick Cox <[email protected]>

Prev by Date: st: My stata won't -tab- with my value labels - why?
Next by Date: Re: st: My stata won't -tab- with my value labels - why?
Previous by thread: Re: st: Creating a second output data set
Next by thread: Re: st: Creating a second output data set
Index(es):
- Date
- Thread