Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Controling precision for multiple runs of same code

From	Melanie Leis <[email protected]>
To	statalist <[email protected]>
Subject	Re: st: Controling precision for multiple runs of same code
Date	Mon, 3 Jun 2013 22:04:13 -0400

Thank you, I appreciate your recommendations about not using m:m.

Nevertheless, in each run, out of 47,267,047 observations and 14
variables, I only get a difference in 102 observations and 2
variables. This seems to me like something that should be fixable.

I insist on using the merge m:m because, overall, it yields the most
reasonable numbers for what I'm trying to do. Joinby gave me a total
population about 60 million higher than what I need (and what I get
with the merge m:m).

I understand that the lack of context and the fact that I insist on
using a merge m:m make this difficult.  Nevertheless, any ideas on how
I could fix the code below to get replicable results without taking
out the merge m:m would be greatly appreciated.

Thank you!

Melanie


On Mon, Jun 3, 2013 at 8:05 PM, Phil Clayton
<[email protected]> wrote:
> The first thing that stands out is:
> merge m:m ...
>
> You should pretty much never do this. Don't take my word for it - the manual entry for -merge- says "m:m specifies a many-to-many merge and is a bad idea" and explains why, including that you might get non-reproducible results.
>
> You probably want m:1. If for some reason you definitely need to join all records in a many-to-many fashion based on one or more ID variables, you should use -joinby-.
>
> Phil
>
> On 04/06/2013, at 9:55 AM, Melanie Leis <[email protected]> wrote:
>
>> Hello,
>>
>> I'm having trouble with a section of my code that yields different
>> results each time I run it.
>>
>> I start out with a dataset, baseline_4.dta, which has 47,267,047
>> observations and 16 variables, and run this:
>>
>> merge m:m statefips agecat_census using "ABCD.dta"
>> assert _merge==3
>> drop _merge
>> egen tot_pop=sum(pop), by(statefips countyfips agecat_census sexcat
>> racecat iprcat_mpact iprcat coverage groupsize)
>> checkpop
>> rename pop oldpop
>> gen pop=tot_pop*prob_agecat_mpact
>> checkpop
>> collapse (sum) pop, by(statefips countyfips agecat_mpact sexcat
>> racecat iprcat_mpact iprcat coverage groupsize)
>> checkpop
>> sum
>> sort _all
>> save "baseline_5.dta", replace
>>
>> checkpop is a program that tells me what my total population is each
>> time I run it. My total population is the same before and after the
>> collapse function (see results below).
>>
>> At the end, my total population and my number of observations in
>> baseline_5.dta is different every time I run this. I suspect the
>> difference is in rounding when it executes the gen pop line, but I've
>> tried replacing it for
>>
>> gen double pop=tot_pop*prob_agecat_mpact
>>
>> and
>>
>> gen float pop=tot_pop*prob_agecat_mpact
>>
>> And I still get differences.
>>
>> I tried using
>>
>> gen long pop=tot_pop*prob_acegat_mpact
>>
>> But I lost too much precision by doing this.
>>
>> Could you please recommend a solution to obtain the exact same numbers
>> in each run, without sacrificing precision?
>>
>> Thanks!
>>
>> Melanie
>>
>> The log file for 2 of the runs I've done:
>>
>> ************* RUN A ***********************
>>
>> . merge m:m statefips agecat_census using "ABCD.dta"
>>
>>    Result                           # of obs.
>>    -----------------------------------------
>>    not matched                             0
>>    matched                        47,267,047  (_merge==3)
>>    -----------------------------------------
>>
>> . assert _merge==3
>>
>> . drop _merge
>>
>> . egen tot_pop=sum(pop), by(statefips countyfips agecat_census sexcat
>> racecat iprcat_mpac
>>> t iprcat coverage groupsize)
>>
>> . checkpop
>>
>> Total pop:       347,095,179
>> Observations:     47,267,047
>> Missing:                   0
>>
>> . rename pop oldpop
>>
>> . gen pop=tot_pop*prob_agecat_mpact
>>
>> . checkpop
>>
>> Total pop:       332,455,972
>> Observations:     47,267,047
>> Missing:                   0
>>
>> . collapse (sum) pop, by(statefips countyfips agecat_mpact sexcat
>> racecat iprcat_mpact ip
>>> rcat coverage groupsize)
>>
>> . checkpop
>>
>> Total pop:       332,455,972
>> Observations:     36,351,520
>> Missing:                   0
>>
>>
>> ************** RUN B *************
>> . merge m:m statefips agecat_census using "ABCD.dta"
>>
>>    Result                           # of obs.
>>    -----------------------------------------
>>    not matched                             0
>>    matched                        47,267,047  (_merge==3)
>>    -----------------------------------------
>>
>> . assert _merge==3
>>
>> . drop _merge
>>
>> . egen tot_pop=sum(pop), by(statefips countyfips agecat_census sexcat
>> racecat iprcat_mpac
>>> t iprcat coverage groupsize)
>>
>> . checkpop
>>
>> Total pop:       347,095,179
>> Observations:     47,267,047
>> Missing:                   0
>>
>> . rename pop oldpop
>>
>> . gen pop=tot_pop*prob_agecat_mpact
>>
>> . checkpop
>>
>> Total pop:       332,455,928
>> Observations:     47,267,047
>> Missing:                   0
>>
>> . collapse (sum) pop, by(statefips countyfips agecat_mpact sexcat
>> racecat iprcat_mpact ip
>>> rcat coverage groupsize)
>>
>> . checkpop
>>
>> Total pop:       332,455,928
>> Observations:     36,351,515
>> Missing:                   0
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: Controling precision for multiple runs of same code
  - From: "Lachenbruch, Peter" <[email protected]>
- Re: st: Controling precision for multiple runs of same code
  - From: Phil Clayton <[email protected]>

References:
- st: Controling precision for multiple runs of same code
  - From: Melanie Leis <[email protected]>
- Re: st: Controling precision for multiple runs of same code
  - From: Phil Clayton <[email protected]>

Prev by Date: Re: st: biprobit, interactions, and correct marginal effects (out of office until12th June)
Next by Date: Re: st: Reganat command in Stata Journal (2013) 13 #1
Previous by thread: Re: st: Controling precision for multiple runs of same code
Next by thread: Re: st: Controling precision for multiple runs of same code
Index(es):
- Date
- Thread