Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Controling precision for multiple runs of same code

From	Phil Clayton <[email protected]>
To	[email protected]
Subject	Re: st: Controling precision for multiple runs of same code
Date	Tue, 4 Jun 2013 13:10:34 +1000

Sorry, but you've already received consistent advice from several experienced users as well as references to the manual, all saying the same thing:
Do not use merge m:m

If merging m:m is the answer then the question is probably wrong. We can't help you further without you describing the structure of the two datasets and what you're trying to achieve.

Trying to "fix" the problem caused by the -merge- is treating the symptom, not the disease. The manual gives some advice in this regard (eg using stable sorting prior to merging), but it's likely that the best you'll achieve with this approach is reproducibly misleading results. Just because your numbers are "reasonable" doesn't mean they are correct.

My guess is that you're essentially trying to "look up" values in ABCD.dta based on the categories defined by statefips & agecat_census, but that you have duplicate entries in ABCD. If this is the case I would recommend cleaning up ABCD so that it only has one entry for each combination of statefips & agecat_census (or finding a third ID variable), then repeating the merge as m:1.

Phil

On 04/06/2013, at 12:04 PM, Melanie Leis <[email protected]> wrote:

> Thank you, I appreciate your recommendations about not using m:m.
> 
> Nevertheless, in each run, out of 47,267,047 observations and 14
> variables, I only get a difference in 102 observations and 2
> variables. This seems to me like something that should be fixable.
> 
> I insist on using the merge m:m because, overall, it yields the most
> reasonable numbers for what I'm trying to do. Joinby gave me a total
> population about 60 million higher than what I need (and what I get
> with the merge m:m).
> 
> I understand that the lack of context and the fact that I insist on
> using a merge m:m make this difficult.  Nevertheless, any ideas on how
> I could fix the code below to get replicable results without taking
> out the merge m:m would be greatly appreciated.
> 
> Thank you!
> 
> Melanie
> 
> 
> On Mon, Jun 3, 2013 at 8:05 PM, Phil Clayton
> <[email protected]> wrote:
>> The first thing that stands out is:
>> merge m:m ...
>> 
>> You should pretty much never do this. Don't take my word for it - the manual entry for -merge- says "m:m specifies a many-to-many merge and is a bad idea" and explains why, including that you might get non-reproducible results.
>> 
>> You probably want m:1. If for some reason you definitely need to join all records in a many-to-many fashion based on one or more ID variables, you should use -joinby-.
>> 
>> Phil
>> 
>> On 04/06/2013, at 9:55 AM, Melanie Leis <[email protected]> wrote:
>> 
>>> Hello,
>>> 
>>> I'm having trouble with a section of my code that yields different
>>> results each time I run it.
>>> 
>>> I start out with a dataset, baseline_4.dta, which has 47,267,047
>>> observations and 16 variables, and run this:
>>> 
>>> merge m:m statefips agecat_census using "ABCD.dta"
>>> assert _merge==3
>>> drop _merge
>>> egen tot_pop=sum(pop), by(statefips countyfips agecat_census sexcat
>>> racecat iprcat_mpact iprcat coverage groupsize)
>>> checkpop
>>> rename pop oldpop
>>> gen pop=tot_pop*prob_agecat_mpact
>>> checkpop
>>> collapse (sum) pop, by(statefips countyfips agecat_mpact sexcat
>>> racecat iprcat_mpact iprcat coverage groupsize)
>>> checkpop
>>> sum
>>> sort _all
>>> save "baseline_5.dta", replace
>>> 
>>> checkpop is a program that tells me what my total population is each
>>> time I run it. My total population is the same before and after the
>>> collapse function (see results below).
>>> 
>>> At the end, my total population and my number of observations in
>>> baseline_5.dta is different every time I run this. I suspect the
>>> difference is in rounding when it executes the gen pop line, but I've
>>> tried replacing it for
>>> 
>>> gen double pop=tot_pop*prob_agecat_mpact
>>> 
>>> and
>>> 
>>> gen float pop=tot_pop*prob_agecat_mpact
>>> 
>>> And I still get differences.
>>> 
>>> I tried using
>>> 
>>> gen long pop=tot_pop*prob_acegat_mpact
>>> 
>>> But I lost too much precision by doing this.
>>> 
>>> Could you please recommend a solution to obtain the exact same numbers
>>> in each run, without sacrificing precision?
>>> 
>>> Thanks!
>>> 
>>> Melanie
>>> 
>>> The log file for 2 of the runs I've done:
>>> 
>>> ************* RUN A ***********************
>>> 
>>> . merge m:m statefips agecat_census using "ABCD.dta"
>>> 
>>>   Result                           # of obs.
>>>   -----------------------------------------
>>>   not matched                             0
>>>   matched                        47,267,047  (_merge==3)
>>>   -----------------------------------------
>>> 
>>> . assert _merge==3
>>> 
>>> . drop _merge
>>> 
>>> . egen tot_pop=sum(pop), by(statefips countyfips agecat_census sexcat
>>> racecat iprcat_mpac
>>>> t iprcat coverage groupsize)
>>> 
>>> . checkpop
>>> 
>>> Total pop:       347,095,179
>>> Observations:     47,267,047
>>> Missing:                   0
>>> 
>>> . rename pop oldpop
>>> 
>>> . gen pop=tot_pop*prob_agecat_mpact
>>> 
>>> . checkpop
>>> 
>>> Total pop:       332,455,972
>>> Observations:     47,267,047
>>> Missing:                   0
>>> 
>>> . collapse (sum) pop, by(statefips countyfips agecat_mpact sexcat
>>> racecat iprcat_mpact ip
>>>> rcat coverage groupsize)
>>> 
>>> . checkpop
>>> 
>>> Total pop:       332,455,972
>>> Observations:     36,351,520
>>> Missing:                   0
>>> 
>>> 
>>> ************** RUN B *************
>>> . merge m:m statefips agecat_census using "ABCD.dta"
>>> 
>>>   Result                           # of obs.
>>>   -----------------------------------------
>>>   not matched                             0
>>>   matched                        47,267,047  (_merge==3)
>>>   -----------------------------------------
>>> 
>>> . assert _merge==3
>>> 
>>> . drop _merge
>>> 
>>> . egen tot_pop=sum(pop), by(statefips countyfips agecat_census sexcat
>>> racecat iprcat_mpac
>>>> t iprcat coverage groupsize)
>>> 
>>> . checkpop
>>> 
>>> Total pop:       347,095,179
>>> Observations:     47,267,047
>>> Missing:                   0
>>> 
>>> . rename pop oldpop
>>> 
>>> . gen pop=tot_pop*prob_agecat_mpact
>>> 
>>> . checkpop
>>> 
>>> Total pop:       332,455,928
>>> Observations:     47,267,047
>>> Missing:                   0
>>> 
>>> . collapse (sum) pop, by(statefips countyfips agecat_mpact sexcat
>>> racecat iprcat_mpact ip
>>>> rcat coverage groupsize)
>>> 
>>> . checkpop
>>> 
>>> Total pop:       332,455,928
>>> Observations:     36,351,515
>>> Missing:                   0
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> 
>> 
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Controling precision for multiple runs of same code
  - From: Melanie Leis <[email protected]>
- Re: st: Controling precision for multiple runs of same code
  - From: Phil Clayton <[email protected]>
- Re: st: Controling precision for multiple runs of same code
  - From: Melanie Leis <[email protected]>

Prev by Date: st: how to evaluate predictions with balanced panel data
Next by Date: Re: st: mlogit margins simultaneous calculation
Previous by thread: Re: st: Controling precision for multiple runs of same code
Next by thread: RE: st: Controling precision for multiple runs of same code
Index(es):
- Date
- Thread