Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Tolerance for -merge- variable

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: Tolerance for -merge- variable
Date	Thu, 29 Mar 2012 16:50:18 +0100

On your bottom line:

I don't disagree that people often want merging with tolerance, or
think they do. It would be all too likely to bite them one way or the
other. The problem is that it _cannot_ be implemented as people would
want in terms of decimals without pitfalls. There is now a substantial
expository literature on precision, in the Stata Journal, FAQs and the
StataCorp blog, but all too much evidence that many Stata users don't
understand that most decimals can not be held exactly.

I'd argue that it's best that they take responsibility for
implementing their own tolerance. If StataCorp move one bit (so to
speak) on this point I will (pretend to) eat one of my hats.

Nick

On Thu, Mar 29, 2012 at 4:37 PM, Rob Ploutz-Snyder
<[email protected]> wrote:
> Thank you Nick for your prompt reply to my post.  You have clarified
> my problem exactly--and more clearly than I.
>
> The precision problem  becomes even more troublesome when different
> software play the game. In my case, I rec'd one data set with ID's
> that were generated in Excel... I've no idea what the precision is
> there but  I know that it doesn't align nicely with the other dataset
> that generated those (decimal) IDs in Stata's.
>
> Alas--I guess I am stuck with converting ID's to String for the merge.
>  The good news is that I was especially avoiding this solution because
> I had assumed that I couldn't then use a string ID var as an
> identifier in Stata's -xtmixed- or other xt routines, so I had to
> back-convert into a numberic ID.
>
> I seem to be able to use a String variable for that too so I suppose
> Stata's -merge- behavior is alright in the end.
>
> ...I stubbornly admit that I still wish it had a tolerance option
> that we could tweak so that, with our instruction, it would treat ID's
> within ?? decimals as equal.

> On Wed, Mar 28, 2012 at 1:43 PM, Nick Cox <[email protected]> wrote:
>> My understanding is that there is _no_ tolerance. Equal matches,
>> unequal doesn't. What implies otherwise?
>>
>> More specifically,
>>
>> 1. Like you, I wouldn't by preference use a non-integer numeric
>> variable as an identifier, largely because of worries that things like
>> this might happen.
>>
>> 2. This is expectable if one variable is -float- and the other
>> -double- as then x.1 (or whatever) will be stored as different binary
>> approximations. See documentation on precision, passim.
>>
>> 3. If the variables are the same type, please show us (a) minimal
>> datasets  and (b) -merge- syntax which shows your problem. But you
>> should first use hexadecimal formats to see if the identifiers really
>> are identical. If not, -merge- is behaving as expected.
>>
>> 4. Otherwise, my best advice is that conversion to string must use an
>> explicit format argument to maximise your chances, e.g. -string(myvar,
>> "%18.1f")-.
>>
>> Nick
>>
>> On Wed, Mar 28, 2012 at 7:26 PM, Rob Ploutz-Snyder
>> <[email protected]> wrote:
>>
>>> I notice that when I have an ID variable stored with 1 decimal place
>>> (ex. id=id+0.1) in two separate data files, the merge command
>>> sometimes fails to equate ID values that are equal within rounding
>>> error.  This is particularly problematic if Stata generated one of
>>> these id variables (ex. gen idnew=id+0.1) and Excel or some other
>>> software generated the id variable in the other dataset (including
>>> hand data entry).
>>>
>>> Is there a way to adjust the tolerance that -merge- uses on the ID var
>>> that is in both data sets so that it links properly out to (for
>>> example) 1 or 2 or 3 digits past the decimal??
>>>
>>> My only solution so far is to generate a string variable from the
>>> numeric ID variables in each dataset and then use the string variable
>>> for the -merge- but it seems like there should be a simpler way to
>>> tweak the tolerance within -merge-.  My other solution is to try to
>>> avoid circumstances when the unique ID is a non-integer, but that's
>>> not always an option for me.
>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Tolerance for -merge- variable
  - From: Rob Ploutz-Snyder <[email protected]>
- Re: st: Tolerance for -merge- variable
  - From: Nick Cox <[email protected]>
- Re: st: Tolerance for -merge- variable
  - From: Rob Ploutz-Snyder <[email protected]>

Prev by Date: Re: st: question: how to collapse data fast for simplified, binned scatter plots
Next by Date: Re: st: Mlogit with fixed effects
Previous by thread: Re: st: Tolerance for -merge- variable
Next by thread: st: ranking variables on the basis of total values of observations
Index(es):
- Date
- Thread