Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | László Sándor <sandorl@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: reversible -destring-, precision, longs v doubles |
Date | Wed, 7 Aug 2013 08:30:45 -0400 |
Thanks, Sergiy, Nick. FWIW, I can just say that I have many, many data sources with the same identifiers. Actually, -egen group()- would not guarantee the same numerical ids for the same strings if the files cover different universes (to merge). And it is error-prone to leave IDs behind. I am not sure the risk would be lower than with converting to the numbers. But I appreciate the advice, I keep it mind. Thanks! On Tue, Aug 6, 2013 at 6:15 PM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: > Laszlo, why don't you create your own ids? Consider the following example: > > sysuse auto, clear > isid make > egen long id=group(make) > isid id > list id make > > Your generated ids will be 'nice' in a sense that you don't need to > worry about leading zeroes, they will be numeric, and they will fit > into the long type limitations. Even if you have to merge multiple > different files it's doable with a few more lines, and it saves you > much of headache later on. It is also a more universal approach, as > neither float nor double would be able to accommodate something like > 12-char wide region followed by 12-char wide PSU followed by 12-char > wide HH number id that is easily handled by -egen-group-. And if you > need any components of ID separately, (like a region code in the > previous example) extract it before converting the IDs into the > numeric form. > > All credit of course goes to NJC: > http://www.stata.com/support/faqs/data-management/creating-group-identifiers/ > > Best, Sergiy Radyakin > > PS: It seems that statistical offices are not just 'fond of 10 digits > or more' as you write, but they are simply using software that is > handling large numbers as strings. CSPro is one such example. One > simply declares the width of the field in digits, whether decimal > point is present or implied, etc. That is very flexible, and you can > have an ID of any length. > > > On Tue, Aug 6, 2013 at 5:36 PM, László Sándor <sandorl@gmail.com> wrote: >> Thanks, Nick, as always. >> >> I am actually still confused, and maybe it is not just me: Could you >> discuss when the reversibility check would fail? >> >> From Bill's penultimate guide to precision, esp. points 4.3, 4.4, I >> gather that the IDs will be unique (not rounded) if my system is set >> to double as the default datatype. Permanently. >> http://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/ >> >> Still, it is a bit scary to risk rounding your identifiers by a >> mistaken float somewhere. On the other hand, string identifiers cannot >> be panel IDs for xtset, so I need to bite the bullet. >> >> Thanks again, >> >> Laszlo >> >> On Tue, Aug 6, 2013 at 12:29 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>> I guess I wrote some zeroth version of that. >>> >>> Conversion is reversible if real(string(<original>)) = <original> or >>> string(real(<original>)) = <original> where <original> is whatever you >>> feed in and -string()- can use whatever format is specified. >>> >>> What this amounts to is a stipulation is that you must lose no >>> information, crucial if you change your mind about what should be done >>> to the data. >>> >>> So, a reversible potato peeler or university education would restore >>> the potatoes or the students to their original state. >>> Nick >>> njcoxstata@gmail.com >>> >>> >>> On 6 August 2013 17:03, László Sándor <sandorl@gmail.com> wrote: >>>> I ran into an error with identifiers longer than -maxlong()- before >>>> (blame statistical offices fond of 10 digits or more). So now I wanted >>>> to be careful while destringing, but you cannot specify the type for >>>> the result — however, -destring- breaks if the process is not >>>> "reversible." What does it mean exactly? I cannot find it documented. >>>> (Actually, the default type for -destring- is double, so it is surely >>>> not the case the destring only produces longs unless forced to.) >>>> >>>> Do I need to worry about my identifiers becoming imprecise or rounded >>>> if -destring- did not warn me? >>>> >>>> The documentation of -tostring- does contain the following, but this >>>> is not exactly the same thing. >>>> >>>> Conversion of numeric data to string equivalents can be problematic. >>>> Stata, like most software, holds numeric data to finite precision and >>>> in binary form. See the discussion in [U] 13.11 Precision and problems >>>> therein. If no format() is specified, tostring uses the format %12.0g. >>>> This format is, in particular, sufficient to convert integers held as >>>> bytes, ints, or longs to string equivalent without loss of precision. >>>> However, users will often need to specify a format themselves, >>>> especially when the numeric data have fractional parts and for some >>>> reason a conversion to string is required. >>>> >>>> Thanks! >>>> >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/