[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: accuracy and preserving uniqueness of id |

Date |
Wed, 26 Feb 2003 09:41:46 -0000 |

Radu Ban wrote > > i'm using -infix- to read in a large dataset into stata. > each line of the > dataset begins with an 18 character, numeric, company identification > block. each company occupies several lines, that all start > with the same > identification code. to make things clearer here's my sample code: > > infix id 1-18 reccat 19-20 var1 21-25 var2 26-30 ... if reccat=11 > infix id 1-18 reccat 19-20 var3 21-23 var4 24-27 ... if reccat=12 > > after i ran this i took a look at my resulting dataset and > to my surprise, > the id displayed by Stata looked very different from the id > i originally > had in my flat text file. > > for example: > > in text, id = 200101380110999991 > in stata, id= 200101375269404672 > > or > > in text, id = 200101380206999991(different from above) > in stata, id= 200101375269404672(same as above) > > what's bothering me is that ids that are different in text > become the same > in stata. is there a way to preserve the accuracy and hence > uniqueness of the ids in this situation? and Devra Golbe, Phil Ryan and Erik Sorensen all firmly advised the use of a string variable for this purpose. I concur. Here are some extracts from a paper "On numbers and strings" in Stata Journal 2(3):314--329 (2002). ... unique identifiers will often conveniently be held in string variables. There is little point in defining a value label if that value label occurs once only. It is also less likely that you would want to use such a variable as defining one axis of a graph. Less obviously, identifiers which consist entirely of numeric codes are often better held as string variables. U.S. Social Security Numbers (SSNs) are one of the most frequently discussed examples on Statalist. .... When stored without hyphens, these SSNs can be read into Stata as numeric variables, but small problems often arise later. More generally, to hold multi-digit identifiers without numeric precision problems (that is, holding every digit exactly) may require the use of a -long- variable. To display such a variable (as with -list-) may require changing format to avoid most digits being lost whenever identifiers are presented in scientific notation. (See [R] format.) For example, a -float- numeric variable set equal to 123456789 will by default be -list-ed as 1.23e+08, shorthand for 1.23 * 10^8. These are small and soluble problems, but they often cause puzzlement to Stata users. Holding such identifiers as strings, even though every character is numeric, solves those problems, with no apparent downside. Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: accuracy and preserving uniqueness of id***From:*Radu Ban <rban@nber.org>

- Prev by Date:
**st: cluster()** - Next by Date:
**st: Re: problems with overidxt** - Previous by thread:
**Re: st: accuracy and preserving uniqueness of id** - Next by thread:
**RE: st: RE: list in stata8** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |