Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: accuracy and preserving uniqueness of id


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: accuracy and preserving uniqueness of id
Date   Wed, 26 Feb 2003 09:41:46 -0000

Radu Ban wrote
>
> i'm using -infix- to read in a large dataset into stata.
> each line of the
> dataset begins with an 18 character, numeric, company identification
> block. each company occupies several lines, that all start
> with the same
> identification code. to make things clearer here's my sample code:
>
> infix id 1-18 reccat 19-20 var1 21-25 var2 26-30 ... if reccat=11
> infix id 1-18 reccat 19-20 var3 21-23 var4 24-27 ... if reccat=12
>
> after i ran this i took a look at my resulting dataset and
> to my surprise,
> the id displayed by Stata looked very different from the id
> i originally
> had in my flat text file.
>
> for example:
>
> in text, id = 200101380110999991
> in stata, id= 200101375269404672
>
> or
>
> in text, id = 200101380206999991(different from above)
> in stata, id= 200101375269404672(same as above)
>
> what's bothering me is that ids that are different in text
> become the same
> in stata. is there a way to preserve the accuracy and hence
> uniqueness of the ids in this situation?

and Devra Golbe, Phil Ryan and Erik Sorensen all firmly
advised the use of a string variable for this purpose.

I concur.

Here are some extracts from a paper "On numbers
and strings" in Stata Journal 2(3):314--329 (2002).

... unique identifiers will often conveniently be held in
string variables.  There is little point in defining a
value label if that value label occurs once only. It is
also less likely that you would want to use such a
variable as defining one axis of a graph.

Less obviously, identifiers which consist entirely of numeric
codes are often better held as string variables. U.S. Social
Security Numbers (SSNs) are one of the most frequently
discussed examples on Statalist. .... When stored without
hyphens, these SSNs can be read into Stata as numeric variables,
but small problems often arise later. More generally, to hold
multi-digit identifiers without numeric precision problems
(that is, holding every digit exactly) may require the use of a
-long- variable.  To display such a variable (as with -list-) may
require changing format to avoid most digits being lost
whenever identifiers are presented in scientific notation.
(See [R] format.) For example, a -float- numeric variable set equal
to 123456789 will by default be -list-ed as 1.23e+08, shorthand for
1.23 * 10^8. These are small and soluble problems, but they often
cause puzzlement to Stata users.  Holding such identifiers as
strings, even though every character is numeric, solves those
problems, with no apparent downside.


Nick
n.j.cox@durham.ac.uk

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index