Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: another data cleaning question


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: another data cleaning question
Date   Mon, 24 Jun 2002 14:31:44 +0100

Babigumira Ronnie

> Thanks for the help. The code produces the same result as what Hakon
> suggested however you make a comment More generally, whenever codes are
> pseudo-numeric, there are several advantages to holding them as strings
> which has generated interest in me so I would like to pursue it further.
>
> You suggest
>
> list if string(cropcode) != substr(string(varcode,1,3)
>
> Now that the code works, I would like to know the underlying principles.
> Please throw some more light especially on the right hand side of the =.
>

My general remark merely echoes a comment often made on
Statalist. I will write down what springs
to mind. Others should feel very free as usual
to amplify and correct. Perhaps this is an FAQ
in embryonic form. Also, as mentioned before
on Statalist, the topic of numeric and string
variables will be the subject of the next
"Speaking Stata" column in the Stata Journal, so
extra comments will be gratefully received.

1. Identifiers which are all numeric often cause small
problems. U.S. social security numbers appear
to be the most common example mentioned on
Statalist. To hold such identifiers without
precision problems (i.e. every digit held
exactly) may require the use of a -long- variable.
Along with that, so to speak, to display such a variable
may require changing format to avoid most
digits being lost whenever identifiers are presented in
scientific notation. These are small and soluble problems,
but frequently cause puzzlement to Stata users.
Holding such identifiers as strings, even though every
character is numeric, solves those problems, with
no apparent downside.

2. Categorical codes which are multi-digit numbers
are often constructed hierarchically: that is, successive
digits take you to finer detail within some
classification system. When such codes are held
as numbers, working from fine to coarse categories, or
vice versa, can be done via tricks with -int()- and
occasionally -mod()-. These tricks strike many users as neat when they
are familiar but indirect or obscure when they are not.
However, the corresponding operations on such codes held as strings
can be done via -substr()- and occasionally -index()-
and these operations are often more transparent to users.

3. A more elementary error is to forget, especially
for statistical rather than data management commands,
that a variable may be numeric to Stata without being
a variable which may fairly be included in a statistical
model as is. It is arguable that a habit of holding arbitrary
numeric codes as strings provides some protection against
foolish statistics of this kind.

Nick
n.j.cox@durham.ac.uk

P.S. I am not clear on Roni's specific question. The -string()-
function converts to string, whereafter -substr()- extracts
specified characters.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index