Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: another data cleaning question

From   Babigumira Ronnie <>
Subject   RE: st: another data cleaning question
Date   Mon, 24 Jun 2002 22:36:43 -0700 (PDT)

Thanks Nick..the notes on string vs numeric were very insightful. About
the specific question, I looked it up in the manual and all is good (I
hadnt figured out what the ,1,3 meant)

--- Nick Cox <> wrote:
> Babigumira Ronnie
> > Thanks for the help. The code produces the same result as what Hakon
> > suggested however you make a comment More generally, whenever codes
> are
> > pseudo-numeric, there are several advantages to holding them as
> strings
> > which has generated interest in me so I would like to pursue it
> further.
> >
> > You suggest
> >
> > list if string(cropcode) != substr(string(varcode,1,3)
> >
> > Now that the code works, I would like to know the underlying
> principles.
> > Please throw some more light especially on the right hand side of the
> =.
> >
> My general remark merely echoes a comment often made on
> Statalist. I will write down what springs
> to mind. Others should feel very free as usual
> to amplify and correct. Perhaps this is an FAQ
> in embryonic form. Also, as mentioned before
> on Statalist, the topic of numeric and string
> variables will be the subject of the next
> "Speaking Stata" column in the Stata Journal, so
> extra comments will be gratefully received.
> 1. Identifiers which are all numeric often cause small
> problems. U.S. social security numbers appear
> to be the most common example mentioned on
> Statalist. To hold such identifiers without
> precision problems (i.e. every digit held
> exactly) may require the use of a -long- variable.
> Along with that, so to speak, to display such a variable
> may require changing format to avoid most
> digits being lost whenever identifiers are presented in
> scientific notation. These are small and soluble problems,
> but frequently cause puzzlement to Stata users.
> Holding such identifiers as strings, even though every
> character is numeric, solves those problems, with
> no apparent downside.
> 2. Categorical codes which are multi-digit numbers
> are often constructed hierarchically: that is, successive
> digits take you to finer detail within some
> classification system. When such codes are held
> as numbers, working from fine to coarse categories, or
> vice versa, can be done via tricks with -int()- and
> occasionally -mod()-. These tricks strike many users as neat when they
> are familiar but indirect or obscure when they are not.
> However, the corresponding operations on such codes held as strings
> can be done via -substr()- and occasionally -index()-
> and these operations are often more transparent to users.
> 3. A more elementary error is to forget, especially
> for statistical rather than data management commands,
> that a variable may be numeric to Stata without being
> a variable which may fairly be included in a statistical
> model as is. It is arguable that a habit of holding arbitrary
> numeric codes as strings provides some protection against
> foolish statistics of this kind.
> Nick
> P.S. I am not clear on Roni's specific question. The -string()-
> function converts to string, whereafter -substr()- extracts
> specified characters.
> *
> *   For searches and help try:
> *
> *
> *

Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index