[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: another data cleaning question |

Date |
Mon, 24 Jun 2002 14:31:44 +0100 |

Babigumira Ronnie > Thanks for the help. The code produces the same result as what Hakon > suggested however you make a comment More generally, whenever codes are > pseudo-numeric, there are several advantages to holding them as strings > which has generated interest in me so I would like to pursue it further. > > You suggest > > list if string(cropcode) != substr(string(varcode,1,3) > > Now that the code works, I would like to know the underlying principles. > Please throw some more light especially on the right hand side of the =. > My general remark merely echoes a comment often made on Statalist. I will write down what springs to mind. Others should feel very free as usual to amplify and correct. Perhaps this is an FAQ in embryonic form. Also, as mentioned before on Statalist, the topic of numeric and string variables will be the subject of the next "Speaking Stata" column in the Stata Journal, so extra comments will be gratefully received. 1. Identifiers which are all numeric often cause small problems. U.S. social security numbers appear to be the most common example mentioned on Statalist. To hold such identifiers without precision problems (i.e. every digit held exactly) may require the use of a -long- variable. Along with that, so to speak, to display such a variable may require changing format to avoid most digits being lost whenever identifiers are presented in scientific notation. These are small and soluble problems, but frequently cause puzzlement to Stata users. Holding such identifiers as strings, even though every character is numeric, solves those problems, with no apparent downside. 2. Categorical codes which are multi-digit numbers are often constructed hierarchically: that is, successive digits take you to finer detail within some classification system. When such codes are held as numbers, working from fine to coarse categories, or vice versa, can be done via tricks with -int()- and occasionally -mod()-. These tricks strike many users as neat when they are familiar but indirect or obscure when they are not. However, the corresponding operations on such codes held as strings can be done via -substr()- and occasionally -index()- and these operations are often more transparent to users. 3. A more elementary error is to forget, especially for statistical rather than data management commands, that a variable may be numeric to Stata without being a variable which may fairly be included in a statistical model as is. It is arguable that a habit of holding arbitrary numeric codes as strings provides some protection against foolish statistics of this kind. Nick n.j.cox@durham.ac.uk P.S. I am not clear on Roni's specific question. The -string()- function converts to string, whereafter -substr()- extracts specified characters. * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: another data cleaning question***From:*Babigumira Ronnie <rutaremwa_rb@yahoo.com>

**References**:**RE: st: another data cleaning question***From:*Babigumira Ronnie <rutaremwa_rb@yahoo.com>

- Prev by Date:
**st: Fixed values restrictions in ordered probit** - Next by Date:
**st: posting a long list** - Previous by thread:
**RE: st: another data cleaning question** - Next by thread:
**RE: st: another data cleaning question** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |