Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: destring ignores more than what specified in ignore()


From   Nick Cox <n.j.cox@durham.ac.uk>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   RE: st: destring ignores more than what specified in ignore()
Date   Mon, 21 Nov 2011 10:36:19 +0000

On the information here 

destring <varlist>, replace ignore("nas")

or 

destring <varlist>, replace force 

should work. Note that you don't need to set up your own loop or a prior filter of numeric variables; -destring- will do both for you. 

Nick 
n.j.cox@durham.ac.uk 


-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
Sent: 21 November 2011 08:22
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: destring ignores more than what specified in ignore()

-destring- ignores characters, not substrings. The problem is at most
that this is not clear to you when you read the help. -destring- did
what you told it to do, which was, among other things, to remove ".".

You need to fix your "n.a." and "n.s." first, e.g. within a loop

replace `var' = subinstr("`var'", "n.a.", ".", .)
replace `var' = subinstr("`var'", "n.s.", ".", .)

or as you did it.

-destring- is just a wrapper for -real()-, so -real()- is not really
an alternative except in so far as -destring- is not understood. Your
code is shorter and more efficient than -destring- as it can be
tailored to your problem.  In fact your last code segment can be
shortened as -real("n.a.")- for example results in numeric missing.

Nick

On Mon, Nov 21, 2011 at 1:51 AM, Impavido, Gregorio <GImpavido@imf.org> wrote:
> I looked at the many FAQ on destring but could not find an answer for my problem.  Hence, the post and hopefully, it is not a  duplicate.
>
> I have a dataset with an unknown (ex ante) number of string variables containing entries of the following three types: (i) "###.###"; (ii) "n.a."; and "n.s.".
>
> These variables should be numeric and I would like to destring them by coding:
>
> foreach var of varlist * {
>    capture confirm numeric variable `var'
>    if _rc {
>       destring `var', replace ignore("n.a." "n.s.")
>       }
> }
>
> This does not work as destring, for some inexplicable (to me) reason, treats "." as a separate non numeric character from "n.a." or "n.s.".
>
> Therefore, it drops the "." in the entries like "###.###" changing them in double numeric ######.   Same happens if option is specified as ignore("n.a" "n.s") (i.e., without final ".").
>
>
> First question (of two):  Why is destring ignoring more things than what specified in the option ignore()?
>
> I found two ways around this odd behaviour of destring.
>
> The first option uses an extra line of code and it is:
>
> foreach var of varlist * {
>    capture confirm numeric variable `var'
>    if _rc {
>       replace `var' = "na" if inlist(`var', "n.a.", "n.s.")  // this gets rid of the "."
>       destring `var', replace ignore("na")  // no "." here!!!
>    }
> }
>
> This preserves both the order and the variable labels of my original string variables (which I need in subsequent code) but it uses again the dreaded destring command (after seeing how it treats "n.a.", I don't "trust" it anymore).
>
> The second option uses generate with the real() function but also more lines of code as real() does not work with replace.
>
> foreach var of varlist * {
>    capture confirm numeric variable `var'
>    if _rc {
>       replace `var' = "." if inlist(`var', "n.a.", "n.s.")
>       local lbl : variable label `var'
>       gen `var'r = real(`var')
>       label var `var'r `"`lbl'"'
>       order `var'r, after(`var')
>       drop `var'
>    }
> }
>
> Both loops seem to end up with numeric only variables in the same order and with the same variable labels as the original dataset. My second question is: should we use real() instead of destring when possible, which is more "fool proof" (my third loop is much faster than the other two)?
>
> Finally, is there a more efficient way to get where I want without writing all this code (especially the last loop)?
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index