Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: destring ignores more than what specified in ignore()


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: destring ignores more than what specified in ignore()
Date   Mon, 21 Nov 2011 16:55:23 +0000

The documentation should be sufficient to understand what a program
does. StataCorp are responsible for that!

What may have misled you is that examples for -ignore()- include " "
as delimiters. That is only needed if space " " is one of the
characters to be ignored, which it often is.

You are correct. ".." as a string value does not map to numeric
missing with -destring-, even though -real("..")- does. This is one of
several ways in which -destring- is fussier than -real()-, and the
fussiness is intended as a feature.

But you shouldn't change -destring-; at most clone it.

Nick

On Mon, Nov 21, 2011 at 3:27 PM, Impavido, Gregorio <GImpavido@imf.org> wrote:
> Thank you Nick. It indeed wasn't clear to me that destring works with characters and not substrings (I should have looked at the ado file first...). It is now clear that destring creates local macros of each individual character specified in ignore() (lines 51-59 of destring.ado) and replaces them with "" in lines 229-230 before applying real().  This means (if understood correctly) that your last suggestion:
>
> destring <varlist>, replace ignore("nas")
>
> does not work as by starting with "n.a." or "n.s.", I am still left with ".." after the substitution. However, by adding
>
> | `temp'==".."
>
> in line 238 of destring, then you suggestion works like a charm. This is (I believe) equivalent to using the force option as you also suggest.
>
> All your other suggestions work perfectly. So thank you again.
>
> Gregorio
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
> Sent: Monday, November 21, 2011 5:36 AM
> To: 'statalist@hsphsun2.harvard.edu'
> Subject: RE: st: destring ignores more than what specified in ignore()
>
> On the information here
>
> destring <varlist>, replace ignore("nas")
>
> or
>
> destring <varlist>, replace force
>
> should work. Note that you don't need to set up your own loop or a prior filter of numeric variables; -destring- will do both for you.
>
> Nick
> n.j.cox@durham.ac.uk
>
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
> Sent: 21 November 2011 08:22
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: destring ignores more than what specified in ignore()
>
> -destring- ignores characters, not substrings. The problem is at most
> that this is not clear to you when you read the help. -destring- did
> what you told it to do, which was, among other things, to remove ".".
>
> You need to fix your "n.a." and "n.s." first, e.g. within a loop
>
> replace `var' = subinstr("`var'", "n.a.", ".", .)
> replace `var' = subinstr("`var'", "n.s.", ".", .)
>
> or as you did it.
>
> -destring- is just a wrapper for -real()-, so -real()- is not really
> an alternative except in so far as -destring- is not understood. Your
> code is shorter and more efficient than -destring- as it can be
> tailored to your problem.  In fact your last code segment can be
> shortened as -real("n.a.")- for example results in numeric missing.
>
> Nick
>
> On Mon, Nov 21, 2011 at 1:51 AM, Impavido, Gregorio <GImpavido@imf.org> wrote:
>> I looked at the many FAQ on destring but could not find an answer for my problem.  Hence, the post and hopefully, it is not a  duplicate.
>>
>> I have a dataset with an unknown (ex ante) number of string variables containing entries of the following three types: (i) "###.###"; (ii) "n.a."; and "n.s.".
>>
>> These variables should be numeric and I would like to destring them by coding:
>>
>> foreach var of varlist * {
>>    capture confirm numeric variable `var'
>>    if _rc {
>>       destring `var', replace ignore("n.a." "n.s.")
>>       }
>> }
>>
>> This does not work as destring, for some inexplicable (to me) reason, treats "." as a separate non numeric character from "n.a." or "n.s.".
>>
>> Therefore, it drops the "." in the entries like "###.###" changing them in double numeric ######.   Same happens if option is specified as ignore("n.a" "n.s") (i.e., without final ".").
>>
>>
>> First question (of two):  Why is destring ignoring more things than what specified in the option ignore()?
>>
>> I found two ways around this odd behaviour of destring.
>>
>> The first option uses an extra line of code and it is:
>>
>> foreach var of varlist * {
>>    capture confirm numeric variable `var'
>>    if _rc {
>>       replace `var' = "na" if inlist(`var', "n.a.", "n.s.")  // this gets rid of the "."
>>       destring `var', replace ignore("na")  // no "." here!!!
>>    }
>> }
>>
>> This preserves both the order and the variable labels of my original string variables (which I need in subsequent code) but it uses again the dreaded destring command (after seeing how it treats "n.a.", I don't "trust" it anymore).
>>
>> The second option uses generate with the real() function but also more lines of code as real() does not work with replace.
>>
>> foreach var of varlist * {
>>    capture confirm numeric variable `var'
>>    if _rc {
>>       replace `var' = "." if inlist(`var', "n.a.", "n.s.")
>>       local lbl : variable label `var'
>>       gen `var'r = real(`var')
>>       label var `var'r `"`lbl'"'
>>       order `var'r, after(`var')
>>       drop `var'
>>    }
>> }
>>
>> Both loops seem to end up with numeric only variables in the same order and with the same variable labels as the original dataset. My second question is: should we use real() instead of destring when possible, which is more "fool proof" (my third loop is much faster than the other two)?
>>
>> Finally, is there a more efficient way to get where I want without writing all this code (especially the last loop)?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index