Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Sergiy Radyakin <serjradyakin@gmail.com> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: Fwd: Fastest way to identify values that start and end with a 9? |

Date |
Thu, 3 Oct 2013 11:16:58 -0400 |

Paul (or Evan) is using a dataset where missing values (DK and REF) are coded as values of the type 9...9 and 9...8. This is similar to the convention used in the DHS datasets, see eg here page 3: http://www.measuredhs.com/pubs/pdf/DHSG4/Recode6_DHS_22March2013_DHSG4.pdf Paul (Evan) must check with the data provider whether the other convention is also true - that the missing values should be at least one digit wider than the widest (in terms of digits) possible value. Otherwise, e.g. if the value 999 determines the missing age, ages 9 and 99 will also be caught in the recoding schemes he is using based on the proposed regular expressions. In general, I don't see how he will be able to determine which 9...9 patterns indeed correspond to missing without having a prior knowledge of the variable contents, or instructions from the data provider, or carefully inspecting individual values of each variable trying to determine the ranges of widths of the values. Values of income of 99USD or 998USD might in the end be actual data, etc. Best, Sergiy Radyakin On Thu, Oct 3, 2013 at 5:08 AM, Evan DeFilippis <defilippis@gmail.com> wrote: > Values in my data set contain different numerical representations for > "Don't Know" and "Refusal" > > A "Don't Know" will always start and end with a '9', but there can be > as many '9's in between as possible, up to the maximum length of a > string (244). > > A "Refusal" will always start with a '9' and end with an '8', and > there can be as many '9's' in between as possible, up to the maximum > length of a string (244). > > The data set contains strings, integers, bytes, etc.. > > I want to be able to convert the numerical representations of 'Don't > Know' and 'Refusal's' into DK and REF, respectively. > > My current strategy for doing this looks like so: > > quietly tostring _all, replace > ds, has(type string) > di "`r(varlist)'" > unab string_vars : `r(varlist)' > foreach j in `string_vars' { > quietly replace `j'= regexr(`j', "^[9]*[9]$","DK") > quietly replace `j' = regexr(`j', "^[9]*[8]$", "REF") > } > > However, this is slow because it converts the entire data set into > strings, which takes about 5 minutes, and then it has to do has(type > string) in order to get r(varlist) to iterate over all those strings > which takes about 4 minutes. > > Is there a faster way to do this that perhaps does not involve > converting everything to strings? > > Paul > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Fwd: Fastest way to identify values that start and end with a 9?***From:*Robert Picard <picard@netbox.com>

**References**:**st: Fwd: Fastest way to identify values that start and end with a 9?***From:*Evan DeFilippis <defilippis@gmail.com>

- Prev by Date:
**Re: st: Fwd: Fastest way to identify values that start and end with a 9?** - Next by Date:
**Re: st: Looping over datetimes for simultaneous variable generation** - Previous by thread:
**st: RE: Fwd: Fastest way to identify values that start and end with a 9?** - Next by thread:
**Re: st: Fwd: Fastest way to identify values that start and end with a 9?** - Index(es):