Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Robert Picard <picard@netbox.com> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: Fwd: Fastest way to identify values that start and end with a 9? |

Date |
Thu, 3 Oct 2013 12:19:04 -0400 |

For the variables that are numeric, assuming that the DK and REF codes are always larger than any other value for the variable, the following code will look for the max value of each variable and convert it to an extended missing value if it matches the pattern (all digits are 9). Then the procedure is repeated the next pattern (either 8 or a series of 9 ending with an 8. Labels are used to identify each type of missing value. Robert * ----------------- begin example --------------- clear input v1 v2 v3 v4 v5 str1 v6 1 22 333 4444 55555 "a" 8 98 998 9998 99998 "b" 9 99 999 9999 99999 "c" 2 33 444 5555 66666 "d" 2 33 444 998 99 "e" end label def dkref .a "DK" .b "REF" ds, has(type numeric) local numvars `r(varlist)' foreach v of varlist `numvars' { sum `v', meanonly local vmax = r(max) local dk = regexm("`vmax'", "^9+$") if `dk' replace `v' = .a if `v' == `vmax' sum `v', meanonly local vmax = r(max) local ref = regexm("`vmax'", "(^8$|^9+8$)") if `ref' replace `v' = .b if `v' == `vmax' label values `v' dkref } list * ----------------- end example ----------------- On Thu, Oct 3, 2013 at 11:16 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: > Paul (or Evan) is using a dataset where missing values (DK and REF) > are coded as values of the type 9...9 and 9...8. This is similar to > the convention used in the DHS datasets, see eg here page 3: > http://www.measuredhs.com/pubs/pdf/DHSG4/Recode6_DHS_22March2013_DHSG4.pdf > > Paul (Evan) must check with the data provider whether the other > convention is also true - that the missing values should be at least > one digit wider than the widest (in terms of digits) possible value. > Otherwise, e.g. if the value 999 determines the missing age, ages 9 > and 99 will also be caught in the recoding schemes he is using based > on the proposed regular expressions. > > In general, I don't see how he will be able to determine which 9...9 > patterns indeed correspond to missing without having a prior knowledge > of the variable contents, or instructions from the data provider, or > carefully inspecting individual values of each variable trying to > determine the ranges of widths of the values. Values of income of > 99USD or 998USD might in the end be actual data, etc. > > Best, Sergiy Radyakin > > > > > On Thu, Oct 3, 2013 at 5:08 AM, Evan DeFilippis <defilippis@gmail.com> wrote: >> Values in my data set contain different numerical representations for >> "Don't Know" and "Refusal" >> >> A "Don't Know" will always start and end with a '9', but there can be >> as many '9's in between as possible, up to the maximum length of a >> string (244). >> >> A "Refusal" will always start with a '9' and end with an '8', and >> there can be as many '9's' in between as possible, up to the maximum >> length of a string (244). >> >> The data set contains strings, integers, bytes, etc.. >> >> I want to be able to convert the numerical representations of 'Don't >> Know' and 'Refusal's' into DK and REF, respectively. >> >> My current strategy for doing this looks like so: >> >> quietly tostring _all, replace >> ds, has(type string) >> di "`r(varlist)'" >> unab string_vars : `r(varlist)' >> foreach j in `string_vars' { >> quietly replace `j'= regexr(`j', "^[9]*[9]$","DK") >> quietly replace `j' = regexr(`j', "^[9]*[8]$", "REF") >> } >> >> However, this is slow because it converts the entire data set into >> strings, which takes about 5 minutes, and then it has to do has(type >> string) in order to get r(varlist) to iterate over all those strings >> which takes about 4 minutes. >> >> Is there a faster way to do this that perhaps does not involve >> converting everything to strings? >> >> Paul >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Fwd: Fastest way to identify values that start and end with a 9?***From:*Evan DeFilippis <defilippis@gmail.com>

**Re: st: Fwd: Fastest way to identify values that start and end with a 9?***From:*Sergiy Radyakin <serjradyakin@gmail.com>

- Prev by Date:
**st: Update to -weakiv- on SSC** - Next by Date:
**st: Single graph from multiple datasets or memory** - Previous by thread:
**Re: st: Fwd: Fastest way to identify values that start and end with a 9?** - Next by thread:
**st: Looping over datetimes for simultaneous variable generation** - Index(es):