Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Fwd: Fastest way to identify values that start and end with a 9?

From	Robert Picard <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: Fwd: Fastest way to identify values that start and end with a 9?
Date	Thu, 3 Oct 2013 12:19:04 -0400

For the variables that are numeric, assuming that the DK and
REF codes are always larger than any other value for the
variable, the following code will look for the max value of
each variable and convert it to an extended missing value if
it matches the pattern (all digits are 9). Then the
procedure is repeated the next pattern (either 8 or a series
of 9 ending with an 8. Labels are used to identify each type
of missing value.

Robert

* ----------------- begin example ---------------
clear
input v1 v2 v3 v4 v5 str1 v6
1 22 333 4444 55555 "a"
8 98 998 9998 99998 "b"
9 99 999 9999 99999 "c"
2 33 444 5555 66666 "d"
2 33 444 998 99 "e"
end

label def dkref .a "DK" .b "REF"

ds, has(type numeric)
local numvars `r(varlist)'

foreach v of varlist `numvars' {
sum `v', meanonly
local vmax = r(max)
local dk = regexm("`vmax'", "^9+$")
if `dk' replace `v' = .a if `v' == `vmax'
sum `v', meanonly
local vmax = r(max)
local ref = regexm("`vmax'", "(^8$|^9+8$)")
if `ref' replace `v' = .b if `v' == `vmax'
label values `v' dkref
}

list

* ----------------- end example -----------------



On Thu, Oct 3, 2013 at 11:16 AM, Sergiy Radyakin <[email protected]> wrote:
> Paul (or Evan) is using a dataset where missing values (DK and REF)
> are coded as values of the type 9...9 and 9...8. This is similar to
> the convention used in the DHS datasets, see eg here page 3:
> http://www.measuredhs.com/pubs/pdf/DHSG4/Recode6_DHS_22March2013_DHSG4.pdf
>
> Paul (Evan) must check with the data provider whether the other
> convention is also true - that the missing values should be at least
> one digit wider than the widest (in terms of digits) possible value.
> Otherwise, e.g. if the value 999 determines the missing age, ages 9
> and 99 will also be caught in the recoding schemes he is using based
> on the proposed regular expressions.
>
> In general, I don't see how he will be able to determine which 9...9
> patterns indeed correspond to missing without having a prior knowledge
> of the variable contents, or instructions from the data provider, or
> carefully inspecting individual values of each variable trying to
> determine the ranges of widths of the values. Values of income of
> 99USD or 998USD might in the end be actual data, etc.
>
> Best, Sergiy Radyakin
>
>
>
>
> On Thu, Oct 3, 2013 at 5:08 AM, Evan DeFilippis <[email protected]> wrote:
>> Values in my data set contain different numerical representations for
>> "Don't Know" and "Refusal"
>>
>> A "Don't Know" will always start and end with a '9', but there can be
>> as many '9's in between as possible, up to the maximum length of a
>> string (244).
>>
>> A "Refusal" will always start with a '9' and end with an '8', and
>> there can be as many '9's' in between as possible, up to the maximum
>> length of a string (244).
>>
>> The data set contains strings, integers, bytes, etc..
>>
>> I want to be able to convert the numerical representations of 'Don't
>> Know' and 'Refusal's' into DK and REF, respectively.
>>
>> My current strategy for doing this looks like so:
>>
>> quietly tostring _all, replace
>> ds, has(type string)
>> di "`r(varlist)'"
>> unab string_vars : `r(varlist)'
>> foreach j in `string_vars'  {
>>   quietly replace `j'= regexr(`j', "^[9]*[9]$","DK")
>>   quietly replace `j' = regexr(`j', "^[9]*[8]$", "REF")
>> }
>>
>> However, this is slow because it converts the entire data set into
>> strings, which takes about 5 minutes, and then it has to do has(type
>> string) in order to get r(varlist) to iterate over all those strings
>> which takes about 4 minutes.
>>
>> Is there a faster way to do this that perhaps does not involve
>> converting everything to strings?
>>
>> Paul
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Fwd: Fastest way to identify values that start and end with a 9?
  - From: Evan DeFilippis <[email protected]>
- Re: st: Fwd: Fastest way to identify values that start and end with a 9?
  - From: Sergiy Radyakin <[email protected]>

Prev by Date: st: Update to -weakiv- on SSC
Next by Date: st: Single graph from multiple datasets or memory
Previous by thread: Re: st: Fwd: Fastest way to identify values that start and end with a 9?
Next by thread: st: Looping over datetimes for simultaneous variable generation
Index(es):
- Date
- Thread