Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Fwd: Fastest way to identify values that start and end with a 9?

From	Nick Cox <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: Fwd: Fastest way to identify values that start and end with a 9?
Date	Thu, 3 Oct 2013 10:40:14 +0100

You seem to have multiple identities. You signed this "Paul", but your
identifier is Evan DeFilippis. The Statalist FAQ, which you were asked
to read before posting, explains that we request the use of full real
names.

This question also seems to overlap with a question posted by "Parseltongue" at

http://stackoverflow.com/questions/19092766/stata-regex-search-and-replace-on-integer-variables

That being so, our policy on cross-posting applies, also explained in
the FAQ you were asked to read before posting:

http://www.stata.com/support/faqs/resources/statalist-faq/#crossposting

Here is the relevant part:

"People posting on Statalist may also think about posting the same
question on other listservers or in web forums. There is absolutely no
rule against doing that; it is not our business to constrain what you
do elsewhere.

But if you do post elsewhere, we ask that you provide cross-references
in URL form to searchable archives. That way, people interested in
your question can quickly check what has been said elsewhere and avoid
posting similar comments. Being open about cross-posting saves
everyone time."

This question arises at least in part because you didn't explain what
you are doing well enough on Stack Overflow for anyone to provide a
complete answer. Anyone inclined to try to answer this would do well
to look at the SO thread cited above.

All that said, I have some comments on your code.

quietly tostring _all, replace
ds, has(type string)

If you convert _all_ variables to string, then there is precisely no
need to fire up -ds- to find out _which_ variables are string. As
said, they all are. So, from that point of view, your code could be
shortened to

quietly tostring _all, replace
quietly foreach j of var * {
  replace `j'= regexr(`j', "^[9]*[9]$","DK")
  replace `j' = regexr(`j', "^[9]*[8]$", "REF")
}

But your main question is whether you need to convert all your
variables to string, and the answer is, at most, only those variables
that might contain these patterns. Also, as already indicated on SO,
if such variables are numeric, you don't _need_ to convert them to
string at all. It might be sufficient to check the first digit and the
last digit. Otherwise I don't think you've explained your data fully
enough to allow a detailed answer. I remain fuzzy whether these 9...8
or 9...9 patterrns are within numeric variables or string variables
holding numeric characters.

Nick
[email protected]

On 3 October 2013 10:08, Evan DeFilippis <[email protected]> wrote:
> Values in my data set contain different numerical representations for
> "Don't Know" and "Refusal"
>
> A "Don't Know" will always start and end with a '9', but there can be
> as many '9's in between as possible, up to the maximum length of a
> string (244).
>
> A "Refusal" will always start with a '9' and end with an '8', and
> there can be as many '9's' in between as possible, up to the maximum
> length of a string (244).
>
> The data set contains strings, integers, bytes, etc..
>
> I want to be able to convert the numerical representations of 'Don't
> Know' and 'Refusal's' into DK and REF, respectively.
>
> My current strategy for doing this looks like so:
>
> quietly tostring _all, replace
> ds, has(type string)
> di "`r(varlist)'"
> unab string_vars : `r(varlist)'
> foreach j in `string_vars'  {
>   quietly replace `j'= regexr(`j', "^[9]*[9]$","DK")
>   quietly replace `j' = regexr(`j', "^[9]*[8]$", "REF")
> }
>
> However, this is slow because it converts the entire data set into
> strings, which takes about 5 minutes, and then it has to do has(type
> string) in order to get r(varlist) to iterate over all those strings
> which takes about 4 minutes.
>
> Is there a faster way to do this that perhaps does not involve
> converting everything to strings?
>
> Paul
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Fwd: Fastest way to identify values that start and end with a 9?
  - From: Evan DeFilippis <[email protected]>

Prev by Date: st: Battling Mata docx commands - automation?
Next by Date: Re: st: Looping over datetimes for simultaneous variable generation
Previous by thread: st: Fwd: Fastest way to identify values that start and end with a 9?
Next by thread: Re: st: Fwd: Fastest way to identify values that start and end with a 9?
Index(es):
- Date
- Thread