Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: regexr and missing values


From   Howard Lempel <HLempel@brookings.edu>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   RE: st: regexr and missing values
Date   Mon, 20 Oct 2008 10:37:33 -0400

Mike,

Thanks for the suggestions!  Both of the workarounds that you proposed work perfectly.

If possible, I'd still like to understand why these unexpected results are occurring (if only to restore my confidence when using regular expressions in the future).  I therefore further explored my data along the lines you suggested:

1. The unexpected results do not appear to be the result of "funny" characters in lfpatfin that appear as blanks when listed.  I issued -outsheet- and then -hexdump-.  The only characters in lfpatfin were the letters that I expected and the expected number of quotation marks.

2.  The unexpected results also do not appear to be the result of the size of my dataset.  I -sort-ed the data so that the "problem blanks" were the first two observations, while the "non-problem blanks" were distributed throughout the dataset (from observation 39 through observation 30,333).  When I recreated test, the same two observations remained "problem blanks."  Dropping all observations except for the two that become "problem blanks" and then recreating test continues to produce "E"s in those observations where I expected to create blanks.

Thanks again,
Howie

Howie Lempel
Research Assistant
The Brookings Institution | Economic Studies

1775 Massachusetts Ave NW | Washington DC 20036
hlempel@brookings.edu | p: (202) 238-3576

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Michael Hanson
Sent: Friday, October 17, 2008 8:36 PM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: regexr and missing values

Howie:

I don't have any special insight into the problem you document, but I
can suggest two potential work-arounds to try. (Note: these are
untested!)


1) Use wildcards to match additional characters and thus (hopefully)
avoid blanks, i.e.:

gen test = lfpatfin
replace test = regexs(1)+"E" if regexm(lfpatfin,"^([A-Z]*)U$")


2) Wrap your -regexr()- command in a -cond()- statement, i.e.:

gen test = cond(missing(lfpatfin), "", regexr(lfpatfin,"U$", "E"))


Two other things to investigate, in light of your inability to
reproduce the problem with other data:

1. It may be the case that what appears as a "blank" is really some
"funny" character (or characters) that appears blank when listed to
the screen, but is not seen as blank by the -regex- functions.
Perhaps -tab lfpatfin, missing- might turn up multiple "blanks".
Otherwise, you could write the data series in question to a file,
then examine that with -hexdump-.  (Or examine it with a high-end
text editor -- though you might not have one at your disposal.)  I
think this explanation is unlikely, however, as it would have to fool
the -missing()- function as well as -list-, but not -regex?-.

2. I notice that both you and Yun Liu seem to run into this problem
at observation numbers above 20000.  Is it possible that some
internal limit is inadvertently triggering this problem? One
potential way to test that theory is to -sort- your data so that
"problem blanks" appear as lower observations, while "non-problem
blanks" now end up above 20000, and repeat your command.  (Obviously,
use a copy of your dataset!)  If the problem is invariant to sort
order, you can likely eliminate the observation number as a
contributor to this problem.  (Unless your generic dataset for
testing was sufficiently large, you might not have triggered this
source of error should it exist.)

HTH,
Mike

On Oct 17, 2008, at 4:34 PM, Howard Lempel wrote:

> Hello all,
>
> I'm using Stata 10 (last updated 10/10/07) and am having a bit of
> trouble with the -regexr- function.  I can't tell if I've stumbled
> on a bug or if I'm doing something wrong.
>
> I am trying to use -regexr- to transform a string variable called
> lfpatfin.  I'd like to take every observation where the last letter
> in lfpatfin is "U" and substitute an "E" for the "U".  The code
> appears to work except that two observations where lfpatfin was
> missing have been replaced with an "E".  This appears to be similar
> to a problem Yun Liu had with -regexm- on July 16 in this thread:
> http://www.stata.com/statalist/archive/2008-07/msg00596.html, but I
> can't tell if Yun's issue was ever resolved.  I have been unable to
> reproduce the problem using the auto dataset or a dataset generic
> dataset I created.  My code and some output follows.  I did nothing
> to test in between generating it and the -list- command.
>
> gen test = regexr(lfpatfin,"U$", "E")
> list lfpatfin test in 1/1000 if lfpatfin != test
>
>       +-----------------+
>       | lfpatfin   test |
>       |-----------------|
>   70. |      FRU    FRE |
>  105. |      RFU    RFE |
>  148. |        U      E |
>  161. |        U      E |
>  554. |       FU     FE |
>       |-----------------|
>  861. |       FU     FE |
>  914. |        U      E |
>       +-----------------+
> list lfpatfin test if missing(lfpatfin) & !missing(test)
>
>        +-----------------+
>        | lfpatfin   test |
>        |-----------------|
> 20074. |               E |
> 24067. |               E |
>        +-----------------+
>
> . list lfpatfin test in 16000/16200 if missing(lfpatfin)
>
>        +-----------------+
>        | lfpatfin   test |
>        |-----------------|
> 16156. |                 |
> 16162. |                 |
> 16166. |                 |
> 16170. |                 |
> 16175. |                 |
>        |-----------------|
> 16176. |                 |
> 16179. |                 |
> 16180. |                 |
> 16183. |                 |
> 16186. |                 |
>        |-----------------|
> 16197. |                 |
>        +-----------------+
>
> For what it's worth, I try to make similar changes to lfpatfin
> (substituting "B"s for final "D"s) later in my code and had the
> same problem.
>
> I'd appreciate it a lot if anyone has any explanation.  I also do
> not know how to see what Stata has updated since my last update,
> but I would be grateful if anyone knows where to go for that - I'd
> like to check whether the -regex- functions have been changed.
> Unfortunately, I don't have the admin rights to update my version
> of Stata.
>
> Thanks for your consideration.
> Howie
>
>
> Howie Lempel
> Research Assistant
> The Brookings Institution | Economic Studies
>
> 1775 Massachusetts Ave NW | Washington DC 20036
> hlempel@brookings.edu | p: (202) 238-3576
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index