Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: regexr and missing values


From   Michael Hanson <mshanson@mac.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: regexr and missing values
Date   Fri, 17 Oct 2008 20:35:46 -0400

Howie:

I don't have any special insight into the problem you document, but I can suggest two potential work-arounds to try. (Note: these are untested!)


1) Use wildcards to match additional characters and thus (hopefully) avoid blanks, i.e.:

gen test = lfpatfin
replace test = regexs(1)+"E" if regexm(lfpatfin,"^([A-Z]*)U$")


2) Wrap your -regexr()- command in a -cond()- statement, i.e.:

gen test = cond(missing(lfpatfin), "", regexr(lfpatfin,"U$", "E"))


Two other things to investigate, in light of your inability to reproduce the problem with other data:

1. It may be the case that what appears as a "blank" is really some "funny" character (or characters) that appears blank when listed to the screen, but is not seen as blank by the -regex- functions. Perhaps -tab lfpatfin, missing- might turn up multiple "blanks". Otherwise, you could write the data series in question to a file, then examine that with -hexdump-. (Or examine it with a high-end text editor -- though you might not have one at your disposal.) I think this explanation is unlikely, however, as it would have to fool the -missing()- function as well as -list-, but not -regex?-.

2. I notice that both you and Yun Liu seem to run into this problem at observation numbers above 20000. Is it possible that some internal limit is inadvertently triggering this problem? One potential way to test that theory is to -sort- your data so that "problem blanks" appear as lower observations, while "non-problem blanks" now end up above 20000, and repeat your command. (Obviously, use a copy of your dataset!) If the problem is invariant to sort order, you can likely eliminate the observation number as a contributor to this problem. (Unless your generic dataset for testing was sufficiently large, you might not have triggered this source of error should it exist.)

HTH,
Mike

On Oct 17, 2008, at 4:34 PM, Howard Lempel wrote:

Hello all,

I'm using Stata 10 (last updated 10/10/07) and am having a bit of trouble with the -regexr- function. I can't tell if I've stumbled on a bug or if I'm doing something wrong.

I am trying to use -regexr- to transform a string variable called lfpatfin. I'd like to take every observation where the last letter in lfpatfin is "U" and substitute an "E" for the "U". The code appears to work except that two observations where lfpatfin was missing have been replaced with an "E". This appears to be similar to a problem Yun Liu had with -regexm- on July 16 in this thread: http://www.stata.com/statalist/archive/2008-07/msg00596.html, but I can't tell if Yun's issue was ever resolved. I have been unable to reproduce the problem using the auto dataset or a dataset generic dataset I created. My code and some output follows. I did nothing to test in between generating it and the -list- command.

gen test = regexr(lfpatfin,"U$", "E")
list lfpatfin test in 1/1000 if lfpatfin != test

      +-----------------+
      | lfpatfin   test |
      |-----------------|
  70. |      FRU    FRE |
 105. |      RFU    RFE |
 148. |        U      E |
 161. |        U      E |
 554. |       FU     FE |
      |-----------------|
 861. |       FU     FE |
 914. |        U      E |
      +-----------------+
list lfpatfin test if missing(lfpatfin) & !missing(test)

       +-----------------+
       | lfpatfin   test |
       |-----------------|
20074. |               E |
24067. |               E |
       +-----------------+

. list lfpatfin test in 16000/16200 if missing(lfpatfin)

       +-----------------+
       | lfpatfin   test |
       |-----------------|
16156. |                 |
16162. |                 |
16166. |                 |
16170. |                 |
16175. |                 |
       |-----------------|
16176. |                 |
16179. |                 |
16180. |                 |
16183. |                 |
16186. |                 |
       |-----------------|
16197. |                 |
       +-----------------+

For what it's worth, I try to make similar changes to lfpatfin (substituting "B"s for final "D"s) later in my code and had the same problem.

I'd appreciate it a lot if anyone has any explanation. I also do not know how to see what Stata has updated since my last update, but I would be grateful if anyone knows where to go for that - I'd like to check whether the -regex- functions have been changed. Unfortunately, I don't have the admin rights to update my version of Stata.

Thanks for your consideration.
Howie


Howie Lempel
Research Assistant
The Brookings Institution | Economic Studies

1775 Massachusetts Ave NW | Washington DC 20036
hlempel@brookings.edu | p: (202) 238-3576



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index