[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Howard Lempel <HLempel@brookings.edu> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: regexr and missing values |

Date |
Mon, 20 Oct 2008 10:37:33 -0400 |

Mike, Thanks for the suggestions! Both of the workarounds that you proposed work perfectly. If possible, I'd still like to understand why these unexpected results are occurring (if only to restore my confidence when using regular expressions in the future). I therefore further explored my data along the lines you suggested: 1. The unexpected results do not appear to be the result of "funny" characters in lfpatfin that appear as blanks when listed. I issued -outsheet- and then -hexdump-. The only characters in lfpatfin were the letters that I expected and the expected number of quotation marks. 2. The unexpected results also do not appear to be the result of the size of my dataset. I -sort-ed the data so that the "problem blanks" were the first two observations, while the "non-problem blanks" were distributed throughout the dataset (from observation 39 through observation 30,333). When I recreated test, the same two observations remained "problem blanks." Dropping all observations except for the two that become "problem blanks" and then recreating test continues to produce "E"s in those observations where I expected to create blanks. Thanks again, Howie Howie Lempel Research Assistant The Brookings Institution | Economic Studies 1775 Massachusetts Ave NW | Washington DC 20036 hlempel@brookings.edu | p: (202) 238-3576 -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Michael Hanson Sent: Friday, October 17, 2008 8:36 PM To: statalist@hsphsun2.harvard.edu Subject: Re: st: regexr and missing values Howie: I don't have any special insight into the problem you document, but I can suggest two potential work-arounds to try. (Note: these are untested!) 1) Use wildcards to match additional characters and thus (hopefully) avoid blanks, i.e.: gen test = lfpatfin replace test = regexs(1)+"E" if regexm(lfpatfin,"^([A-Z]*)U$") 2) Wrap your -regexr()- command in a -cond()- statement, i.e.: gen test = cond(missing(lfpatfin), "", regexr(lfpatfin,"U$", "E")) Two other things to investigate, in light of your inability to reproduce the problem with other data: 1. It may be the case that what appears as a "blank" is really some "funny" character (or characters) that appears blank when listed to the screen, but is not seen as blank by the -regex- functions. Perhaps -tab lfpatfin, missing- might turn up multiple "blanks". Otherwise, you could write the data series in question to a file, then examine that with -hexdump-. (Or examine it with a high-end text editor -- though you might not have one at your disposal.) I think this explanation is unlikely, however, as it would have to fool the -missing()- function as well as -list-, but not -regex?-. 2. I notice that both you and Yun Liu seem to run into this problem at observation numbers above 20000. Is it possible that some internal limit is inadvertently triggering this problem? One potential way to test that theory is to -sort- your data so that "problem blanks" appear as lower observations, while "non-problem blanks" now end up above 20000, and repeat your command. (Obviously, use a copy of your dataset!) If the problem is invariant to sort order, you can likely eliminate the observation number as a contributor to this problem. (Unless your generic dataset for testing was sufficiently large, you might not have triggered this source of error should it exist.) HTH, Mike On Oct 17, 2008, at 4:34 PM, Howard Lempel wrote: > Hello all, > > I'm using Stata 10 (last updated 10/10/07) and am having a bit of > trouble with the -regexr- function. I can't tell if I've stumbled > on a bug or if I'm doing something wrong. > > I am trying to use -regexr- to transform a string variable called > lfpatfin. I'd like to take every observation where the last letter > in lfpatfin is "U" and substitute an "E" for the "U". The code > appears to work except that two observations where lfpatfin was > missing have been replaced with an "E". This appears to be similar > to a problem Yun Liu had with -regexm- on July 16 in this thread: > http://www.stata.com/statalist/archive/2008-07/msg00596.html, but I > can't tell if Yun's issue was ever resolved. I have been unable to > reproduce the problem using the auto dataset or a dataset generic > dataset I created. My code and some output follows. I did nothing > to test in between generating it and the -list- command. > > gen test = regexr(lfpatfin,"U$", "E") > list lfpatfin test in 1/1000 if lfpatfin != test > > +-----------------+ > | lfpatfin test | > |-----------------| > 70. | FRU FRE | > 105. | RFU RFE | > 148. | U E | > 161. | U E | > 554. | FU FE | > |-----------------| > 861. | FU FE | > 914. | U E | > +-----------------+ > list lfpatfin test if missing(lfpatfin) & !missing(test) > > +-----------------+ > | lfpatfin test | > |-----------------| > 20074. | E | > 24067. | E | > +-----------------+ > > . list lfpatfin test in 16000/16200 if missing(lfpatfin) > > +-----------------+ > | lfpatfin test | > |-----------------| > 16156. | | > 16162. | | > 16166. | | > 16170. | | > 16175. | | > |-----------------| > 16176. | | > 16179. | | > 16180. | | > 16183. | | > 16186. | | > |-----------------| > 16197. | | > +-----------------+ > > For what it's worth, I try to make similar changes to lfpatfin > (substituting "B"s for final "D"s) later in my code and had the > same problem. > > I'd appreciate it a lot if anyone has any explanation. I also do > not know how to see what Stata has updated since my last update, > but I would be grateful if anyone knows where to go for that - I'd > like to check whether the -regex- functions have been changed. > Unfortunately, I don't have the admin rights to update my version > of Stata. > > Thanks for your consideration. > Howie > > > Howie Lempel > Research Assistant > The Brookings Institution | Economic Studies > > 1775 Massachusetts Ave NW | Washington DC 20036 > hlempel@brookings.edu | p: (202) 238-3576 > > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: regexr and missing values***From:*Howard Lempel <HLempel@brookings.edu>

**Re: st: regexr and missing values***From:*Michael Hanson <mshanson@mac.com>

- Prev by Date:
**RE: st: mfx after xtnbreg and how to compute predicted Y** - Next by Date:
**st: AW: re: irf: change innovation** - Previous by thread:
**Re: st: regexr and missing values** - Next by thread:
**[no subject]** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |