Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Re: Cleaning messy data |
Date | Tue, 29 Nov 2011 08:13:15 +0000 |
For the record, moss dpr, match("([0-9]+) *y") regex is an example weakening the assumption about spaces to zero or more. Nick On Mon, Nov 28, 2011 at 7:49 PM, Nick Cox <njcoxstata@gmail.com> wrote: > Interesting problem. > > Having sometimes sung the song "regex functions are good, but more > basic string functions can be as good or better", I am now going to > reverse that. > > This is a good test problem for -moss- (SSC)! > > I started something like this, with the idea that the pattern was a > number, possibly a space and then a word: > > . moss dpr, match("([0-9]+ ?[a-z]+)") regex > > . drop _pos* > > . l > > +-----------------------------------------------------------------+ > | dpr _count _match1 _match2 _match3 | > |-----------------------------------------------------------------| > 1. | 2 yrs 5months 26 days 3 2 yrs 5months 26 days | > 2. | 3 yrs 2 months 2 3 yrs 2 months | > 3. | 1yr 9 months 2 1yr 9 months | > 4. | 1 yr 8 months 2 1 yr 8 months | > 5. | 1 yr 11 months 28 days 3 1 yr 11 months 28 days | > |-----------------------------------------------------------------| > 6. | 1 yr 12 days 2 1 yr 12 days | > 7. | 3 yrs 3 months12 days 3 3 yrs 3 months 12 days | > 8. | 3yrs 4 months 26 days 3 3yrs 4 months 26 days | > 9. | 1 yr 9mnths 8 days 3 1 yr 9mnths 8 days | > +-----------------------------------------------------------------+ > > That is a fair start, but there is still some cleaning up to do. An > alternative, which I prefer, extracts each element separately. > > . keep dpr > > . moss dpr , match("([0-9]+) ?y") regex prefix(y) > > . moss dpr , match("([0-9]+) ?m") regex prefix(m) > > . moss dpr , match("([0-9]+) ?d") regex prefix(d) > > . drop *count *pos* > > . l > > +------------------------------------------------------+ > | dpr ymatch1 mmatch1 dmatch1 | > |------------------------------------------------------| > 1. | 2 yrs 5months 26 days 2 5 26 | > 2. | 3 yrs 2 months 3 2 | > 3. | 1yr 9 months 1 9 | > 4. | 1 yr 8 months 1 8 | > 5. | 1 yr 11 months 28 days 1 11 28 | > |------------------------------------------------------| > 6. | 1 yr 12 days 1 12 | > 7. | 3 yrs 3 months12 days 3 3 12 | > 8. | 3yrs 4 months 26 days 3 4 26 | > 9. | 1 yr 9mnths 8 days 1 9 8 | > +------------------------------------------------------+ > > The assumption of "at most one space" between number and word works > for the example, but might be too strong for the whole dataset. > > Nick > > On Mon, Nov 28, 2011 at 6:12 PM, daniel klein > <klein.daniel.81@googlemail.com> wrote: >> Martyn, >> >> based on what you have now, something like the following could work. >> >> replace dpr4 = subinstr(dpr4, "y", "*365", .) >> replace dpr4 = subinstr(dpr4, "m", "*28", .) >> replace dpr4 = subinstr(dpr4, "d", "", .) >> replace dpr4 = subinstr(dpr4, " ", "+") >> // note the missing spaces between "*" and "#" >> >> encode dpr4 ,g(new) >> qui su new ,mean >> forv j = 1/`r(max)' { >> qui replace new = `: lab new `j'' if new == `j' >> } >> >> This is a little ad-hoc and might neither be the most elegant nor the >> fastest solution. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/