Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Re: Cleaning messy data
From 
 
Nick Cox <[email protected]> 
To 
 
[email protected] 
Subject 
 
Re: st: Re: Cleaning messy data 
Date 
 
Mon, 28 Nov 2011 19:49:03 +0000 
Interesting problem.
Having sometimes sung the song "regex functions are good, but more
basic string functions can be as good or better", I am now going to
reverse that.
This is a good test problem for -moss- (SSC)!
I started something like this, with the idea that the pattern was a
number, possibly a space and then a word:
. moss dpr, match("([0-9]+ ?[a-z]+)") regex
. drop _pos*
. l
    +-----------------------------------------------------------------+
    |                    dpr   _count   _match1     _match2   _match3 |
    |-----------------------------------------------------------------|
 1. |  2 yrs 5months 26 days        3     2 yrs     5months   26 days |
 2. |         3 yrs 2 months        2     3 yrs    2 months           |
 3. |           1yr 9 months        2       1yr    9 months           |
 4. |          1 yr 8 months        2      1 yr    8 months           |
 5. | 1 yr 11 months 28 days        3      1 yr   11 months   28 days |
    |-----------------------------------------------------------------|
 6. |           1 yr 12 days        2      1 yr     12 days           |
 7. |  3 yrs 3 months12 days        3     3 yrs    3 months   12 days |
 8. |  3yrs 4 months 26 days        3      3yrs    4 months   26 days |
 9. |     1 yr 9mnths 8 days        3      1 yr      9mnths    8 days |
    +-----------------------------------------------------------------+
That is a fair start, but there is still some cleaning up to do. An
alternative, which I prefer, extracts each element separately.
. keep dpr
. moss dpr , match("([0-9]+) ?y") regex prefix(y)
. moss dpr , match("([0-9]+) ?m") regex prefix(m)
. moss dpr , match("([0-9]+) ?d") regex prefix(d)
. drop *count *pos*
. l
     +------------------------------------------------------+
     |                    dpr   ymatch1   mmatch1   dmatch1 |
     |------------------------------------------------------|
  1. |  2 yrs 5months 26 days         2         5        26 |
  2. |         3 yrs 2 months         3         2           |
  3. |           1yr 9 months         1         9           |
  4. |          1 yr 8 months         1         8           |
  5. | 1 yr 11 months 28 days         1        11        28 |
     |------------------------------------------------------|
  6. |           1 yr 12 days         1                  12 |
  7. |  3 yrs 3 months12 days         3         3        12 |
  8. |  3yrs 4 months 26 days         3         4        26 |
  9. |     1 yr 9mnths 8 days         1         9         8 |
     +------------------------------------------------------+
The assumption of "at most one space" between number and word works
for the example, but might be too strong for the whole dataset.
Nick
On Mon, Nov 28, 2011 at 6:12 PM, daniel klein
<[email protected]> wrote:
> Martyn,
>
> based on what you have now, something like the following could work.
>
> replace dpr4 = subinstr(dpr4, "y", "*365", .)
> replace dpr4 = subinstr(dpr4, "m", "*28", .)
> replace dpr4 = subinstr(dpr4, "d", "", .)
> replace dpr4 = subinstr(dpr4, " ", "+")
>    // note the missing spaces between "*" and "#"
>
> encode dpr4 ,g(new)
> qui su new ,mean
> forv j = 1/`r(max)' {
>    qui replace new = `: lab new `j'' if new == `j'
> }
>
> This is a little ad-hoc and might neither be the most elegant nor the
> fastest solution.
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/