Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Re: Cleaning messy data


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Re: Cleaning messy data
Date   Mon, 28 Nov 2011 19:49:03 +0000

Interesting problem.

Having sometimes sung the song "regex functions are good, but more
basic string functions can be as good or better", I am now going to
reverse that.

This is a good test problem for -moss- (SSC)!

I started something like this, with the idea that the pattern was a
number, possibly a space and then a word:

. moss dpr, match("([0-9]+ ?[a-z]+)") regex

. drop _pos*

. l

    +-----------------------------------------------------------------+
    |                    dpr   _count   _match1     _match2   _match3 |
    |-----------------------------------------------------------------|
 1. |  2 yrs 5months 26 days        3     2 yrs     5months   26 days |
 2. |         3 yrs 2 months        2     3 yrs    2 months           |
 3. |           1yr 9 months        2       1yr    9 months           |
 4. |          1 yr 8 months        2      1 yr    8 months           |
 5. | 1 yr 11 months 28 days        3      1 yr   11 months   28 days |
    |-----------------------------------------------------------------|
 6. |           1 yr 12 days        2      1 yr     12 days           |
 7. |  3 yrs 3 months12 days        3     3 yrs    3 months   12 days |
 8. |  3yrs 4 months 26 days        3      3yrs    4 months   26 days |
 9. |     1 yr 9mnths 8 days        3      1 yr      9mnths    8 days |
    +-----------------------------------------------------------------+

That is a fair start, but there is still some cleaning up to do. An
alternative, which I prefer, extracts each element separately.

. keep dpr

. moss dpr , match("([0-9]+) ?y") regex prefix(y)

. moss dpr , match("([0-9]+) ?m") regex prefix(m)

. moss dpr , match("([0-9]+) ?d") regex prefix(d)

. drop *count *pos*

. l

     +------------------------------------------------------+
     |                    dpr   ymatch1   mmatch1   dmatch1 |
     |------------------------------------------------------|
  1. |  2 yrs 5months 26 days         2         5        26 |
  2. |         3 yrs 2 months         3         2           |
  3. |           1yr 9 months         1         9           |
  4. |          1 yr 8 months         1         8           |
  5. | 1 yr 11 months 28 days         1        11        28 |
     |------------------------------------------------------|
  6. |           1 yr 12 days         1                  12 |
  7. |  3 yrs 3 months12 days         3         3        12 |
  8. |  3yrs 4 months 26 days         3         4        26 |
  9. |     1 yr 9mnths 8 days         1         9         8 |
     +------------------------------------------------------+

The assumption of "at most one space" between number and word works
for the example, but might be too strong for the whole dataset.

Nick

On Mon, Nov 28, 2011 at 6:12 PM, daniel klein
<[email protected]> wrote:
> Martyn,
>
> based on what you have now, something like the following could work.
>
> replace dpr4 = subinstr(dpr4, "y", "*365", .)
> replace dpr4 = subinstr(dpr4, "m", "*28", .)
> replace dpr4 = subinstr(dpr4, "d", "", .)
> replace dpr4 = subinstr(dpr4, " ", "+")
>    // note the missing spaces between "*" and "#"
>
> encode dpr4 ,g(new)
> qui su new ,mean
> forv j = 1/`r(max)' {
>    qui replace new = `: lab new `j'' if new == `j'
> }
>
> This is a little ad-hoc and might neither be the most elegant nor the
> fastest solution.
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index