Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Re: Cleaning messy data

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: Re: Cleaning messy data
Date	Mon, 28 Nov 2011 19:49:03 +0000

Interesting problem.

Having sometimes sung the song "regex functions are good, but more
basic string functions can be as good or better", I am now going to
reverse that.

This is a good test problem for -moss- (SSC)!

I started something like this, with the idea that the pattern was a
number, possibly a space and then a word:

. moss dpr, match("([0-9]+ ?[a-z]+)") regex

. drop _pos*

. l

    +-----------------------------------------------------------------+
    |                    dpr   _count   _match1     _match2   _match3 |
    |-----------------------------------------------------------------|
 1. |  2 yrs 5months 26 days        3     2 yrs     5months   26 days |
 2. |         3 yrs 2 months        2     3 yrs    2 months           |
 3. |           1yr 9 months        2       1yr    9 months           |
 4. |          1 yr 8 months        2      1 yr    8 months           |
 5. | 1 yr 11 months 28 days        3      1 yr   11 months   28 days |
    |-----------------------------------------------------------------|
 6. |           1 yr 12 days        2      1 yr     12 days           |
 7. |  3 yrs 3 months12 days        3     3 yrs    3 months   12 days |
 8. |  3yrs 4 months 26 days        3      3yrs    4 months   26 days |
 9. |     1 yr 9mnths 8 days        3      1 yr      9mnths    8 days |
    +-----------------------------------------------------------------+

That is a fair start, but there is still some cleaning up to do. An
alternative, which I prefer, extracts each element separately.

. keep dpr

. moss dpr , match("([0-9]+) ?y") regex prefix(y)

. moss dpr , match("([0-9]+) ?m") regex prefix(m)

. moss dpr , match("([0-9]+) ?d") regex prefix(d)

. drop *count *pos*

. l

     +------------------------------------------------------+
     |                    dpr   ymatch1   mmatch1   dmatch1 |
     |------------------------------------------------------|
  1. |  2 yrs 5months 26 days         2         5        26 |
  2. |         3 yrs 2 months         3         2           |
  3. |           1yr 9 months         1         9           |
  4. |          1 yr 8 months         1         8           |
  5. | 1 yr 11 months 28 days         1        11        28 |
     |------------------------------------------------------|
  6. |           1 yr 12 days         1                  12 |
  7. |  3 yrs 3 months12 days         3         3        12 |
  8. |  3yrs 4 months 26 days         3         4        26 |
  9. |     1 yr 9mnths 8 days         1         9         8 |
     +------------------------------------------------------+

The assumption of "at most one space" between number and word works
for the example, but might be too strong for the whole dataset.

Nick

On Mon, Nov 28, 2011 at 6:12 PM, daniel klein
<[email protected]> wrote:
> Martyn,
>
> based on what you have now, something like the following could work.
>
> replace dpr4 = subinstr(dpr4, "y", "*365", .)
> replace dpr4 = subinstr(dpr4, "m", "*28", .)
> replace dpr4 = subinstr(dpr4, "d", "", .)
> replace dpr4 = subinstr(dpr4, " ", "+")
>    // note the missing spaces between "*" and "#"
>
> encode dpr4 ,g(new)
> qui su new ,mean
> forv j = 1/`r(max)' {
>    qui replace new = `: lab new `j'' if new == `j'
> }
>
> This is a little ad-hoc and might neither be the most elegant nor the
> fastest solution.
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Re: Cleaning messy data
  - From: Nick Cox <[email protected]>
- Re: st: Re: Cleaning messy data
  - From: Martyn Sherriff <[email protected]>

References:
- Re: st: Re: Cleaning messy data
  - From: daniel klein <[email protected]>

Prev by Date: RE: st: Graphing medians with xtgraph bug
Next by Date: st: Dataset of means from the three largest values of a group
Previous by thread: Re: st: Re: Cleaning messy data
Next by thread: Re: st: Re: Cleaning messy data
Index(es):
- Date
- Thread