Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Re: Cleaning messy data


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Re: Cleaning messy data
Date   Tue, 29 Nov 2011 07:16:11 +0000

-moss- (SSC) grew out of a thread on this list and was publicised on
its appearance and since.

Nick

On Mon, Nov 28, 2011 at 9:27 PM, Martyn Sherriff <statams48@gmail.com> wrote:
> Daniel & Nick,
> Thank you for your advice. I thought that I had exhausted the use of
> 'subinstr' and I did not know about 'moss'.
> Martyn
>
> On 28 November 2011 19:49, Nick Cox <njcoxstata@gmail.com> wrote:
>> Interesting problem.
>>
>> Having sometimes sung the song "regex functions are good, but more
>> basic string functions can be as good or better", I am now going to
>> reverse that.
>>
>> This is a good test problem for -moss- (SSC)!
>>
>> I started something like this, with the idea that the pattern was a
>> number, possibly a space and then a word:
>>
>> . moss dpr, match("([0-9]+ ?[a-z]+)") regex
>>
>> . drop _pos*
>>
>> . l
>>
>>     +-----------------------------------------------------------------+
>>     |                    dpr   _count   _match1     _match2   _match3 |
>>     |-----------------------------------------------------------------|
>>  1. |  2 yrs 5months 26 days        3     2 yrs     5months   26 days |
>>  2. |         3 yrs 2 months        2     3 yrs    2 months           |
>>  3. |           1yr 9 months        2       1yr    9 months           |
>>  4. |          1 yr 8 months        2      1 yr    8 months           |
>>  5. | 1 yr 11 months 28 days        3      1 yr   11 months   28 days |
>>     |-----------------------------------------------------------------|
>>  6. |           1 yr 12 days        2      1 yr     12 days           |
>>  7. |  3 yrs 3 months12 days        3     3 yrs    3 months   12 days |
>>  8. |  3yrs 4 months 26 days        3      3yrs    4 months   26 days |
>>  9. |     1 yr 9mnths 8 days        3      1 yr      9mnths    8 days |
>>     +-----------------------------------------------------------------+
>>
>> That is a fair start, but there is still some cleaning up to do. An
>> alternative, which I prefer, extracts each element separately.
>>
>> . keep dpr
>>
>> . moss dpr , match("([0-9]+) ?y") regex prefix(y)
>>
>> . moss dpr , match("([0-9]+) ?m") regex prefix(m)
>>
>> . moss dpr , match("([0-9]+) ?d") regex prefix(d)
>>
>> . drop *count *pos*
>>
>> . l
>>
>>     +------------------------------------------------------+
>>     |                    dpr   ymatch1   mmatch1   dmatch1 |
>>     |------------------------------------------------------|
>>  1. |  2 yrs 5months 26 days         2         5        26 |
>>  2. |         3 yrs 2 months         3         2           |
>>  3. |           1yr 9 months         1         9           |
>>  4. |          1 yr 8 months         1         8           |
>>  5. | 1 yr 11 months 28 days         1        11        28 |
>>     |------------------------------------------------------|
>>  6. |           1 yr 12 days         1                  12 |
>>  7. |  3 yrs 3 months12 days         3         3        12 |
>>  8. |  3yrs 4 months 26 days         3         4        26 |
>>  9. |     1 yr 9mnths 8 days         1         9         8 |
>>     +------------------------------------------------------+
>>
>> The assumption of "at most one space" between number and word works
>> for the example, but might be too strong for the whole dataset.
>>
>> Nick
>>
>> On Mon, Nov 28, 2011 at 6:12 PM, daniel klein
>> <klein.daniel.81@googlemail.com> wrote:
>>> Martyn,
>>>
>>> based on what you have now, something like the following could work.
>>>
>>> replace dpr4 = subinstr(dpr4, "y", "*365", .)
>>> replace dpr4 = subinstr(dpr4, "m", "*28", .)
>>> replace dpr4 = subinstr(dpr4, "d", "", .)
>>> replace dpr4 = subinstr(dpr4, " ", "+")
>>>    // note the missing spaces between "*" and "#"
>>>
>>> encode dpr4 ,g(new)
>>> qui su new ,mean
>>> forv j = 1/`r(max)' {
>>>    qui replace new = `: lab new `j'' if new == `j'
>>> }
>>>
>>> This is a little ad-hoc and might neither be the most elegant nor the
>>> fastest solution.
>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index