Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Re: Cleaning messy data


From   Martyn Sherriff <statams48@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: Re: Cleaning messy data
Date   Mon, 28 Nov 2011 16:53:20 +0000

I have a large dataset in which the dates have been entered as
freetext with random spaces, spellings and characters:
. list dpr dpr4 in 1/10

  . list dpr dpr4 in 1/10

     +-------------------------------------+
     |                    dpr         dpr4 |
     |-------------------------------------|
  1. |         2 yrs 5months 26 days    2y 5m 26d |
  2. |         3 yrs 2 months                 3y 2m  |
  3. |           1yr 9 months                  1y 9m  |
  4. |          1 yr 8 months                   1y 8m  |
  5. | 1 yr 11 months 28 days            1y 11m 28d |
     |-------------------------------------|
  6. |           1 yr 12 days                 1y 12d |
  7. |  3 yrs 3 months12 days         3y 3m 12d |
  8. |  3yrs 4 months 26 days          3y 4m 26d |
  9. |     1 yr 9mnths 8 days             1y 9m 8d |
 10. |                                     |
     +-------------------------------------+
I have managed to convert dpr to dpr4 in which I have separated the
years months and days in a very pedestrian way:

gen dpr1= subinstr(dpr," y","y",.)
replace dpr1= subinstr(dpr1,"yrs","y",.)
replace dpr1= subinstr(dpr1,"yr","y",.)
replace dpr1= subinstr(dpr1,"year","y",.)

gen dpr2= subinstr(dpr1,"months","m",.)
replace dpr2= subinstr(dpr2,"month","m",.)
replace dpr2= subinstr(dpr2,"mnths","m",.)
replace dpr2= subinstr(dpr2," m","m",.)

gen dpr3= subinstr(dpr2," days","d",.)
replace dpr3= subinstr(dpr3,"days","d",.)
replace dpr3= subinstr(dpr3," d","d",.)

gen dpr4= subinstr(dpr3,"+","",.)
replace dpr4= subinstr(dpr4,"y","y ",.)
replace dpr4= subinstr(dpr4,"m","m ",.)
replace dpr4= subinstr(dpr4,"  "," ",.)
replace dpr4= subinstr(dpr4,"  "," ",.)

I now need to convert these values in dpr4 into a total number of
days. The problem I have is that in some cases either the year, month
or day is missing. I have to assume that a month is 28 days.
I was thinking of using 'split' and then processing the results based
on the character.

Is this a reasonable approach or is there a more efficient process?

I would be grateful for some advice on this step. I am using Stata 12

Thank you.
Martyn
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index