Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

st: Re: Cleaning messy data

 From Martyn Sherriff To statalist@hsphsun2.harvard.edu Subject st: Re: Cleaning messy data Date Mon, 28 Nov 2011 16:53:20 +0000

```I have a large dataset in which the dates have been entered as
freetext with random spaces, spellings and characters:
. list dpr dpr4 in 1/10

. list dpr dpr4 in 1/10

+-------------------------------------+
|                    dpr         dpr4 |
|-------------------------------------|
1. |         2 yrs 5months 26 days    2y 5m 26d |
2. |         3 yrs 2 months                 3y 2m  |
3. |           1yr 9 months                  1y 9m  |
4. |          1 yr 8 months                   1y 8m  |
5. | 1 yr 11 months 28 days            1y 11m 28d |
|-------------------------------------|
6. |           1 yr 12 days                 1y 12d |
7. |  3 yrs 3 months12 days         3y 3m 12d |
8. |  3yrs 4 months 26 days          3y 4m 26d |
9. |     1 yr 9mnths 8 days             1y 9m 8d |
10. |                                     |
+-------------------------------------+
I have managed to convert dpr to dpr4 in which I have separated the
years months and days in a very pedestrian way:

gen dpr1= subinstr(dpr," y","y",.)
replace dpr1= subinstr(dpr1,"yrs","y",.)
replace dpr1= subinstr(dpr1,"yr","y",.)
replace dpr1= subinstr(dpr1,"year","y",.)

gen dpr2= subinstr(dpr1,"months","m",.)
replace dpr2= subinstr(dpr2,"month","m",.)
replace dpr2= subinstr(dpr2,"mnths","m",.)
replace dpr2= subinstr(dpr2," m","m",.)

gen dpr3= subinstr(dpr2," days","d",.)
replace dpr3= subinstr(dpr3,"days","d",.)
replace dpr3= subinstr(dpr3," d","d",.)

gen dpr4= subinstr(dpr3,"+","",.)
replace dpr4= subinstr(dpr4,"y","y ",.)
replace dpr4= subinstr(dpr4,"m","m ",.)
replace dpr4= subinstr(dpr4,"  "," ",.)
replace dpr4= subinstr(dpr4,"  "," ",.)

I now need to convert these values in dpr4 into a total number of
days. The problem I have is that in some cases either the year, month
or day is missing. I have to assume that a month is 28 days.
I was thinking of using 'split' and then processing the results based
on the character.

Is this a reasonable approach or is there a more efficient process?

I would be grateful for some advice on this step. I am using Stata 12

Thank you.
Martyn
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```