# st: re: complex data cleaning issue (well, complex for me)

 From Kit Baum To statalist@hsphsun2.harvard.edu Subject st: re: complex data cleaning issue (well, complex for me) Date Tue, 29 Apr 2008 07:56:57 -0400

Stephen said

For each employee, I have employee number (e.g., 109123) the start date (e.g., 07Aug07), start day (e.g., Monday) , end date (e.g., 09Aug07), end day (e.g., Wednesday), number of hours (e.g., 21), and number of days (e.g., 3). There is other info as well. Each case of absenteeism is on one line, with some employees being represented on one line only (only one recorded absenteeism entry), and other employees with multiple records of absenteeism, for example being absent on several different days across the year.

Unfortunately, for many records, there are then 'corrections'. (A correction could occur for several reasons, such as that entry no longer being regarded as absenteeism but sick leave). A correction shows up as a re-entry of that data with negative values for hours and days. Below is an example of the original entry, plus a correction.

(snip)

I don't see anything complex here. If you -collapse- on employee#, start date and end date, computing sums of the hours and days field, everything should work properly.

In Example A, you will get hours = days = 0, and you can screen out records that have hours = days = 0

In Example B, the first two records will negate each other, and you will be left with the third.

In Example C, you will again be left with hours = days = 0.

I don't think you need any -duplicates- logic to handle this.

You also don't want to do any reshaping. You can usie mofd() to generate a month variable, and then just -collapse- on that variable, generating sums, to make what you have into a monthly time series per employee. Just two applications of -collapse-. In fact you could probably get by with one.

One possible wrinkle: the start and end dates might span a month-end, so you might need to be somewhat canny about that in terms of classifying each record as belonging to a particular month.

Kit

Kit Baum, Boston College Economics and DIW Berlin
http://ideas.repec.org/e/pba1.html
An Introduction to Modern Econometrics Using Stata:
http://www.stata-press.com/books/imeus.html

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/