Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: complex data cleaning issue (well, complex for me)


From   Stephen Cox <[email protected]>
To   "'[email protected]'" <[email protected]>
Subject   st: complex data cleaning issue (well, complex for me)
Date   Tue, 29 Apr 2008 17:45:33 +1000

Folks - I have a data set I have to clean up (it was supposed to completely cleaned prior to coming to me, but that's just an opportunity for me to whinge). Data management in these ways is not my strength. Any help would be greatly appreciated. I have tried to communicate the issues as clearly as possible, but let me know if I need to add more detail.

The data reflects employee absenteeism.

Data structure.

The data are in long form.

For each employee, I have employee number (e.g., 109123) the start date (e.g., 07Aug07), start day (e.g., Monday) , end date (e.g., 09Aug07), end day (e.g., Wednesday), number of hours (e.g., 21), and number of days (e.g., 3). There is other info as well. Each case of absenteeism is on one line, with some employees being represented on one line only (only one recorded absenteeism entry), and other employees with multiple records of absenteeism, for example being absent on several different days across the year.

Unfortunately, for many records, there are then 'corrections'. (A correction could occur for several reasons,  such as that entry no longer being regarded as absenteeism but sick leave). A correction shows up as a re-entry of that data with negative values for hours and days. Below is an example of the original entry, plus a correction.

EXAMPLE A.

employee#    startdate    startday    enddate    endday    hours    days
109123          07Aug07    Monday    09Aug07    Wednesday    21    3
109123          07Aug07    Monday    09Aug07    Wednesday    -21    -3

What I want to do is find those matching pairs of records and remove them from my file.

But it gets a little more difficult also. Inspection indicates that quite a few of the corrections are are of the form above. In some cases however, there is a correction, and then a re-entry again! So it might look like this, below:

EXAMPLE B.

employee#    startdate    startday    enddate    endday    hours    days
109123           07Aug07    Monday    09Aug07    Wednesday    21    3
109123           07Aug07    Monday    09Aug07    Wednesday    -21    -3
109123           07Aug07    Monday    09Aug07    Wednesday    21    3

For this reason, just finding the duplicate dates (using -duplicate-) and deleting both of them wont necessary work: sometimes I want to remove all the records (as in EXAMPLE A), sometimes I do not (EXAMPLE B, where I want to keep one record).

Problem extension 2.

Sometimes the original entry will be corrected over several records. For example, EXAMPLE A might be corrected as follows:

employee#         startdate    startday    enddate    endday    hours    days
109123            07Aug07    Monday    09Aug07    Wednesday  21           3
109123            07Aug07    Monday    07Aug07    Monday     -7       -1
109123            08Aug07    Monday    08Aug07    Monday     -7       -1
109123            09Aug07    Monday    09Aug07    Monday     -7       -1

I think I might be able to deal with these by hand if necessary. I can't deal with all of it by hand (12000 records).

I will then need to collapse hours and days into the months in which they occurred (I think -reshape- and then -generate- new variables is the way to go there).... but removing the invalid data will be a start.


Cheers,

Stephen

______________________________________

Stephen Cox |
Faculty of Business | Queensland University of Technology | <http://www.bus.qut.com/> www.bus.qut.edu.au <http://www.bus.qut.edu.au/>
phone: +61 7 3138 1776 | Office: Z758 | email: [email protected] | CRICOS No. 00213J



*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index