[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Lachenbruch, Peter" <Peter.Lachenbruch@oregonstate.edu> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: Data consistency heuristics |

Date |
Tue, 7 Oct 2008 14:46:23 -0700 |

A simple do file should work. display caseno age if age<0 or age>120 & age<. // may want to print missing ages display caseno gender if gender~=a | gender ~=b // a and b are the unique values (could be strings so you'd want to fix that up) diplay caseno if age<NNN and mother==1 // mother is an indicator etc. An interesting question is whether you want to correct these - e.g. convert them to missing or an error code (I first typed coed - but that's NOT what I meant!) In a study earlier this summer I did just this. Initially I printed all the missing value cases, but the data came from medical records and about half of 2000 cases were missing, so I simply didn't print, but gave a count for each variable. Some of the variables had many possible legal values (e.g., which of 30 drugs were being taken), so the checking became very complicated - especially when the dosage and schedule were being checked. Svend Juul has a nice chapter on this in his book. Tony Peter A. Lachenbruch Department of Public Health Oregon State University Corvallis, OR 97330 Phone: 541-737-3832 FAX: 541-737-4001 -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Sergiy Radyakin Sent: Tuesday, October 07, 2008 2:02 PM To: statalist@hsphsun2.harvard.edu Subject: st: Data consistency heuristics Hello All, this is more or less general question, not related to Stata itself, but to data processing. I wonder if anyone could point me to a good source of heuristics/rules on checking the data for consistency/plausibility. I am looking for something like: * age of a person must be within the range 0-120 * gender must have no more than 2 unique values * person younger than NNN years may not be a mother * if a person is reporting not working, wage must be missing/zero * if a person is attending primary school, occupation may not be "manager" * if a person is attending university, [s]he may not report being illiterate etc Note that these are more or less flexible rules and there might be exceptions. But if it is valid for 99% of cases - it's what I am looking for. The context topics include economics (employment/earnings/wages/sector/hours of work etc), education(years of educ/enrollment/completion), family structure and composition, and other related topics commonly found in family, household or labor force surveys. I believe a significant amount of such checks is being done by data collectors before releasing the data to public, and I wouldn't want to reinvent the wheel here. Thank you, Sergiy Radyakin * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: RE: Data consistency heuristics***From:*"Sergiy Radyakin" <serjradyakin@gmail.com>

**Re: st: RE: Data consistency heuristics***From:*Maarten buis <maartenbuis@yahoo.co.uk>

**Re: st: RE: Data consistency heuristics***From:*Maarten buis <maartenbuis@yahoo.co.uk>

**References**:**st: Data consistency heuristics***From:*"Sergiy Radyakin" <serjradyakin@gmail.com>

- Prev by Date:
**st: How to label the value 999999999999** - Next by Date:
**st: RE: regression on variable over time to get mean autoregressive parameter** - Previous by thread:
**st: Data consistency heuristics** - Next by thread:
**Re: st: RE: Data consistency heuristics** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |