[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Data consistency heuristics

From   Steven Samuels <[email protected]>
To   [email protected]
Subject   Re: st: RE: Data consistency heuristics
Date   Tue, 7 Oct 2008 18:22:46 -0400

Also take a look at -ckvar- by Bill Rising of StataCorp, available from SSC. It is an amazing package for data validation.


On Oct 7, 2008, at 6:05 PM, Maarten buis wrote:

An alternative or complementary approach would be to use -assert-, as
is advocated in (Gould 2003)

William Gould (2003) Stata tip 3: How to be assertive

--- "Lachenbruch, Peter" <[email protected]> wrote:

A simple do file should work.

display caseno age if age<0 or age>120 & age<.  // may want to print
missing ages
display caseno gender if gender~=a | gender ~=b  // a and b are the
unique values (could be strings so you'd want to fix that up)
diplay caseno if age<NNN and mother==1  // mother is an indicator

An interesting question is whether you want to correct these - e.g.
convert them to missing or an error code  (I first typed coed - but
that's NOT what I meant!)
In a study earlier this summer I did just this.  Initially I printed
the missing value cases, but the data came from medical records and
about half of 2000 cases were missing, so I simply didn't print, but
gave a count for each variable.
Some of the variables had many possible legal values (e.g., which of
drugs were being taken), so the checking became very complicated -
especially when the dosage and schedule were being checked.

Svend Juul has a nice chapter on this in his book.


Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Sergiy
Sent: Tuesday, October 07, 2008 2:02 PM
To: [email protected]
Subject: st: Data consistency heuristics

Hello All,

this is more or less general question, not related to Stata itself,
but to data processing. I wonder if anyone could point me to a good
source of heuristics/rules on checking the data for
consistency/plausibility. I am looking for something like:

* age of a person must be within the range 0-120
* gender must have no more than 2 unique values
* person younger than NNN years may not be a mother
* if a person is reporting not working, wage must be missing/zero
* if a person is attending primary school, occupation may not be
* if a person is attending university, [s]he may not report being

Note that these are more or less flexible rules and there might be
exceptions. But if it is valid for 99% of cases - it's what I am
looking for.

The context topics include economics
(employment/earnings/wages/sector/hours of work etc), education(years
of educ/enrollment/completion), family structure and composition, and
other related topics commonly found in family, household or labor
force surveys.

I believe a significant amount of such checks is being done by data
collectors before releasing the data to public, and I wouldn't want
reinvent the wheel here.

Thank you, Sergiy Radyakin
*   For searches and help try:

*   For searches and help try:

Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands

visiting address:
Buitenveldertselaan 3 (Metropolitan), room N515

+31 20 5986715

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index