[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Data consistency heuristics

From   "Nick Cox" <>
To   <>
Subject   st: RE: Data consistency heuristics
Date   Wed, 8 Oct 2008 15:32:54 +0100

I too have wanted to find a theory of data cleaning, but in practice
it's mightily elusive. I think this is the most bottom-up part of
statistical science in which at best you have rules that work most of
the time for your kind of data. 

A colleague worked with records on glaciers which supposedly had been
reviewed very carefully. He found many things that the quality control
had missed, including glaciers that were just in the wrong places, as
shown by a scatter of latitude and longitude; glaciers reported twice,
by different countries; and many Russian glaciers reported to face East
when they faced West and vice versa. (Apparently, Sergiy, that was a
transliteration/translation problem.) He found these things by slow
scrutiny and started building up ad hoc a list of things that could be

As for gender having two known categories and one unknown category,
there are plenty of datasets in which that classification misses really
important distinctions. 


Sergiy Radyakin

this is more or less general question, not related to Stata itself,
but to data processing. I wonder if anyone could point me to a good
source of heuristics/rules on checking the data for
consistency/plausibility. I am looking for something like:

* age of a person must be within the range 0-120
* gender must have no more than 2 unique values
* person younger than NNN years may not be a mother
* if a person is reporting not working, wage must be missing/zero
* if a person is attending primary school, occupation may not be
* if a person is attending university, [s]he may not report being

Note that these are more or less flexible rules and there might be
exceptions. But if it is valid for 99% of cases - it's what I am
looking for.

The context topics include economics
(employment/earnings/wages/sector/hours of work etc), education(years
of educ/enrollment/completion), family structure and composition, and
other related topics commonly found in family, household or labor
force surveys.

I believe a significant amount of such checks is being done by data
collectors before releasing the data to public, and I wouldn't want to
reinvent the wheel here.

*   For searches and help try:

© Copyright 1996–2023 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index