Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Data consistency heuristics


From   David Radwin <radwin@berkeley.edu>
To   <statalist@hsphsun2.harvard.edu>
Subject   Re: st: RE: Data consistency heuristics
Date   Wed, 8 Oct 2008 09:14:31 -0700

I agree with Nick Cox that there is probably no system for creating such rules that is independent of the domain/topic, though I too would be happy to find out otherwise.

Two minor ideas which may or may not help are 1) for time series data, looking for large changes between time periods and 2) for interval-level data, looking for outliers. These will still net some false positives (e.g. Zimbabwe's current annual inflation rate is reportedly 24,000,000%) and false negatives, but that is always the case.

With respect to his anecdote about his colleague discovering numerous errors in someone else's dataset, I fear this is probably far more common (at least in social science) than is generally recognized. The article I recently cited in http://www.stata.com/statalist/archive/2008-09/msg01363.html describes a study where the researchers replicated a year's worth of articles from an economics journal and found them rife with such errors.

The recommended solution, if I may repeat myself, is to carefully and deliberately document one's research (do files and otherwise) so that the procedures and conclusions could be easily replicated by someone else. This approach not only allows errors to be more easily detected but also tends to prevent them in the first place.

David

At 3:32 PM +0100 10/8/08, Nick Cox wrote:
I too have wanted to find a theory of data cleaning, but in practice
it's mightily elusive. I think this is the most bottom-up part of
statistical science in which at best you have rules that work most of
the time for your kind of data.

A colleague worked with records on glaciers which supposedly had been
reviewed very carefully. He found many things that the quality control
had missed, including glaciers that were just in the wrong places, as
shown by a scatter of latitude and longitude; glaciers reported twice,
by different countries; and many Russian glaciers reported to face East
when they faced West and vice versa. (Apparently, Sergiy, that was a
transliteration/translation problem.) He found these things by slow
scrutiny and started building up ad hoc a list of things that could be
wrong.


--
David Radwin // radwin@berkeley.edu
Office of Student Research, University of California, Berkeley
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index