Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Data consistency heuristics


From   "Sergiy Radyakin" <serjradyakin@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: RE: Data consistency heuristics
Date   Thu, 9 Oct 2008 23:00:40 -0400

Thank you very much to everyone who responded. From what I understand
there is no such cookbook I am looking for, and one has to start from
scratch creating the checks oneself each time. Best regards, Sergiy
Radyakin

On Wed, Oct 8, 2008 at 12:14 PM, David Radwin <radwin@berkeley.edu> wrote:
> I agree with Nick Cox that there is probably no system for creating such
> rules that is independent of the domain/topic, though I too would be happy
> to find out otherwise.
>
> Two minor ideas which may or may not help are 1) for time series data,
> looking for large changes between time periods and 2) for interval-level
> data, looking for outliers. These will still net some false positives (e.g.
> Zimbabwe's current annual inflation rate is reportedly 24,000,000%) and
> false negatives, but that is always the case.
>
> With respect to his anecdote about his colleague discovering numerous errors
> in someone else's dataset, I fear this is probably far more common (at least
> in social science) than is generally recognized. The article I recently
> cited in http://www.stata.com/statalist/archive/2008-09/msg01363.html
> describes a study where the researchers replicated a year's worth of
> articles from an economics journal and found them rife with such errors.
>
> The recommended solution, if I may repeat myself, is to carefully and
> deliberately document one's research (do files and otherwise) so that the
> procedures and conclusions could be easily replicated by someone else. This
> approach not only allows errors to be more easily detected but also tends to
> prevent them in the first place.
>
> David
>
> At 3:32 PM +0100 10/8/08, Nick Cox wrote:
>>
>> I too have wanted to find a theory of data cleaning, but in practice
>> it's mightily elusive. I think this is the most bottom-up part of
>> statistical science in which at best you have rules that work most of
>> the time for your kind of data.
>>
>> A colleague worked with records on glaciers which supposedly had been
>> reviewed very carefully. He found many things that the quality control
>> had missed, including glaciers that were just in the wrong places, as
>> shown by a scatter of latitude and longitude; glaciers reported twice,
>> by different countries; and many Russian glaciers reported to face East
>> when they faced West and vice versa. (Apparently, Sergiy, that was a
>> transliteration/translation problem.) He found these things by slow
>> scrutiny and started building up ad hoc a list of things that could be
>> wrong.
>
>
> --
> David Radwin // radwin@berkeley.edu
> Office of Student Research, University of California, Berkeley
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index