[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Data consistency heuristics

From	"Sergiy Radyakin" <[email protected]>
To	[email protected]
Subject	Re: st: RE: Data consistency heuristics
Date	Thu, 9 Oct 2008 23:00:40 -0400

Thank you very much to everyone who responded. From what I understand
there is no such cookbook I am looking for, and one has to start from
scratch creating the checks oneself each time. Best regards, Sergiy
Radyakin

On Wed, Oct 8, 2008 at 12:14 PM, David Radwin <[email protected]> wrote:
> I agree with Nick Cox that there is probably no system for creating such
> rules that is independent of the domain/topic, though I too would be happy
> to find out otherwise.
>
> Two minor ideas which may or may not help are 1) for time series data,
> looking for large changes between time periods and 2) for interval-level
> data, looking for outliers. These will still net some false positives (e.g.
> Zimbabwe's current annual inflation rate is reportedly 24,000,000%) and
> false negatives, but that is always the case.
>
> With respect to his anecdote about his colleague discovering numerous errors
> in someone else's dataset, I fear this is probably far more common (at least
> in social science) than is generally recognized. The article I recently
> cited in http://www.stata.com/statalist/archive/2008-09/msg01363.html
> describes a study where the researchers replicated a year's worth of
> articles from an economics journal and found them rife with such errors.
>
> The recommended solution, if I may repeat myself, is to carefully and
> deliberately document one's research (do files and otherwise) so that the
> procedures and conclusions could be easily replicated by someone else. This
> approach not only allows errors to be more easily detected but also tends to
> prevent them in the first place.
>
> David
>
> At 3:32 PM +0100 10/8/08, Nick Cox wrote:
>>
>> I too have wanted to find a theory of data cleaning, but in practice
>> it's mightily elusive. I think this is the most bottom-up part of
>> statistical science in which at best you have rules that work most of
>> the time for your kind of data.
>>
>> A colleague worked with records on glaciers which supposedly had been
>> reviewed very carefully. He found many things that the quality control
>> had missed, including glaciers that were just in the wrong places, as
>> shown by a scatter of latitude and longitude; glaciers reported twice,
>> by different countries; and many Russian glaciers reported to face East
>> when they faced West and vice versa. (Apparently, Sergiy, that was a
>> transliteration/translation problem.) He found these things by slow
>> scrutiny and started building up ad hoc a list of things that could be
>> wrong.
>
>
> --
> David Radwin // [email protected]
> Office of Student Research, University of California, Berkeley
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Data consistency heuristics
  - From: "Sergiy Radyakin" <[email protected]>
- st: RE: Data consistency heuristics
  - From: "Nick Cox" <[email protected]>

Prev by Date: Re: st: Re:Reading data into Stata
Next by Date: st: loop to rename vars
Previous by thread: Re: st: RE: Data consistency heuristics
Next by thread: st: How to label the value 999999999999
Index(es):
- Date
- Thread