Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Data cleaning challenge


From   "N. S. Miceli, Ph.D." <[email protected]>
To   [email protected], [email protected], [email protected]
Subject   Re: st: Data cleaning challenge
Date   Wed, 07 Aug 2013 00:48:42 -0500

Hi,

I hope that these comments are somewhat helpful to you in working out your algorithm.

Look at the last transaction set and the second transaction set:

--TX---ITEM----ITEM$---TOTAL$--TOTAL$GS--DIF---MARKER-
  29   50044     1.5     2.5       2     .5    0
  29   50003      .5     2.5       2     .5    1
  29   50003      .5     2.5       2     .5    1
------------------------------------------------
  32   15001    1.25     3.2     2.7     .5    0
  32   50005      .5     3.2     2.7     .5    1
  32   50005      .5     3.2     2.7     .5    1
  32   50035     .95     3.2     2.7     .5    0 <==
-------------------------------------------------
  207   56879   1.18    5.11    3.43   1.68    0
  207   56879   1.18    5.11    3.43   1.68    0
  207   50005     .5    5.11    3.43   1.68    0 <==
  207   50035     .5    5.11    3.43   1.68    0 <==
  207   50168   1.75    5.11    3.43   1.68    0

These two line items are not strictly speaking duplicates. The item numbers are not the same. The item #56879 is clearly a duplicate since the quantity, item number, and extended amount are the same. However, unless you meant that the #50035 OR #50005 in transaction are potentially duplicates (because of being an error introduced from previous sales transactions of the same items).

In this case, the only way to really tell if the data is a duplicate in error would be to physically inventory the items and reconcile the on hand balances versus inventory records.

My suggestion would be that the logic you use has to pick up on _all_ items with unit sales of more than a quantity of one (for the same item) where the total of transactions in one file does not agree with the GS amount. You're trying to reconcile detail to totals. In the above data you can rule out #50035 as a duplicate because the sales quantity in transaction 32 is not equal to the sales quantity in transaction 207. Therefore, the duplicate is the item #50005 in transaction 207 and not the item #50035.

At least, that's what the logic seems like to me. Please excuse me if I've overlooked anything (and for lack of clarity) but it seems like you've got problems between-transactions, as well as within-transactions__.

Regards,

Nick Miceli







Date: Mon, 5 Aug 2013 15:10:38 -0400
From: Michael Barker<[email protected]>
Subject: Re: st: Data cleaning challenge

Hi Doug,

You could start by dropping items in transactions where all duplicates
are invalid. This would cover all of the examples that you gave, but
it would not cover transactions with some valid and some invalid
duplicates.

Assuming you have a single data set like the examples, the code would
look something like this:

* number all duplicate items, 2 and greater are duplicates
sort TX ITEM
by TX ITEM : gen dupitem==_n

* generate total dollar amount of  excess duplicates
bys TX: egen duptotal = total(ITEM$) if dupitem>=2

* check if removing duplicates would make TOTAL$ equal to TOTAL$GS
gen invalid_dup = 0
replace invalid_dup = 1 if dupitem>=2 & TOTAL$-duptotal==TOTAL$GS

* drop all invalid duplicates
drop if invalid_dup==1

That would get the easy ones, and would not drop any valid duplicates.
It would leave the more difficult task of finding duplicates with
valid and invalid duplicates in the same transaction. But I would do
the easy part first, then see what problems are still remaining.

Mike



On Mon, Aug 5, 2013 at 9:47 AM, Douglas Levy
<[email protected]>  wrote:

Hi All,
I have a data cleaning challenge where data files have some duplicate
lines and I'm trying to automate the elimination of those duplicates.
Consider the following. File A has item-level cash register data and
some of the items are duplicated in error, though some items are
duplicated legitimately (e.g. someone bought more than one of an
item). Therefore, I can't just eliminate all duplicates. File B has
total transactions amounts (a gold standard), but does not have
individual item-level data. See below for example data from a merged
file.

For transaction #29, I can see that File A's transaction total ($2.50)
is $0.50 over File B's total ($2.00). It is clear that item #50003 was
entered twice in error. I tagged all items whose price was equal to
the difference between cash register total and the gold standard total
using the variable "marker." Now I can eliminate the first item in the
transaction that is tagged with "marker." I can do similarly with
transaction #32. However, consider transaction #207. There, it is
clear that there are*two*  duplicates in the transaction (an extra
item #56789 at $1.18 and an extra item #50005 at $0.50 totaling a
difference of $1.68). Here the "marker" strategy does not work.

Does anyone have suggestions on how to create an algorithm in Stata
would tag excess items for deletion in this more complicated example?
Many thanks!
-Doug

--TX---ITEM----ITEM$---TOTAL$--TOTAL$GS--DIF---MARKER-
  29   50044     1.5     2.5       2     .5    0
  29   50003      .5     2.5       2     .5    1
  29   50003      .5     2.5       2     .5    1
------------------------------------------------
  32   15001    1.25     3.2     2.7     .5    0
  32   50005      .5     3.2     2.7     .5    1
  32   50005      .5     3.2     2.7     .5    1
  32   50035     .95     3.2     2.7     .5    0
-------------------------------------------------
  207   56879   1.18    5.11    3.43   1.68    0
  207   56879   1.18    5.11    3.43   1.68    0
  207   50005     .5    5.11    3.43   1.68    0
  207   50035     .5    5.11    3.43   1.68    0
  207   50168   1.75    5.11    3.43   1.68    0
Tx = transaction number - a unique transaction identifier (File A)

Item = item number - a unique identifier of specific types of items
sold (File A)
Item$ = the item's price (File A)
Total$ = the transaction total based on cash register data (File A)
Total$GS = an external gold standard for the transaction total (File B)
Dif = the difference between the cash register total and the gold standard total
marker = a marker for items that could explain the difference

--
================================================
Nicholas S. Miceli, Ph.D.
Associate Professor of Human Resource Management
================================================
Park University
School of Business
8700 N. W. River Park Drive
Parkville, MO 64152
================================================
[email protected]
================================================

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index