Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Removing (partial) duplicate obs with lots of vars


From   Nick Cox <n.j.cox@durham.ac.uk>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Removing (partial) duplicate obs with lots of vars
Date   Thu, 21 Oct 2010 11:43:56 +0100

Within each -key cusip- set, you need to replace missings by non-missings. Then check for -duplicates-. 

For example, consider -exchange-, presumably a string variable, in which your blanks are presumably empty strings "".  (This is a guess on both counts.) 

bysort key cusip (exchange) : replace exchange = exchange[_N] if missing(exchange) 

You need to use the _last_ observation in each block because empty strings are sorted to first upwards within blocks. 

It may well be that this problem arises only for string variables, in which case 

ds, has(type string)

foreach v in `r(varlist)' { 
	bysort key cusip (`v') : replace `v' = `v'[_N] if missing(`v') 
} 

does this for all string variables. 

If the same problem arises for numeric variables, 

ds, has(type numeric)
foreach v in `r(varlist)' { 
	bysort key cusip (`v') : replace `v' = `v'[1] if missing(`v')
}

as numeric missings are sorted to the end of each block. 

Then apply -duplicates-. 

But only you can say whether this is exactly what you want. 

Nick 
n.j.cox@durham.ac.uk 

Poliquin, Christopher

I have a dataset with about 73,000 observations and close to 400 variables.  The observations are companies and their actively trading issues in a given year.

Some of these 73,000 observations are duplicates, but none of the observations actually match on all variables, so I am faced with the problem of which observation to drop.

Does anyone have a suggestion of how to compare the duplicate observations, consolidate the information each has in certain variables, and then keep one observations?

For example, consider the following...

+--------------------------------------------------------------------------------+
KEY      CUSIP     EXCHANGE   YEAR   RATIO   NOTES
011    03467110      NYSE     2007           Name change --> Acme Corp. in 2005
011    03467110               2007           
011    03467110      NYSE     2007    1:2    
+--------------------------------------------------------------------------------+

So, let's say I label these duplicates because it is the same company and issue in the same year.  What I now need is one observation in my dataset that looks like this...

+--------------------------------------------------------------------------------+
KEY      CUSIP     EXCHANGE   YEAR   RATIO   NOTES
011    03467110      NYSE     2007    1:2    Name change --> Acme Corp. in 2005
+--------------------------------------------------------------------------------+

As I said, the dataset is 73,000 obs and 400 vars.  Minimizing typing *and* minimizing errors/data loss is very important.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index