Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Removing (partial) duplicate obs with lots of vars


From   "Poliquin, Christopher" <cpoliquin@hbs.edu>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   st: Removing (partial) duplicate obs with lots of vars
Date   Wed, 20 Oct 2010 15:30:32 -0400

Hello,

I have a dataset with about 73,000 observations and close to 400 variables.  The observations are companies and their actively trading issues in a given year.

Some of these 73,000 observations are duplicates, but none of the observations actually match on all variables, so I am faced with the problem of which observation to drop.

Does anyone have a suggestion of how to compare the duplicate observations, consolidate the information each has in certain variables, and then keep one observations?

For example, consider the following...

+--------------------------------------------------------------------------------+
KEY      CUSIP     EXCHANGE   YEAR   RATIO   NOTES
011    03467110      NYSE     2007           Name change --> Acme Corp. in 2005
011    03467110               2007           
011    03467110      NYSE     2007    1:2    
+--------------------------------------------------------------------------------+

So, let's say I label these duplicates because it is the same company and issue in the same year.  What I now need is one observation in my dataset that looks like this...

+--------------------------------------------------------------------------------+
KEY      CUSIP     EXCHANGE   YEAR   RATIO   NOTES
011    03467110      NYSE     2007    1:2    Name change --> Acme Corp. in 2005
+--------------------------------------------------------------------------------+

As I said, the dataset is 73,000 obs and 400 vars.  Minimizing typing *and* minimizing errors/data loss is very important.

I would greatly appreciate anyone's advice/help.


Best wishes,
Chris

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index