Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <n.j.cox@durham.ac.uk> |
To | "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu> |
Subject | st: RE: Removing (partial) duplicate obs with lots of vars |
Date | Thu, 21 Oct 2010 11:43:56 +0100 |
Within each -key cusip- set, you need to replace missings by non-missings. Then check for -duplicates-. For example, consider -exchange-, presumably a string variable, in which your blanks are presumably empty strings "". (This is a guess on both counts.) bysort key cusip (exchange) : replace exchange = exchange[_N] if missing(exchange) You need to use the _last_ observation in each block because empty strings are sorted to first upwards within blocks. It may well be that this problem arises only for string variables, in which case ds, has(type string) foreach v in `r(varlist)' { bysort key cusip (`v') : replace `v' = `v'[_N] if missing(`v') } does this for all string variables. If the same problem arises for numeric variables, ds, has(type numeric) foreach v in `r(varlist)' { bysort key cusip (`v') : replace `v' = `v'[1] if missing(`v') } as numeric missings are sorted to the end of each block. Then apply -duplicates-. But only you can say whether this is exactly what you want. Nick n.j.cox@durham.ac.uk Poliquin, Christopher I have a dataset with about 73,000 observations and close to 400 variables. The observations are companies and their actively trading issues in a given year. Some of these 73,000 observations are duplicates, but none of the observations actually match on all variables, so I am faced with the problem of which observation to drop. Does anyone have a suggestion of how to compare the duplicate observations, consolidate the information each has in certain variables, and then keep one observations? For example, consider the following... +--------------------------------------------------------------------------------+ KEY CUSIP EXCHANGE YEAR RATIO NOTES 011 03467110 NYSE 2007 Name change --> Acme Corp. in 2005 011 03467110 2007 011 03467110 NYSE 2007 1:2 +--------------------------------------------------------------------------------+ So, let's say I label these duplicates because it is the same company and issue in the same year. What I now need is one observation in my dataset that looks like this... +--------------------------------------------------------------------------------+ KEY CUSIP EXCHANGE YEAR RATIO NOTES 011 03467110 NYSE 2007 1:2 Name change --> Acme Corp. in 2005 +--------------------------------------------------------------------------------+ As I said, the dataset is 73,000 obs and 400 vars. Minimizing typing *and* minimizing errors/data loss is very important. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/