Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Removing (partial) duplicate obs with lots of vars

From	Nick Cox <[email protected]>
To	"'[email protected]'" <[email protected]>
Subject	st: RE: Removing (partial) duplicate obs with lots of vars
Date	Thu, 21 Oct 2010 11:43:56 +0100

Within each -key cusip- set, you need to replace missings by non-missings. Then check for -duplicates-.

For example, consider -exchange-, presumably a string variable, in which your blanks are presumably empty strings "". (This is a guess on both counts.)

bysort key cusip (exchange) : replace exchange = exchange[_N] if missing(exchange)

You need to use the _last_ observation in each block because empty strings are sorted to first upwards within blocks.

It may well be that this problem arises only for string variables, in which case

ds, has(type string)

foreach v in `r(varlist)' {
bysort key cusip (`v') : replace `v' = `v'[_N] if missing(`v')
}

does this for all string variables.

If the same problem arises for numeric variables,

ds, has(type numeric)
foreach v in `r(varlist)' {
bysort key cusip (`v') : replace `v' = `v'[1] if missing(`v')
}

as numeric missings are sorted to the end of each block.

Then apply -duplicates-.

But only you can say whether this is exactly what you want.

Nick
[email protected]

Poliquin, Christopher

I have a dataset with about 73,000 observations and close to 400 variables. The observations are companies and their actively trading issues in a given year.

Some of these 73,000 observations are duplicates, but none of the observations actually match on all variables, so I am faced with the problem of which observation to drop.

Does anyone have a suggestion of how to compare the duplicate observations, consolidate the information each has in certain variables, and then keep one observations?

For example, consider the following...

+--------------------------------------------------------------------------------+
KEY CUSIP EXCHANGE YEAR RATIO NOTES
011 03467110 NYSE 2007 Name change --> Acme Corp. in 2005
011 03467110 2007
011 03467110 NYSE 2007 1:2
+--------------------------------------------------------------------------------+

So, let's say I label these duplicates because it is the same company and issue in the same year. What I now need is one observation in my dataset that looks like this...

+--------------------------------------------------------------------------------+
KEY CUSIP EXCHANGE YEAR RATIO NOTES
011 03467110 NYSE 2007 1:2 Name change --> Acme Corp. in 2005
+--------------------------------------------------------------------------------+

As I said, the dataset is 73,000 obs and 400 vars. Minimizing typing *and* minimizing errors/data loss is very important.

*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

References:
- st: Removing (partial) duplicate obs with lots of vars
  - From: "Poliquin, Christopher" <[email protected]>

Prev by Date: RE: st: MLE for each day
Next by Date: RE: st: Data management
Previous by thread: st: Removing (partial) duplicate obs with lots of vars
Next by thread: st: Accounting for measurement error in regression
Index(es):
- Date
- Thread