Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: eliminating multiple identical obervations


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: eliminating multiple identical obervations
Date   Fri, 3 Oct 2003 10:26:04 +0100

Kripa Freitas

> I'm working with the SIPP data. I have multiple 
> observations per person
> per wave which are all identical. So to eliminate this I create the
> variable x in the following way:
> 
> . sort ssuid epppnum eentaid wave
> . qui by ssuid epppnum eentaid wave: gen x=_N
> .drop if x>1
> 
> what i'm left with is:
> . sort ssuid epppnum eentaid wave
> 
> . tab x
> 
>           x |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           0 |         45        0.01        0.01
>           1 |    327,053       99.99      100.00
> ------------+-----------------------------------
>       Total |    327,098      100.00
> 
> To recheck I create variable y in the exact same way:
> . sort ssuid epppnum eentaid wave
> 
> . by ssuid epppnum eentaid wave: gen y=_N
> 
> . tab y
> 
>           y |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           1 |    327,047       99.98       99.98
>          51 |         51        0.02      100.00
> ------------+-----------------------------------
>       Total |    327,098      100.00
> 
> Would someone be able to help me with a reason why it gives 
> me different results?

First note that your first block of code 

. sort ssuid epppnum eentaid wave
. qui by ssuid epppnum eentaid wave: gen x=_N
. drop if x>1

could be telescoped to 

. bysort ssuid epppnum eentaid wave : gen x = _N 
. drop if x > 1

However, I don't understand how -x- can ever be 
created as 0. _N, as I understand it, can only 
be a positive integer. Are you sure you did nothing 
else to -x-? 

Your second block of code is credible. 

That said, it seems that your code will 
not do what you really want, as it will drop 
_all_ repeated copies. I guess that you would 
prefer to keep one from each set of repeated copies. 
For that a first principles solution could be 

. bysort ssuid epppnum eentaid wave : keep if _n == 1 

or 

. bysort ssuid epppnum eentaid wave : drop if _n > 1 

In addition, there are various programs dedicated 
to this problem. Official Stata (from version 8) 
now has a -duplicates- command. 

Nick 
[email protected] 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index