# st: RE: eliminating multiple identical obervations

 From "Nick Cox" <[email protected]> To <[email protected]> Subject st: RE: eliminating multiple identical obervations Date Fri, 3 Oct 2003 10:26:04 +0100

```Kripa Freitas

> I'm working with the SIPP data. I have multiple
> observations per person
> per wave which are all identical. So to eliminate this I create the
> variable x in the following way:
>
> . sort ssuid epppnum eentaid wave
> . qui by ssuid epppnum eentaid wave: gen x=_N
> .drop if x>1
>
> what i'm left with is:
> . sort ssuid epppnum eentaid wave
>
> . tab x
>
>           x |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           0 |         45        0.01        0.01
>           1 |    327,053       99.99      100.00
> ------------+-----------------------------------
>       Total |    327,098      100.00
>
> To recheck I create variable y in the exact same way:
> . sort ssuid epppnum eentaid wave
>
> . by ssuid epppnum eentaid wave: gen y=_N
>
> . tab y
>
>           y |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           1 |    327,047       99.98       99.98
>          51 |         51        0.02      100.00
> ------------+-----------------------------------
>       Total |    327,098      100.00
>
> Would someone be able to help me with a reason why it gives
> me different results?

First note that your first block of code

. sort ssuid epppnum eentaid wave
. qui by ssuid epppnum eentaid wave: gen x=_N
. drop if x>1

could be telescoped to

. bysort ssuid epppnum eentaid wave : gen x = _N
. drop if x > 1

However, I don't understand how -x- can ever be
created as 0. _N, as I understand it, can only
be a positive integer. Are you sure you did nothing
else to -x-?

Your second block of code is credible.

That said, it seems that your code will
not do what you really want, as it will drop
_all_ repeated copies. I guess that you would
prefer to keep one from each set of repeated copies.
For that a first principles solution could be

. bysort ssuid epppnum eentaid wave : keep if _n == 1

or

. bysort ssuid epppnum eentaid wave : drop if _n > 1

In addition, there are various programs dedicated
to this problem. Official Stata (from version 8)
now has a -duplicates- command.

Nick
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```