# Re: st: eliminating multiple identical obervations

 From David Kantor <[email protected]> To [email protected] Subject Re: st: eliminating multiple identical obervations Date Tue, 07 Oct 2003 10:25:26 -0400

```At 02:46 PM 10/2/2003 -0500, Kripa Freitas wrote:
```
```Hi ,
I'm working with the SIPP data. I have multiple observations per person
per wave which are all identical. So to eliminate this I create the
variable x in the following way:

. sort ssuid epppnum eentaid wave
. qui by ssuid epppnum eentaid wave: gen x=_N
.drop if x>1

what i'm left with is:
. sort ssuid epppnum eentaid wave

. tab x

x |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |         45        0.01        0.01
1 |    327,053       99.99      100.00
------------+-----------------------------------
Total |    327,098      100.00

To recheck I create variable y in the exact same way:
. sort ssuid epppnum eentaid wave

. by ssuid epppnum eentaid wave: gen y=_N

. tab y

y |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |    327,047       99.98       99.98
51 |         51        0.02      100.00
------------+-----------------------------------
Total |    327,098      100.00

Would someone be able to help me with a reason why it gives me different
results?
```
(I don't know if anyone else responded.)

These are indeed strange results, which I cannot explain.

sort ssuid epppnum eentaid wave
qui by ssuid epppnum eentaid wave: gen x=_N
drop if x>1

will drop whole sets of duplicates. Perhaps you want to retain one observation per duplicated set. Then you should write,
... gen x = _n

which will retain the first observation in each duplicated set.

And of course, you have only tested duplication on some identifiers; this does not check whether other variables have duplicated values with these sets. If any have distinct values, then this (with the change that I have suggested) makes an arbitrary choice of which observation to retain. So, depending on what you want to do, you may need to be more careful.

Putting these matters aside, I would advocate that whenever I create such a counting variable...

by ... : gen x = _n

(or _N, whichever is appropriate)

I would use an integer type -- usually int or long.

by ... : gen long x = _n

though, I don't see how this difference would cause the strange results you got.

Incidentally, you can shorten these kinds of constructs to a single command, without even generating the counting variable:
bysort ... : drop if _n >1

(or _N>1)

But on the other hand, sometimes it is useful to have the counting variable around, to see what is actually happening before I drop observations.

I hope this is useful.
-- David

David Kantor
Institute for Policy Studies
Johns Hopkins University
[email protected]
410-516-5404

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/