Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: eliminating multiple identical obervations

From   David Kantor <[email protected]>
To   [email protected]
Subject   Re: st: eliminating multiple identical obervations
Date   Tue, 07 Oct 2003 10:25:26 -0400

At 02:46 PM 10/2/2003 -0500, Kripa Freitas wrote:
Hi ,
I'm working with the SIPP data. I have multiple observations per person
per wave which are all identical. So to eliminate this I create the
variable x in the following way:

. sort ssuid epppnum eentaid wave
. qui by ssuid epppnum eentaid wave: gen x=_N
.drop if x>1

what i'm left with is:
. sort ssuid epppnum eentaid wave

. tab x

          x |      Freq.     Percent        Cum.
          0 |         45        0.01        0.01
          1 |    327,053       99.99      100.00
      Total |    327,098      100.00

To recheck I create variable y in the exact same way:
. sort ssuid epppnum eentaid wave

. by ssuid epppnum eentaid wave: gen y=_N

. tab y

          y |      Freq.     Percent        Cum.
          1 |    327,047       99.98       99.98
         51 |         51        0.02      100.00
      Total |    327,098      100.00

Would someone be able to help me with a reason why it gives me different
(I don't know if anyone else responded.)

These are indeed strange results, which I cannot explain.

I will add a few comments, though.
Your code...

sort ssuid epppnum eentaid wave
qui by ssuid epppnum eentaid wave: gen x=_N
drop if x>1

will drop whole sets of duplicates. Perhaps you want to retain one observation per duplicated set. Then you should write,
... gen x = _n

which will retain the first observation in each duplicated set.

And of course, you have only tested duplication on some identifiers; this does not check whether other variables have duplicated values with these sets. If any have distinct values, then this (with the change that I have suggested) makes an arbitrary choice of which observation to retain. So, depending on what you want to do, you may need to be more careful.

Putting these matters aside, I would advocate that whenever I create such a counting variable...

by ... : gen x = _n

(or _N, whichever is appropriate)

I would use an integer type -- usually int or long.

by ... : gen long x = _n

though, I don't see how this difference would cause the strange results you got.

Incidentally, you can shorten these kinds of constructs to a single command, without even generating the counting variable:
bysort ... : drop if _n >1

(or _N>1)

But on the other hand, sometimes it is useful to have the counting variable around, to see what is actually happening before I drop observations.

I hope this is useful.
-- David

David Kantor
Institute for Policy Studies
Johns Hopkins University
[email protected]

* For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index