[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
David Kantor <[email protected]> |

To |
[email protected] |

Subject |
Re: st: eliminating multiple identical obervations |

Date |
Tue, 07 Oct 2003 10:25:26 -0400 |

At 02:46 PM 10/2/2003 -0500, Kripa Freitas wrote:

(I don't know if anyone else responded.)Hi , I'm working with the SIPP data. I have multiple observations per person per wave which are all identical. So to eliminate this I create the variable x in the following way: . sort ssuid epppnum eentaid wave . qui by ssuid epppnum eentaid wave: gen x=_N .drop if x>1 what i'm left with is: . sort ssuid epppnum eentaid wave . tab x x | Freq. Percent Cum. ------------+----------------------------------- 0 | 45 0.01 0.01 1 | 327,053 99.99 100.00 ------------+----------------------------------- Total | 327,098 100.00 To recheck I create variable y in the exact same way: . sort ssuid epppnum eentaid wave . by ssuid epppnum eentaid wave: gen y=_N . tab y y | Freq. Percent Cum. ------------+----------------------------------- 1 | 327,047 99.98 99.98 51 | 51 0.02 100.00 ------------+----------------------------------- Total | 327,098 100.00 Would someone be able to help me with a reason why it gives me different results?

These are indeed strange results, which I cannot explain.

I will add a few comments, though.

Your code...

sort ssuid epppnum eentaid wave

qui by ssuid epppnum eentaid wave: gen x=_N

drop if x>1

will drop whole sets of duplicates. Perhaps you want to retain one observation per duplicated set. Then you should write,

... gen x = _n

which will retain the first observation in each duplicated set.

And of course, you have only tested duplication on some identifiers; this does not check whether other variables have duplicated values with these sets. If any have distinct values, then this (with the change that I have suggested) makes an arbitrary choice of which observation to retain. So, depending on what you want to do, you may need to be more careful.

Putting these matters aside, I would advocate that whenever I create such a counting variable...

by ... : gen x = _n

(or _N, whichever is appropriate)

I would use an integer type -- usually int or long.

by ... : gen long x = _n

though, I don't see how this difference would cause the strange results you got.

Incidentally, you can shorten these kinds of constructs to a single command, without even generating the counting variable:

bysort ... : drop if _n >1

(or _N>1)

But on the other hand, sometimes it is useful to have the counting variable around, to see what is actually happening before I drop observations.

I hope this is useful.

-- David

David Kantor

Institute for Policy Studies

Johns Hopkins University

[email protected]

410-516-5404

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

**References**:**st: eliminating multiple identical obervations***From:*Kripa Freitas <[email protected]>

- Prev by Date:
**Re: st: random effects model** - Next by Date:
**st: RE: Another graph printing problem** - Previous by thread:
**st: RE: eliminating multiple identical obervations** - Next by thread:
**st: fixed effects estimates with survey data** - Index(es):

© Copyright 1996–2024 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |