Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: RE: -finddup- for panel?

From   Fred Wolfe <[email protected]>
To   [email protected]
Subject   RE: st: RE: -finddup- for panel?
Date   Wed, 21 Apr 2004 06:53:38 -0500

As the author of -finddup-, I have to agree with Nick that -duplicates- has more bells and whistles and seems to do everything that -finddup- does.

Joe J., however, finds that -finddup- is useful when one has to decide over which among the duplicates to include and which to exclude. I agree. I haven't used -duplicates- much and I may be mistaken about its capabilities in tagging duplicates. I believe that -duplicates- tags the duplicated observation with a number that represent the number of duplicates. -finddup- tags the duplicates with a sequential number based on a sorted list such that if there are 3 duplicates they will be numbered 1,2,3 (for example).

I find that feature to be very useful. In situations where there are duplicated keys but not duplicated observations, one may need to decide which of the duplicates to retain or to keep. Being able to tag them with a sequential number facilitates that task. For example, -drop if inrange(dupval, 2,99)-

Here are some examples. We survey people with arthritis. Inexplicably, some persons complete 2 surveys (!) and are assigned duplicate keys for the major data set keys. The question arises, which observation should be deleted (retained) as they are not true duplicates. One might want to make a rule to delete the first observation or the second, or might want to look at the data before making such a choice. For me, -finddup- is little easier to use in that circumstance. Nick will correct me if I have misread -duplicates. Perhaps sequential numbering of the duplicates could be added to -duplicates-

-finddup- also does an un-Stata thing. it automatically creates a variable called -dupval-. -duplicates- forces you to name the new variable. I like -dupval- because i always remember its name, sort of like -_merge- that Stata creates automatically.


At 05:24 AM 4/21/2004, you wrote:

If observations are duplicates, the choice of
which to keep can be difficult...

-duplicates- arrived with Stata 8. Some
users were already in the habit of using
various user-written programs published
in the STB or on SSC, including -unique-,
-finddup-, -dups- and various others.
If they serve your purpose, fine.

But you no doubt are aware that observations
can be duplicates with respect to some
variables -- in your case -id- and -year- --
but differ with respect to other variables.

-finddup- offers no facilities for dropping
duplicates. It is an inspection program,
and gives information which can be used
to decide on what to -drop-.

The intent of -duplicates- is to provide
a more general tool, including functionality
for -drop-ping duplicates. But -duplicates-
will not let you go

. duplicates drop id year

whenever other variables also exist. You
must spell out

. duplicates drop id year, force

as a reminder that you may be losing information.
In this way -duplicates- is designed to be
potentially destructive, but also to inhibit
accidental loss of real information.

[email protected]

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of joe J.
> Sent: 21 April 2004 11:08
> To: [email protected]
> Subject: RE: st: RE: -finddup- for panel?
> Stata's official -duplicates- command also helps to identify
> duplicate
> observations. But I have a feeling that -finddup- is useful
> when one has to
> decide over which among the duplicates to include and which
> to exclde  (for
> late use, say) while generating a dupliate-free data set.

*   For searches and help try:

Fred Wolfe
National Data Bank for Rheumatic Diseases
Wichita, Kansas
Tel (316) 263-2125     Fax (316) 263-0761
[email protected]

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index