[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
RE: st: RE: -finddup- for panel?
As the author of -finddup-, I have to agree with Nick that -duplicates- has
more bells and whistles and seems to do everything that -finddup- does.
Joe J., however, finds that -finddup- is useful when one has to decide
over which among the duplicates to include and which to exclude. I agree. I
haven't used -duplicates- much and I may be mistaken about its capabilities
in tagging duplicates. I believe that -duplicates- tags the duplicated
observation with a number that represent the number of duplicates.
-finddup- tags the duplicates with a sequential number based on a sorted
list such that if there are 3 duplicates they will be numbered 1,2,3 (for
I find that feature to be very useful. In situations where there are
duplicated keys but not duplicated observations, one may need to decide
which of the duplicates to retain or to keep. Being able to tag them with a
sequential number facilitates that task. For example, -drop if
Here are some examples. We survey people with arthritis. Inexplicably, some
persons complete 2 surveys (!) and are assigned duplicate keys for the
major data set keys. The question arises, which observation should be
deleted (retained) as they are not true duplicates. One might want to make
a rule to delete the first observation or the second, or might want to look
at the data before making such a choice. For me, -finddup- is little easier
to use in that circumstance. Nick will correct me if I have misread
-duplicates. Perhaps sequential numbering of the duplicates could be added
-finddup- also does an un-Stata thing. it automatically creates a variable
called -dupval-. -duplicates- forces you to name the new variable. I like
-dupval- because i always remember its name, sort of like -_merge- that
Stata creates automatically.
At 05:24 AM 4/21/2004, you wrote:
If observations are duplicates, the choice of
which to keep can be difficult...
-duplicates- arrived with Stata 8. Some
users were already in the habit of using
various user-written programs published
in the STB or on SSC, including -unique-,
-finddup-, -dups- and various others.
If they serve your purpose, fine.
But you no doubt are aware that observations
can be duplicates with respect to some
variables -- in your case -id- and -year- --
but differ with respect to other variables.
-finddup- offers no facilities for dropping
duplicates. It is an inspection program,
and gives information which can be used
to decide on what to -drop-.
The intent of -duplicates- is to provide
a more general tool, including functionality
for -drop-ping duplicates. But -duplicates-
will not let you go
. duplicates drop id year
whenever other variables also exist. You
must spell out
. duplicates drop id year, force
as a reminder that you may be losing information.
In this way -duplicates- is designed to be
potentially destructive, but also to inhibit
accidental loss of real information.
> -----Original Message-----
> From: firstname.lastname@example.org
> [mailto:email@example.com]On Behalf Of joe J.
> Sent: 21 April 2004 11:08
> To: firstname.lastname@example.org
> Subject: RE: st: RE: -finddup- for panel?
> Stata's official -duplicates- command also helps to identify
> observations. But I have a feeling that -finddup- is useful
> when one has to
> decide over which among the duplicates to include and which
> to exclde (for
> late use, say) while generating a dupliate-free data set.
* For searches and help try:
National Data Bank for Rheumatic Diseases
Tel (316) 263-2125 Fax (316) 263-0761
* For searches and help try: