Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Duplicate observations

From	Joe Canner <[email protected]>
To	"[email protected]" <[email protected]>
Subject	st: RE: Duplicate observations
Date	Mon, 10 Mar 2014 18:58:44 +0000

Emanuele,

Nick provided a good solution to your problem, but it's probably worth noting why you had a problem to begin with.

The statement:

by reporter partner year (x_1 -date), sort: gen duplicates=_n

is probably not doing what you want it to do.  It looks like you want to sort by x_1 (ascending) and date (descending).  However, as far as I am aware, the minus sign to indicate a descending sort can only be used in a -gsort- command.  In this context the minus is sign is interpreted as a hyphen and thus "x_1 -date" is a variable list (variables x_1 through date).  Accordingly, it is not sorting in descending date order, which results in the problem you noted.  

If you need to do something like this in the future and Nick's solution doesn't apply, try the following:

gsort reporter partner year x_1 -date
bysort reporter partner year:  gen duplicates=_n

Regards,
Joe Canner
Johns Hopkins University School of Medicine

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of emanuele mazzini
Sent: Monday, March 10, 2014 2:31 PM
To: [email protected]
Subject: st: Duplicate observations

Hello to everybody,

I have an issue about duplicate observations that I find puzzling to solve.
I have data on country-pairs by year and I am interested in two
specific variables, a date and, say a variable which I call x_1.

Specifically, my data look like this :

reporter  partner   year       date         x_1

Albania  Austria   1980   6dec1980     n_1
Albania  Austria   1980  15nov1980    n_1
.         .        .
.         .        .
.         .        .

As you may have noticed observations differ amongst them only by date
and I need to drop them so as to keep the most recent one (hence, in
this case the second one).

I ran the following commands:

duplicates tag reporter partner year, generate(dup)

by reporter partner year (x_1 -date), sort: gen duplicates=_n

so as to be able to identify duplicates and then - among those with
dup >0 - drop those for which duplicates > 1.
This method was suggested in this thread (I take this opportunity to
thank again), but it seems not to work for some observations.
Take, for instance the following example:

reporter partner    year      date         x_1    dup     duplicates
Albania Germany 1967 08apr1967    n_1      1           1
Albania Germany 1967 17dec1967   n_1      1           2

As you may notice, Stata identifies the observation occurred the
17dec1967 as those with duplicates > 1 (which will then be dropped),
while I would have expected Stata to make the opposite.

Can anyone explain me why and, possibly, tell me how to deal with such issue?

Thank you very much in advance,

Emanuele
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: Duplicate observations
  - From: Nick Cox <[email protected]>

References:
- st: Duplicate observations
  - From: emanuele mazzini <[email protected]>

Prev by Date: Re: st: Duplicate observations
Next by Date: Re: st: RE: Duplicate observations
Previous by thread: Re: st: Duplicate observations
Next by thread: Re: st: RE: Duplicate observations
Index(es):
- Date
- Thread