If you have duplicates, then the -duplicates-
command should be useful.
Doing it from first principles is instructive,
but you have to be clear on some basics.
In your case, Stata is working as advertised.
Your duplicates must be identified in terms
of both -code- and -year-. Just sorting on
-code- does not determine where observations
with different values of -year- will occur.
Nick
[email protected]
Mosca, Ilaria
> I have been working with a database in which I had to identify
> duplicates of institutions and afterwards count the number of
> institutions per year. I therefore wrote the following commands:
>
> . sort code
> . quietly by code: gen dup=cond(_N==1,0,_n)
> . drop if dup>1
> . gen id04==1 if year==2004
> . count if id04==1
>
> My problem is that EVERY TIME that I was running these commands, I
> obtained different results! Once the count command was 749, once 753,
> and so on. And this without any apparent reason.
>
> In order to cope with this problem I therefore used the
> command tag, and
> namely:
> . sort code
> . egen tag=tag(code)
> . count if tag==1
>
> I runned these commands several times and the results shown are always
> the same.
>
> My question to you is thus the following: why does the command for
> duplicates seem not to work in this case? I frequently have
> to identify
> duplicates in my databases, and I use these commands pretty often. But
> getting different results every time, cast doubts on its
> effectiveness.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/