Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: how to assign new identifier numbers with duplicates

From   "Svend Juul" <SJ@SOCI.AU.DK>
To   <>
Subject   Re: st: how to assign new identifier numbers with duplicates
Date   Wed, 28 Mar 2007 09:35:55 +0200

Linda wrote:

I am using a firm-level panel data for performance analysis
of firms. But I found that my dataset has duplicated identifiers
(e.g. the same identity numbers for two firms in two different
regions in a certain year). My dataset looks like as follows
(a1 is a code for high-level region, a2 is the subregions, a3
is the firm identity number):

year a1 a2 a3
1995 450 57 206141
1995 450 54 206141
1996 450 57 206141
1996 450 54 206141
1997 450 57 206141
1997 450 54 206141
1995 470 41 223243
1995 470 43 223243
1995 470 44 223243
1996 470 41 223243
1996 470 44 223243
1997 470 41 223243
1998 470 41 223243
2000 470 41 223243

This moment, I don't want to consider the differences of subregions.
So, I want to change the identity number such that I have uniquely
identified observations by the identifier variable a3 and year....

I am not sure what you want. You tell us that two different firms
in two different regions (a1) can have the same id number (a3). You
then consider it a problem that the same firm id occurs in several
times in a given year. From you sample data it seems that this
occurs because the same firm has a record for each year and subregion,
but you don't want to consider subregions.

If the problem is, as you describe, that the same firm id is used for
different firms in different regions, you could combine firm id (a3) 
and region id (a1) to get a unique firm id. But first a couple of

1) In your attempts you -replace-d the original firm id by a modified
thus destroying the original information. This is dangerous behaviour.

2) In long id numbers you may get precision problems; use string
to prevent that (see

Here I construct the string variable -newid-; for the first observation
in the sample data it becomes "206141-450":

   generate sa1=string(a1,"%03.0f")
   generate sa3=string(a3,"%6.0f")
   generate newid=sa3 + "-" + sa1

Hope this helps



Svend Juul
Institut for Folkesundhed, Afdeling for Epidemiologi
(Institute of Public Health, Department of Epidemiology)
Vennelyst Boulevard 6
DK-8000  Aarhus C, Denmark
Phone: +45 8942 6090
Home:  +45 8693 7796

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index