How do I create individual identifiers numbered from 1 upwards?
|
Title
|
|
Creating group identifiers
|
|
Author
|
Nicholas J. Cox, Durham University, UK
William Gould, StataCorp
|
|
Date
|
December 1999; minor revisions March 2001
|
Case 1. I want to create variable id containing 1, 2, 3, ...
Type
. gen id = _n
_n is the Stata way of referring to the observation number.
In a 10-observation dataset, _n takes on the values 1, 2, ..., 10.
Case 2.
I already have an id variable, and
I have multiple observations per id, but I want a new id
variable containing 1 for the first id, 2 for the second, and so on.
Such questions often arise with panel data and in other circumstances.
Perhaps the identifier variable is a string — id "numbers" 1A038,
2B217, ... — and you need numeric identifiers — 1, 2, ...
— because some Stata commands require them. Perhaps the original id
is numeric — of the form 102938, 149384, 150394, ... — but you
want to draw a graph using the identifier as one of the axes and want the
data points equally spaced.
Answer 1.
To create a new variable newid from the existing variable
oldid, whether oldid is string or numeric, type
. egen newid = group(oldid)
>
The new variable newid will contain 1 for the first value of
oldid, 2 for the second value, and so on.
Answer 2.
To create a new variable newid from the existing variable
oldid, whether oldid is string or numeric, type
. sort oldid
. by oldid: gen newid = 1 if _n==1
. replace newid = sum(newid)
. replace newid = . if missing(oldid)
Both answers yield the same results: the four lines of answer 2 amount to
what
egen does. It is,
however, worth understanding answer 2.
We start with existing identifier ID, which may be either a numeric variable
or a string variable.
. sort oldid
This command puts the observations in the order of oldid.
. by oldid: gen newid = 1 if _n == 1
This command creates a new variable newid that is 1 for the first
observation for each individual and missing otherwise. _n is the
Stata way of referring to the observation number; in a 10-observation
dataset, _n takes on the values 1, 2, ..., 10. When _n is
combined with by, however, _n is the observation number within
by-group, in this case, within oldid. If there were three
oldid==1 observations followed by two oldid==2 observations in
the dataset, _n would take on the values 1, 2, 3, 1, 2. Thus,
by ...: ... if _n==1 is a way to refer to the first
observation in each by-group. See the sections of [U] indexed under
by varlist: prefix.
by oldid: gen newid=1 if _n==1 sets newid to 1 in the first
observation of each oldid.
. replace newid = sum(newid)
This command replaces newid by its cumulative or running sum.
. replace newid = . if missing(oldid)
This command puts missing value into newid, where oldid
contained missing value. This step is probably unnecessary because if
oldid really is an ID variable, it should never contain missing
anyway.
Let us see how that works for a simple dataset. Missing values (.)
make no difference to a cumulative sum. In that context, they are treated as
numerically equal to 0.
oldid newid (as created) newid (as replaced)
1 1 1
1 . 1
1 . 1
1 . 1
22 1 2
22 . 2
22 . 2
33 1 3
33 . 3
|