The following is a 3-page extract from the 32 pages of Lecture 4,
NetCourse®
101, An Introduction to Stata.
NetCourses: Sample lecture NC101
Ensuring that identifiers are unique
For a match–merge to work, the identifier or identifiers must uniquely
identify each observation. Ensuring that they do is your responsibility,
and I want to show you how to do that. Here’s what you should do before you
merge:
. use one, clear
. by id: assert _N==1
. use two, clear
. by id: assert _N==1
I’ll explain what the following commands do, but, here’s what would have
happened if the identifier had not been unique:
. use baddata, clear
. by id: assert _N==1
r(9);
Not much useful information there, unless you know how to read it. Typing
search r(9) in Stata reveals quite a bit. Now I will explain.
assert is a simple little command that tells me if an expression is
true or false. For instance, to find out if the number 2 is equal to 1 + 1, I could type
. assert 2==1+1
assert showed nothing, and that’s assert’s way of saying, yes it
is, (assert is the strong, silent type). Let’s try a patently false
statement:
. assert 2+2 == 5
assertion is false
r(9)
assert, a nothing, gains its power through Stata’s standard syntax just
as count, another nothing, does.
If I type assert _N==1, I am checking that _N is 1. _N (as you
remember from Lecture 2) is Stataspeak for the total number of observations in
the dataset. You may also remember from Lecture 2 that _n refers to the
observation number in the dataset. In Stataspeak, the observations are
numbered _n, where _n = 1, 2, 3, ..., _N.
Now, remember the definition of by:
-
by is formally defined as producing the same result as if you formed
separate datasets for each by-group and ran the command separately on each
group’s data.
Thus, with a "by id", _N refers to the total number of observations in the
dataset that have that id number. Therefore, by id: assert _N==1 asserts
that, for each id number, there is one and only one observation. Let’s try it
with the data for which it was not true:
. use baddata, clear
. by id: assert _N==1
4 contradictions in 5 by-groups
assertion is false
r(9);
. list
+----+
| id |
|----|
1. | 1 |
2. | 1 |
3. | 1 |
4. | 2 |
5. | 2 |
|----|
6. | 3 |
7. | 3 |
8. | 4 |
9. | 5 |
10. | 5 |
+----+
by id: assert _N==1 checked out every id number and, for each one,
reported whether the assertion _N==1 was true or false and kept a
running total for us. We can obviously see that the only unique id is in
observation 8. Thus we confirm that "4 contradictions in 5 by-groups" is
correct.
If I put a quietly in front of by id: assert _N==1, I will
suppress the detail but still report the overall summary of the truth value of
my assertion:
. quietly by id: assert _N==1
r(9);
There are other ways you could verify that an id is really unique. One of
the alternatives would be duplicates report id.
. duplicates report id
Duplicates in terms of id
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 1 0
2 | 6 3
3 | 3 2
--------------------------------------
The duplicates help file tells us:
-
"duplicates report produces a table showing observations
that occur as one or more copies and indicating how many observations are `surplus' in the sense
that they are the second (third, ...) copy of the first of each group
of duplicates."
A second alternative would be to use the isid command, which
checks whether the specified variables identify the observations uniquely.
. isid id
variable id does not uniquely identify the observations
r(459);
If the id variable contained all unique values, isid would not give any
output at all.
A third alternative would be to tabulate id:
. tabulate id
id | Freq. Percent Cum.
------------+-----------------------------------
1 | 3 30.00 30.00
2 | 2 20.00 50.00
3 | 2 20.00 70.00
4 | 1 10.00 80.00
5 | 2 20.00 100.00
------------+-----------------------------------
Total | 10 100.00
While we can see the problems, just as we could in the above example when we
listed the data, imagine that we have several thousand or even several million
observations. Using the tabulate or listing method would be time
consuming, if not futile.
- tabulate id might not work. There might be too many values for
tabulate to list, or if there are more than 12,000 unique values
(12,000 for Stata/SE and 3,000 for Intercooled Stata), tabulate will
give an error. Remember, we are verifying uniqueness before a merge.
The identifier used with merge can take on any number of values (fewer
than 12,000 in Stata/SE and 3,000 in Intercooled Stata if you want to
tabulate them), and merge is perfectly happy if the id variable
is a string.
- Even if tabulate id works, the listing might be so long that
I do not spot the problem. by id: assert _N==1 looks down
the list for me.
Typically, one variable identifies the observations, such as my id, which I
have said is the id number of the patient. Sometimes, however, it takes more
than one variable to uniquely identify observations. Suppose I have a dataset of patients across hospitals. I
would have one variable, hid, which identifies the hospital and another variable,
pid, which is the hospital’s identification number for the patient. In that
case, hid and pid together would uniquely identify the patients. merge will
allow this,
. merge hid pid using ...
Before performing the merge, however, I must remember to sort the data
(of course merge will remind me if I forget):
. sort hid pid
When I sort the data, I should verify that hid and pid together really do
uniquely identify the patients:
. by hid pid: assert _N==1
In this case, they do.
NetCourse is a registered trademark of StataCorp LP.
|