Stata: Data Analysis and Statistical Software
   >> Home >> Products >> Training >> NetCourses >> Sample lecture NC101
The following is a 3-page extract from the 32 pages of Lecture 4, NetCourse® 101, An Introduction to Stata.

NetCourses: Sample lecture NC101

Ensuring that identifiers are unique

For a match–merge to work, the identifier or identifiers must uniquely identify each observation. Ensuring that they do is your responsibility, and I want to show you how to do that. Here’s what you should do before you merge:

        . use one, clear
        
        . by id:  assert _N==1
        
        . use two, clear
        
        . by id:  assert _N==1

I’ll explain what the following commands do, but, here’s what would have happened if the identifier had not been unique:

        . use baddata, clear
        
        . by id:  assert _N==1
        r(9);

Not much useful information there, unless you know how to read it. Typing search r(9) in Stata reveals quite a bit. Now I will explain. assert is a simple little command that tells me if an expression is true or false. For instance, to find out if the number 2 is equal to 1 + 1, I could type

        . assert 2==1+1

assert showed nothing, and that’s assert’s way of saying, yes it is, (assert is the strong, silent type). Let’s try a patently false statement:

        . assert 2+2 == 5
        assertion is false
        r(9)

assert, a nothing, gains its power through Stata’s standard syntax just as count, another nothing, does.

If I type assert _N==1, I am checking that _N is 1. _N (as you remember from Lecture 2) is Stataspeak for the total number of observations in the dataset. You may also remember from Lecture 2 that _n refers to the observation number in the dataset. In Stataspeak, the observations are numbered _n, where _n = 1, 2, 3, ..., _N.

Now, remember the definition of by:

by is formally defined as producing the same result as if you formed separate datasets for each by-group and ran the command separately on each group’s data.

Thus, with a "by id", _N refers to the total number of observations in the dataset that have that id number. Therefore, by id: assert _N==1 asserts that, for each id number, there is one and only one observation. Let’s try it with the data for which it was not true:

        . use baddata, clear
        
        . by id:  assert _N==1
        4 contradictions in 5 by-groups
        assertion is false
        r(9);
        
        . list 
        
             +----+
             | id |
             |----|
          1. |  1 |
          2. |  1 |
          3. |  1 |
          4. |  2 |
          5. |  2 |
             |----|
          6. |  3 |
          7. |  3 |
          8. |  4 |
          9. |  5 |
         10. |  5 |
             +----+

by id: assert _N==1 checked out every id number and, for each one, reported whether the assertion _N==1 was true or false and kept a running total for us. We can obviously see that the only unique id is in observation 8. Thus we confirm that "4 contradictions in 5 by-groups" is correct.

If I put a quietly in front of by id: assert _N==1, I will suppress the detail but still report the overall summary of the truth value of my assertion:

        . quietly by id: assert _N==1
        r(9);

There are other ways you could verify that an id is really unique. One of the alternatives would be duplicates report id.

        . duplicates report id
        
        Duplicates in terms of id
        
        --------------------------------------
           copies | observations       surplus
        ----------+---------------------------
                1 |            1             0
                2 |            6             3
                3 |            3             2
        --------------------------------------

The duplicates help file tells us:

"duplicates report produces a table showing observations that occur as one or more copies and indicating how many observations are `surplus' in the sense that they are the second (third, ...) copy of the first of each group of duplicates."

A second alternative would be to use the isid command, which checks whether the specified variables identify the observations uniquely.

        . isid id
        variable id does not uniquely identify the observations
        r(459);

If the id variable contained all unique values, isid would not give any output at all.

A third alternative would be to tabulate id:

        . tabulate id
        
                 id |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  1 |          3       30.00       30.00
                  2 |          2       20.00       50.00
                  3 |          2       20.00       70.00
                  4 |          1       10.00       80.00
                  5 |          2       20.00      100.00
        ------------+-----------------------------------
              Total |         10      100.00

While we can see the problems, just as we could in the above example when we listed the data, imagine that we have several thousand or even several million observations. Using the tabulate or listing method would be time consuming, if not futile.

  1. tabulate id might not work. There might be too many values for tabulate to list, or if there are more than 12,000 unique values (12,000 for Stata/SE and 3,000 for Intercooled Stata), tabulate will give an error. Remember, we are verifying uniqueness before a merge. The identifier used with merge can take on any number of values (fewer than 12,000 in Stata/SE and 3,000 in Intercooled Stata if you want to tabulate them), and merge is perfectly happy if the id variable is a string.
  2. Even if tabulate id works, the listing might be so long that I do not spot the problem. by id: assert _N==1 looks down the list for me.

Typically, one variable identifies the observations, such as my id, which I have said is the id number of the patient. Sometimes, however, it takes more than one variable to uniquely identify observations. Suppose I have a dataset of patients across hospitals. I would have one variable, hid, which identifies the hospital and another variable, pid, which is the hospital’s identification number for the patient. In that case, hid and pid together would uniquely identify the patients. merge will allow this,

        . merge hid pid using ...

Before performing the merge, however, I must remember to sort the data (of course merge will remind me if I forget):

        . sort hid pid

When I sort the data, I should verify that hid and pid together really do uniquely identify the patients:

        . by hid pid: assert _N==1

In this case, they do.

NetCourse is a registered trademark of StataCorp LP.

Bookmark and Share 
NetCourses
Overview
Schedule
Enroll
FAQ
NC101
NC151
NC152
NC461
What is NetCourseNow?
Sample lecture
User comments
Like us on Facebook Follow us on Twitter Follow us on LinkedIn Google+ Watch us on YouTube
Follow us
© Copyright 1996–2013 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index   |   View mobile site