Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable

From   "Sarah Edgington" <>
To   <>
Subject   RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
Date   Tue, 12 Jun 2012 18:02:01 -0700

One thing you haven't mentioned, I don't think, is whether you have any
duplicate observations per person in the set that you are trying to merge on
to the visit data.  If you have multiple visits for each ID in your master
data set but the using dataset has only one record per ID you can simply do
a m:1 merge and you shouldn't have any problems.  If your other file has
multiple records per ID, then your problem is more complicated and merging
the files as-is probably is not a very good idea at all.

Nick is right that the correct merge should not create duplicates.  There
are a number of ways to confirm this for yourself without having to
-reshape- the data to wide form.
For me the best place to start is by looking carefully at the created _merge
variable.  Are there cases that didn't match?  Did you expect that?  If not,
that bears investigating.

Next, look at the overall number of observations.  First, count how many
observations are in the master dataset in long form (that is, the data with
ID codes and multiple visits per ID).  Then, if you do a many to one merge
using your second data set you should find that [original observations] =
[number matched] + [number in master only].  If that isn't the case,
something is likely wrong.

Finally, if you're still worried and want to be sure that you have the exact
same records in your merged data as you did before the merge, try looking at
the means of some important variables from the master file before and after
the merge.  If your ID field is a numeric variable (though it's often best
if it isn't) then you can look at the N and mean of that variable before and
after the merge too.  If the distribution of variables from the master file
remains the same before and after the merge then you have some pretty good
evidence that you have not somehow introduced extra records.  (This assumes
that all the data in your master file matches a record in the using file; if
this isn't the case go back to the first step and make sure you understand

I know merging sometimes seems complicated, but as long as you pay very
close attention to the details of the output and make sure you understand
why some IDs matched and some didn't, it's generally going to be ok.  Unless
you're doing a many to many merge.  Then it's complicated and, in nearly all
cases, the wrong approach entirely.

Hope that helps.


-----Original Message-----
[] On Behalf Of Nick Cox
Sent: Tuesday, June 12, 2012 5:29 PM
Subject: Re: st: RE: RE: RE: RE: Combining multiple observations by an ID

Your original data structure strikes me as far better for the majority of
purposes for which it might be used within Stata. Whether -reshape
wide- is possible is thus secondary. It is almost certainly not a good idea.

Incidentally, -reshape- is a command, not a function. Also, I see no reason
why the correct -merge- command should create extra observations as you
imply here.


On Tue, Jun 12, 2012 at 11:31 PM, Claude Beaty <> wrote:

> Reshape was something I considered as well. Unfortunately, every time I
attempt to run this code I get the error "too many macros". I have stata 12,
which I believe is the most updated version. If anyone knows of a way around
this, please let me know.

Swanquist, Quinn Thomas

> Fair enough,
> If you need the observations to equal the number of visits and you need to
keep the data from each visit, you are going to need to use the reshape wide
function on the master dataset before the merge. Since you said that you
have 70 variables for each visit, you will now have 70 * the max number of
visits variables. Depending on your version of Stata you may or may not be
able to work with that many variables.
> You can get help with this function using:
> help reshape

Claude Beaty

> It looks like the merger attempt was likely successful, though I'm sure
there are some duplicates. However, your suggested code did not help to
shift the data so that the total observations equal the number of ID codes
instead of the number of visits. I have tried reshaping etc, but there are
too many macros to reshape all of the variables. Is there another way? If I
can arrange the data in this way, it is easier to compare with my previous
file and find duplicate ID codes. As it stands now, it is difficult to tell
if duplicate ID codes are due to successive visits or duplications created
by the file merger.

Swanquist, Quinn Thomas

> Do you have an identifier for visit number (if not you could use date).
> Sort as follows:
> sort IDcode visit
> then merge many to one as follows:
> merge m:1 IDcode using "usingfile"

Claude Beaty

> I have a large dataset of observations in which individuals (~40,000 
> ID codes) were evaluated multiple times (5-10 visit numbers per 
> individual) on over 70 variables. However, the data has been arranged 
> so that each visit number is an observation, instead of each 
> individual ID code as an observation. I need to merge this file with 
> another file sorted by individual ID codes. How do I rearrange this 
> data so that it is arranged by ID codes with consecutive follow up 
> visits? Thanks

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index