Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable

From	Claude Beaty <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
Date	Wed, 13 Jun 2012 14:35:12 +0000

Sarah,

Thank you for your suggestions on how to rule out duplicates. It appears as though my merge was successful without adding unanticipated additional observations.

Claude A. Beaty Jr., M.D.
Halsted Surgical Resident
Cardiac Surgery Research Fellow
The Johns Hopkins Hospital

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Sarah Edgington
Sent: Tuesday, June 12, 2012 9:02 PM
To: [email protected]
Subject: RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable

Claude,
One thing you haven't mentioned, I don't think, is whether you have any duplicate observations per person in the set that you are trying to merge on to the visit data.  If you have multiple visits for each ID in your master data set but the using dataset has only one record per ID you can simply do a m:1 merge and you shouldn't have any problems.  If your other file has multiple records per ID, then your problem is more complicated and merging the files as-is probably is not a very good idea at all.

Nick is right that the correct merge should not create duplicates.  There are a number of ways to confirm this for yourself without having to
-reshape- the data to wide form.
For me the best place to start is by looking carefully at the created _merge variable.  Are there cases that didn't match?  Did you expect that?  If not, that bears investigating.

Next, look at the overall number of observations.  First, count how many observations are in the master dataset in long form (that is, the data with ID codes and multiple visits per ID).  Then, if you do a many to one merge using your second data set you should find that [original observations] = [number matched] + [number in master only].  If that isn't the case, something is likely wrong.

Finally, if you're still worried and want to be sure that you have the exact same records in your merged data as you did before the merge, try looking at the means of some important variables from the master file before and after the merge.  If your ID field is a numeric variable (though it's often best if it isn't) then you can look at the N and mean of that variable before and after the merge too.  If the distribution of variables from the master file remains the same before and after the merge then you have some pretty good evidence that you have not somehow introduced extra records.  (This assumes that all the data in your master file matches a record in the using file; if this isn't the case go back to the first step and make sure you understand why).

I know merging sometimes seems complicated, but as long as you pay very close attention to the details of the output and make sure you understand why some IDs matched and some didn't, it's generally going to be ok.  Unless you're doing a many to many merge.  Then it's complicated and, in nearly all cases, the wrong approach entirely.

Hope that helps.

-Sarah

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Nick Cox
Sent: Tuesday, June 12, 2012 5:29 PM
To: [email protected]
Subject: Re: st: RE: RE: RE: RE: Combining multiple observations by an ID variable

Your original data structure strikes me as far better for the majority of purposes for which it might be used within Stata. Whether -reshape
wide- is possible is thus secondary. It is almost certainly not a good idea.

Incidentally, -reshape- is a command, not a function. Also, I see no reason why the correct -merge- command should create extra observations as you imply here.

Nick

On Tue, Jun 12, 2012 at 11:31 PM, Claude Beaty <[email protected]> wrote:

> Reshape was something I considered as well. Unfortunately, every time 
> I
attempt to run this code I get the error "too many macros". I have stata 12, which I believe is the most updated version. If anyone knows of a way around this, please let me know.

Swanquist, Quinn Thomas

> Fair enough,
>
> If you need the observations to equal the number of visits and you 
> need to
keep the data from each visit, you are going to need to use the reshape wide function on the master dataset before the merge. Since you said that you have 70 variables for each visit, you will now have 70 * the max number of visits variables. Depending on your version of Stata you may or may not be able to work with that many variables.
>
> You can get help with this function using:
>
> help reshape

Claude Beaty

> It looks like the merger attempt was likely successful, though I'm 
> sure
there are some duplicates. However, your suggested code did not help to shift the data so that the total observations equal the number of ID codes instead of the number of visits. I have tried reshaping etc, but there are too many macros to reshape all of the variables. Is there another way? If I can arrange the data in this way, it is easier to compare with my previous file and find duplicate ID codes. As it stands now, it is difficult to tell if duplicate ID codes are due to successive visits or duplications created by the file merger.

Swanquist, Quinn Thomas

> Do you have an identifier for visit number (if not you could use date).
>
> Sort as follows:
>
> sort IDcode visit
>
> then merge many to one as follows:
>
> merge m:1 IDcode using "usingfile"

Claude Beaty

> I have a large dataset of observations in which individuals (~40,000 
> ID codes) were evaluated multiple times (5-10 visit numbers per
> individual) on over 70 variables. However, the data has been arranged 
> so that each visit number is an observation, instead of each 
> individual ID code as an observation. I need to merge this file with 
> another file sorted by individual ID codes. How do I rearrange this 
> data so that it is arranged by ID codes with consecutive follow up 
> visits? Thanks

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Combining multiple observations by an ID variable
  - From: Claude Beaty <[email protected]>
- st: RE: Combining multiple observations by an ID variable
  - From: "Swanquist, Quinn Thomas" <[email protected]>
- st: RE: RE: Combining multiple observations by an ID variable
  - From: Claude Beaty <[email protected]>
- st: RE: RE: RE: Combining multiple observations by an ID variable
  - From: "Swanquist, Quinn Thomas" <[email protected]>
- st: RE: RE: RE: RE: Combining multiple observations by an ID variable
  - From: Claude Beaty <[email protected]>
- Re: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
  - From: Nick Cox <[email protected]>
- RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
  - From: "Sarah Edgington" <[email protected]>

Prev by Date: Re: st: documentation on iteration for a non linear regression
Next by Date: RE: st: documentation on iteration for a non linear regression
Previous by thread: Re: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
Next by thread: st: documentation on iteration for a non linear regression
Index(es):
- Date
- Thread