Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable

From   "Sarah Edgington" <>
To   <>
Subject   RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable
Date   Wed, 13 Jun 2012 09:50:55 -0700

What is your actual motivation for wanting the data in wide form?  Your
previous messages seem to suggest that you've been able to successfully do
the merge of the two datasets without reshaping.  Is there any reason, then,
to translate the data into wide form?  It seems to me like it's going to be
much harder to work with wide data, particularly if your number of visits
per person is not constant. 

When your data is in long form as it is now, doing calculations for
individuals will require understanding how to use -by- effectively.  Once
you get that logic, though, I think you'll find many calculations fairly
straightforward.  However, if you make the data wide I predict that you'll
find yourself often having to loop through variables for each follow-up
visit and figure out how to deal with missing values in a way that's tedious
and confusing.

Take as an example something as simple as merely counting how many follow-up
visits each patient has.  If each record is a single follow-up visit then in
long form this is simply:  - by trr_id_code: gen nvisits=_N -
Trying to do the same thing in wide form requires somewhat more complicated
maneuvering involving checking to see whether there are missing values for
certain variables to determine whether a visit happened or not (without
knowing more about the data I can't say exactly how you'd construct a number
of visits variable in wide form, but probably it would involve counting
non-missing dates or something similar).  And it's just going to get more
complicated from there.

Now you may be thinking to yourself that the nvisits variable I just
suggested is useless because you can't, for instance, - sum nvisits - and
get the mean number of visits per person because you have multiple records
per person.  However, this too is easily handled with long data.  If you
construct measures that are constant within observations (as the nvisits
variable I just suggested would be) you can simply create a flag for the
first record per person to use for generating descriptive statistics.  For
instance -by trr_id_code: gen indexrec=(_n==1) - would create an indicator
equal to one for the first observation for a particular patient and zero for
all others. 

Of course what makes most sense depends entirely on what sort of analysis
you're doing.  However, even if you ultimately want to end up with only one
observation per patient, it really does often make more sense to start in
long form.  I know it may seem counter-intuitive but it may be easiest to
create your analytic variables that summarize individual experiences while
the data is long and then -keep- only a single observation per person.  I
think this is particularly likely to be true if there's a lot of variation
in visits per individual.  If you're in wide form not only do you risk
running out of room to create new variables, you also have to think
carefully through how to construct measures differently for a patient who
has 2 visits and one who has 10.


-----Original Message-----
[] On Behalf Of Claude Beaty
Sent: Wednesday, June 13, 2012 8:33 AM
Subject: RE: st: RE: RE: RE: RE: Combining multiple observations by an ID


As suggested, I am including more information about my master dataset. My
individual ID variable is " trr_id_code". My follow up visit ID variable is
"trr_fol_id_code". Both variables are string. As previously mentioned, I
have about 50,000 "trr_id_code" observations and over 350,000
"trr_fol_id_code" observations. Currently, the dataset is in the long form
by the "trr_fol_id_code" variable. I would like this dataset to be in the
long form by the "trr_id_code" variable instead (the wide form of the
"trr_fol_id_code" variable), as I currently have another dataset which is
organized in this way and would like to merge the two files. I am using the
following code to accomplish this task:

sort trr_id_code
unab vlist:_all
reshape wide `vlist', i(trr_id_code) j(trr_fol_id_code) string

When this code is applied to the master dataset (approximately 70 variables
in the variable list), I receive the error code "too many macros". I have
attempted to -reshape- after merging by " trr_id_code" and paring down the
database to approximately 13,000 "trr_id_code" observations and 30,000
"trr_fol_id_code" observations, but the increased number of variables in my
second dataset (460) results in the same error message. Is my code
incorrect, or have I reached the limit of Stata's capabilities by having so
many variables and/or observations? Any thoughts would be appreciated.

Claude A. Beaty Jr., M.D.
Halsted Surgical Resident
Cardiac Surgery Research Fellow
The Johns Hopkins Hospital

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index