Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable

 From "Sarah Edgington" To Subject RE: st: RE: RE: RE: RE: Combining multiple observations by an ID variable Date Wed, 13 Jun 2012 10:54:13 -0700

```Claude,
If the comorbidity measures are constant across visits than the solution is
simply to keep only 1 visit.  If they are not constant you'll have to create
a rule for whether an individual patient has or does not have a given
comorbidity (or does or does not have some other factor) no matter what form
the data is in.  Implementing whatever rules you need to create measures
that are constant for an individual is probably going to be easier in a long
data set than it would be in a wide one for exactly the reasons my number of
visits example suggests.  Even if you don't need number of visits, thinking
through that example (and how you would calculate that sort of variable in a
wide dataset) is likely to help you figure out how to construct other
measures.

I've already addressed the issue of how to deal with multiple lines of data
per person making it impossible to calculate
prevalence/incidence/means/standard deviations/whatever other descriptive
statistic you want.  Just because you have multiple lines per person doesn't
mean you need to use all them in every command.

-Sarah

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Claude Beaty
Sent: Wednesday, June 13, 2012 10:28 AM
To: statalist@hsphsun2.harvard.edu
Subject: RE: st: RE: RE: RE: RE: Combining multiple observations by an ID
variable

Sarah,

I am attempting to investigate a clinical outcome based on patient
parameters prior to an intervention and then subsequent to that
intervention. I would like to incorporate and adjust for factors such as
existing comorbidities and the prevalence/incidence of these factors is
necessary for thorough analysis. One database has patient parameters prior
to the intervention while the other has patient outcomes subsequent to the
intervention (master). The master data set is in the long form by return
visit ID, and does not include visit dates. The problem is that when looking
at the data in the long form, individual patients are repeated multiple
times depending on the number of visits post-intervention, which is not
consistent. This will artificially affect both the prevalence and incidence
of all comorbidities. Likewise, an outcome that may be noticed in the first
post-intervention visit is subsequently re-noted in every other follow-up
visit, artificially affecting this measure !
as well. If I could look at the data in wide form, then each comorbidity
and subsequent outcome would only be counted once, instead of multiple
times. Incidentally, you may be asking why I made the database this way. It
is in fact a database that was coded and given to me by a national
organization and I had no input on its coding style.

I appreciate your examples regarding the potential difficulty in managing
the data in wide format and I admit, I am not proficient with the
-by-command. However, I am not interested in how many follow-up visits a
patient had. I am merely concerned with whether or not a certain outcome was
reached and by how many people. I think the purest way (that I know of) to
assess this is to arrange the data so that each person is counted only once.
However, if there are more efficient methods to accomplish these same goals
in the current layout, I am more than happy to use them. With my current
track record, I am skeptical that a wide reshaping is even possible in this
data set.

Claude A. Beaty Jr., M.D.
Halsted Surgical Resident
Cardiac Surgery Research Fellow
The Johns Hopkins Hospital

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Sarah Edgington
Sent: Wednesday, June 13, 2012 12:51 PM
To: statalist@hsphsun2.harvard.edu
Subject: RE: st: RE: RE: RE: RE: Combining multiple observations by an ID
variable

Claude,
What is your actual motivation for wanting the data in wide form?  Your
previous messages seem to suggest that you've been able to successfully do
the merge of the two datasets without reshaping.  Is there any reason, then,
to translate the data into wide form?  It seems to me like it's going to be
much harder to work with wide data, particularly if your number of visits
per person is not constant.

When your data is in long form as it is now, doing calculations for
individuals will require understanding how to use -by- effectively.  Once
you get that logic, though, I think you'll find many calculations fairly
straightforward.  However, if you make the data wide I predict that you'll
find yourself often having to loop through variables for each follow-up
visit and figure out how to deal with missing values in a way that's tedious
and confusing.

Take as an example something as simple as merely counting how many follow-up
visits each patient has.  If each record is a single follow-up visit then in
long form this is simply:  - by trr_id_code: gen nvisits=_N - Trying to do
the same thing in wide form requires somewhat more complicated maneuvering
involving checking to see whether there are missing values for certain
variables to determine whether a visit happened or not (without knowing more
about the data I can't say exactly how you'd construct a number of visits
variable in wide form, but probably it would involve counting non-missing
dates or something similar).  And it's just going to get more complicated
from there.

Now you may be thinking to yourself that the nvisits variable I just
suggested is useless because you can't, for instance, - sum nvisits - and
get the mean number of visits per person because you have multiple records
per person.  However, this too is easily handled with long data.  If you
construct measures that are constant within observations (as the nvisits
variable I just suggested would be) you can simply create a flag for the
first record per person to use for generating descriptive statistics.  For
instance -by trr_id_code: gen indexrec=(_n==1) - would create an indicator
equal to one for the first observation for a particular patient and zero for
all others.

Of course what makes most sense depends entirely on what sort of analysis
you're doing.  However, even if you ultimately want to end up with only one
observation per patient, it really does often make more sense to start in
long form.  I know it may seem counter-intuitive but it may be easiest to
create your analytic variables that summarize individual experiences while
the data is long and then -keep- only a single observation per person.  I
think this is particularly likely to be true if there's a lot of variation
in visits per individual.  If you're in wide form not only do you risk
running out of room to create new variables, you also have to think
carefully through how to construct measures differently for a patient who
has 2 visits and one who has 10.

-Sarah

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Claude Beaty
Sent: Wednesday, June 13, 2012 8:33 AM
To: statalist@hsphsun2.harvard.edu
Subject: RE: st: RE: RE: RE: RE: Combining multiple observations by an ID
variable

Steve,

individual ID variable is " trr_id_code". My follow up visit ID variable is
"trr_fol_id_code". Both variables are string. As previously mentioned, I
have about 50,000 "trr_id_code" observations and over 350,000
"trr_fol_id_code" observations. Currently, the dataset is in the long form
by the "trr_fol_id_code" variable. I would like this dataset to be in the
long form by the "trr_id_code" variable instead (the wide form of the
"trr_fol_id_code" variable), as I currently have another dataset which is
organized in this way and would like to merge the two files. I am using the
following code to accomplish this task:

sort trr_id_code
unab vlist:_all
reshape wide `vlist', i(trr_id_code) j(trr_fol_id_code) string

When this code is applied to the master dataset (approximately 70 variables
in the variable list), I receive the error code "too many macros". I have
attempted to -reshape- after merging by " trr_id_code" and paring down the
database to approximately 13,000 "trr_id_code" observations and 30,000
"trr_fol_id_code" observations, but the increased number of variables in my
second dataset (460) results in the same error message. Is my code
incorrect, or have I reached the limit of Stata's capabilities by having so
many variables and/or observations? Any thoughts would be appreciated.

Claude A. Beaty Jr., M.D.
Halsted Surgical Resident
Cardiac Surgery Research Fellow
The Johns Hopkins Hospital

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```