Tools for longitudinal data management
In Stata longitudinal data are usually coded long, that is to say each set
of measurements at each new time point constitutes a new record, and the set
of all records for a subject share the same subject id. When exploring such
data interactively most simple operations refer to records, but often the
answers required are those referring to subjects. The most obvious example
is how many subjects are there? This is the same as the number of unique
codes for subject id, and is returned by the Stata command codebook
id, along with much else. A simple alternative is the new command
unique id which generalizes to unique id visit, for example,
which reports the number of unique combinations of id and
visit. In general, the command
. unique varlist, by(varname) gen(newvar)
will give the number of unique combinations of varlist. When the
by is present it creates a new variable newvar, which contains
the number of unique combinations of varlist for each level of
varname. For example,
. unique job, by(id) gen(jobvar)
reports the overall number of unique values for the variable job, and
creates the variable jobvar which contains the number of different
job codes for each subject.
Slightly more complex questions take the form: how many records satisfy the
condition C, where C refers to a single variable. An example is the
condition height == . . The command longch takes the form
. longch id, c(height == .)
where id is the subject id variable name and c( ) contains the
condition. The output looks like this:
71 records fulfill the condition height == .
some : 46 subjects have height == . in at least one record
none : 51 subjects have height == . in no records
every: 0 subjects have height == . in every record
In addition three logical variables called _some, _none, and
_every, are created for convenience in further manipulation (e.g.
dropping or keeping records). These flag all records belonging to subjects
with some records satisfying the condition, and so on.
|