I am having problems with the reshape command. Can you give further guidance?
|
Title
|
|
Problems with reshape
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
September 2001; updated February 2003, December 2003; minor revisions August 2005
|
reshape
is a very powerful command for restructuring your datasets. Like many other
powerful commands, it can seem rather complicated until you have
experience with it. Most questions about its use can be answered from a
careful reading of the manual entry at [D] reshape, but some further
notes and examples are added here.
For the specific problem of reshaping (e.g., pharmacokinetic)
Latin-square and crossover data, also check out
pkshape.
Is it really a problem for reshape?
reshape moves back and forth between what Stata calls wide data
structures and long data structures. In a wide structure, rectangular blocks
of data are held in several variables (i.e., several columns), whereas in a
long data structure, such blocks are stretched out into single variables
(one column for each block). In both cases, one or more identifier variables
exist alongside—at a minimum, a one row identifier for a wide structure
and a one row and one column identifier for a long structure.
Perhaps the most common kind of problem in practice is to reshape
from wide to long. Your data may come compactly stored with information for
each individual in one row (case, observation), but Stata shows a marked
preference for long structures, especially when observations are repeated
across time for each individual in a dataset. One good reason for this
preference is that a long structure is more general and easier to handle,
which is often true in many fields where the number of observations
may vary from individual to individual.
If you want to transform from one wide or block structure to another, this
can often be done with two applications of reshape, and examples are
given below. However, make sure first that your problem is not better
handled by
xpose, which
transposes the dataset so that observations become variables, and vice
versa.
reshape works with stubnames
A key point is that in reshaping from wide to long, reshape
expects to find one or more groups of variables so that names in each
group all begin with the same stubname. Thus the variable names
inc60, inc70, inc80, and inc90 all share the
stubname inc.
The syntax diagram for the command in the manual and in the online help
gives patterns for the so-called basic syntax (setting aside the options) of
reshape wide stubnames
reshape long stubnames
Thus, in the example above, the stubname would be inc. If we also
have variables pop60, pop70, pop80, and pop90,
then pop would be an extra stubname.
Suffixes should be numeric unless specified otherwise with string
Moreover, reshape expects what follows the stub—the
suffix—will be numeric, unless you specify otherwise with the
string option. Any nonnumeric characters may mess things up if you
overlook this rule, including not only letters but also the _
(underscore), which you might be using in your variable names. Suppose
you had variables pop90f and pop90m for male and female
population. You have three choices given the presence of the letters
f and m in the suffixes:
- specify pop as stub together with string
- rename your variables, say, to mpop90 and fpop90, or whatever
makes sense
- use the @ character, as explained in the manual.
On occasion, people use numeric suffixes with leading zeros, such as
01, 02, and so forth. reshape will understand these
properly only if they are declared as string.
Whatever you do, your choice of stubname will determine how the data can be
restructured.
The following example is based on an exchange on
Statalist.
Question
If I have many variables all occurring in pairs for two years 1997 and 1998,
so that the dataset looks like A97, A98, B97,
B98, and so on, is there any easy way to reshape the data to
long without typing all the stub names?
Answer
The variable names are collectively *97 *98, so we need a way
of expanding that list of wildcard names automatically and then
removing the suffix. We can work on either *97 or *98.
unab is usually
billed as a programmer's command, but it can be used interactively. It
unabbreviates a varlist and puts the result in a local macro.
. unab vars : *97
Then we zap all the occurrences of the suffix "97":
. local stubs : subinstr local vars "97" "", all
In other words, each occurrence of "97" is replaced by an empty
string; that is, they are removed. See
macro.
Then we can
. reshape long `stubs', options
You may need to rename some variable sets before running reshape
We have seen data in which monthly temperatures were stored as variables
January through December. Although the names are logical and
clear, reshape can do nothing with such a variable set until these
variables are renamed to a stub plus suffix form. The official Stata command
rename can only rename one variable at a time, so it is worth knowing
about other ways of renaming several variables at once. Here we will suppose
that January through December are stored in that order in your
dataset.
One way is to use the renvars command published in STB-60.
. renvars Jan-Dec \ temp1-temp12
Although it is not obvious from the syntax, renvars checks that the
new variable names (specified after the \) are really all new (and
all legal). renvars has many other options as well, so it could be
of use for other renaming problems.
Column identifiers split between variables must be combined
The help for reshape explains that i() can specify one or more
variables. However, this is not true of j(), which can specify only
one variable. If your column identifier is split between variables, you need
to put them together in a single variable. The easiest solution, especially
if the variables are categorical or discrete, is to use egen,
concat() or egen, group(). See
egen.
The following example is based on an exchange on
Statalist.
Question
I have an identifier that is split between variables. This situation is
common in ANOVA where treatment cells may be identified by more than one
factor. I have the following data:
. list, sep(6)
+----------------------------------------+
| level delay animal peak |
|----------------------------------------|
1. | 0 50 1_1_0F 773.75 |
2. | 0 100 1_1_0F 1001.63 |
3. | 75 50 1_1_0F 472.5 |
4. | 75 100 1_1_0F 927.875 |
5. | 85 50 1_1_0F 611.375 |
6. | 85 100 1_1_0F 654.375 |
|----------------------------------------|
7. | 0 50 1_1_1F 1116.88 |
8. | 0 100 1_1_1F 1101.38 |
9. | 75 50 1_1_1F 544.875 |
10. | 75 100 1_1_1F 567.875 |
11. | 85 50 1_1_1F 443.875 |
12. | 85 100 1_1_1F 466 |
+----------------------------------------+
The first two variables, level and delay, define treatment
conditions for which peak was measured for each animal. To
calculate a ratio from the conditions within each animal, I would like to
reshape this dataset to one that looks like animal peak0_50
peak0_100 peak75_50 peak75_100 peak85_50 peak85_100.
Answer
. egen treatment = concat(level delay), p(_)
. drop level delay
. reshape wide peak, i(animal) j(treatment) string
The factors level and delay take integer values,
so there is no difficulty in concatenating them into a string variable.
Should we later desire the original data structure, we can type
. reshape long
. split treatment, p(_) destring
. rename treatment1 level
. rename treatment2 delay
split is, broadly speaking, the inverse of egen, concat().
See
split.
You may need two reshapes to get to where you want to be
A traveller asks, “How do I get to X from here?”, and a local
replies, “If I was going to X, I wouldn't start from here”. Many
reshaping problems provoke this feeling. One broad strategic comment is to
underline a point exemplified at [D] reshape: although
reshape apparently offers only wide to long and long to wide
reshapes, either a reshape long followed by a reshape wide, or
the reverse, may be a good way of solving many problems, possibly with some
manipulations in between. Having said that, some users ask for ways of
getting from a wide structure to another wide structure. The best advice is
often to question why they really want to do that, as Stata generally offers
many more ways of working with long data structures than with wide data
structures. The long-term view is that you are likely to be better off with
a long structure.
Another possibility sometimes encountered is the need for reshape
long followed by reshape long, a mapping to what may be called a
"long long" structure. An example is given later.
A simple example of a double reshape is the calculation of a row
median, calculated for each observation across a set of variables. Although
user-written programs exist to do this, the double reshape device
provides a trick you may use for other problems. Let's
imagine a stubname x, and generate an identifier variable if it does
not exist:
. generate id = _n
. reshape long x, i(id) j(varno)
. egen medianx = median(x), by(id)
. reshape wide
We are back where we started but with the extra variable medianx.
The following examples are based on exchanges on
Statalist.
Question
I have a question that deals with rearranging a dataset. My variables are
electoral data: v1, v2, v3 ... are vote percentages for
parties, s1, s2, s3 ... are seat percentages for
parties, and p1, p2, p3 ... are names of parties.
The observations are single elections in given countries. I would like to
sort the variables so that v1 contains the highest vote percentage
for each election, v2 the second highest, and so on. Above all, I
have to sort the s* and p* accordingly, so that I am
still able to match vote percentages, seat percentages, and party names.
Answer
This solution is easy with a reshape back and forth, but perhaps not
otherwise.
First, as usual, generate an identifier variable if you do not have one:
. gen id = _n
Now reshape to long
. reshape long v s p , i(id) j(order)
and then sort to put highest vote first within each election:
. gsort id -v
Now recalculate rankings
. by id : replace order = _n
and reshape back to wide again:
. reshape wide
Question
I have a dataset with multiple observations for each identifier id
according to a second variable pos. I want to gather information by
id and pos on a set of other variables co*. Here is an
example:
id pos co1 co2 co3
1. 1 1 56 86 65
2. 1 1 44 55 66
3. 1 2 33 . .
What I want is to gather all the data in co* by id and
pos into a new dataset:
id pos co1 co2 co3 co4 co5 co6
1. 1 1 56 86 65 44 55 66
2. 1 2 33 . . . . .
Answer
From this example, neither id nor id and pos combined
identify observations uniquely, so we need to produce a new identifier. We
can then reshape to long:
. gen ID = _n
. reshape long co, i(ID) j(no)
It is the combinations of id and pos that define new
observations in the data structure we want, so we
. egen group = group(id pos)
. list
+---------------------------------+
| ID no id pos co group |
|---------------------------------|
1. | 1 1 1 1 56 1 |
2. | 1 2 1 1 86 1 |
3. | 1 3 1 1 65 1 |
4. | 2 1 1 1 44 1 |
5. | 2 2 1 1 55 1 |
|---------------------------------|
6. | 2 3 1 1 66 1 |
7. | 3 1 1 2 33 2 |
8. | 3 2 1 2 . 2 |
9. | 3 3 1 2 . 2 |
+---------------------------------+
Now let’s get the column numbers we want and reshape again:
. by group, sort: replace no = _n
. drop ID
. reshape wide co, i(group) j(no)
. drop group
. list
+----------------------------------------------+
| co1 co2 co3 co4 co5 co6 id pos |
|----------------------------------------------|
1. | 56 86 65 44 55 66 1 1 |
2. | 33 . . . . . 1 2 |
+----------------------------------------------+
Think of reshape of being like a Fourier transform, or even a
logarithm: transform, do some stuff, and transform again.
Question
I have a dataset on which services are provided and which counties are
covered by various agencies. Variables ES1-ES10 are 1 or 0 if a
particular service is or is not provided by an agency, and variables
A1-A5 are 1 or 0 if a particular county is or is not covered by an
agency. I want to tabulate service by county to show how many
providers of each service operate in each county.
Answer
What is important here—and what can easily be missed, as we did when
first trying to understand this question—is that the two stubs
ES and A are not on the same footing. The dataset is a
three-way array, agencies X service X counties, even though each agency is
taken to offer the same mix of services in the counties it covers. (If that
were not true, then the data structure we start from would be
misrepresenting reality.) Thus the reshaping is another double
reshape, but this time long followed by long.
Assuming an identifier variable id, which, as always, we can easily
create if absent, we should first preserve because we shall probably want
to return to the dataset in its present form:
. preserve
and then
. reshape long ES, i(id) j(service)
. reshape long A, i(id service) j(county)
The “long long” data structure has ES equal to 1 and
A equal to 1 when (a) a service is provided by an agency and (b) a
particular county is covered; thus we count only observations for which
both (a) and (b) are true, which we can specify via the weights for the
table:
. tabulate service county [w = ES * A]
If desired to return to the original data,
. restore
Acknowledgment
David Airey made helpful suggestions.
|